Categories
Guide

The Great Data Grab: Reconciling AI Ambition with Individual Privacy

Artificial intelligence is rapidly transforming our world, fueled by vast quantities of data scraped from the internet. This automated harvesting of information, while enabling innovation, has ignited a fierce debate about privacy. Personal details, once shared with specific expectations, are now routinely extracted and repurposed, often without consent or even knowledge. This presents an urgent challenge: how do we reconcile the boundless ambition of AI with fundamental protections for individual privacy in an age of unprecedented data collection? The answers to this question have far-reaching consequences for individuals, communities, and the future of the internet itself.

When was this paper forthcoming?

This paper, titled “The Great Scrape: The Clash Between Scraping and Privacy,” is identified as forthcoming. Specifically, it is scheduled to appear in Volume 113 of the California Law Review in 2025, though the exact issue is unspecified. The mention of “forthcoming” is crucial as it places the paper within a timeline of academic publication, signaling its acceptance for future publication after undergoing peer review and revisions.

The draft available for reference is dated July 26, 2024. This draft date provides additional context, revealing the stage of completion the paper had reached at that time. Although the paper is marked as a draft, implying potential changes before the final publication, the existence of this draft with a specific date allows readers and researchers to access the authors’ work prior to its official release in the California Law Review. The authors are Daniel J. Solove and Woodrow Hartzog.

What is the main subject of this paper?

This paper primarily addresses the clash between the growing practice of data scraping and the fundamental principles of privacy law. Artificial intelligence (AI) systems increasingly rely on vast quantities of data, much of which is personal, gathered through automated extraction processes known as “scraping.” While scraping enables web searching, archival, and scientific research, its application in AI development raises significant concerns about fairness, lack of individual control, transparency, consent, data minimization, onward transfer, and data security. The authors argue that scraping must undergo a serious reckoning with privacy law, as it currently bypasses many of the key tenets designed to protect personal data. They emphasize the urgent need for a “great reconciliation” between scraping practices and privacy safeguards, given the zealous pursuit and astronomical growth of AI, which they term the “great scrape.”

The central argument revolves around the fundamental tension between scraping and established privacy principles. The paper highlights that scraping of personal data violates nearly every key privacy principle embodied in privacy laws, frameworks, and codes – including transparency, purpose limitation, data minimization, choice, access, deletion, portability, and protection. Scrapers act as if all publicly available data were free for the taking, yet privacy law is regularly conflicted about publicly available data. While some laws exclude such data, other laws such as the EU’s General Data Protection Regulation largely do not. The authors intend to demonstrate how the public availability of scraped data shouldn’t give scrapers a free pass. Because scraping involves the mass, unauthorized extraction of personal data for unspecified purposes without any limitations or protections, it stands in stark contrast to the values underpinning privacy law and erodes key principles that form the backbone of data privacy regulations.

In their analysis, the authors explore the inherent complexities in balancing the benefits of scraping with the need to protect individual privacy. They contend that most scholarship about scraping focuses on how scraping fares under particular laws, especially the Computer Fraud and Abuse Act (CFAA). This article is much broader and more conceptual. Yet a categorical ban on scraping would be undesirable and probably untenable if we want a useable Internet. A fundamental tension exists between scraping and core longstanding privacy principles. As the paper progresses, the authors aim to propose re-conceptualizing the scraping of personal data as surveillance and protecting against the scraping of personal data as a duty of data security. Beyond addressing the legal and ethical dimensions, the authors also discuss technological considerations, such as website defenses against scraping. This is to explore the multifaceted nature of the ‘scraping wars’ that are going on as website operators try to stop scraping, and AI entities are trying to perform scraping.

What is the primary type of data used by AI tools based on the text?

AI tools heavily rely on personal data, often acquired through a process called “scraping.” This data extraction method involves the automated collection of vast quantities of information from the internet. While scraping can gather various types of data, its application in AI development often centers on personal data, transforming it into what fuels technologies like facial recognition, deep fakes, and generative AI models. This reliance on personal data distinguishes AI development from other scraping applications, raising heightened concerns about privacy and individual rights.

The personal data scraped encompasses a wide array of information, including details shared on social media platforms, personal websites, and even professional networks. This information often includes names, images, contact details, personal interests, opinions, and behaviors. The automated nature of scraping enables the acquisition of such data easily and quickly, often bypassing traditional routes such as APIs designed for consensual data transfer. The ability to collect personal data at such scale makes it an appealing method to the developers of AI tools, despite the ethical and legal quandaries it presents.

Deep Dive: The Role of Personal Data in Training AI

Personal data from profiles on different platforms, blogs, media articles, etc., provides the foundation for training large language models (LLMs) and other types of generative AI. This training process allows the AI to learn patterns, behaviors, and relationships from human-generated text and images, enabling it to respond to prompts, generate new content, and perform tasks that require understanding and mimicking human intelligence. While developers scrape all sorts of data, large portions are personal and sensitive. The heavy reliance on scraping publicly available data for training has triggered numerous lawsuits and debates concerning data privacy, intellectual property, and the ethical considerations of AI development.

According to the text, how do organizations collect significant personal data?

Organizations collect significant personal data through a process known as scraping, which is the automated extraction of large amounts of data from the internet. Scrapers are designed to gather data from websites in an efficient and systematic manner. This method allows organizations to acquire vast quantities of information quickly and cheaply, without the need for direct interaction, notice, consent, or the opportunity for individuals to object. The personal data obtained through scraping is then used for a variety of purposes, including training artificial intelligence (AI) models, conducting market research, compiling feeds, monitoring competitor pricing and practices, and analyzing trends and activities.

The escalation of scraping is closely tied to the rise of AI, which requires massive amounts of training data. Organizations are either directly scraping data or purchasing scraped data to maintain a competitive edge. This has led to what the source material terms a “great scrape,” a frenzied data grab on a grand scale. Prominent examples include Clearview AI, which scraped billions of images to develop a facial recognition system, and OpenAI, which has been accused of scraping data from “hundreds of millions of internet users” to train its AI chatbot, ChatGPT. This activity often occurs without permission or authorization, raising significant privacy concerns. Platforms such as Facebook, X (formerly Twitter), and Reddit have also experienced extreme levels of data scraping, further highlighting the scale and pervasiveness of this practice.

The Role of Bots

Web scraping is often carried out using programmed computer programs called “web crawlers,” “spiders,” or “bots”. These bots scour the internet gathering information from webpages in a systematic manner, and though for a long time information-gathering bots on the internet have operated in a generally courteous manner, bots that do not respect the simple text files, called robot.txt have grown, and their efforts have become more sophisticated.

For how long, as described in the text, has scraping been occurring?

Scraping has been a persistent presence on the World Wide Web for decades, with only pockets of resistance. Since the inception of the commercial internet in the early 1990s, bots have been deployed to scour the internet for data. While initially used to index websites, thus enabling search functionality, scraping evolved to encompass market research, feed compilation, competitive analysis, and trend identification. This continuous operation, although subject to varying legal and technological countermeasures, highlights its foundational role in the digital ecosystem from the onset of the web’s development.

The article suggests that the practice of scraping has not only been ongoing for a considerable period but also that its intensity and reach have significantly escalated in recent years, particularly with the rise of artificial intelligence. The demand for massive datasets to train AI models has fueled “the great scrape,” characterized by a frenzied and large-scale data acquisition. Previously, many bots respected the instructions offered in “robots.txt” files, but those days are coming to an end as automated collection becomes increasingly ubiquitous. This surge in scraping activity points to a departure from earlier, more restrained practices of data harvesting to an era where the sheer volume and velocity of scraping are unprecedented.

What is the definition of the term data scraping?

Data scraping, in essence, is automated online data harvesting. More technically, it is defined as anytime “a computer program extracts data from output generated from another program.” In more specific terms, data scraping is the means the “retrieval of content posted on the World Wide Web through the use of a program other than a web browser or an application programming interface (API).” This process is crucial for transforming unstructured data on the web into a structured format that can be stored and analyzed in a central, local database or spreadsheet. This facilitates efficient data handling.

While “scraping” might colloquially refer to manual methods like copy-pasting, the contemporary understanding largely focuses on automated techniques involving programs known as “web crawlers,” “spiders,” or “bots.” These sophisticated computer programs systematically scour the Internet, collecting information from webpages with relative ease and minimal cost. This automated process has become increasingly ubiquitous, supporting numerous online activities from search engine indexing to market research. Any publicly accessible website can be scraped by these automated tools.

The advent of artificial intelligence has driven an unprecedented rise in data scraping. AI systems rely on massive quantities of training data, often gathered through scraping techniques. Large language models (LLMs) and generative AI models demand vast amounts of data, leading organizations to either scrape data themselves or purchase it from specialized data scraping services. This has led to exponential growth in the web scraping software market which shows how crucial this information is for the ongoing development and maintenance of technological advancements across platforms. Moreover, the rise in bots-as-a-service industries highlights the increasing accessibility and commercialization of data scraping.

What types of automated data extraction tools are discussed in the text?

The primary focus of the document is on automated web scraping, achieved through the use of computer programs referred to as “web crawlers,” “spiders,” or “bots.” These tools systematically gather information from webpages across the internet. The authors clarify that not all bots engage in web scraping, distinguishing the information-gathering bots from those used for activities like spamming, marketing, or launching denial-of-service attacks. The scraping bots are designed to efficiently transform unstructured data found on the web into structured data that can be stored and analyzed in databases or spreadsheets.

The text implicitly contrasts automated data extraction with manual methods. There is a passing reference to copy-and-paste as a traditional, “manual” technique sometimes colloquially described as scraping. However, the document primarily concentrates on the automated kind of scraping involving complex computer programs rather than simple human actions.

In addition to general web crawlers, there is also a mention of application programming interfaces, known as “APIs.” APIs are designed for a consensual extraction and sharing of data. The mention of APIs implies they are tools for a controlled extraction of data, distinct from the more encompassing and less regulated nature of web scraping undertaken by crawlers and bots. The APIs are presented as designed for explicit, pre-approved data sharing, but large quantities of data needed for various functions are largely still obtained through scraping techniques.

What has historically determined whether or not a bot crawls a website?

Bots have existed since the early days of the commercial internet, and their operation has historically relied on an unusual system of mutual respect. Websites typically employ a simple text file called “robots.txt.” This file acts as a set of instructions to web crawlers, or “bots,” politely indicating which parts of a website they are permitted to crawl or forbidden from accessing. This text file doesn’t have any specific legal or technical authority. It was originally an understanding among internet pioneers to respect each other’s wishes and build the internet collaboratively. This odd system depends on bot operators adhering to the guidelines set out in the robots.txt file, essentially honoring a gentleman’s agreement. This has been the determining factor until more recently.

The Rise of Scraping and the Limitations of Robots.txt

Despite the widespread adoption of robots.txt files and the relative adherence of “good” bots to their directives, the system has inherent limitations. Critically, obeying the robots.txt file is entirely voluntary. There is no inherent legal or technical requirement for bot operators to adhere to the directives contained within. This means that “bad” bots, or those deployed by actors intent on scraping data regardless of expressed wishes, could freely ignore those instructions and continue crawling. That said, even large media companies blocking scraper bots report that their sites are still being scraped contrary to robots.txt instructions. This distinction between “good” bots that respect the robots.txt file and “bad” bots that disregard it has become increasingly important in the age of AI, where the demand for training data has incentivized increasingly aggressive scraping activities.

As the digital landscape evolves and the stakes rise, the limitations of relying solely on the robots.txt file have become more apparent. The polite handshake deal that once regulated bot behavior is increasingly strained by entities unwilling to abide by its terms. As the value of data, and the volume required to train AI systems, has continued to explode, so has the use of more aggressive techniques implemented by some websites, including access restrictions, Captchas, Rate limiting, browser fingerprinting, and banning user’s accounts and IP addresses. The rise of the “scraping wars” and the increasingly aggressive tactics to scrape data and defend against data scraping highlight that the robots.txt file is becoming less reliable.

What has lead to a more intense focus on the conflict between scraping and privacy?

The intensifying conflict between scraping and privacy stems from a few critical developments. Historically, scraping existed in a kind of ethical twilight, tolerated as a potential evil but not fully condemned or critically examined. Now, fueled by the relentless pursuit of artificial intelligence development and the expansion of AI’s capabilities, it now requires scraping on a grand scale. Companies are scraping and purchasing scraped data in order to compete. In effect, organizations engaged in widespread scraping are acting as though anything online is fair game for data collection, which leads to conflict with basic privacy.

The growth in AI and the recognition of the potential impacts and risks associated with it have brought the privacy tensions to the forefront. Personal data is the fuel that drives AI, so the zealous pursuit can be objectionable or even harmful to individuals and society by directly and indirectly increasing their exposure to surveillance, harassment, and automated decisions. To compound the problem, social media presents a treasure trove to scrapers: platforms such as Facebook, X (formerly Twitter), Reddit, and LinkedIn host billions of photos and personal data.

What are the key principles of data privacy?

Data privacy law, rooted in Fair Information Practice Principles (FIPPs), aims to protect personal information through a framework of fairness, individual autonomy, and organizational accountability. Scraping, the automated extraction of data, often contravenes these core principles. The core tenents of the FIPPs include the concept of only collecting and processing data when necessary for a legitimate purpose spelled out in advance, keeping the data safe and accurate, and doing everything in a transparent and accountable way. These principals, developed in response to fears about the power of digital databases to easily collect, store, aggregate, search, and share information, serve as the fundamental concepts that most data privacy laws are built on, therefore, it creates an inherent and irreconcilable conflict when attempting to regulate scraping under those foundational priniciples.

Scraping creates tension with key privacy tenets: fairness dictates handling data as expected and avoiding adverse effects; individual rights empower control over data use, yet scraping proceeds without consent, stripping agency; transparency mandates clear disclosure of data practices, whereas scraping often occurs surreptitiously. Additionally, purpose specification limits data use to originally stated aims, but scraping involves indiscriminate collection for unknown purposes. Data minimization requires collecting only necessary data, contrasting with scraping’s broad acquisition. Restrictions on onward transfer are also thwarted as are rules for data security and safe storage. Data security rules are meant to secure sensitive data in a safe and secure environment, however, automated scraping and the AI services that use scraped data, render these rules ineffective.

These principles create a vision for data privacy constructed around fairness, individual autonomy, and data processor accountability. Because most privacy laws include each of these tenets, scraping of personal data is incompatible with core ideas in each of these areas. It is important, therefore, to craft rules and limitations when regulating scraping in the age of artifical intelligence, in order to reconcile traditional notions of data security and storage with modern technology. In short, privacy expectations of users and the core principals of data security have proven to be incompatible, as well as created problems for effective and practical legal oversight.

What do individuals expect regarding their personal data?

Individuals have specific expectations when sharing their personal data online, and these expectations are deeply tied to the context in which the information is shared. Research consistently reveals a desire for control over personal data and a reasonable expectation that recipients will protect it from unauthorized access. This expectation is often violated by scraping, as data is extracted without knowledge or consent, disrupting the intended control and purpose initially envisioned by the individual. Personal preferences and design features like delete buttons, edit functions, and personalized news feeds further underscore that even public disclosures are intended to be limited and subject to modification. This expectation for control and context is a bedrock principle in many privacy discussions.

The loss of control over one’s personal information is a palpable consequence of data scraping, especially when it occurs without knowledge or consent. Existing privacy frameworks aim to provide individuals with certain rights, including access, correction, and deletion, but data scraping effectively nullifies these rights. The wide dissemination of scraped data across multiple platforms makes it virtually impossible to exercise control effectively, rendering deletion requests, for example, futile. The very notion of informational self-determination, a goal championed by many privacy laws, is undermined by the prevalence of automated data extraction, which treats individuals’ data more as a freely exploitable resource than a matter subject to their personal management and agency.

Many individuals are aware that web sites collect some data, but expect that data will be used for some very narrow purposes, or that data will be retained by particular actors such as a commercial website, but rarely an unknown individual. They are also unaware of the possibilities that data can be retained indefinitely in the hands of any number of actors that collect it for purposes completely unrelated to their original intentions. Scrapers will often rely on illusory bright line distinctions between pubic and private, ignoring concerns that the nature of that material in question would create additional security concerns when gathered improperly. This further devalues those expectations, which causes risks in surveillance that should be better managed from a user perspective. This means that they face risks without a meaningful say about how their material will be properly protected.

What does the principle of data minimization require?

The principle of data minimization is a cornerstone of modern data protection legislation, and it demands that organizations collect and process only the personal data that is strictly necessary for a specified, legitimate purpose. This principle is a critical component of responsible data handling, aiming to reduce the potential for harm arising from the storage and use of excessive or irrelevant information. The core idea is that less data in circulation means fewer opportunities for breaches, misuse, and privacy violations. It dictates that data collection be proportional to the stated objective, ensuring that individuals’ privacy is not unnecessarily infringed upon. Data minimization thus reflects a commitment to responsible data governance, compelling organizations to carefully consider the extent and nature of the personal data they acquire and retain, while striving to handle only what is truly essential for their operations.

In the context of data scraping, the principle of data minimization is particularly relevant given the often broad and indiscriminate nature of this practice. Traditional scraping typically involves extracting large swaths of information from websites, with little regard for the necessity of each piece of data. This indiscriminate approach fundamentally clashes with the spirit of data minimization, raising significant privacy concerns and underlining the need for more focused and ethical scraping strategies. Effective data minimization in this case would require scrapers to precisely define their objectives and to narrow the scope of their data collection to only those elements directly relevant to achieving said goals. It may require them to limit the retention of the scraped data, requiring them to specify how long they retain the data for, in order to remain compliant. It would also create incentives for organizations to actively assess and refine their scraping practices, implementing procedural and regulatory mechanisms to ensure ongoing compliance. Scrapers collecting large amounts of data should also take special considerations to delete unnecessary data and remove it at the appropriate time.

Ultimately, the principle of data minimization acts as a crucial check on the scope and impact of data operations, including web scraping. When thoughtfully applied, it can protect personal data from overcollection and promote more careful, ethical data practices; and, for scrapers in particular, it can act as a way to remain compliant with regulations by taking extra regulatory burdens off of their operations and processes. By limiting data collections to what is directly relevant and justified, organizations can reduce the risk of privacy harms and maintain greater accountability in how they handle personal information. Enshrining data minimization in practice will not only enhance data security and individual privacy, but also promote a more responsible and sustainable data ecosystem overall.

What does the principle of onward transfer encompass?

The principle of onward transfer, a cornerstone of many privacy laws, mandates that organizations transferring personal data to third parties must establish contractual and technical controls to ensure that the data continues to be protected downstream. This principle seeks to safeguard the privacy expectations of individuals, recognizing that when data is shared, individuals take into consideration the recipient’s identity, as well as those of intended and imagined audience members. It aims to create a chain-link confidentiality agreement among data processors, so that reasonable privacy is protected where possible, regardless of data location. The key requirement involves vetting third-party processors for compliance to security and privacy standards, as well as enforcing sufficient contractual agreements designed to protect consumer data.

Scraping demonstrably circumvents onward transfer, as it entails the appropriation of data by unauthorized entities, lacking any contractual bonds, stipulations, or individual consent. It subverts the promises made by companies concerning data usage and security, rendering security measures a farce if malicious actors can merely extract data without consequence. Regulatory frameworks like the GDPR and other U.S. privacy laws necessitate that data holders must require contracts from all secondary recipients of their data to uphold data protection — yet, owing to scraping, these regulatory standards will be undermined by organizations who don’t respect rules that were designed to apply in all secondary transfers. Such a system creates a bifurcated situation in that organizations with legal standing would comply with laws that organizations engaging in unauthorized data scraping will ignore. Any regulation that would be based only on law-abiding companies protecting consumer data will be undercut by scrapers who ignore these protections to improperly sell personal data.

What are the requirements of the principle of data security?

The principle of data security, a cornerstone of modern privacy regulations, mandates that organizations processing personal data must implement appropriate technical and organizational measures to ensure its protection. This includes safeguarding against unauthorized or unlawful processing, as well as accidental loss, destruction, or damage. Data security extends beyond simply preventing data breaches; it encompasses maintaining the confidentiality, integrity, and availability of personal data throughout its lifecycle.

In the context of web scraping, the principle of data security places a specific onus on the organizations whose websites are targeted. These organizations must adopt proactive measures to mitigate the risk of unauthorized data extraction. These measures can range from implementing access restrictions and CAPTCHAs to employing rate limiting and browser fingerprinting. A robust data security strategy also involves continuous monitoring for suspicious activity and swift responses to detected scraping attempts, such as banning user accounts or IP addresses. Neglecting these security responsibilities essentially renders other privacy safeguards meaningless, as unauthorized scraping can circumvent transparency requirements, consent mechanisms, and limitations on data sharing or use.

Scraping and Data Security Violations

In the case of scraped data that should be protected by a health Breach, which includes improperly sharing data with third parties. Failing to implement or maintain reasonable protections against scrapers is a violation. Data must still protected. In this regard, as a rule, data must be safe guaranteed and protected, despite being available to the public.

How is scraping best understood?

Scraping is best understood as automated online data harvesting, encompassing any instance where a computer program extracts data from another program’s output. More precisely, scraping is defined as the retrieval of content posted on the World Wide Web through programs excluding standard web browsers or APIs. It transforms unstructured web data into structured data for centralized storage and analysis. Though some might consider copying and pasting as scraping, the focus here is the automated kind done via web crawlers, spiders, or bots which facilitates cheap and easy mass collection of information. These computer programs systematically scour the internet for data, and their proliferation is only growing.

From a privacy and ethical standpoint, scraping exists in a gray area, neither wholly condoned nor fully condemned. While some view it as simple data gathering from publicly accessible sources, others frame it as an intrusion, a digital trespass where scrapers pilfer data viewed as a form of property. Still others see it as a norm violation, much like taking more than one’s fair share of free food samples. Ultimately, these are imperfect metaphors, that are helpful frameworks, but do not fully capture every nuance. The most important point is the scale of automatation: scraping drastically reduces the cost of scale compared to non-automated data collection. It is this stark contrast between collecting information via manual means versus scraping that sets the stage for current conflicts regarding privacy, commercial interest, and intellectual property.

Therefore, the key is to focus on scraping’s affordances – its inherent properties that determine how it can be used. Scraping dramatically lowers the cost of obtaining and keeping information at scale, something unimaginable with manual data collection. This difference sets the stage for the conflict it brings. Its unprecedented capacity transforms information gathering, requiring a paradigm shift in how we understand data collection and its implications.

What is the most important question to consider for scraping?

The most important question to consider for scraping isn’t simply whether it’s technically feasible or legally permissible, but rather, whether it aligns with the public interest. While the extraction of data might offer economic advantages or fuel artificial intelligence advancements, it is crucial to assess whether it causes unreasonable risk of harm to individuals, disadvantaged groups, or society at large. This assessment necessitates a careful balancing of potential benefits against the potential for privacy violations and the erosion of trust, considering that unchecked scraping can facilitate surveillance, enable discriminatory practices, and undermine the autonomy of individuals. Privacy rights often get scant consideration in litigation around scraping, so a framework is needed that starts and remains focused on privacy.

This public interest framework demands a focus on proportionality, requiring data harvesting to provide meaningful benefits that outweigh the risks and that are proportional or in excess of any private, individual benefit derived by those doing the scraping. Too often, companies offer some modest, trivial benefit in exchange for lucrative information extract that benefits only the scraper. For example, an efficiency related to an AI system should not be considered justification for the use of scraping. Such rules should require that the purported benefit be specific, grounded in reality, and necessary and proportional to the collection of information

A more comprehensive decision-making process is key. Decision making around data and what is safe and reasonable requires an input and representation structure that reflects the diversity of interested individuals. This includes individuals, stakeholders, legal counsel, policymakers, and technology advisors. Data laws rarely require such diversity.

The protection and respect of the data must be maintained. Scraped data must be afforded the same safeguards as other personal data described under current privacy laws, so it should not lose all protections as a result of open availability. The laws need to continue to limit bad downstream results of the data use.

What is the primary goal of the Reasonable Risk of Harm Principle?

The Reasonable Risk of Harm Principle, within the context of regulating data scraping, seeks to prevent the collection, use, and transfer of scraped personal data if it poses an unreasonable risk of harm to individuals, disadvantaged groups, or society as a whole. This moves beyond simply identifying direct, individual-level harms to considering broader societal consequences, including oppressive surveillance and social harms to marginalized communities and society such as loss of trust and democratic failure. Such a risk-based approach explicitly recognizes that individuals’ right to privacy extends to not just the protection of identifiable personal information, but also the wider socio-economic dangers that data aggregation poses.

Ultimately, a robust conception of harm is to balance economic and technological growth and innovation against the potential harms data scraping could unleash. While absolute certainty about the future impact will always be unattainable, this principle encourages the adoption of measures to better assess risk and for regulators to mitigate potential damages. This means having a response if downstream harms start to come into focus, with particular attention paid to impacts upon the marginalized and disadvantaged.

What does the Proportional Benefits Principle require?

The Proportional Benefits Principle, as envisioned within the context of regulating data scraping, insists that the gathering, utilization, and conveyance of extracted personal information must yield tangible advantages for individuals, marginalized communities, and the broader society. These advantages must not only be meaningful but also quantitatively proportionate to, or even surpass, the benefits accruing to the entity conducting the scraping. This criterion is intended to prevent scenarios where a scraper derives substantial profits while leaving individuals, groups, or society with only minimal or nonexistent benefits.

This principle aims to address the inherent imbalance of power and value extraction in large-scale data harvesting. Beyond merely generating profit for a company, the substantive benefits must be publicly justified. Without this, AI development simply risks further entrenching inequality, where a small group of companies will profit from the free data of millions of people. To ensure benefits are aligned with the law, policy makers must create tangible and auditable means of validation. This means that lawmakers should require that the purported benefit from data scraping is grounded in reality and necessary and proportional to that collection of personal data. The principle forces the scraping entities to not only consider, but quantify and weigh the pros and cons of any proposal before implementation.

In balancing the ethical and practical considerations of data utilization, regulatory measures must be developed to uphold a model of reciprocity and societal impact. As the modern economy has come to depend on data collection and data analysis, a critical component of the conversation on public needs must include considerations of access. By creating an open marketplace where data is available to a myriad of different agencies — in an open source manner, for example — researchers and policy makers will be able to innovate, create novel techniques, and more reliably produce the sorts of benefits imagined by this model.
The unchecked proliferation of scraping poses a serious threat to individual privacy and societal well-being. While technological advancements and economic gains tempt us to prioritize unfettered data collection, a reckoning is overdue. A framework focused on the public interest, requiring demonstrable and proportional benefits that outweigh potential harm, offers a path forward. This necessitates not only robust data security but also a fundamental shift in how we perceive the value of personal information, acknowledging that it is not merely a resource to be mined but a fundamental aspect of individual autonomy deserving of meaningful protection. Only through such a comprehensive approach can we hope to reconcile the allure of data with the imperative of privacy in the age of AI.

Categories
Guide

Understanding and Mitigating AI Risks: A Framework for Systemic Safety and Ethical Development

Artificial intelligence is rapidly transforming our world, promising unprecedented benefits while simultaneously introducing unforeseen challenges. Ensuring its responsible development and deployment requires a clear understanding of potential pitfalls. This research delves into the multifaceted landscape of AI risks, seeking to identify and categorize the factors that can lead to systemic harm. It explores how these risks manifest across different areas, from biased decision-making to security vulnerabilities, and examines the underlying causes that drive their emergence. By providing a structured framework for analyzing AI’s potential downsides, this work aims to inform the ongoing efforts to build safer, more ethical, and beneficial AI systems.

What factors can create systemic risks?

Systemic risks in the context of AI arise from various interconnected factors that can amplify and propagate harm throughout a broader system. One crucial factor is the entity responsible for causing the risk. This could be an AI system acting autonomously, humans intentionally or unintentionally misusing AI, or external factors influencing AI behavior. Notably, a significant portion of identified AI risks stem from decisions made and actions taken by AI systems themselves, rather than direct human intervention. This highlights the potential for AI to generate risks independently, even without malicious human intent. Furthermore, whether the risk is an expected outcome of planned actions (intentional) or an unforeseen consequence (unintentional) plays a key role. These risks are almost equally likely highlighting the challenge of predicting and mitigating unintended consequences, while also accounting for risks designed to cause harm. The stage in the AI lifecycle when these risks manifest—pre-deployment during development or post-deployment after the system is in use—also determines systemic potential by revealing which actions are hardest to contain.

Specific characteristics of AI systems themselves contribute to systemic risks. A lack of robustness, especially when AI systems encounter unforeseen circumstances or biased or incomplete data, allows failures that can have far-reaching impacts across interconnected systems relying on them. Transparency, too, is crucial; challenges in interpreting AI decision-making processes can erode trust, impede accountability, and ultimately inhibit the ability to detect and correct errors before they escalate into systemic issues. Intertwined socioeconomic factors also fuel systemic risk. The concentration of power and resources in the hands of a few entities capable of affording sophisticated AI increases societal inequality, raising the potential for biased or manipulative AI systems. Widespread deployment can displace human workers or create exploitative labor conditions, causing instability that is hard to isolate in one part of life. Further compounding these concerns, the rapid pace of AI development can outstrip governance mechanisms and ethical standards, creating failures in regulation and oversight capable of exacerbating systemic harm.

What is the methodology of this study?

The methodology employed in this study is multi-faceted, incorporating a systematic literature search, expert consultation, and a unique “best fit” framework synthesis. The overall aim was to create a comprehensive AI Risk Repository that could serve as a common frame of reference for understanding and addressing risks associated with AI. The process began with an extensive search of academic databases, including Scopus, and various preprint servers such as arXiv and SSRN, to identify relevant research, articles, reports, and documents focused on proposing new frameworks, taxonomies, or other structured classifications of risks from AI. Pre-specified rules were used to define the studies to be included in this summary, which was facilitated by the use of active learning in ASReview for faster and more effective dual title/abstract screening.

Following the initial literature search, the researchers engaged in forward and backward citation searching, along with expert consultation, to identify additional relevant materials. This involved reviewing bibliographies of selected papers and reaching out to experts in the field for recommendations. The research team then extracted information about 777 different risks from 43 documents into a “living” database. To classify these risks, the researchers adopted a “best fit framework synthesis” approach. This involved selecting existing classification systems from the identified literature and iteratively adapting them to effectively categorize the risks in the database. Grounded theory methods were used during coding to analyze the data as presented in the original sources, without interpretation.

Taxonomy Development

The synthesis process ultimately led to the development of two distinct taxonomies: a high-level Causal Taxonomy of AI Risks, which classifies risks by their causal factors (Entity, Intentionality, and Timing), and a mid-level Domain Taxonomy of AI Risks, which categorizes risks into seven domains (Discrimination & toxicity, Privacy & security, etc.) and 23 subdomains. These taxonomies were developed iteratively, with initial frameworks being tested, refined, and expanded upon based on the data extracted from the literature. The goal was to create a unified classification system that could effectively capture the diverse perspectives on AI risks and facilitate a more coordinated approach to managing these risks.

What are the characteristics of the documents included in this study?

The study included a diverse range of documents, totaling 43, comprising 17 peer-reviewed articles, 16 preprints, 6 conference papers, and 4 other reports. The majority of identified literature was recent, with almost all documents published after 2020, reflecting the rapidly evolving landscape of AI risk research. The corresponding authors were primarily based in the USA, China, the United Kingdom, and Germany, indicating a significant concentration of research efforts in these regions. Affiliations varied, with most corresponding authors representing universities, followed by industry organizations, and a smaller number from government, international, or non-government organizations. A notable trend was the prevalence of narrative reviews or “surveys” as the most common methodology, followed by systematic and scoping reviews, suggesting a need to synthesize existing literature rather than primary empirical investigation.

The included documents presented a compilation of 777 risk categories and sub-categories, demonstrating the breadth of AI risk considerations. However, two documents were excluded from later coding as they did not present distinct risk categories according to the study’s definitions. The framing of risk and AI risk varied significantly, with only a few documents explicitly defining risk. The classifications, frameworks, and taxonomies used varied terms to describe risks, including: “risks of/from AI,” “harms of AI,” “AI ethics,” “ethical issues/concerns/challenges,” “social impacts/harms,” and others, indicating a lack of standardization in terminology and scope of discussion. Considering the type of AI assessed, there was a lack of explicit definitions in most of the documents, but large language models emerged as the most frequent target of risk assessments. These characteristics of the documents highlight the heterogeneity and evolving nature of AI risk research, underscoring the need for a comprehensive repository that can accommodate and categorize the diverse perspectives and methodologies employed.

How are AI risks classified based on causal factors?

AI risks can be classified based on their underlying causal factors, providing a framework for understanding how, when, and why these risks may emerge. One such classification system, here termed the Causal Taxonomy of AI Risks, categorizes risks according to three primary dimensions: the Entity responsible for the risk, the Intent behind the action leading to the risk, and the Timing of the risk occurrence. These dimensions provide a structured approach to dissecting the origins and progression of AI-related harms. Considering these dimensions helps to differentiate for example, between risks originating from intentional malice versus unintentional errors in design, or risks that are primarily attributable to the AI versus those mainly stemming from human decisions.

The ‘Entity’ dimension identifies whether the risk is primarily caused by a human decision or action, by an AI system itself, or by some other ambiguous reason. The ‘Intent’ dimension distinguishes between risks that arise as an expected outcome of pursuing a specific goal (intentional risks) versus those that occur due to unexpected or unintended consequences (unintentional risks). A third option, ‘Other’, acknowledges that the nature of the intentionality can be unclearly specified in original categorizations of risk, for example simply arising from environmental constraints. Meanwhile, the ‘Timing’ dimension categorizes risks based on whether they occur before the AI system is deployed (pre-deployment) or after it has been trained and put into use (post-deployment). The Timing classification also has an “Other” option. This captures risks for which a specific time of occurrence is not clearly presented, acknowledging that some descriptions of risk may span durations or contexts. This classification, by detailing when and how intentional AI risks may occur, allows risk analysis to be more targeted in the development of comprehensive auditing systems.

This approach to classifying AI risks also accounts for how certain domains of risk from AI are generally presented in research and analysis. By accounting for the “Entity”, “Intent”, and “Timing” most consistently presented in discussions of risk, one can identify a degree of consistency and coherence of the general presentation of those risks in research. Identifying which factors vary greatly when researchers present and discuss a specific type of risk can help provide insight into how the nature and origin of that risk is perceived, the type of solutions which may be called for in response, and so on.

Are there clear trends present in the data regarding Entity, Intent, and Timing as they relate to each domain of AI risk?

The analysis reveals varying trends across different AI risk domains concerning Entity (the cause of the risk: Human, AI, or Other), Intent (Intentional, Unintentional, or Other), and Timing (Pre-deployment, Post-deployment, or Other). For example, risks in the Discrimination & toxicity, Misinformation, and AI system safety, failures & limitations domains are more frequently presented as caused by AI systems rather than human actions. Specific subdomains exhibit this trait more clearly, such as the “False or misleading information” subdomain within Misinformation, where 94% of risks are attributed to AI. This contrasts sharply with subdomains like “Compromise of privacy by obtaining, leaking or correctly inferring sensitive information,” where AI is identified as the causal entity in only 62% of instances, indicating less consensus regarding the source of privacy risks. In contrast, Humans are presented as the primary cause for risks related to the Malicious actors & misuse domain, suggesting a perception that these risks stem from deliberate human actions rather than the AI’s inherent behavior. While some risks are attributed overwhelmingly to human decisions (e.g., “Power centralization and unfair distribution of benefits”), others receive mixed attribution, signifying divergent perspectives on accountability. These differences highlight varying perceptions and framings of AI risks depending on what aspect of harm people are concerned about.

Regarding Intent, the Malicious actors & misuse domain overwhelmingly associates risks with intentionality, while subdomains such as “Unfair discrimination and misrepresentation” and “Unequal performance across groups” attribute risks mainly to unintentional behaviors. This divergence signifies varied understandings of how harm manifests: either as a deliberate outcome in domains like misuse or as an unintended consequence in domains like discrimination. Yet, areas such as “Power centralization and unfair distribution of benefits” and “Governance failure” see ambiguous or mixed classification of intent, pointing towards a recognition that these risks may have components of intent related to decisions about structure and governance, unintended accidents as systems and policies are put in place, or even the lack of intent overall, and the emergence of ambiguous risks can occur spontaneously, often without defined direction.

Most risks are presented as occurring Post-deployment, across most domains. However a few subdomains show ambiguity or equal measures of pre and post deployment concerns. These ambiguities related to timing across subdomains imply complex interplay of risks throughout a system’s lifecycle, not necessarily as isolated pre- or post-deployment events, but perhaps as continuing occurrences with causes traceable to multiple stages of action and development. This detailed dissection exposes a nuanced depiction of the AI risk landscape, offering a refined comprehension essential for devising targeted mitigation strategies and policies across varied AI domains and contexts.

What are the primary domains of AI risk?

The risks associated with Artificial Intelligence (AI) are complex and multifaceted, spanning several key domains. This paper synthesizes a Domain Taxonomy of AI Risks, grouping these risks into seven primary domains to provide a comprehensive overview. These domains are discrimination and toxicity, covering issues of unfair bias and harmful content; privacy and security, addressing data breaches and system vulnerabilities; misinformation, focusing on the spread of false or misleading information; malicious actors and misuse, highlighting the potential for AI in cyberattacks and manipulation; human-computer interaction, exploring overreliance and loss of agency; socioeconomic and environmental harms, examining inequality and ecological impact; and AI system safety, failures, and limitations, including issues of goal misalignment and lack of robustness. Each domain offers a specific lens through which to understand and address the various ways in which AI can pose risks to individuals, society, and the environment.

Each primary domain is further divided into subdomains to provide more granular understanding of specific risks. For example, the discrimination and toxicity domain includes subdomains for unfair discrimination and misrepresentation, exposure to toxic content, and unequal performance across groups. Similarly, the privacy and security domain is broken down into compromise of privacy by obtaining, leaking, or correctly inferring sensitive information, and AI system security vulnerabilities and attacks. This detailed categorization allows for a more targeted approach to risk assessment and mitigation strategies. The prevalence of these domains in existing literature varies, with AI system safety, failures, & limitations, socioeconomic & environmental harms, and discrimination & toxicity being the most frequently discussed, suggesting these areas are of particular concern to researchers and practitioners.

Most and least discussed areas

While many domains are well-explored, some remain relatively underexamined. AI welfare and rights, pollution of the information ecosystem and loss of consensus reality, and competitive dynamics all receive less focus in the current literature. This disparity highlights potential gaps in AI risk research, indicating areas that may require further attention and investigation. By understanding the scope and prevalence of risks within each domain, stakeholders can better prioritize their efforts and develop more effective strategies for mitigating the potential harms associated with AI development and deployment.

What implications does the AI Risk Repository have for key audiences such as policymakers?

The AI Risk Repository offers several specific benefits for policymakers navigating the complex landscape of AI regulation. Firstly, it provides a concrete foundation for operationalizing the frequently cited, yet often vaguely defined, terms “harm” and “risk” that appear in AI regulatory frameworks. By offering a detailed catalog of potential risks, the repository enables the development of clear, measurable compliance metrics. These metrics can then be used to effectively monitor adherence to established standards, ensuring that AI systems are developed and deployed responsibly. In essence, the repository brings clarity and specificity to regulatory language, facilitating more effective enforcement and risk mitigation.

Secondly, the AI Risk Repository fosters international collaboration by providing a common language and shared criteria for discussing AI risks. This is particularly important as AI technologies transcend national borders, requiring coordinated regulatory approaches. Bodies such as the EU-US Trade and Technology Council, which are working to develop shared repositories of metrics and methodologies for assessing AI trustworthiness, can leverage the AI Risk Repository to promote interoperability between regulatory frameworks. By harmonizing terminology and providing a unified classification system for AI risks, the repository facilitates the development of global standards that promote responsible AI innovation worldwide. Moreover, the AI Risk Repository offers a comprehensive, up-to-date database of AI risks that assists policymakers in effectively prioritizing resources, tracking emergent risk trends, and creating targeted training programs to address key vulnerabilities within the AI ecosystem.

What are the limitations of this study?

This study, while providing a valuable synthesis of AI risk frameworks, acknowledges certain limitations that should be considered when interpreting its findings. Firstly, the comprehensiveness of the AI Risk Repository relies heavily on the availability and quality of the documented taxonomies and classifications. The exclusion of domain-specific (e.g., healthcare) or location-specific (e.g., a particular country) frameworks limits the generalizability of the findings to a broader context. Additionally, the reliance on extraction and coding by single expert reviewers introduces the potential for subjective biases and interpretation errors, impacting the neutrality of the assembled classifications. Although efforts were made to capture risks as presented by the original authors and coders, ambiguities in the source materials may have led to unintentional misinterpretations or omissions, possibly influencing the final content of the repository.

Moreover, the AI Risk Repository is conceived as a foundation for general use, trading accuracy for clarity, simplicity, and exhaustiveness. As such, it may not be perfectly suited for all specific contexts or use cases, such as technical risk evaluations that require more nuanced analyses or granular categorizations. The binary labeling of pre- versus post-deployment risks could benefit from a more elaborate representation involving several distinct stages, covering the progression of AI systems from design to long-term application. Furthermore, the risk analysis lacks quantitative dimensions, failing to capture important aspects, such as the impact and likelihood of risks, as well as interdependencies across different risks and the crucial distinction between instrumental and terminal risks. These omissions limit its utility for prioritization or balancing risk mitigation with benefit maximization.

Going forward, future research can improve upon this work by refining the consistency, specificity, and coherence of the definitions used for AI risks, potentially integrating semantic tools, such as ontologies, to enable a more shared understanding. More attention could be given to risk areas that are relatively unexplored by the literature, such as AI agents beyond language models, pre-deployment risks caused by humans, and AI rights and welfare considerations. Future iterations might incorporate variables such as threat vectors (bio, cyber), AI classification (agentic, generative), open source or closed, organizational structures or types (big tech or startups), and types of harm, such as economic loss or danger to human life. By acknowledging and addressing these areas for improvement, future work can continue to foster the construction of a coordinated approach to defining, auditing, and managing the varied risks presented by AI systems.
Ultimately, this work illuminates the multifaceted nature of AI risk, moving beyond simplistic notions to reveal the complex interplay of causal factors like the responsible entity, the intent behind actions, and the timing of risk manifestation. The identified trends, or lack thereof, across different AI domains underscore the evolving and often inconsistent ways we perceive and frame these risks. By providing a structured yet adaptable framework, this repository empowers stakeholders to move towards a more unified understanding, fostering clearer communication, and, crucially, enabling the development of targeted strategies to proactively mitigate the potential harms of AI. This represents a crucial step towards responsibly navigating the promises and perils inherent in this rapidly advancing technology.

Categories
Guide

Generative AI: Remaking the Knowledge Ecosystem or Undermining Creativity?

A profound shift is underway, threatening the very foundations of how knowledge is created and shared. Generative artificial intelligence, unlike any technology before it, is producing outputs that directly compete with original human endeavors. This raises critical questions about the future of creative work, as algorithms learn from and replicate existing content at an unprecedented scale, potentially marginalizing the contributions of authors, artists, and journalists. At stake is the delicate balance that sustains a vibrant and diverse information landscape, and whether we can safeguard the incentives that drive human creativity in the age of intelligent machines.

How is generative AI impacting the existing knowledge ecosystem

Generative AI is fundamentally reshaping the knowledge ecosystem, challenging the traditional roles and rights within it. Unlike past technological advancements, generative AI competes directly with human creators by producing content that substitutes for original works, potentially displacing authors, artists, and journalists in the marketplace. This raises serious concerns about the sustainability of the creative sector since most AI firms are not compensating creative workers for the use of their content. The reliance on copyrighted material for training these models, without proper consent or compensation, has created a copyright crisis that threatens to undermine incentives for the ongoing creation of original works, particularly fiction and nonfiction, leading to a destabilized knowledge ecosystem. This situation is further complicated by the scale and opacity of AI systems, which erode authors’ proprietary control over their creations, leading to what some describe as an “existential crisis” for creatives.

The impact of generative AI extends beyond mere replication; it also involves the wholesale appropriation of expressive dimensions of creative works. Unlike search engines or past technologies that primarily dealt with the non-expressive aspects of data, generative AI uses copyrighted works as training data to produce substitutes for those original works. This substitution effect reduces the marginal value of creative labor, potentially pushing it below subsistence levels. Major platforms like Alphabet, X, and Meta are heavily invested in AI development, further intensifying the challenges faced by human creators. Without a mechanism for fair compensation, or even the ability to opt out of having their work used, creatives face the risk of having their work and style replicated and exploited for commercial gain, leading to a decline in the quality and originality of training data available for future AI models.

Model Collapse

One of the most serious concerns is the potential for “model collapse,” a degenerative process where AI models gradually lose touch with the true underlying data distribution due to learning from content generated by other models. This is particularly problematic when human creators are not adequately compensated, leading AI to increasingly rely on its own outputs, resulting in outputs skewed by each generation of AI’s selection and arrangement of key point. This degradation of data quality threatens the long-term viability of AI itself, potentially leading to a homogenization and decline in the richness and diversity of the information landscape. AI also has challenges due to it’s nature as language models, instead of knowledge models. Without a human ineraction with the world, LLMs are increasingly based on earlier LLM outputs which may become, after sufficient iterations, unrecognizably blurred and distorted over time. Thus, intervention and new frameworks are required to avoid the unfair and self-defeating outcome of AI copies becoming prevalent over human works.

What is the proposed response to the challenges presented by generative AI

The challenges presented by generative AI necessitate a multifaceted approach that balances the interests of copyright owners and AI developers. A proposed response involves coupling two key mechanisms: an opt-out right for copyright owners and a compensation levy on AI providers. This combination aims to address the concerns of creatives who fear that AI is unfairly exploiting their work and undermining their livelihoods, while also ensuring the long-term viability of AI by fostering a sustainable knowledge ecosystem. The goal is to forge a new grand bargain between copyright owners and AI firms, fostering a more equitable and sustainable future for both AI and the human creativity upon which it depends.

The opt-out mechanism would empower copyright owners to regain proprietary control over their works by allowing them to prevent the non-consensual use of their works for training AI models. After documenting copyright infringement, creatives could submit requests to AI providers to remove their works from training datasets. This streamlined process is modeled after the Digital Millennium Copyright Act (DMCA) “notice and action” procedures but adapted for the unique challenges presented by AI. Furthermore, AI providers would be obligated to actively take steps to prevent their systems from generating outputs that are identical or substantially similar to the relevant copyrighted works. If a copyright owner identifies that an output generated by an AI provider’s AI system contains either a verbatim or substantially similar copy of his or her work or a derivative work they must provide documentation of ownership and the infringement and the AI provider must respond effectively.

For copyright owners who do not choose to opt-out or license their work to AI providers, a compensation levy would ensure they receive a fair share of the economic value generated by AI systems. This levy, imposed on AI providers by a central authority, would be distributed to copyright owners whose work is used without a license. Determining the proper level of these levies requires careful consideration of various factors, including the economic value of copyrighted works used in AI training, the potential market substitution of human creativity by AI, and the dangers of “model collapse,” where AI systems degrade due to over-reliance on AI-generated content. Precedents in other creative industries and the ongoing debates about valuations in transport, communications and other infrastucture could inform a system of levies and standards administered by pricing experts.

What are the fundamental principles that underpin copyright and consent within the knowledge ecosystem

Copyright law forms a cornerstone of the knowledge ecosystem, incentivizing the creation and dissemination of original works. At its core, copyright grants authors exclusive rights to control the reproduction, distribution, and public performance of their creations. This exclusivity serves as a legal foundation for authors’ proprietary control, allowing them to protect their work from unauthorized use and to grant permissions for access and utilization, often in exchange for financial compensation such as royalties. Moreover, copyright law incentivizes intermediaries like publishers, performers, and broadcasting organizations to disseminate these works to the public, further enriching the knowledge ecosystem. The ability of authors to grant consent for the use of their works, coupled with the expectation of fair compensation, is thus central to a healthy and sustainable environment for creativity and knowledge creation. This ecosystem is however challenged by the opacity and scale of modern AI training techniques.

Generative artificial intelligence (AI) systems are disrupting this established knowledge ecosystem by significantly eroding authors’ proprietary control over their works, extending beyond previous digital challenges. Unlike past technologies that primarily focused on non-expressive aspects of works, AI often targets the expressive dimensions. While search engines direct users to original content and works themselves, AI tends to provide substitutions without adequate citation or acknowledgment of the sources that trained it. Many AI firms are using vast volumes of copyrighted works without direct authorization, bypassing human control in ways that raise difficult legal and ethical questions. This disregard for copyright and consent threatens to destabilize the knowledge ecosystem by devaluing human creativity, creating unfair competition, and potentially undermining the very incentives necessary for the ongoing creation of the datasets and works upon which further AI development and evolution depends. The fundamental principles, therefore—copyright protection, authorial consent, and fair compensation—are all vital for both human creators and the long-term viability of AI.

What are the core distinctions between historical web-scraping and AI’s current practices

The emergence of generative AI poses a unique challenge to copyright law, markedly different from previous instances of technological disruption. While web-scraping has been a common practice for years, the focus has generally been on extracting the non-expressive aspects of works, such as factual data, or creating search tools that direct users to the original source. In contrast, AI’s current practices frequently target the expressive dimensions of copyrighted material. This means AI models are trained on, and generate outputs that closely resemble or directly replicate, existing creative works. Unlike traditional search engines that provide links, AI often offers substitutes for these works, frequently failing to acknowledge the sources or provide citations to the original creators. This shift from data extraction to expressive mimicry fundamentally alters the relationship between technology, copyright, and creative output.

The Erosion of Proprietary Control

Furthermore, the scale and opacity of AI systems exacerbate the erosion of authors’ proprietary control. Whereas past scraping practices, even at scale, tended to respect the ‘robots.txt’ convention that permitted content creators to restrict access by web crawlers, many AI firms have disregarded this convention in pursuit of high-quality datasets. Massive collections of copyrighted works, such as the Books3 dataset, have been used to train AI systems without obtaining explicit authorization from the authors. This disregard for established norms and copyright protections stands in stark contrast to the historical uses of web-scraping, which generally focused on information retrieval and aggregation rather than the direct replication or substitution of creative works. The ease and scale at which AI can copy and repurpose copyrighted content, coupled with the secrecy surrounding training practices, has led to understandable anger and frustration among creators fearing an existential crisis.

How and why does the free appropriation of copyrighted work by AI providers endanger the future of AI

The unchecked appropriation of copyrighted work by AI providers fundamentally destabilizes the knowledge ecosystem upon which AI’s very existence depends. Myriad texts and images inform the models powering apps, highlighting AI’s parasitic relationship to training data. While parasitic relationships can exist in stable equilibria, AI threatens to overwhelm its host. The free appropriation of copyrighted work by AI providers not only devalues human creativity but also eliminates critical incentives for the ongoing creation of works necessary for further technological development. This creates a self-defeating cycle wherein the very source material that fuels AI’s progress is diminished due to a lack of proper acknowledgment and compensation.

A policy of free appropriation of copyrighted work may menace AI development itself. It is not sustainable to expect training data to persist as a renewable resource when it is being mined, without compensation, in part to create substitutes for itself. Scholars have identified the danger of LLMs “learning from data produced by other models,” a possibility that is more likely the less humans are compensated for their work. This pathological outcome, dubbed “model collapse,” is a degenerative process whereby models forget the true underlying data distribution. Because LLMs are language models, not knowledge models, and have no ability to independently reason about or reflect on the original intent and meaning behind the texts and images they process, LLMs increasingly based on earlier LLM output may become, after sufficient iterations, like faded analog copies — almost unrecognizably blurred and distorted over time.

What is the proposed opt-out mechanism and how would it work

The authors propose a streamlined opt-out mechanism that enables copyright owners to reclaim proprietary control over their works concerning the training of AI models. The mechanism would require AI providers to remove objectors’ works from their databases once copyright infringement has been documented. This mechanism empowers authors to effectively prevent AI systems from generating outputs that appear identical or substantially similar to their copyrighted works through a “notice and action” procedure aimed at AI providers. Through this mechanism, copyright owners can submit requests to AI providers for the removal of their works from datasets that are used to train relevant AI systems and to prevent future similar uses.

Under this proposed mechanism, copyright owners would be entitled to send an initial notice to an AI provider upon identifying that an output generated by the provider’s AI system contains either a verbatim or substantially similar copy of their work or a derivative work. The owner’s notice would target AI-generated content that resembles or adapts the copyrighted work, infringing author’s rights. Importantly, the copyright owner would be obliged to document the unauthorized reproduction of the work plus the owner’s established claim of copyright ownership. Upon receiving the notice, the AI provider must then take action to prevent such infringing content from being generated by its system again. If necessary, AI can remove the work, embed filtering tech or initiate a “machine unlearning” process. After the AI Provider has completed the process, they must inform the author of the actions taken and provide an adequate explanation of the effects of such actions.

Normative Rationale

This mechanism effectively empowers authors to opt out of AI systems that generate content infringing on their copyrights, enhancing control over their works, and providing protection amidst the surge in copyright infringement facilitated by AI systems. Additionally, this mechanism is designed to serve an information-forcing function, empowering copyright owners the ability to be able to address infringing activities and to compel AI providers to disclose information about the use of their works in training models and content generation. Overall, this mechanism is designed to allow authors to have more affordable, readily available resolutions over complex, and often expensive, litigation processes.

What justifications support the opt-out mechanism

The proposed opt-out mechanism aims to empower authors to reclaim proprietary control of their works. The surge in copyright infringement facilitated by AI systems necessitates methods to enhance authors’ control and safeguard their interests. This mechanism, grounded in established “notice and action” procedures, allows copyright owners to request AI providers to remove their works from datasets and prevent the generation of infringing content. It serves to effectively deter substantial copying of copyrighted works by AI systems. Granting authors this right provides a means to protect their creative property in a sphere that increasingly relies on automated processes.

Drawing on Existing Copyright Frameworks

The rationale for this mechanism rests on several normative arguments. It draws heavily from existing copyright practices. Online intermediaries have been incentivized by DMCA safe harbors to adopt notice and takedown procedures, offering authors swift ways to address copyright infringements. The opt-out mechanism extends this logic to AI, granting authors a tool to manage the use of their work in these systems. The key difference is the AI content is being generated by the AI firm not hosted. The opt-out mechanism also counters AI firms’ opacity by compelling them to disclose information on how copyrighted materials are used in model training and content generation. To be clear, the opt out mechanism is not designed to undermine fair use privileges of the AI firm.

The opt-out mechanism solves the problems that existing systems can be complex. Some AI providers offer opt-outs that are convoluted, rendering them ineffective. More importantly, this approach creates an efficient alternative to the judicial process in dispute resolution. Copyright litigation is expensive, rendering it less feasible for many creators. The opt-out mechanism encourages conversation between the author and the AI provider, preventing potentially lengthy and costly issues to arise. This system balances the rights of copyright protection and technological innovation.

What are pertinent licensing precedents for compensation within the creative industries

The question of compensation for the use and production of works via AI is controversial, hinging on what levels constitute fair compensation for the copyright-protected inputs used for training AI models. The U.S. government has set prices for certain uses of music, but more complex and higher-stakes economic arrangements are subject to multiple forms of administered pricing. Drawing from examples including the use of blanket licenses administered by ASCAP, legislatures can work out appropriate compensation schemes and delegate their crafting to expert administrators. Similarly, the administration of economic value and valuation for utilities such as transport and telecommunications infrastructure also provides well-established procedures for calculation.

Precedents for Calculating Compensation

Looking to methods of valuation, levying AI providers using copyrighted works is one approach to generate funds for compensation, as seen in the Audio Home Recording Act (AHRA). AHRA imposed a levy on sales of recording devices and media, anticipating their use in unauthorized copying. Similar to fears around AI, this was seen as a threat to creative individuals and copyright owners. Not only were sales levied, but also importation and distribution of devices. This model demonstrates the potential for a tax on devices as a method of generating revenues to compensate creators. Considering the complexity of the AI supply chain, policymakers will need to consider what elements will generate revenues. Levies may be imposed on the use of particular datasets, on model training, an aggregate number of responses provided to users, or paid subscriptions. The levy level could be benchmarked with respect to some percentage of AI providers’ expenditures or revenues.

The revenue-based model may raise copyright owner concerns about the adequacy of compensation, which would be valid if works were compulsorily licensed. A levy system coupled with an opt-out opportunity enables a “soft compulsion,” meaning copyright holders can forego their participation if unsatisfied. AI providers that have fully licensed content should not be required to pay a levy. Significant percentages of licensed works use should be able to discount their levy obligations. Transparency to allow for such accounting for the use of and payment for works would require accounting, which may create spillover benefits enabling external scrutiny.

What are the justifications for compensating copyright owners

There are multifaceted justifications for compensating copyright owners when their works are used in the context of Generative AI. From a basic fairness perspective, the knowledge ecosystem relies on copyright law to incentivize authors by granting them exclusive rights, which allow them to control reproduction, distribution, and public performance of their works. Copyright law also encourages intermediaries to disseminate these works. However, the opacity and scale of AI systems disrupt this ecosystem by eroding authors’ control. Many AI firms utilize copyrighted works as training data without consent, jeopardizing the livelihoods of creatives and threatening the viability of the very knowledge ecosystem that AI depends upon. Compensation may be seen as a means of correcting this imbalance, ensuring creators receive a fair share of the economic benefits derived from their contributions to AI systems. Further, the need for human authorship in the creation of copyrightable works may be used to justify some level of wealth transfer away from entities developing AI, as may a growing power asymmetry between capital and labor.

Unjust enrichment provides another rationale for compensation. This equitable principle holds that it is unfair for one party to retain a benefit obtained at the expense of another. In the context of AI, firms may be unjustly enriched by leveraging copyrighted works without appropriate remuneration. Avoiding windfalls attributable to another’s property or services is part of that rational. There are strong ethical arguments for requiring firms to compensate those whose works have been used without consent. Requiring some form of wealth transfer away from the AI firms expropriating copyrighted works aligns such development with the public interest, promoting fair competition and reducing the likelihood that a new form of piracy will deter critical investments in the affected industries.

Several other real-world observations on the impact of failing to compensate also provide justification. Without fair compensation, low-cost automated content may overwhelm human created works, even if the latter have demonstrable societal value. In that case, some safeguards can and should be put in place now, to promote human-centric creative endeavors and ensure long-term production of knowledge for future generations. To that end, the prospect of AI-generated works overwhelming human-created works without some legal rebalancing of rights and interests provides yet another rationale for human-centric compensation to those works’ owners.

What methods of assessing the appropriate compensation level for copyright owners are proposed

This section explores methods for assessing compensation to copyright owners for the use and production of works via AI. This involves addressing both the “why” of compensation, rooted in varied normative perspectives that justify payments to copyright holders whose work underpins AI, and the “how” of compensation, referencing historical precedents involving fixed payments or proportional revenue sharing for copyright owners. The ethical dilemma centers on a conflict between labor and capital, where AI’s potential to enrich capital at the expense of labor could accelerate wealth redistribution. Copyright doctrine favors human creation, thus providing a justification for human-centric compensation.

One proposed method for assessing compensation is inspired by examples of administered pricing throughout communications and infrastructure, but focuses on the unique context of copyright. A levy imposed on AI providers using copyrighted works to train their models could provide funds for compensating affected copyright owners. The AHRA offers a precedent, which levied sales of recording devices and media to address uncompensated copying of copyrighted work. While a per-device cost may not be feasible, levies on datasets or model training, calculated as a percentage of AI provider expenditures or revenues, represent other potential payment triggers. Based on a tripartite division of inputs, policymakers may conclude that training data is worth as much as the talent and computing equipment.

Drawing on industry-specific valuation methods, another approach to valuation would be premised on the revenue generated by firms providing AI. For example, legislators could mandate that a for-profit firm with $10 billion in revenue allocate 5% to a levy for copyright holders not engaged in alterative licensing arrangements. This kind of plan has precedence and would not be an open faced request to the AI companies, for it is possible to take a plan from another area such as online advertising, as was done with the recent plan that Google and Facebook would owe $11.9 Billion to content content creators annually. It is also advisable to calibrate such a system to the uses and purposes of AI, such as a non-profit using the data for purely research purposes versus a new service producing composite articles. An entity leveraging AI in ways that demonstrably undercut the revenue streams of content creators should be held accountable in order to ensure the long-term production of knowledge. In cases of this strategy causing record keeping complications for copyright owners, it should be noted that it would allow for greater transparency in datasets and algorithmic governance, to resolve the social ends that are being negatively impacted.

What objections might arise regarding this proposal

Several potential objections could be raised against the proposed opt-out mechanism and levy structure for AI providers. One common concern is that such regulations might stifle innovation and hinder the development of beneficial AI technologies. It is argued that the added costs and administrative burdens associated with obtaining consent and providing compensation could disincentivize investment and slow down research and development. Critics might suggest that these measures are unduly favorable to copyright owners and excessively costly for AI companies.

Another objection often raised is the potential for a substantial number of copyright owners to withhold their works from AI training datasets, seeking higher payments than those obtained through a levy distribution. This could lead to a scarcity of high-quality training data, severely impeding the progress of AI. Furthermore, it is argued that smaller AI providers, particularly those focused on non-commercial or niche applications, might be disproportionately disadvantaged by the levy system, potentially skewing the AI landscape towards larger, more established players.

Additional Concerns

Concerns may also be raised over the long-term effects of the proposals, questioning whether the potential benefits of the mechanism outweigh the potential negative consequences on creativity and technology. Would limiting access to copyrighted work create a more equitable or accessible environment for innovation in the future? What will be the impact on training models if specific news outlets block AI data collection, intentionally or not? How well can the incentives put in place prevent harms like online bullying or fraudulent transactions? These and other such concerns will need to be considered and addressed in order to craft the most effective legislative solutions.

RESPONSE TO OBJECTIONS

Some might see our plan for an opt-out method and a levy as excessively in favor of copyright holders or too expensive for AI businesses. It has been claimed that compensating authors for the use of their work will halt AI development. However, it seems unlikely that a moderate annual fee would significantly affect the finances of the major corporations supporting much of today’s AI advancements. Additionally, voluntary licensing offers AI developers a “way out” from the levy we advise. Since OpenAI has previously engaged in licensing agreements with significant content providers, it’s unlikely that ensuring creators are compensated will impede AI research any more than numerous online policy changes will hinder the Internet.

Another worry is that a large group of copyright holders may take their work off the market in order to demand higher fees than they would receive from a levy. If they do, it will seriously hamper future AI development. There are several ways to respond to this issue. While scholarly work on the interpretation of current copyrighted works has been dominated by a trade-off analysis of incentives vs. access, the construction of future laws can and should be guided by a more sophisticated and inclusive set of policy goals, including industrial policy. Much depends on the proportion of opt-outs compared to holdings as a whole, the importance of these missing works to advances of training generative AI, and the social value of AI generally.

Furthermore, moral or other nonmonetary objections to the use of their work by certain companies may also exist. However, this does not indicate an outright rejection of AI. Rather, holdouts may want to grant a commercial advantage to businesses that are more aligned with their own moral principles, or they may want to aid small competitors of today’s AI behemoths. In many cases, this would be an entirely legitimate reason to exercise opt-out rights. It is exceedingly difficult even for those within the AI realm to forecast the medium- and long-term consequences of the changes in the relative costs of data that our proposal would likely bring. Certainty here commends a principle-centered, rather than results-centered, approach, while legislators continually re-evaluate effects.
The unchecked growth of generative AI presents a pivotal moment for the knowledge ecosystem. Failing to address the fundamental issues of copyright and fair compensation risks undermining the very foundations upon which future innovation rests. By implementing mechanisms for creators to control their work and receive equitable remuneration, we can steer AI development towards a more sustainable path, enabling the technology to flourish while safeguarding the vital contributions of human creativity. The ultimate goal is to cultivate a mutually beneficial relationship where AI and human ingenuity coexist and enrich one another, ensuring a vibrant and diverse future for knowledge creation.