A profound shift is underway, threatening the very foundations of how knowledge is created and shared. Generative artificial intelligence, unlike any technology before it, is producing outputs that directly compete with original human endeavors. This raises critical questions about the future of creative work, as algorithms learn from and replicate existing content at an unprecedented scale, potentially marginalizing the contributions of authors, artists, and journalists. At stake is the delicate balance that sustains a vibrant and diverse information landscape, and whether we can safeguard the incentives that drive human creativity in the age of intelligent machines.
How is generative AI impacting the existing knowledge ecosystem
Generative AI is fundamentally reshaping the knowledge ecosystem, challenging the traditional roles and rights within it. Unlike past technological advancements, generative AI competes directly with human creators by producing content that substitutes for original works, potentially displacing authors, artists, and journalists in the marketplace. This raises serious concerns about the sustainability of the creative sector since most AI firms are not compensating creative workers for the use of their content. The reliance on copyrighted material for training these models, without proper consent or compensation, has created a copyright crisis that threatens to undermine incentives for the ongoing creation of original works, particularly fiction and nonfiction, leading to a destabilized knowledge ecosystem. This situation is further complicated by the scale and opacity of AI systems, which erode authors’ proprietary control over their creations, leading to what some describe as an “existential crisis” for creatives.
The impact of generative AI extends beyond mere replication; it also involves the wholesale appropriation of expressive dimensions of creative works. Unlike search engines or past technologies that primarily dealt with the non-expressive aspects of data, generative AI uses copyrighted works as training data to produce substitutes for those original works. This substitution effect reduces the marginal value of creative labor, potentially pushing it below subsistence levels. Major platforms like Alphabet, X, and Meta are heavily invested in AI development, further intensifying the challenges faced by human creators. Without a mechanism for fair compensation, or even the ability to opt out of having their work used, creatives face the risk of having their work and style replicated and exploited for commercial gain, leading to a decline in the quality and originality of training data available for future AI models.
Model Collapse
One of the most serious concerns is the potential for “model collapse,” a degenerative process where AI models gradually lose touch with the true underlying data distribution due to learning from content generated by other models. This is particularly problematic when human creators are not adequately compensated, leading AI to increasingly rely on its own outputs, resulting in outputs skewed by each generation of AI’s selection and arrangement of key point. This degradation of data quality threatens the long-term viability of AI itself, potentially leading to a homogenization and decline in the richness and diversity of the information landscape. AI also has challenges due to it’s nature as language models, instead of knowledge models. Without a human ineraction with the world, LLMs are increasingly based on earlier LLM outputs which may become, after sufficient iterations, unrecognizably blurred and distorted over time. Thus, intervention and new frameworks are required to avoid the unfair and self-defeating outcome of AI copies becoming prevalent over human works.
What is the proposed response to the challenges presented by generative AI
The challenges presented by generative AI necessitate a multifaceted approach that balances the interests of copyright owners and AI developers. A proposed response involves coupling two key mechanisms: an opt-out right for copyright owners and a compensation levy on AI providers. This combination aims to address the concerns of creatives who fear that AI is unfairly exploiting their work and undermining their livelihoods, while also ensuring the long-term viability of AI by fostering a sustainable knowledge ecosystem. The goal is to forge a new grand bargain between copyright owners and AI firms, fostering a more equitable and sustainable future for both AI and the human creativity upon which it depends.
The opt-out mechanism would empower copyright owners to regain proprietary control over their works by allowing them to prevent the non-consensual use of their works for training AI models. After documenting copyright infringement, creatives could submit requests to AI providers to remove their works from training datasets. This streamlined process is modeled after the Digital Millennium Copyright Act (DMCA) “notice and action” procedures but adapted for the unique challenges presented by AI. Furthermore, AI providers would be obligated to actively take steps to prevent their systems from generating outputs that are identical or substantially similar to the relevant copyrighted works. If a copyright owner identifies that an output generated by an AI provider’s AI system contains either a verbatim or substantially similar copy of his or her work or a derivative work they must provide documentation of ownership and the infringement and the AI provider must respond effectively.
For copyright owners who do not choose to opt-out or license their work to AI providers, a compensation levy would ensure they receive a fair share of the economic value generated by AI systems. This levy, imposed on AI providers by a central authority, would be distributed to copyright owners whose work is used without a license. Determining the proper level of these levies requires careful consideration of various factors, including the economic value of copyrighted works used in AI training, the potential market substitution of human creativity by AI, and the dangers of “model collapse,” where AI systems degrade due to over-reliance on AI-generated content. Precedents in other creative industries and the ongoing debates about valuations in transport, communications and other infrastucture could inform a system of levies and standards administered by pricing experts.
What are the fundamental principles that underpin copyright and consent within the knowledge ecosystem
Copyright law forms a cornerstone of the knowledge ecosystem, incentivizing the creation and dissemination of original works. At its core, copyright grants authors exclusive rights to control the reproduction, distribution, and public performance of their creations. This exclusivity serves as a legal foundation for authors’ proprietary control, allowing them to protect their work from unauthorized use and to grant permissions for access and utilization, often in exchange for financial compensation such as royalties. Moreover, copyright law incentivizes intermediaries like publishers, performers, and broadcasting organizations to disseminate these works to the public, further enriching the knowledge ecosystem. The ability of authors to grant consent for the use of their works, coupled with the expectation of fair compensation, is thus central to a healthy and sustainable environment for creativity and knowledge creation. This ecosystem is however challenged by the opacity and scale of modern AI training techniques.
Generative artificial intelligence (AI) systems are disrupting this established knowledge ecosystem by significantly eroding authors’ proprietary control over their works, extending beyond previous digital challenges. Unlike past technologies that primarily focused on non-expressive aspects of works, AI often targets the expressive dimensions. While search engines direct users to original content and works themselves, AI tends to provide substitutions without adequate citation or acknowledgment of the sources that trained it. Many AI firms are using vast volumes of copyrighted works without direct authorization, bypassing human control in ways that raise difficult legal and ethical questions. This disregard for copyright and consent threatens to destabilize the knowledge ecosystem by devaluing human creativity, creating unfair competition, and potentially undermining the very incentives necessary for the ongoing creation of the datasets and works upon which further AI development and evolution depends. The fundamental principles, therefore—copyright protection, authorial consent, and fair compensation—are all vital for both human creators and the long-term viability of AI.
What are the core distinctions between historical web-scraping and AI’s current practices
The emergence of generative AI poses a unique challenge to copyright law, markedly different from previous instances of technological disruption. While web-scraping has been a common practice for years, the focus has generally been on extracting the non-expressive aspects of works, such as factual data, or creating search tools that direct users to the original source. In contrast, AI’s current practices frequently target the expressive dimensions of copyrighted material. This means AI models are trained on, and generate outputs that closely resemble or directly replicate, existing creative works. Unlike traditional search engines that provide links, AI often offers substitutes for these works, frequently failing to acknowledge the sources or provide citations to the original creators. This shift from data extraction to expressive mimicry fundamentally alters the relationship between technology, copyright, and creative output.
The Erosion of Proprietary Control
Furthermore, the scale and opacity of AI systems exacerbate the erosion of authors’ proprietary control. Whereas past scraping practices, even at scale, tended to respect the ‘robots.txt’ convention that permitted content creators to restrict access by web crawlers, many AI firms have disregarded this convention in pursuit of high-quality datasets. Massive collections of copyrighted works, such as the Books3 dataset, have been used to train AI systems without obtaining explicit authorization from the authors. This disregard for established norms and copyright protections stands in stark contrast to the historical uses of web-scraping, which generally focused on information retrieval and aggregation rather than the direct replication or substitution of creative works. The ease and scale at which AI can copy and repurpose copyrighted content, coupled with the secrecy surrounding training practices, has led to understandable anger and frustration among creators fearing an existential crisis.
How and why does the free appropriation of copyrighted work by AI providers endanger the future of AI
The unchecked appropriation of copyrighted work by AI providers fundamentally destabilizes the knowledge ecosystem upon which AI’s very existence depends. Myriad texts and images inform the models powering apps, highlighting AI’s parasitic relationship to training data. While parasitic relationships can exist in stable equilibria, AI threatens to overwhelm its host. The free appropriation of copyrighted work by AI providers not only devalues human creativity but also eliminates critical incentives for the ongoing creation of works necessary for further technological development. This creates a self-defeating cycle wherein the very source material that fuels AI’s progress is diminished due to a lack of proper acknowledgment and compensation.
A policy of free appropriation of copyrighted work may menace AI development itself. It is not sustainable to expect training data to persist as a renewable resource when it is being mined, without compensation, in part to create substitutes for itself. Scholars have identified the danger of LLMs “learning from data produced by other models,” a possibility that is more likely the less humans are compensated for their work. This pathological outcome, dubbed “model collapse,” is a degenerative process whereby models forget the true underlying data distribution. Because LLMs are language models, not knowledge models, and have no ability to independently reason about or reflect on the original intent and meaning behind the texts and images they process, LLMs increasingly based on earlier LLM output may become, after sufficient iterations, like faded analog copies — almost unrecognizably blurred and distorted over time.
What is the proposed opt-out mechanism and how would it work
The authors propose a streamlined opt-out mechanism that enables copyright owners to reclaim proprietary control over their works concerning the training of AI models. The mechanism would require AI providers to remove objectors’ works from their databases once copyright infringement has been documented. This mechanism empowers authors to effectively prevent AI systems from generating outputs that appear identical or substantially similar to their copyrighted works through a “notice and action” procedure aimed at AI providers. Through this mechanism, copyright owners can submit requests to AI providers for the removal of their works from datasets that are used to train relevant AI systems and to prevent future similar uses.
Under this proposed mechanism, copyright owners would be entitled to send an initial notice to an AI provider upon identifying that an output generated by the provider’s AI system contains either a verbatim or substantially similar copy of their work or a derivative work. The owner’s notice would target AI-generated content that resembles or adapts the copyrighted work, infringing author’s rights. Importantly, the copyright owner would be obliged to document the unauthorized reproduction of the work plus the owner’s established claim of copyright ownership. Upon receiving the notice, the AI provider must then take action to prevent such infringing content from being generated by its system again. If necessary, AI can remove the work, embed filtering tech or initiate a “machine unlearning” process. After the AI Provider has completed the process, they must inform the author of the actions taken and provide an adequate explanation of the effects of such actions.
Normative Rationale
This mechanism effectively empowers authors to opt out of AI systems that generate content infringing on their copyrights, enhancing control over their works, and providing protection amidst the surge in copyright infringement facilitated by AI systems. Additionally, this mechanism is designed to serve an information-forcing function, empowering copyright owners the ability to be able to address infringing activities and to compel AI providers to disclose information about the use of their works in training models and content generation. Overall, this mechanism is designed to allow authors to have more affordable, readily available resolutions over complex, and often expensive, litigation processes.
What justifications support the opt-out mechanism
The proposed opt-out mechanism aims to empower authors to reclaim proprietary control of their works. The surge in copyright infringement facilitated by AI systems necessitates methods to enhance authors’ control and safeguard their interests. This mechanism, grounded in established “notice and action” procedures, allows copyright owners to request AI providers to remove their works from datasets and prevent the generation of infringing content. It serves to effectively deter substantial copying of copyrighted works by AI systems. Granting authors this right provides a means to protect their creative property in a sphere that increasingly relies on automated processes.
Drawing on Existing Copyright Frameworks
The rationale for this mechanism rests on several normative arguments. It draws heavily from existing copyright practices. Online intermediaries have been incentivized by DMCA safe harbors to adopt notice and takedown procedures, offering authors swift ways to address copyright infringements. The opt-out mechanism extends this logic to AI, granting authors a tool to manage the use of their work in these systems. The key difference is the AI content is being generated by the AI firm not hosted. The opt-out mechanism also counters AI firms’ opacity by compelling them to disclose information on how copyrighted materials are used in model training and content generation. To be clear, the opt out mechanism is not designed to undermine fair use privileges of the AI firm.
The opt-out mechanism solves the problems that existing systems can be complex. Some AI providers offer opt-outs that are convoluted, rendering them ineffective. More importantly, this approach creates an efficient alternative to the judicial process in dispute resolution. Copyright litigation is expensive, rendering it less feasible for many creators. The opt-out mechanism encourages conversation between the author and the AI provider, preventing potentially lengthy and costly issues to arise. This system balances the rights of copyright protection and technological innovation.
What are pertinent licensing precedents for compensation within the creative industries
The question of compensation for the use and production of works via AI is controversial, hinging on what levels constitute fair compensation for the copyright-protected inputs used for training AI models. The U.S. government has set prices for certain uses of music, but more complex and higher-stakes economic arrangements are subject to multiple forms of administered pricing. Drawing from examples including the use of blanket licenses administered by ASCAP, legislatures can work out appropriate compensation schemes and delegate their crafting to expert administrators. Similarly, the administration of economic value and valuation for utilities such as transport and telecommunications infrastructure also provides well-established procedures for calculation.
Precedents for Calculating Compensation
Looking to methods of valuation, levying AI providers using copyrighted works is one approach to generate funds for compensation, as seen in the Audio Home Recording Act (AHRA). AHRA imposed a levy on sales of recording devices and media, anticipating their use in unauthorized copying. Similar to fears around AI, this was seen as a threat to creative individuals and copyright owners. Not only were sales levied, but also importation and distribution of devices. This model demonstrates the potential for a tax on devices as a method of generating revenues to compensate creators. Considering the complexity of the AI supply chain, policymakers will need to consider what elements will generate revenues. Levies may be imposed on the use of particular datasets, on model training, an aggregate number of responses provided to users, or paid subscriptions. The levy level could be benchmarked with respect to some percentage of AI providers’ expenditures or revenues.
The revenue-based model may raise copyright owner concerns about the adequacy of compensation, which would be valid if works were compulsorily licensed. A levy system coupled with an opt-out opportunity enables a “soft compulsion,” meaning copyright holders can forego their participation if unsatisfied. AI providers that have fully licensed content should not be required to pay a levy. Significant percentages of licensed works use should be able to discount their levy obligations. Transparency to allow for such accounting for the use of and payment for works would require accounting, which may create spillover benefits enabling external scrutiny.
What are the justifications for compensating copyright owners
There are multifaceted justifications for compensating copyright owners when their works are used in the context of Generative AI. From a basic fairness perspective, the knowledge ecosystem relies on copyright law to incentivize authors by granting them exclusive rights, which allow them to control reproduction, distribution, and public performance of their works. Copyright law also encourages intermediaries to disseminate these works. However, the opacity and scale of AI systems disrupt this ecosystem by eroding authors’ control. Many AI firms utilize copyrighted works as training data without consent, jeopardizing the livelihoods of creatives and threatening the viability of the very knowledge ecosystem that AI depends upon. Compensation may be seen as a means of correcting this imbalance, ensuring creators receive a fair share of the economic benefits derived from their contributions to AI systems. Further, the need for human authorship in the creation of copyrightable works may be used to justify some level of wealth transfer away from entities developing AI, as may a growing power asymmetry between capital and labor.
Unjust enrichment provides another rationale for compensation. This equitable principle holds that it is unfair for one party to retain a benefit obtained at the expense of another. In the context of AI, firms may be unjustly enriched by leveraging copyrighted works without appropriate remuneration. Avoiding windfalls attributable to another’s property or services is part of that rational. There are strong ethical arguments for requiring firms to compensate those whose works have been used without consent. Requiring some form of wealth transfer away from the AI firms expropriating copyrighted works aligns such development with the public interest, promoting fair competition and reducing the likelihood that a new form of piracy will deter critical investments in the affected industries.
Several other real-world observations on the impact of failing to compensate also provide justification. Without fair compensation, low-cost automated content may overwhelm human created works, even if the latter have demonstrable societal value. In that case, some safeguards can and should be put in place now, to promote human-centric creative endeavors and ensure long-term production of knowledge for future generations. To that end, the prospect of AI-generated works overwhelming human-created works without some legal rebalancing of rights and interests provides yet another rationale for human-centric compensation to those works’ owners.
What methods of assessing the appropriate compensation level for copyright owners are proposed
This section explores methods for assessing compensation to copyright owners for the use and production of works via AI. This involves addressing both the “why” of compensation, rooted in varied normative perspectives that justify payments to copyright holders whose work underpins AI, and the “how” of compensation, referencing historical precedents involving fixed payments or proportional revenue sharing for copyright owners. The ethical dilemma centers on a conflict between labor and capital, where AI’s potential to enrich capital at the expense of labor could accelerate wealth redistribution. Copyright doctrine favors human creation, thus providing a justification for human-centric compensation.
One proposed method for assessing compensation is inspired by examples of administered pricing throughout communications and infrastructure, but focuses on the unique context of copyright. A levy imposed on AI providers using copyrighted works to train their models could provide funds for compensating affected copyright owners. The AHRA offers a precedent, which levied sales of recording devices and media to address uncompensated copying of copyrighted work. While a per-device cost may not be feasible, levies on datasets or model training, calculated as a percentage of AI provider expenditures or revenues, represent other potential payment triggers. Based on a tripartite division of inputs, policymakers may conclude that training data is worth as much as the talent and computing equipment.
Drawing on industry-specific valuation methods, another approach to valuation would be premised on the revenue generated by firms providing AI. For example, legislators could mandate that a for-profit firm with $10 billion in revenue allocate 5% to a levy for copyright holders not engaged in alterative licensing arrangements. This kind of plan has precedence and would not be an open faced request to the AI companies, for it is possible to take a plan from another area such as online advertising, as was done with the recent plan that Google and Facebook would owe $11.9 Billion to content content creators annually. It is also advisable to calibrate such a system to the uses and purposes of AI, such as a non-profit using the data for purely research purposes versus a new service producing composite articles. An entity leveraging AI in ways that demonstrably undercut the revenue streams of content creators should be held accountable in order to ensure the long-term production of knowledge. In cases of this strategy causing record keeping complications for copyright owners, it should be noted that it would allow for greater transparency in datasets and algorithmic governance, to resolve the social ends that are being negatively impacted.
What objections might arise regarding this proposal
Several potential objections could be raised against the proposed opt-out mechanism and levy structure for AI providers. One common concern is that such regulations might stifle innovation and hinder the development of beneficial AI technologies. It is argued that the added costs and administrative burdens associated with obtaining consent and providing compensation could disincentivize investment and slow down research and development. Critics might suggest that these measures are unduly favorable to copyright owners and excessively costly for AI companies.
Another objection often raised is the potential for a substantial number of copyright owners to withhold their works from AI training datasets, seeking higher payments than those obtained through a levy distribution. This could lead to a scarcity of high-quality training data, severely impeding the progress of AI. Furthermore, it is argued that smaller AI providers, particularly those focused on non-commercial or niche applications, might be disproportionately disadvantaged by the levy system, potentially skewing the AI landscape towards larger, more established players.
Additional Concerns
Concerns may also be raised over the long-term effects of the proposals, questioning whether the potential benefits of the mechanism outweigh the potential negative consequences on creativity and technology. Would limiting access to copyrighted work create a more equitable or accessible environment for innovation in the future? What will be the impact on training models if specific news outlets block AI data collection, intentionally or not? How well can the incentives put in place prevent harms like online bullying or fraudulent transactions? These and other such concerns will need to be considered and addressed in order to craft the most effective legislative solutions.
RESPONSE TO OBJECTIONS
Some might see our plan for an opt-out method and a levy as excessively in favor of copyright holders or too expensive for AI businesses. It has been claimed that compensating authors for the use of their work will halt AI development. However, it seems unlikely that a moderate annual fee would significantly affect the finances of the major corporations supporting much of today’s AI advancements. Additionally, voluntary licensing offers AI developers a “way out” from the levy we advise. Since OpenAI has previously engaged in licensing agreements with significant content providers, it’s unlikely that ensuring creators are compensated will impede AI research any more than numerous online policy changes will hinder the Internet.
Another worry is that a large group of copyright holders may take their work off the market in order to demand higher fees than they would receive from a levy. If they do, it will seriously hamper future AI development. There are several ways to respond to this issue. While scholarly work on the interpretation of current copyrighted works has been dominated by a trade-off analysis of incentives vs. access, the construction of future laws can and should be guided by a more sophisticated and inclusive set of policy goals, including industrial policy. Much depends on the proportion of opt-outs compared to holdings as a whole, the importance of these missing works to advances of training generative AI, and the social value of AI generally.
Furthermore, moral or other nonmonetary objections to the use of their work by certain companies may also exist. However, this does not indicate an outright rejection of AI. Rather, holdouts may want to grant a commercial advantage to businesses that are more aligned with their own moral principles, or they may want to aid small competitors of today’s AI behemoths. In many cases, this would be an entirely legitimate reason to exercise opt-out rights. It is exceedingly difficult even for those within the AI realm to forecast the medium- and long-term consequences of the changes in the relative costs of data that our proposal would likely bring. Certainty here commends a principle-centered, rather than results-centered, approach, while legislators continually re-evaluate effects.
The unchecked growth of generative AI presents a pivotal moment for the knowledge ecosystem. Failing to address the fundamental issues of copyright and fair compensation risks undermining the very foundations upon which future innovation rests. By implementing mechanisms for creators to control their work and receive equitable remuneration, we can steer AI development towards a more sustainable path, enabling the technology to flourish while safeguarding the vital contributions of human creativity. The ultimate goal is to cultivate a mutually beneficial relationship where AI and human ingenuity coexist and enrich one another, ensuring a vibrant and diverse future for knowledge creation.