Ok, I admit it, this article is slightly outside of this blog’s usual focus on cybersecurity. However, the recent rise of questions about AI – and my interest in a specific, narrow portion of that discussion as a blogger, podcaster, and general content creator: AI’s interaction with and dependence on the created works of humans – has been a major topic of discussion.

Rapid advancements in the development of artificial intelligence (AI) and generative AI and the even more rapid deployment of generative AI tools for public use have raised several legal questions about their use.

As AI-generated content becomes increasingly sophisticated and prevalent, understanding the implications of copyright becomes crucial. Copyright (and I’m going to default to copyright protection under US law) exists to protect those who produce creative works and grant exclusive rights to benefit from those works. Benefits may be monetary or otherwise and, of course, may be assigned to others, but only by the original creator of the work.

Since AI models are trained on vast amounts of data, including copyrighted materials, there are fears that they could be used to infringe on the rights of content creators. Or, critically, that they already have.

None of these issues have been definitively resolved, or even fully addressed. While several lawsuits have been filed concerning this issue, it will likely be years before definitive court rulings determine whether and how the existing laws apply in this area. However, the ultimate implications of a finding of copyright infringement could affect not only the company that created the infringing tool but also anyone who used the tool(s) to generate material that was created relying on protected work.

A Brief Disclaimer

I will try to keep this discussion tied to what we know about the existing application of copyright law to the online posting and distribution of content, including text, images, and video.

However, you must always keep something in the back of your mind: copyright law was not designed and has not been updated to factor in tools that use the elements of copyrighted material creatively, such as when an artist incorporates another artist’s style in their own creations.

That’s what these systems do.

However, they are not human creators, or in any way like them.

Generative AI systems do not create something new. They produce an output by combining what they have been taught as allowed by their algorithm. The algorithm is written by and operates for the ultimate benefit of the Generative AI system’s owner (even non-monetary benefits are benefits, which copyright law recognizes).

The idea of imagining Stable Diffusion or Mid Journey as a digital Van Gogh or Rembrandt is a false equivalence. ChatGPT isn’t answering your questions; it’s synthesizing a response based on how the materials it was trained on have answered that question.

As such, most of our current application of copyright law, especially the focus on the fair use exception to copyright infringement, may be entirely inappropriate (and inapplicable) to how these systems operate.

Despite reading many articles on this topic, I have yet to see many people acknowledge that simple fact: there may not be any basis for asserting fair use of copyrighted materials by a Generative AI system.

And yet, most of the coverage on the issue that I’ve encountered tends to raise the issue of fair use, deem that it clearly applies, and quickly waive away any of the concerns about copyright. To do so, in my opinion, fails to apply both the letter and the purpose of copyright law.

The Intersection of AI and Copyright

Large language models, such as OpenAI’s GPT-4, are designed to understand and generate human-like text based on massive amounts of data. In GPT’s case that appears to include data collected by crawling and scanning web pages, books, pictures, movies, and articles. Image-based models rely on primarily Generative Adversarial Networks or Variational Autoencoders and are trained on a dataset consisting of (often) millions of images that have been categorized by human reviewers.

(It is important to remember that essentially none of these AI systems function without the data being input into the models first being given specific values by human reviewers. It’s a step that far too many Generative AI industry advocates conveniently omit from their discussions about AI).

All these systems rely on plugging a massive amount of relevant data through the appropriate machine and deep learning system. Modern Generative AI relies on its models the same way your computer relies on its hard drive. It can technically operate without it, but it can’t do much. The reality is that most of the major Generative AI systems in operation today have a significant amount of copyright-protected content in their datasets, the vast majority not only without the creator’s permission but even without their knowledge.

As AI-generated content becomes more widespread, it raises questions about the ownership and protection of intellectual property. For instance, does the use of copyrighted material in AI training data constitute copyright infringement? Current answer: maybe. Should we allow AI-generated content itself to be copyrighted, and if so, who holds the rights? Current answer: in the U.S., no, but elsewhere, yes.

Focusing on the use of copyrighted materials without the permission of the owner of the copyright, the most common defense that I’ve seen discussed is that the content generated by AI systems should qualify as fair use.

The Fair Use Doctrine

Fair use is a legal doctrine that allows for the limited use of copyrighted material without obtaining permission from the rights holder. For fair use to apply, the material must be subject to valid copyright. (Importantly, we are not challenging the creator’s right to protect the material, nor are we suggesting that the material has no value.)

The whole point of Fair Use is to serve as a balance between the interests of the copyright holders to benefit from their works against the public benefit derived from the ability of someone else to use the copyrighted works. More plainly, it is the recognition that there is a public benefit to allowing some uses of copyrighted material without requiring permission or payment to the creator for the use.

In the United States, the fair use analysis is guided by balancing four factors, as outlined in Section 107 of the Copyright Act:

  1. The purpose and character of the use;
  2. The nature of the copyrighted work;
  3. The amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
  4. The effect of the use upon the potential market for or value of the copyrighted work.

Applying the Fair Use Doctrine to AI-generated Content

Based on established precedent, a court will consider the question of fair use by determining how significantly any or all of the factors favor allowing the use of the protected work without requiring the permission of the creator (or payment of licensing fees to the creator). Generally speaking, the further the challenged use of the material is from a commercial reproduction of the original work, the more likely fair use applies.

Importantly, there is no hard-and-fast rule, no clear cutoff point where reproduction is definitely fair use. This is a legal balancing test, and due to considerable litigation over this issue, there are additional balancing tests and conditions of application within the various factors themselves.

The Purpose and Character of the Use

The first factor focuses on how the copyrighted material is used in the reproduction. A transformative use—one that adds new meaning, value, or purpose—may weigh in favor of fair use. Examples of transformative use include parody (using parts of the copyrighted work to make fun of the original work), repurposing the work entirely (i.e. use of copyrighted student papers as a basis for a plagiarism detector was transformative), and importantly for the visual Generative AI tools, the use of copyrighted imagery in art (where the use involves alteration of the original to create works with a different aesthetic and purpose).

In the context of AI-generated content, the extent to which the copyrighted material is transformed may vary depending on the specific AI model and application.

For example, a large language model may generate text that includes phrases or sentences from copyrighted works, but the final output may be an entirely new piece of writing with its unique purpose and meaning. In such cases, the transformative nature of the AI-generated content may weigh in favor of fair use. However, if the AI-generated content closely resembles the original copyrighted work, this factor may weigh against fair use.

Additionally, since it appears that the US Copyright Office is going to stand by its position that any work generated by AI cannot itself be copyrighted, another question of transformation arises. Can something that does not have the independent ability to “create” copyrighted material “transform” existing material at all? There is an argument that since it cannot, this entire prong of the analysis does not apply.

(Remember, Fair Use is an exception to copyright law. As such, the party alleging fair use must “win” the balancing test. Taking the question of transformation off the table is not neutral – it means the fair use exception is less likely to apply.)

The Nature of the Copyrighted Work

The second factor distinguishes between creative works, which receive greater protection under copyright law, and factual or informational works, which receive lesser protection. Another key distinction under this factor is the status of publication. Unpublished works are much more likely to receive protection – and thus weigh strongly against a finding of fair use – by the very fact that the author had not chosen to expose the work to public view.

AI-generated content, particularly from large language models, often includes a mix of creative and factual elements. In these cases, the fair use analysis may depend on the specific type and degree of creativity present in the copyrighted works used in the AI training data.

I am interested to see how courts address the published vs. unpublished question in the context of content shared on the internet. There is a clear, easily demonstrable difference between books that have been published or not and TV or radio that have been broadcast or not.

On the other hand, there are many different ways for content to be “published” on the internet that, unlike a published book or television broadcast, are not available to the public at large. This distinction is particularly important for content that has been published online only to a handful of individuals who are given access for a reason other than having paid for it in some way. Is a photo you shared only with your close friends on Facebook something that can be considered “published?”

The Amount and Substantiality of the Portion Used

The third factor examines the quantity and quality of the copyrighted material used in relation to the original copyrighted work as a whole. However, it is not strictly a numerical assessment. The analysis must be both quantitative (what % of the work is copied) and qualitative (to what extent is the copied portion the “heart” of the protected work). Additionally, under US copyright law, a collection of works is copyrighted both as one whole work and as individual copyrighted works.

Also, because it might be important depending on how the analysis goes, wholesale copying of a protected work is only going to be fair use in extremely limited, and usually non-commercial uses (e.g. recording a television show for re-watching at home).

If we assume that the way copyright law will apply to AI-generated work is to directly compare ONE generated work with the copyrighted material used in its creation, then the analysis might look like this:

While it may be difficult to quantify the exact amount of copyrighted material used in training an AI model, the focus should be on the significance of the portion used in relation to the copyrighted work as a whole. If an AI model only uses small, non-essential portions of copyrighted works, this factor may weigh in favor of fair use. However, if the AI-generated content heavily relies on or reproduces substantial portions of copyrighted works, this factor may weigh against fair use.

However, this traditional analysis simply does not make sense for how Generative AI actually works. AI models are “trained” by copying content wholesale, with qualitative tags given to the copied material by human reviewers as it is added so that the AI can distinguish one type of content from another. This is an important step because the AI model doesn’t know what a dog looks like and doesn’t understand your question about toilet repair on its own. It must be taught.

This part is where, from my reading, the whole concept dives off the deep end, potentially eliminating the defense of fair use for AI-generated work.

The purpose of the fair use doctrine, according to the statute itself, is to allow limited use of copyrighted material, without permission, “for purposes such as criticism, comment, news reporting, teaching… scholarship, or research,” so as to foster creativity and promote the progress of knowledge.

The creation of images that do not in any way rely on applying human creativity but rely entirely on a database of previously created works does not seem to fit within the purpose of the statute. Arguments that user-entered prompts should add elements of human creativity to the production and thereby render the resulting product copyright-able lack seriousness. The only way user-entered prompts impact the final product is if those prompts are already in the system. Selecting from a menu is not a creative process.

Based on my review of copyright law, which I assure you is by no means exhaustive or definitive, the fair use doctrine likely does not apply to any such creations.

The Effect of the Use upon the Potential Market for or Value of the Copyrighted Work

The fourth factor concerns the impact of AI-generated content on the market for the original copyrighted work and has been described by numerous courts as “by far” the most important factor. This factor considers not only the potential that the AI-generated work at issue could potentially harm the market for or serve as a substitute for the original work but also addresses potential derivative works. As such, the question looks to whether a finding that certain conduct constitutes “fair use” would result in subsequent conduct by the person generating the potentially infringing product or others, continuing to reproduce the copyrighted work and whether that subsequent reproduction would have an impact on the market for the original work.

The impact on the market will depend almost as much on the use of AI-generated work as it will on its nature. Obviously, an AI chatbot that crawls websites for information and then uses that information to provide answers to users without attribution is effectively seeking to replace the source website entirely. Both the use and nature of the reproduction directly impact the value and market for the copyrighted website material.

The question becomes more complicated with AI-generated artwork, audio, video, and other forms of creative expression. As far as the direct impact goes, if the replication is used to be a commentary on or parody of the original work, it will be more likely to be considered fair use than if it is simply intended to be, for example, a piece of art available for purchase.

The more significant question here is likely to be whether Generative AI products, if found to be non-infringing, would result in a massive proliferation of similar tools and will the proliferation have an impact on the market for the original material. While it’s hard to say how this will be addressed, I have a hard time seeing an argument where 250 companies like Mid Journey producing artwork based on copyrighted material wouldn’t end up having a negative impact on the market for the original protected work.

The Future of the AI Copyright War

As the capabilities of large language models and generative AI continue to evolve, the legal landscape will need to adapt accordingly. Policymakers, courts, and legal scholars must grapple with the complexities of AI-generated content and its implications on copyright law.

Unfortunately, the current options for how the law should account for Generative AI is dominated by voices of those who have a vested interest in existing Generative AI platforms. I strongly urge everyone to view with skepticism any suggestions that the law should conform itself in a way that benefits the Generative AI platforms or provides copyrightable recognition to the products of these systems without providing any compensation to the original rights-holders.

We also need to be wary of any arguments that frame the discussion in terms of innovation. Most people agree that innovation is good, but innovation without regard for anything other than the financial benefit of the innovators – including evaluations of public good and personal privacy – is not innovation. It’s greed.

The argument that we will “fall behind” if we don’t allow unfettered development of these tools also rings hollow. Having a technology sector that is less likely to constantly, arbitrarily, and opaquely suck up and sell off all of my personal data than some other country doesn’t actually sound like a bad thing to me.

Also, be wary of any arguments or suggestions that lack any specificity, particularly when they include sweeping statements or envision equality of benefits between copyright holders and Generative AI platform developers but offer no way to achieve either.

For example, be wary of arguments like:

AI-specific Copyright Legislation

“As AI-generated content becomes more prevalent, there may be a need for AI-specific copyright legislation that provides clearer guidelines for AI developers and users. Such legislation could help strike a balance between protecting the rights of copyright holders and fostering innovation in the field of AI.”

This suggestion appears to encourage the adoption of laws/regulations that would enshrine Generative AI platforms with rights and privileges that no other similarly situated industry would have. Also, I tend to be highly skeptical of any argument that suggests that “fostering innovation” must go hand in hand with letting large companies, particularly those who currently dominate their industry, engage in questionable behavior.

Human-AI Collaboration

“Another possibility is to emphasize the role of human-AI collaboration in the creation of AI-generated content. By recognizing the creative input of human developers and users, it may be possible to establish a clearer path for copyright protection and ownership in the context of AI-generated content.”

This suggestion is a backdoor attempt to legitimize the creations of Generative AI by allowing them to be copyright-protected as though they were human creations. As I discussed above, there are very few instances where a user can provide inputs into what the AI creates that aren’t limited by the algorithm or the dataset on which the AI was trained. When considered in that way, the “input of human developers and users” amounts to writing down the items on a menu and selecting them off the menu. Moreover, if we consider the actions of a developer of the AI system along with the selections made by the user who is seeking to create something using the AI, who gets the copyright?

Are there any real options?

One potential solution to the copyright challenges posed by AI-generated content is the implementation of collective licensing models, like those used in the music and publishing industries. These models could allow AI developers to access and use copyrighted material for training purposes in exchange for royalty payments to rights holders.

This kind of arrangement, which already exists in some situations, is probably the most appropriate way to address the question of copyright. However, complying with these arrangements would significantly limit the datasets that AI models could train on, thereby limiting the quality of their outputs. Moreover, it’s not entirely clear how non-artistic models would be able to do this at all – does OpenAI have to pay every website owner to use their content?

Other significant questions come up in this context as well. How is consent determined? Can these contracts be canceled or withdrawn? How will disputes be handled? How will things like DMCA takedown notices be executed?

Another HUGE problem is the fact that the companies generating the AI models have exclusive access to the data needed to confirm whether they are complying with any law or contract addressing these problems. As can be seen with Clearview AI, who has already been ordered to delete any images in their system of EU citizens – how can anyone verify that it’s been done? Among the least persuasive options for regulation, in my opinion, is self-regulation. Particularly when most of the companies and individuals in leadership positions in this industry have previously, and repeatedly, proven incapable of actually regulating themselves.

Conclusion

Ultimately, it is important to recognize that large language models like GPT-4 are powerful tools that have the potential to revolutionize the way we create and consume content. However, it is also essential to approach these technologies with care and attention to ensure that they are used in ways that are ethical and respectful of the rights of all stakeholders. By taking a proactive and collaborative approach, we can help ensure that large language models are used to drive innovation and creativity while respecting the principles of copyright and intellectual property.

Pin It on Pinterest

Share This