18

https://meta.stackexchange.com/q/388551/178179 mentions that SE will force some firms to pay to be allowed to train an AI model on the SE data dump (CC BY-SA licensed) and make a commercial use of it without distributing the model under CC BY-SA.

This makes me wonder: Is it illegal for a firm to train an AI model on a CC BY-SA 4.0 corpus and make a commercial use of it without distributing the model under CC BY-SA?

I found https://creativecommons.org/2021/03/04/should-cc-licensed-content-be-used-to-train-ai-it-depends/:

At CC, we believe that, as a matter of copyright law, the use of works to train AI should be considered non-infringing by default, assuming that access to the copyright works was lawful at the point of input.

Is that belief correct?

More specifically to the share-alike clause in CC licenses, from my understanding of https://creativecommons.org/faq/#artificial-intelligence-and-cc-licenses, it is legal for a firm to train an AI model on a CC BY-SA 4.0 corpus and make a commercial use of it without distributing the model under CC BY-SA, unless perhaps if the output is shared (2 questions: Is the output of an LLM considered an adaptation or derivative work under copyright? Does the "output" in the flowchart below mean LLM output in the case a trained LLM?).

enter image description here

Franck Dernoncourt
  • 5,525
  • 3
  • 31
  • 71
  • CC BY-SA doesn't forbid commercial use, so that part would be fine. (The training datasets likely contain lots of CC BY-NC-SA content, too, though, for which it wouldn't be OK.) CC BY-SA does require attribution and share-alike, which AI companies are not abiding by. The training dataset is a derivative work of all the works they scraped, the model itself is also a derivative work, and the output of the model is a derivative work, yet they don't provide attribution to the authors/artists who wrote/drew the content they used to create that output. – endolith Dec 07 '23 at 20:05

1 Answers1

12

The flowchart included in the question is trying to summarize a rather large amount of legal uncertainty into one image. It must be emphasized that each decision point represents an unsettled area of law. Nobody knows which path through that flowchart the law will take, or even if different forms or implementations of AI might take different paths. The short and disappointing answer to your question is that nobody knows what is or isn't legal yet.

To further elaborate on each decision point:

  • The first point is asking whether the training process requires a license at all. There are two possible reasons to think that it does not:
    • AI training is protected by fair use (see 17 USC 107). This is a case-by-case inquiry that would have to be decided by a judge.
    • AI training is nothing more than the collection of statistical information relating to a work, and does not involve "copying" the work within the meaning of 17 USC 106 (except for a de minimis period which is similar to the caching done by a web browser, and therefore subject to a fair use defense).
  • The second point is, I think, asking whether the model is subject to copyright protection under Feist v. Rural and related caselaw. Because the model is trained by a purely automated process, there's a case to be made that the model is not the product of human creativity, and is therefore unprotected by copyright altogether.
    • Dicta in Feist suggest that the person or entity directing the training might be able to obtain a "thin" copyright in the "selection or organization" of training data, but no court has ever addressed this to my knowledge.
    • This branch can also be read as asking whether the output of the model is copyrightable, when the model is run with some prompt or input. The Copyright Office seems to think the answer to that question is "no, because a human didn't create it."
  • The third decision point is, uniquely, not a legal question, but a practical question: Do you intend to distribute anything, or are you just using it for your own private entertainment? This determines whether you need to consult the rest of the flowchart or not.
  • The final decision point is whether the "output" (i.e. either the model itself, or its output) is a derivative work of the training input.
    • This would likely be decided on the basis of substantial similarity, which is a rather complicated area of law. To grossly oversimplify, the trier of fact would be shown both the training input and the allegedly infringing output, and asked to determine whether the two items have enough copyrightable elements in common that copying can reasonably be inferred.
Kevin
  • 4,659
  • 18
  • 35
  • and does not involve "copying" the work Isn't there a bunch of contrary evidence?

    https://cdn.arstechnica.net/wp-content/uploads/2023/03/948c88f4-e3d8-4123-ab42-7f681e70ad01_1600x1142.webp

    https://arxiv.org/abs/2202.07646

    https://arxiv.org/abs/2301.13188

    https://bair.berkeley.edu/blog/2020/12/20/lmmem/

    https://twitter.com/stefankarpinski/status/1410971061181681674

    etc.

    – endolith Nov 29 '23 at 17:10
  • except for a de minimis period which is similar to the caching done by a web browser, and therefore subject to a fair use defense

    But that fair use defense is contingent on caching being noncommercial and having "minimal impact on the potential market for the original work", which is not true of commercial language models used to generate content that mimics the original.

    – endolith Nov 29 '23 at 17:10
  • @endolith: Incorrect, fair use is not inherently contingent on being noncommercial. See Campbell v. Acuff-Rose Music, Inc., Author's Guild, Inc. v. Google, Inc., and several other cases. Furthermore, OpenAI is ultimately controlled by a nonprofit entity, so it may not make a difference in their case. – Kevin Nov 29 '23 at 20:23
  • Yes, but it's one of the four factors and is cited in the rationale for why web browser caches are fair use. OpenAI Global, LLC is a for-profit corporation and ChatGPT is a commercial product. – endolith Nov 29 '23 at 22:21
  • @endolith: No, it is part of one of the four factors, which in full is the "purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes." But under US case law, this factor also includes whether and to what extent the use is transformative. The AI companies can somewhat credibly argue that the model is significantly transformative, because it (the model, not its outputs) serves a fundamentally different function to the training works. This would mitigate any commercial motive the court might ascribe to the defendants. – Kevin Nov 30 '23 at 18:39
  • More to the point, Chrome is made by a for-profit company. So is Safari, as were many older browsers such as Netscape and MSIE. If commerciality were enough to defeat the fair use argument in this context, then web browsers' caches would not be fair use either. – Kevin Nov 30 '23 at 18:41
  • Did you read the answer I linked to, referencing Perfect 10 v. Google, Inc.? – endolith Nov 30 '23 at 22:06
  • @endolith: You told me that commerciality "is cited in the rationale for why web browser caches are fair use," and I responded to that. To respond to the fourth factor concern (which has nothing to do with commerciality) would be a whole different answer, but in short: I think you are reading the fourth factor much more broadly than it actually sweeps. We're not talking about the model's possible outputs, but about the model itself. The model is not a market substitute for any of its training inputs, even though it might be used to create such outputs (which could separately be infringing). – Kevin Nov 30 '23 at 23:18
  • Did you read the answer I linked to? – endolith Dec 01 '23 at 20:58
  • @endolith: Yes. Please do not make any further comments unless there is a specific improvement you want to suggest. – Kevin Dec 01 '23 at 21:28
  • So do you now agree that copying content in order to train a commercial AI does not fall under Fair Use in the way that copying content into a web browser cache does? – endolith Dec 03 '23 at 14:40
  • @endolith: I have read your arguments. I have explained why I disagree with them. Please stop pinging my inbox. – Kevin Dec 03 '23 at 22:30