1

Github copilot is an AI application developed by Microsoft, trained on code available online including that within the github code repository. It outputs code that may solve problems presented as a function header.

There is at least one instance of Github copilot reproducing code that is a verbatim copy of GPL'ed code, see the image below, this twitter thread and the original source. There has been a question about the ownership of the generated code but what about the copilot application itself? From the point of view of a informatician, it would seem that to be able to replicate this code the copilot program must have the information within it, and therefore be a derivative work of the code replicated, but I do not know how that relates to the legal definition of a derivative work.

The code represented here is licensed under the GPL v2, which I think would not require Microsoft to release the source of a derivative work if they are not releasing the binary. If an owner of the copyright of code that did not allow this (for example Affero GPL) was to identify their work being reproduced by Github copilot are they likely to be able to successfully argue that copilot is a derivative work, and so should be made available under the Affero GPL?

Github copilot generating Quake III code

User65535
  • 6,608
  • 5
  • 24
  • 52
  • The generator tool is a separate thing from the input and the outputs. From the way you describe it, the inputs to the training model are copyrighted works (many of which are GPL licensed), and the only question should be what is the status of the output -- are they derivatives of the input or not? The status of the tool itself should not be affected at all by the licenses of what goes into it as input. – Brandin Sep 08 '22 at 11:05
  • @Brandin You do get that the information that is in the inputs has been encapsulated in the tool? It is not getting this code from the repository when it outputs it, it is being generated by whatever logic makes up the copilot AI. It seems therefore that all the information, and the "creative expression" must be duplicated within that tool. – User65535 Sep 08 '22 at 11:08
  • Yes, it's an interesting question. Maybe it depends on the details of how the generator/trainer works. I've only built simple applications like this before, but in all those cases, the trainind data were separate from the application. So, if that trained data were derivative of GPL licensed code, and if if I bundled up this trained data along with my application, I suppose you are asking if the GPL then obligates me to release my code as GPL as well? – Brandin Sep 08 '22 at 11:19
  • Also note that as long as no Affero GPL code is used, then there is probably no problem either way. The Affero GPL gives the additional requirement that you must give users a copy of the complete source code, also whenever they can access the services of the software online (and an API would quality). For the plain GPL, however, there is no obligation in any case to offer your complete source code to the remote API users. For discussing the details of GPL vs. AGPL and so on, though, it's probably better to go to https://opensource.stackexchange.com/ – Brandin Sep 08 '22 at 11:28
  • @Brandin Almost. I do not see any need for the trained data to be bundled, AIUI all the information (and therefore creative content) is contained within the application, the training data is not used after the training is complete. Also this particular example is not Affero GPL, but it would be amazing if copilot did not contain some code under that licence. The question is really about what makes a derivative work, this is not specific to OS. Closed source code within github would create the same issue. – User65535 Sep 08 '22 at 11:30
  • The way you've phrased part of the question I think is a bit of misunderstanding about GPL v2 vs. GPL v3. Neither of those requires releasing source code, if you don't actually release the binary code first. GPLv3 disallows so-called tivoization, but still allows that you host a modified version of GPL code behind a server, as copilot may be doing. – Brandin Sep 08 '22 at 13:41
  • @Brandin I think you are right, I shall replace V3 with Affero GPL. – User65535 Sep 08 '22 at 13:43
  • As mentioned in your linked answer from 2021, when you post code on Github, then you already gave GitHub permission to make copies of it 'to improve the Service', 'to display it to other users', etc. So the fact that it's Affero GPL-licensed or licensed in some other way effectively doesn't apply to GitHub in this case. – Brandin Sep 08 '22 at 13:59
  • @Brandin That does not give them rights to distribute derivative works. – User65535 Sep 08 '22 at 14:04
  • In my understanding, copilot is not a work which distributed, but it is part of the internal GitHub service itself. So I think this would only be answerable question, if GitHub were actually distributing copilot, say, to install on your server. – Brandin Sep 08 '22 at 14:14

1 Answers1

1

Something used to create a derivative work is not itself a derivative work by virtue of that fact, any more than a copying machine is a derivative work of a book it is used to copy.

Derivative means "based upon". If the software used to make a derivative work itself was not based upon the work that it was used to make a derivative work of, then it isn't a derivative work.

Furthermore, since software programs lack legal personality, the software program itself can't infringe on the copyright of another. Only a person utilizing the software program can infringe on someone else's copyright.

ohwilleke
  • 211,353
  • 14
  • 403
  • 716