Is there a license like AGPL for datasets and machine learning models?

Question

The GNU Affero General Public License is a modified version of the ordinary GNU GPL version 3. It has one added requirement: if you run a modified program on a server and let other users communicate with it there, your server must also allow them to download the source code corresponding to the modified version running there.

A lot of machine learning relies on fine-tuning existing models. Is there a license comparable to AGPL designed for datasets and machine learning models? That is, is there a license that if I publish a dataset under it, anyone who trains models with that license must also release their models under such license? ... And requires that any extensions or refinements to the dataset used in training their models must also be published under said license? ... And requires that any models fine-tuned from models that were trained on even one data element from my dataset be published under this license?

I suppose if alot of data were under this sort of license, it might provide a caveat to require complex models like gpt-3 to be publically released if OpenAI had accidently trained on a corpus under this license instead of only offering them as a paid black-box service.

score 3 · Accepted Answer · answered Mar 26 '22 at 15:59

The problem is that open source licenses incl AGPL have legal force because copyright prevents copying + modifying software by default (software is treated as a “literary work” and therefore enjoys copyright protection). But copyright on datasets and ML models is very different – if it is even recognized at all.

Thus, similar copyleft techniques would only work in particular jurisdictions. For example, the EU recognizes “database rights”. A copyleft license that accounts for this is CC-BY-SA-4.0, meaning that in an EU context, databases based on the original database would have to use the same license. However, database rights are not recognized in the US (facts are not copyrightable, and the “sweat of the brow” doctrine is not recognized). Thus, the CC-BY-SA-4.0 license would not have any copyleft effect with respect to databases in an US context. Independently of copyright in the database as a whole, the data in the database might be copyrighted material, for example if the database contains text or images, complicating matters further.

Machine learning models derived from a data set are much more difficult. Clearly, ML models that have been trained are not a creative work and are therefore not eligible for independent copyright protection. At most, it can be argued that the model is an automatically transformed version of the input data, so that copyright in the input data implies copyright in the model.¹ Perhaps hyperparameter choices could reflect some creative input. This is very much an active topic of debate. Given this uncertainty, it would be impossible to create a public license that works reliably.

^{1. An interesting discussion topic is the potential effect on Microsoft's Copilot ML model which was also trained on GPL-licensed source code.}

Instead of deriving force from copyright law, it would be possible to impose conditions via a contract, i.e. EULA-like terms that only provide access to the material after the terms have been accepted. But again, this is difficult. Contract law differs wildly between jurisdictions. For example, a contract is defined by the “meeting of minds” in some jurisdictions; by offer, acceptance, and consideration in others. But how does such a contract ensure appropriate consideration? How can acceptance be ensured if the material is publicly available?

For these reasons, I think that unless broad international agreement emerges about IP protections for machine learning models, such a copyleft system for ML models is impossible.

It is worth noting that the lack of such protections is probably quite good for innovation and research, since researchers are free to improve each other's work without legal concerns. The idea of copyleft is a hack to subvert the “everything is forbidden by default” system of copyright, but “everything is allowed by default” might be better.²

^{2. To continue the Microsoft Copilot example: some copyleft advocates like Bradley M Kuhn are sceptical about Copilot's GPL compliance, but remind us that copyleft maximalism means copyright maximalism, and that this is not the goal of open source. https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/}

Is there a license like AGPL for datasets and machine learning models?

1 Answers1