The problem is that open source licenses incl AGPL have legal force because copyright prevents copying + modifying software by default (software is treated as a “literary work” and therefore enjoys copyright protection). But copyright on datasets and ML models is very different – if it is even recognized at all.
Thus, similar copyleft techniques would only work in particular jurisdictions. For example, the EU recognizes “database rights”. A copyleft license that accounts for this is CC-BY-SA-4.0, meaning that in an EU context, databases based on the original database would have to use the same license. However, database rights are not recognized in the US (facts are not copyrightable, and the “sweat of the brow” doctrine is not recognized). Thus, the CC-BY-SA-4.0 license would not have any copyleft effect with respect to databases in an US context. Independently of copyright in the database as a whole, the data in the database might be copyrighted material, for example if the database contains text or images, complicating matters further.
Machine learning models derived from a data set are much more difficult. Clearly, ML models that have been trained are not a creative work and are therefore not eligible for independent copyright protection. At most, it can be argued that the model is an automatically transformed version of the input data, so that copyright in the input data implies copyright in the model.1 Perhaps hyperparameter choices could reflect some creative input. This is very much an active topic of debate. Given this uncertainty, it would be impossible to create a public license that works reliably.
1. An interesting discussion topic is the potential effect on Microsoft's Copilot ML model which was also trained on GPL-licensed source code.
Instead of deriving force from copyright law, it would be possible to impose conditions via a contract, i.e. EULA-like terms that only provide access to the material after the terms have been accepted. But again, this is difficult. Contract law differs wildly between jurisdictions. For example, a contract is defined by the “meeting of minds” in some jurisdictions; by offer, acceptance, and consideration in others. But how does such a contract ensure appropriate consideration? How can acceptance be ensured if the material is publicly available?
For these reasons, I think that unless broad international agreement emerges about IP protections for machine learning models, such a copyleft system for ML models is impossible.
It is worth noting that the lack of such protections is probably quite good for innovation and research, since researchers are free to improve each other's work without legal concerns. The idea of copyleft is a hack to subvert the “everything is forbidden by default” system of copyright, but “everything is allowed by default” might be better.2
2. To continue the Microsoft Copilot example: some copyleft advocates like Bradley M Kuhn are sceptical about Copilot's GPL compliance, but remind us that copyleft maximalism means copyright maximalism, and that this is not the goal of open source. https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/