28

I am currently interested in doing some research performing various measurements and algorithms on the most common words in the English language. I have found a few good word lists online that would be suitable, but I am concerned that they may be subject to copyright and would like to be sure about their status before I use them. The lists are in the form of a simple text file with one word per line.

I understand that collections of words such as dictionaries are subject to copyright because they contain a large amount of words and their definitions which can be considered to require original work and creativity, but what about just words without any definitions or additional information?

I have seen this other Law Stack Exchange question, which mentions lists of words, but the author seems to have been interested in also using some short definitions, and I am not certain the answer refers to plain word lists.

Is a list of common words copyrightable? If so, would it still be considered fair use to use the word list to generate data not related to the words themselves?

NK1406
  • 381
  • 3
  • 8
  • 20
    Name your jurisdiction. – TRiG Oct 07 '19 at 10:22
  • 11
    Depending on jurisdiction, database rights might actually be the relevant right rather than copyright. – Jasper Oct 07 '19 at 12:44
  • 1
    Relevant Wikipedia article: https://en.wikipedia.org/wiki/Sweat_of_the_brow – Golden Cuy Oct 08 '19 at 01:55
  • If you are only using the list, and not redistributing that list or a derivative work based on that list, then I don't see how its copyrightability would be relevant. Fair use would not be relevant either; that would only be needed if you actually copied or redistributed the list (assuming it was copyrightable in the first place). – Brandin Oct 09 '19 at 14:42

2 Answers2

42

Depending on your jurisdiction, such lists may be protected, but not by copyright.

For example, in Germany there was a court decision that scanning all the country’s phone books and selling them on CD constituted “unfair competition” and was illegal, while hiring 1000 typists who would manually type in all this information would not be.

Databases are protected in many jurisdictions, and a list of the 1000 most commonly used English words could reasonably be called a database.

gnasher729
  • 34,028
  • 2
  • 46
  • 88
19

The words themselves are not protected by copyright, because they are "facts" (of the English Language -- also, the list-maker didn't create the words). Lists of words created by an algorithm are "facts", and lack the speck of creativity that makes web pages protected. The corpora that underlie the lists are protected, as is the program that filters them to give token counts, but the resulting table of information is not, see Feist v. Rural Telephone.

user6726
  • 214,947
  • 11
  • 343
  • 576
  • 30
    You seem to assuming one particular jurisdiction without ever naming which one. Also, it is not clear why you assume that this particular jurisdiction is the one and only, considering that the OP has not named any particular jurisdiction. In particular, in 99.5% of all jurisdictions, the reference you cited is completely und utterly irrelevant. – Jörg W Mittag Oct 07 '19 at 10:47
  • 6
    "The list-maker didn't create the words" is beside the point.Few people would try to argue that a novel is not copyright because the writer did not create any of the words in it. The arrangement of the words (and whether or not that is "trivial" and/or "common knowledge") is the important point. – alephzero Oct 07 '19 at 11:10
  • 4
    The sequence of most common English words is the output of a well-defined algorithm, not a creative work. It's hard to see how any jurisdiction with a reasonable definition of copyright would protect that list. – asgallant Oct 07 '19 at 16:18
  • 1
    @JörgWMittag While your comment has some validity, the Supreme Court is controlling in the vast majority of jurisdictions, weighted by how likely someone would be asking, on an English language website, what the law regarding English words is. – Acccumulation Oct 07 '19 at 17:39
  • 8
    @Acccumulation Which Supreme Court? There's at least a few, and there's nothing that says this question is about English law (only about the use of lists of English words - for all we know it could be comparative research taking place in Thailand comparing most used English words to Korean words), and this site has users all around the world. I mean, in all likelyhood it's about US law since the Questioner's profile says they're in the US, but it's silly to assume US-centric for all questions on the site. – Delioth Oct 07 '19 at 19:21
  • 13
    Regarding "any jurisdiction with a reasonable definition of copyright", in fact, I'm not aware of any jurisdiction with what I would consider a reasonable definition nor implementation of copyright. – dotancohen Oct 07 '19 at 20:02
  • 1
    @asgallant: You are correct, that this list is probably not copyrighted anywhere. It might however be protected under database protection in several jurisdictions, including in Germany and in fact the entire EU. Even though the question explicitly asks about copyright only, I would consider any answer that doesn't mention database protection incomplete, just like I would consider any answer that doesn't mention that this depends entirely on the jurisdiction incomplete. – Jörg W Mittag Oct 07 '19 at 22:23
  • 2
    @JörgWMittag its actually relevant to 4% of the world's population or 27% of the developed world – JonathanReez Oct 07 '19 at 23:25
  • 1
    @Delioth My comment was disputing the idea that there's only 0.5% chance that the OP is asking about the US. – Acccumulation Oct 08 '19 at 00:14
  • 1
    @asgallant selection of an appropriate corpus to point your algorithm at is a non-trivial (in an academic sense - I don't know enough about the legal sense) task; if that's done by algorithm, you have a non-trivial algorithm – Chris H Oct 08 '19 at 09:04
  • Many jurisdictions, including the US, have a concept of "sweat of the brow" (see https://en.wikipedia.org/wiki/Sweat_of_the_brow ), which would apply here, but didn't in the cited case. The basic idea being that if something requires significant effort to create, then the product of that work intrinsically attracts property rights, and is thence copyrightable. – james Oct 08 '19 at 14:11
  • @asgallant, also, "the" (as in the one and only) sequence of most common words probably doesn't even exist -- even the Wikipedia page has two sets of ranks. Plus a mention of the fact that a similarly-written word may have two different meanings, and that might be taken into account (or not). An algorithm to identify the different meanings of a single word probably isn't that trivial. – ilkkachu Oct 08 '19 at 18:02
  • 2
    @james Since Feist, the US has no "sweat of the brow" doctrine anymore. More importantly, in the US you cannot violate the copyright of a work you did not access. Ten people who independently create even precisely the same work are all entitled to copyright it. – David Schwartz Oct 08 '19 at 18:53
  • I’d like to see the “well-defined” algorithm to create a list of the 1,000 most used English words. – gnasher729 Oct 09 '19 at 07:13
  • @Acccumulation SCOTUS only has jurisdiction over the USA. If you only consider UK, Australia, Canada, and New Zealand, that still leaves SCOTUS with jurisdiction over only about 70% of the English speaking jurisdictions weighted by population. I don't really think that counts as "vast majority". – Martin Bonner supports Monica Oct 09 '19 at 14:14
  • @asgallant "The sequence of most common English words is the output of a well-defined algorithm". So is any novel: print "Once upon a time, [...]". More significantly, the input to the algorithm you allude to is unclear. Now, I'm not claiming that the list is copyrightable, but the idea that you can I could both run the same procedure and would necessarily come up with the same answer isn't really true. – David Richerby Oct 09 '19 at 18:38