20

There are tools like OSINT or just plain old web-scraping which could allow you to harvest a lot of data. Let’s say you never harvest data in an illegal way, so technically all the data you compile is public information. However, you managed to extract data from so many fragmented parts of the internet that the net sum is a data set that is unusually useful and informative - about private or sensitive topics like individual’s info.

Is there any threshold one crosses where even if the data is legal to acquire, it becomes illegal to gather and store and share it, because the resulting data set is more or less a kind of privacy invasion or privacy threat?

Julius Hamilton
  • 533
  • 3
  • 11
  • 4
    The jurisdiction tags in the answers implicitly deal with this point, but what is legal or not is not uniform globally. Different countries, and sometimes different state or regions within countries each have their own laws on this subject. Sudan, China, France, California, and Alabama would each have their own legal regimes in this area. Determining which jurisdiction's law applies to particular conduct or a particular dispute is often a highly involved and non-trivial inquiry. It isn't always possible to know with certainty in advance which jurisdiction's law will apply. – ohwilleke Mar 13 '24 at 17:11
  • Your intent may matter here. I'm imagining a scenario where you collect a large quantity of public personal data with the goal of doxing someone. The collection and storage of the public data sources may not be illegal itself, but sharing it with the intent of inciting harassment may be, and scraping public databases for information on one particular person could possibly be considered stalking. The nebulous goal of "sharing information" could conceivably be illegal if the intent of doing so is to cause harm to someone. – Nuclear Hoagie Mar 13 '24 at 17:59
  • 3
    "and share it" now you're into fair use (or at least, the part where other parties start to care), which it either is or it isn't. If, "public sources" means there's no copyright.... But it depends on who you're up against; "the NFL may successfully enforce false claims through sheer force of will." https://lawreview.unl.edu/warning-nfl-claims-copyright-ownership-%E2%80%A6-everything "But the warning used at the beginning of each game is not actually enforceable by law… it’s legally bunk." legal-smegal. If somebody cares, you get sent a DCMA; you either take it down or you hire a lawyer. – Mazura Mar 14 '24 at 01:43
  • 2
    Does the word "public" in this question mean "public domain", as in "not protected by copyright"? Or does it merely mean "publicly accessible without cost"? – Todd Wilcox Mar 14 '24 at 05:43
  • 2
    There's an existing question about scraping data on athletes but it's a bit narrower in scope so probably not a duplicate. – Stuart F Mar 14 '24 at 12:28
  • Not due to privacy, but some countries prohibit or try to prohibit distributing the locations of speeding cameras, which I suppose would make Openstreetmap technically illegal (at least OSMand tells users not to use the "warn for speed cameras" in those countries). – gerrit Mar 15 '24 at 08:18
  • Can you say how you managed to extract data from so many… parts?

    How could matter that the net sum is unusually useful or informative?

    If you're suggesting data from three or five or 46 protected sources
    might be combined to reveal private or sensitive topics like individuals' information, why not show first how that might be technically possible… and isn't it clear that no question of legality in the circumstances you describe could not be dependant on that - thus-far unstated - technicality?

    – Robbie Goodwin Mar 15 '24 at 21:31
  • It seems to me that you are conflating "public domain" vs "something that the public can access". These are 2 very different things. I can have a webserver that "publishes" content under whatever license I want and users must legally abide by it. On the other hand I can have a paid subscription service that requires you to have 7 authentication factors including having 2 physical person turn physical keys in sync nuclear-launch style and that might only give you access to public domain data that you could legally resell or publish yourself. – Bakuriu Mar 16 '24 at 09:41

5 Answers5

26

Even if you gather from public sources, you need to comply with the GDPR to process Personal Identifiable Data. That means you need a strictly legal basis to even be allowed to gather them. That you gathered them from all over just means you made data of others identifiable. The threshold at which you have to comply with GDPR is the moment you start to gather data about people in Europe.

As nvoigt correctly noted, a phonebook is the easiest example in Europe: A person is only listed in the phone book because the phone book maker has an interest and usually consent. This consent is not given to anyone other but the phone book company, and this consent is not transferable. To process the data in a phonebook but for purely personal use (e.g. as a company), you are required to have another legal basis to process it. Among them is legitimate interest or to get consent from the data subject.

Data Scraping, under the GDPR, is almost impossible and very risky:

Do you remember Equifax? They were struck with the worst of all punishments. Not a fine, but they had to destroy any part of their database that contained any data obtained without consent, because some of the data was obtained by illegal scraping.

Do you remember ClearView? Fined 20 million by Italy for violation of the GDPR, together with an order banning ClearVieww from operating in Italy and to delete all data from people inside Italy in February 2022. They had to delete their French database in 2021. And they were fined another 7.5 million in the UK in November 2022.

Database rights

Some countries also have a copyright-akin right in databases, which disallows scraping data from those databases.

Trish
  • 39,097
  • 2
  • 79
  • 156
  • Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on [meta], or in [chat]. Comments continuing discussion may be removed. – feetwet Mar 14 '24 at 19:11
16

There is no such law in the United States. It is legal to collect large amounts of raw data from public sources and share it.

Usually, even information protected by a non-disclosure agreement, or a statutory privacy requirement can be legally collected and shared once it becomes a matter of public record or is made public (although this isn't true for certain national defense information, and for information obtained in confidential attorney-client communications if the attorney is the one seeking to share it).

There is no generally applicable right to privacy of information in the United States, although sometimes there a privacy rights associated with information disclosed in the context of certain specific kinds of relationships (e.g. banker-customer, attorney-client, health care provider-patient). Some U.S. states protect more privacy in more specific relationships than other U.S. states.

Some public data sources, such as PACER, the public database of the federal courts, charge users for large downloads of data from their database, but not for small downloads of data, however.

Indeed, even privately collected assemblies of raw data are not protected by copyright. See Feist Publications, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340 (1991).

ohwilleke
  • 211,353
  • 14
  • 403
  • 716
  • A long time ago, I was sent something from the feds and it pointed me to a website which had not only my information but information about many other people. IIRC, it even contained SSNs. I'm assuming this was a mistake or some sort of oversight. I would think (hope) that there are some laws preventing the distribution of mistakenly published data such as that, no? – JimmyJames Mar 13 '24 at 20:56
  • 4
    @JimmyJames Unless there is a legal privilege regarding the information or a statute expressly prohibiting it from being disclosed, probably not. – ohwilleke Mar 13 '24 at 21:26
  • 2
    Somewhat ironically, right now, the biggest thing stopping companies from abusing American's private information much is... Europe's GDPR. If they accidentally gather information about a European in America, then Europe can take down the whole company. – Mooing Duck Mar 15 '24 at 04:22
  • 1
    @MooingDuck: I assume Europe could only prevent an American company from doing business in EU, not actually take it down entirely? (And there's probably a lot of companies that won't even notice being kicked out from Europe.) – user1686 Mar 15 '24 at 10:15
  • 1
    @user1686 They fine the company to the ground, demand deletion of data, and... us courts have assisted in some cases to collect and do the court ordered things. – Trish Mar 15 '24 at 12:02
  • 1
    I wonder: With that being said, why do companies complain that much when AI models use their seemingly public data to train their models? – U. Windl Mar 15 '24 at 12:23
  • The tldr is that the data isn't public. It's licensed data. The companies are giving us the license to view their data, but they believe that the license prevents the data from being used to train AI. For instance, everything we type here, stackexchange is licensing readers to read it, and even use it, but anything using the data must cite stackexchange as the source, and use the same license, and . Virtually everything you see is licensed one way or another. – Mooing Duck Mar 15 '24 at 16:29
5

The goal of "sharing information" may possibly be illegal itself, depending on what you are sharing and why, regardless of where the information came from.

As an example, a student named Jack Sweeney is currently being threatened with legal action by Taylor Swift, because he runs social media accounts that compile public FAA data to publish tracking of celebrities' jets. All the data Sweeney uses is publicly available to anyone, but Swift's lawyers contend that the aggregation and publication of the data in with only 24 hours of delay amounts to unlawful harassment. It's unknown at this time if such a claim would be successful, but it's within the realm of possibility. Here is a more detailed legal analysis of the claims in the case - the upshot is it's not clear what laws if any Sweeney might be breaking, but it does mention a few not-too-distant hypotheticals that would be more likely illegal, like using public information to stalk and harass celebrities in violation of anti-paparazzi laws.

Nuclear Hoagie
  • 5,635
  • 1
  • 26
  • 23
  • 5
    There are some specific doxing statutes related to public officials. But the lawsuit against Taylor Swift is on very shaky legal ground and is unlikely to prevail. There is really no precedent for a lawsuit on this theory winning. – ohwilleke Mar 13 '24 at 19:13
  • 1
    And the article linked in the answer seems to concur. The legal scholar they interviewed repeatedly shoots down the potential legal claims against him. OTOH, defending yourself in court may be prohibitively expensive. – Barmar Mar 13 '24 at 20:30
  • 5
    Sweeny ➝ Sweeney, and probably worth mentioning Elon Musk if you're going to talk about him, see e.g. https://en.wikipedia.org/wiki/ElonJet – mrienstra Mar 14 '24 at 04:16
  • 2
    Actually, he isn't publishing real time data - it is time-shifted by a day or something. – MikeB Mar 14 '24 at 09:36
  • 3
    Information only: A while ago Google made available large volumes of search records. People demonstrated that they could "often enough" identify individuals and related personal information using this data. Google stopped doing it. || Highly related butnot about what I said above: https://wiki2.org/en/Privacy_concerns_with_Google – Russell McMahon Mar 14 '24 at 09:36
  • @CGCampbell That is what I meant to say. Just a typo. – ohwilleke Mar 15 '24 at 15:18
3

It's perfectly possible to collate a set of public data that a government agency might feel it necessary to apply a restrictive security classification to. At that point you're in Official Secrets territory and doing anything with it comes with fairly horrible downsides.

Where this might apply to privacy is if the individual concerned has some national security significance. You might not be aware of this in advance.

regularfry
  • 131
  • 1
2

Note that the answers assume you are collecting the data from an unrestricted source. In the US, a fact can not be copyrighted but a collection of facts can be, so a specific database may be copyrighted even if the data originally came from open sources, and extensive copying would make your version a Derivative Work at best.

It's not uncommon for databases that are copyrighted but exposed to include some "smoking gun" entries which don't affect use of the data but whose presence in another resource would demonstrate that it was bulk-copied from this one and hence (unless specifically authorized) a copyright violation.

Format of the data may also be copyrightable, or trademarkable, or qualify for a Design Patent.

Basically, if you aren't sure you have permission to use a dataset, ask. And then remember that they may be wrong. In either direction.

keshlam
  • 176
  • 5