Whenever you mention web scraping, you’re guaranteed to get mixed responses.
Some people love web scraping, others hate it.
The lovers will point to how using web data can make the world a better, more productive place. Whereas the haters will point to the harm web scraping supposedly causes.
Regardless of your views on web scraping ethics, this argument nearly always boils down to one question:
“Is web scraping legal?”
With high profile legal cases like LinkedIn vs HiQ bringing this question into the spotlight, we decided to write this guide to separate the passion from the facts and break down when is web scraping legal, and when is it illegal in the year 2022.
Disclaimer : I am not your lawyer, and these comments are solely based on our experience working with thousands of clients to scrape the web, please seek legal assistance if you are in doubt about your own particular project.
Is Web Scraping Legal?
Some people make blanket statements saying that web scraping is legal or illegal. These statements are often based on their own incentives. Be it web scrapers themselves arguing how web scraping is perfectly legal or corporate lawyers and anti-bots companies arguing the opposite.
In truth, there isn’t an easy yes or no answer to this question.
It really depends on the particular situation and the web scraping definition that you’re using. Here we define web scraping simply as the process of collecting data from across the internet. Scraping data from other websites is a useful and essential part of many legitimate data analysis operations. Web data scraping itself isn’t illegal, but it can be illegal (or in a grey area) depending on these three things:
- The type of data you are scraping
- How you plan to use the scraped data
- How you extracted the data from the website
Numbers 1 & 2 are more clear cut so we will start here before tackling number 3, the tricky one.
What Types of Data Are Illegal To Scrape?
Be it e-commerce, personal or article data, the type of data you are scraping and how you plan to use it can have a huge bearing on its legality.
Unbeknown to many, the final use case of the data often has a significant impact on whether or not it is legal to scrape.
Sometimes it can be perfectly legal to scrape a website, but how you intend to use the data can make it illegal.
The two types of data we need to worry about:
- Personal Data
- Copyrighted Data
If the data you are scraping doesn’t match any of the above then you are generally safe.
Data Type #1: Personal Data
Personal data, or personally identifiable information (PII) as it is technically known, is any data that could be used to directly or indirectly identify a specific individual.
With the introduction of GDPR in 2018, the California Consumer Privacy Act and outrage that accompanied scandals such as Cambridge Analytica’s interference in the 2016 US Presidential Election, the issue of personal data has become a hot topic and one that every web scraper must be cognisant of.
Every legal jurisdiction has different regulations governing personal data, however in general, in jurisdictions with the latest consumer privacy legislation (the EU, California, etc.), it is illegal for companies to obtain, store and/or use someone’s personal data without their consent or without having a lawful reason for doing so.
Types of personal data include:
- Name
- Phone Number
- Address
- User Name
- IP Address
- Date of Birth
- Employment Info
- Bank or Credit Card Info
- Medical Data
- Biometric Data
In the vast majority of cases (lead generation, sales intelligence, etc.), when scraping personal data from a website you don’t have the consent of the data owner (the person whose data you are scraping) to scrape their data and it’s very hard to argue you have one of these lawful reasons to do so:
- Consent – the data subject consented to us having their data.
- Contract – the personal data is required for performance of a contract with the data subject.
- Compliance – necessary for compliance with a legal obligation.
- Vital Interest, Public Interest, or Official Authority – typically only applicable for state-run bodies where access to personal data is in the public’s interest.
- Legitimate Interest – necessary for our legitimate interests.
As a result, in most cases scraping the personal data of a citizen of the EU or California could result in your web scraping being deemed illegal.
If you’re not extracting any personal data, or just the personal data of non-EU or Californian citizens, then you are likely safe to keep scraping.
Data Type #2: Copyrighted Data
The second type of data you need to be careful of scraping is copyrighted data.
Copyrighted data is data owned by businesses and individuals with explicit control over its reproduction and capture.
Like the use of copyrighted images and songs, just because the data is publicly available on the internet doesn’t mean it is legal for it to be scraped without the owner’s consent. You could be infringing the owner’s copyright by scraping their data.
This generally applies the following types of web data:
- Articles
- Videos
- Pictures
- Stories
- Music
- Databases
Scraping copyrighted data itself isn’t illegal, it’s what you plan to do with the copyrighted data that could potentially make it illegal.
One person could scrape a copyrighted article and be perfectly legal to do so, however, someone else could scrape the same article and be found to have breached the owner’s copyright.
It really depends on how you plan to use the data after you’ve scraped the data.
- Can you argue fair use? Instead of replicating the article in full, you plan to use snippets of the original article.
- Can you argue that the data is factual, therefore not copyrightable? Facts like product names, prices, features, etc. aren’t covered by copyright laws so can you argue the data you plan to scrape is factual in nature.
A trickier aspect to copyright law, however, is the issue of database rights . A database is an organized collection of materials that permits a user to search for and access individual pieces of information contained within the materials.
This means that it can be illegal to scrape a full database from the web and then reproduce it exactly for your own purposes.
Again the US and the EU have different regulations around what constitutes a database and what legal protections they give to the database owner. So it is important to understand the rules and regulations for the legal jurisdictions you are scraping in.
The risks of infringing someone’s database rights can be mitigated by altering how the data is scraped and used. These two tips help ensure you’re conducting ethical data scraping with copyrighted data:
- Only scrape some of the available data;
- Do not replicate the organisational structure of the original database;
Okay, so far we’ve covered what types of data can be illegal to scrape, and have seen how you plan to use the scraped data can affect its legality.
Next, we’re going to answer the most contentious issue about the legality of web scraping: how you extract the data from the website .
Is Web Scraping Itself Illegal?
It’s pretty straightforward to determine if scraping personal or copyrighted data will make your web scraping illegal because there are clear laws that set out what is legal and what is illegal.
It gets a lot more tricky when it comes to the act of web scraping itself because no government has passed any law explicitly legalising or de-legalising web scraping. Instead, we have to go off the verdicts of lawsuits between web scrapers and website owners. Which there are many:
To name a few.
The main issue of all these cases is the question of whether the Terms of Service listed on many websites that forbid web scraping (or automatic access) are legally enforceable. Of course, with websites that allow web scraping, there are no issues.
Although cases on the topic of web scraping have gone both ways, as of 2021 the courts are beginning to clarify the legality of data scraping for web scrapers.
The most recent of which HiQ vs LinkedIn, found that scraping data from a website doesn’t violate anti-hacking laws as long as the data is public and the scraper hasn’t explicitly agreed to the website’s terms and conditions in advance.
What this means is that so long as the data is publicly available on a website, and doesn’t require the web scraper to login and explicitly accept the terms of conditions of the website, the web scraper is within their right to scrape the publicly available data.
So how does this affect web scrapers?
If you are scraping a website then you need to ask these questions to determine if its legal or not:
- Is the data publicly available? If the data isn’t hidden behind a login, then the website’s terms and conditions aren’t enforceable, so you can legally scrape the public data.
- Do you need to create an account and login to access the data? If this is the case then you need to examine the terms and conditions you agreed to when you created the account, because by agreeing to them you made them legally enforceable.
A lot of websites include in their Terms and Conditions (that you agree to when you create an account with their site) that they forbid you to scrape content from their site. So as a rule of thumb, you should always assume that logging into a site and scraping is illegal unless you’ve examined their T&Cs.
That is why at ScraperAPI we forbid our users to scrape data from behind the login.
Your Own Legal Web Scraping Sanity Check
So there you go, we’ve discussed all the main issues that determine the legality of your web scraping. In the majority of cases we see, what companies want to scrape is perfectly legal.
However, we always advise them to double-check their plans to ensure they’re conducting both legal and ethical web scraping with these three simple checks:
- Am I scraping personal data?
- Am I scraping copyrighted data?
- Am I scraping data from behind a login?
If your answers to all three of these questions is “No”, then your web scraping is legal.
However, if you answer “Yes” to any of them, then you should take a step back and do a full legal review of your web scraping to ensure you’re not scraping the web illegally.