Want to know how to hide your IP address for web scraping?
Scrapers navigate the internet the way we do through our browsers.
Once we give a URL to our scraper, it’ll send an HTTP request with all the necessary information for the server to identify the traffic and know how to respond. In this connection, our scraper will use our network and our IP address. (However, this little detail can have a big impact on the information we get access to and even whether or not we can get access at all.)
So, in this article, we’ll discuss the role of your IP address on your web scraper efficacy and how you can hide/change it to avoid getting blocked. Let’s start from the beginning!
What’s an IP Address?
An IP address is a string of numbers, separated by periods, associated with a device connected to the internet. This number is unique and assigned by your internet service provider (ISP) to identify your device.
This string of numbers can take any values from 0 to 255, and it’s generated every time your device connects to a network. In that sense, your device has a different IP address at home or at your workplace.
Note: There’s a more recent Internet Protocol (IP) version, the IPv6, which takes a different format from the traditional 93.45.125.173
and looks more like this: 2001:h07:6469:fl49:4d77:9842:9719:1c2f
. Check this page to learn more about their differences – still, the principles and techniques we’ll discuss later on apply to this new format as well.
When we connect to the web, our devices need to communicate with one another to send and receive information. The IP address works the same way as a normal home address, allowing information to flow in the right direction. To check your IP, you can use https://whatismyipaddress.com/ to get both your IPv4 and IPv6.
Why Should You Hide Your IP Address While Scraping the Web?
In terms of web scraping, your IP address shares several details that are relevant to servers: This information can be any of the below.
- Your internet provider
- Geolocation (country, region/state, city, and area code)
In short, it tells the server where the request originated from. Then, it creates a unique number to tell your requests apart from the rest.
And that’s where problems begin.
Unlike organic users, a web scraper is designed to send a greater number of requests in a shorter period of time, which can overwhelm the target servers, causing real problems to the websites we scrape. These problems were very common in the early days of web scraping, giving it a bad reputation and boosting more anti-scraping techniques.
How it works is that when a server recognizes a bot, it blocks the scraper from accessing its content – in most cases permanently – by banning the IP address. To mitigate the risk of getting your IP banned forever from certain sites, it’s necessary to use techniques to hide and change your IP address and be able to scrape the data you need.
Note: Of course, it’s important to apply web scraping best practices. Always make sure you’re extracting data legally to avoid problems.
Another reason you might want to change your IP is to access geo-specific information. For example, eCommerce marketplaces and search engines will show different results based on the geographical information attached to the requests. So if you are in the US but want to see how a competitor markets its products in France, changing your location will help collect accurate data from a different location.
4 Ways to Hide Your IP Address for Web Scraping
There are four ways to hide your IP, some more and less effective.
Let’s explore these options to see which fit our requirements:
1. Using The Onion Routing Project (TOR)
According to Investopedia, the onion routing project “is an open-source privacy network that enables anonymous web browsing. The worldwide Tor computer network uses secure, encrypted protocols to ensure that users’ online privacy is protected.” “To simply explain Tor:
Instead of your computer connecting directly to your destination, the request is sent through several in the network (layers) that work as intermediaries. The same happens with the response: the designated page will send the response, which Tor’s network will then intercept before getting to your device. Through this circuit of connections, your IP is masked (not changed) as the destination will receive the request from the last node (Tor user computer) on your behalf. In terms of web scraping, this can work relatively fine, and it’s quite easy to implement.
Note: Here’s a guide on using Tor for web scraping.
However, the disadvantage of this approach is that Tor relies on volunteer servers across the globe, and your connection will pass through several random servers – which might not always be the best-performing machines or count with the best network speeds. The combination of these two factors will slow down a scraper a lot.
Of course, speed is not the only factor that determines whether or not to use a technique for scraping. The most important problem with Tor, though, is that some websites won’t give access to traffic coming from the Tor network, making it an unreliable option for serious projects.
As the number of requests increases, you’ll quickly start experiencing a lot of 404 and 403 errors. Even without loading speed issues, a low success rate will ultimately make your web scraper unviable.
2. Using a Virtual Private Network (VPN)
A virtual private network or VPN is a privacy service that creates an encrypted tunnel for your data to move through, effectively hiding your IP address and allowing you to surf the internet anonymously.
VPN services will redirect your requests through their secure servers worldwide while encrypting all data sent and received. Unlike Tor, you’ll be able to choose which server you want to connect to, meaning your IP address will be the IP of the server you’re using.
This is very useful because it allows accessing geo-specific information by connecting to a server from another country, virtually changing your real IP, and keeping it secure from blacklists.
However, VPNs suffer from the same scalability and speed issues as Tor, just to a different degree. While Tor connects your request through random servers (currently about 6000), a VPN will send all your requests through the same server every time. In other words, you’ll be sending all your requests using the same IP address.
As you hit 100k to a million requests, websites will notice the high frequency of requests and block the VPN traffic. For example, famous VPNs like NordVPN can be detected by some websites, as you’re not the only one using these IPs. To see what we mean, open a VPN and try to connect to Amazon Prime or HBO Max.
This unreliability comes from the single (or a very limited number) alternative IPs these services provide.
Note: Even using more expensive solutions and private servers – which will come with a hefty price tag –VPNs won’t be reliable enough for large projects.
Of course, because VPNs are routing your connection, performance will decrease naturally. In order to determine exactly how slow it will be, will depend on your network speed and the VPN’s server. However, because a private company maintains these, it shouldn’t be such a wild difference as using Tor – or at the very least, will be more stable.
3. Proxies
The previous two technologies are, in essence, built upon proxy servers, but you can use proxies without them. A proxy is an intermediary server that disguises your IP for theirs, allowing you to scrape the websites without putting your IP address at risk.
Resource: 6 Types of Proxies You Need to Know.
Of course, one proxy isn’t enough. You’ll need to use a pool of proxies and create a system to rotate them after a set or every request is sent. The idea behind this is that servers will interpret your bot as thousands or thousands of users sending one request instead of millions of requests coming from one agent.
A well-built proxy management infrastructure will make web scraping easier, but it does require a lot of knowledge, time, and resources to build and maintain this solution in-house.
We wrote an in-depth analysis of the pros and cons of in-house proxy management solutions and what variables you need to consider when building one. In short, good proxy management will ensure a high success rate and make scraping data easier. If you have an experienced engineering team, creating an in-house solution will give you full control over your proxy system.
However, you’ll need to consider the time and money it takes to maintain this system working properly. As websites changes, IPs get blocked, and your needs evolve, you’ll need to build more and more complex systems like:
- IP rotation systems
- Dynamic rotations based on server feedback
- Delays and retries
- Handling dynamic content
- Geolocation
- Growing your IP pool
- Managing and implementing constant patches and security updates
Note: Without (a) proper and experienced cyber security knowledge/expert, in-house proxy clouds can become easy targets for hackers, putting your data at risk.
The best option for projects that need to scale sustainably is to use an off-the-shelf solution instead.
4. Off-The-Shelf Proxy Management Solutions
As its name suggests, these are solutions built and maintained by third parties that dedicate all their resources to providing you and your team with the best and most robust solution possible.
These solutions span the whole technical implementation spectrum.
Some provide and maintain the proxies you can use to build your own scrapers to fully off-the-shelf scraping solutions, which only require you to specify the URLs and desired data.
As you can imagine, the more you approach the done-for-you solution side, the less control you have. Also, these solutions are more expensive than their more code-based counterparts.
Not all scraping APIs and services are built the same. Some are built with a specific use in mind and won’t work nearly as well for other projects.
When choosing a scraping solution, answer the following questions:
- How much customization does your project need?
- How many requests (approximately) you’ll be sending?
- What’s your budget?
- Do you need any specific functionalities?
- What’s your team’s (or your) scraping knowledge?
- Is there a specific tech stack you want/need to use?
- What’s your budget?
Having this information will make your decision easier and avoid making the wrong investment.
Beyond Hiding Your IP: Using ScraperAPI to Increase Success Rate, Avoid Blocks and Scrape Data More Efficiently
The truth is that just hiding your IP address won’t be enough to scale your web scraper.
Data has become more valuable than ever, so web admins use more complex anti-scraping techniques like browser behavior profiling, CAPTCHAs, and tracking request patterns to detect and block all bots.
You’ll have to handle JavaScript/dynamic content – which regular crawlers won’t be able to access –geo-specific content, retries, IP rotation, headers, etc., and that’s where ScraperAPI can help.
ScraperAPI is a scraping solution designed for scalability and customization. It uses a massive pool of data centers, residential, mobile, and premium proxies in combination with machine learning, huge browser farms, and years of statistical analysis to rotate and choose the best proxy and header combinations to get a successful response.
In addition, it can be integrated with any tech stack (from Python and Node.js to PHP and C#), render dynamic content, access geo-specific content, and manage sessions and concurrent requests.
All of this by just sending your request through their servers. No complex or time-consuming setups.
Hiding Our IP with ScraperAPI
For this quick example, let’s use Node.JS to send a request to http://httpbin.org/ip – and console.log()
the response – which will be your IP address:
</p>
const fetch = require("node-fetch");
const hide_ip = async () => {
const response = await fetch("http://httpbin.org/ip");
const html = await response.text();
console.log(html);
};
hide_ip();
<p>
In our case, here’s what we get:
</p>
{
"origin": "98.45.124.273"
}
<p>
Disclaimer: For privacy reasons, this IP address is just a placeholder, but you should be getting your real IP printed on your terminal.
Now, let’s add the ScraperAPI endpoint to the URL like so:
- ScraperAPI’s sync endpoint:
https://api.scraperapi.com?
- Your API key:
api_key=yourApiKey
- The target URL:
url=http://httpbin.org/ip
To get your API key, create a free account and receive 5000 free API credits.
Everything put together, here’s how the response looks like now:
</p>
const fetch = require("node-fetch");
const hide_ip = async () => {
const response = await fetch(
"https://api.scraperapi.com?api_key=[yourApiKey]&url=http://httpbin.org/ip"
);
const html = await response.text();
console.log(html);
};
hide_ip();
<p>
</p>
{
"origin": "107.165.192.39"
}
<p>
As you can see, our IP address changed from 98.45.124.273
to 107.165.192.39
. When we send a request through ScraperAPI’s server (which is why we added its endpoint to the target URL), the API will send the request through one of its millions of IP addresses to ensure ours is completely hidden. Creating real anonymity for us and allowing our scrapers to access virtually any site without risk.
This IP rotation happens from request to request. So if we send another request:
</p>
{
"origin": "194.114.137.55"
}
<p>
… our IP address will be different every time. In other words, you’ll be able to access a pool of data center, mobile, and residential IP addresses without the overhead of maintaining them or their infrastructure.
Changing Your Geolocation
One of the reasons you might want to change your geolocation it’s to access geo-specific information from a website. Here’s an example: let’s create a simple Node.JS scraper to access NewChic’s women’s t-shirts page.
Because NewChic’s side sells to several countries in different languages, their server will respond in the appropriate language based on our IP’s geographical location.
We’ll use Cheerio to build the parser:
</p>
<pre class="wp-block-syntaxhighlighter-code">const fetch = require("node-fetch");
const cheerio = require("cheerio");
const hide_ip = async () => {
const response = await fetch(
"https://api.scraperapi.com?api_key=[yourApiKey]&url=https://www.newchic.com/women-t-shirts-c-3666/"
);
const html = await response.text();
const $ = cheerio.load(html);
const allProducts = $(".mb-lg-32");
allProducts.each((<i>index</i>, <i>element</i>) => {
const productName = $(<i>element</i>).find(".product-item-name-js").text();
console.log(productName);
});
};
hide_ip();</pre>
<p>
Here’s the list printed on the terminal:
</p>
Solid Crew Neck Casual T-shirt
Stripe Pattern Half Placket T-shirt
Flower Print Casual T-Shirt
Casual Printed Overhead T-Shirt
Solid Long Sleeve V-neck T-shirt
Floral Embroidery Dish O-neck T-shirt
Figure Pattern Split Twist T-Shirt
Solid Color O-neck Long Sleeve T-Shirt
Contrast Color Button T-shirt
Solid Lace Patchwork Collar
Landscape Prints O-neck T-shirt
Solid O-neck Ruffle Patchwork T-Shirt
Floral Embroidery Short Sleeve T-Shirt
Floral Embroidery Casual T-Shirt
Cartoon Mushroom O-neck T-Shirt
Mushroom Print Loose O-neck T-Shirt
Contrast Color Long Sleeve T-shirt
Solid Lace Patchwork T-shirt
Abstract Geo Stripe Print Blouse
Floral Print O-neck Casual T-shirt
Corduroy Plaid Print Patchwork T-shirt
Corduroy Striped Print Patchwork T-shirt
Butterfly Graphic Print Casual T-shirt
Solid Tie-Back Crew Neck Casual T-shirt
Solid Button Short Sleeve T-shirt
Sun Graphic Crew Neck Casual T-shirt
Letters Graphic Crew Neck T-shirt
Graphic Curved Hem Casual T-shirt
Retro Graphic Curved Hem T-shirt
Letters Graphic Curved Hem T-shirt
Retro Graphic Curved Hem T-shirt
Solid Off-shoulder T-shirt
Solid Ripped Long Sleeve T-shirt
Geo Print O-neck Casual T-shirt
Mesh Stitch Cut Out T-shirt
Guipure Lace Mesh Stitch T-shirt
Solid Irregular Off Shoulder T-shirt
Butterfly Letter Graphic T-shirt
Letter Butterfly Graphic Casual T-shirt
Butterfly Letter Graphic T-shirt
Butterfly Graphic Short Sleeve T-shirt
Car Letters Graphic T-shirt
Figure Letters Graphic T-shirt
Butterfly Flower Letters Graphic T-shirt
Leisure Solid Asymmetrical Split T-Shirt
Leisure Solid Slit Short Sleeve T-Shirt
Leisure Solid Slit Short Sleeve T-Shirt
Tie Dye Short Sleeve T-shirt
Gradient Stripe Tie Dye T-shirt
Letters Animal Print Crew Neck T-shirts
Fan Embroidered Short Sleeve Blouse
Flower Print Long Sleeve T-shirt
Plaid Pattern Print Patchwork T-Shirt
Tie Dye Knitted Crop T-shirt
Solid Crisscross V-neck T-shirt
Solid Short Sleeve Irregular T-shirt
Butterfly Flower Graphic Casual T-shirt
Rainbow Rain Printed T-shirt
Letter Vegetation Printed O-neck T-shirt
Striped Print Slit Hem T-Shirt
<p>
When sending the request, ScraperAPI will randomly choose an IP (IP rotation) from the millions of available addresses. Because NewChic has a different version for certain countries, we could get different results based on the IP’s location.
Let’s take advantage of ScraperAPI and specify the country we want our request to be sent from. To make it obvious, let’s start by using the country_code=fr
parameter in the URL like this:
</p>
const response = await fetch(
"https://api.scraperapi.com?api_key=[yourApiKey]&url=https://www.newchic.com/women-t-shirts-c-3666/&country_code=fr"
);
<p>
Now the server responds with the French version of this page:
</p>
T-shirt décontracté uni à col rond
T-shirt à demi patte de boutonnage à rayures
T-shirt décontracté à imprimé fleuri
T-shirt uni à manches longues et col en V
T-shirt décontracté imprimé
T-shirt à col rond et broderie florale
T-shirt torsadé fendu à motif figure
T-shirt à manches longues et col rond de couleur unie
T-shirt à manches longues de couleur contrastante
T-shirt patchwork en dentelle unie
Chemisier à rayures géométriques abstraites
T-shirt décontracté à col rond et imprimé floral
T-shirt patchwork à carreaux en velours côtelé
T-shirt patchwork à rayures en velours côtelé
T-shirt décontracté à imprimé graphique papillon
T-shirt décontracté uni à col ras du cou et noué dans le dos
T-shirt uni à manches courtes et boutons
T-shirt décontracté à encolure ras du cou Sun Graphic
T-shirt ras du cou graphique lettres
T-shirt décontracté graphique à ourlet arrondi
T-shirt graphique rétro à ourlet arrondi
T-shirt graphique à ourlet arrondi avec lettres
T-shirt graphique rétro à ourlet arrondi
T-shirt uni à épaules dénudées
T-shirt uni à manches longues déchiré
T-shirt décontracté à col rond et imprimé géométrique
T-shirt à découpes en maille
T-shirt en maille et dentelle guipure
T-shirt uni irrégulier à épaules dénudées
T-shirt graphique à lettres papillons
T-shirt Décontracté Lettre Papillon Graphique
T-shirt graphique à lettres papillons
T-shirt graphique papillon à manches courtes
T-shirt graphique de lettres de voitureblank
T-shirt Graphique Lettres Chiffres
Papillon Fleur Lettres T-shirt graphique
T-shirt fendu asymétrique uni Leisure
T-shirt uni à manches courtes et fente Leisure
T-shirt uni à manches courtes et fente Leisure
T-shirt à manches courtes tie-dye
T-shirt tie-dye à rayures dégradées
T-shirts ras du cou à imprimé animal et lettres
Chemisier à manches courtes brodé d'éventails
T-shirt à manches longues à imprimé fleuri
T-shirt patchwork imprimé à carreaux
T-shirt court en maille tie-dye
T-shirt uni croisé à col en V
T-shirt irrégulier uni à manches courtes
T-shirt décontracté graphique à fleurs papillons
T-shirt rayé à ourlet fendu
T-shirt col rond imprimé multicolore
T-shirt col V imprimé multicolore
T-shirt à manches courtes de couleur contrastante
T-shirt boutonné à col rond et imprimé floral
T-shirt décontracté en coton patchwork
T-shirt fleuri à manches courtes et col rond
T-shirt ample à col en V et imprimés de paysages
T-shirt à fleurs papillon de couleur contrastée
T-shirt imprimé carte à manches courtes
T-shirt à col rond et imprimé floral
<p>
It’s important to use this feature to get geo-specific data from websites that change their content based on the IP location.
For example, search engines like Google will show different results based on location, so if we want to get the most accurate data possible from their result pages, it’s recommended to use the country parameter in the request.
Note: You can see a full list of features (like JS rendering with the render=true parameter) in our documentation.
Wrapping Up
The logic behind web scraping is quite simple.
You could keep adding more elements to the parser and export all data to a CSV in a few more lines of code. The real challenge is scaling your project while keeping performance as high as possible.
The final goal is data, so a higher success rate and a flexible environment to adapt your scraper to the different challenges that will get in your way are key.
If you want to try ScraperAPI, you can create a free account at any point – no credit card required – and see how ScraperAPI will supercharge your process.
For inspiration and examples, our blog is full of use cases and ready-to-use code snippets.
Hope you learn a thing or two today. Until next time, happy scraping!