Scrape Google Search Results Consistently – Even with JavaScript

Python Requests: How to Use and Rotate Proxies

Python Requests: How to Use and Rotate Proxies

While performing web scraping, the target website might likely ban your IP.

This is even more common when the target website uses anti-bot solutions provided by the likes of Cloudflare, Google, and Akamai. Therefore, you must use proxies to hide your real IP.

And because these proxies can also get banned by the target website, it is important to rotate proxies regularly. In this article, you will learn about the two ways to use and rotate Python proxies — ScraperAPI (the easier way) and with Requests in Python (a more complicated one).

Avoid Getting Your IP Banned!

The easiest way to scrape at scale is to use and rotate Python proxies with ScraperAPI!
Try it free with 5,000 credits.

1. Using and Rotating Proxies with Python (the Easy Way)

Using ScraperAPI is the easiest way to use and rotate proxies with Python. This is because you need to constantly be on the lookout for fresh proxies that are not already blocked by your target website. ScraperAPI takes this proxy rotation burden off of your shoulders so you no longer need to focus on the tools, but get the data you need straight away.

Sign up to ScraperAPI and get 5000 (!) free API credits. Next, get your API key from the dashboard:

Getting Started with ScraperAPI

Then you can use the ScraperAPI proxy using Requests like this:

import requests
proxies = {
  "http": "http://scraperapi:APIKEY@proxy-server.scraperapi.com:8001",
  "https": "http://scraperapi:APIKEY@proxy-server.scraperapi.com:8001"
}
r = requests.get('http://httpbin.org/ip', proxies=proxies, verify=False)
print(r.text)

Now, whenever you send a new request using the ScraperAPI proxy, ScraperAPI will rotate the proxy for you and use a new one for each request. It can not get any simpler than this!

You just outsourced all of the proxy sourcing, validating, and rotating hassle to ScraperAPI and can focus more on your unique business logic. You can also rest assured that you will always have a fresh proxy available as ScraperAPI has access to 40 Million+ proxies!

Access 40 Million+ Proxies Now!

With ScraperAPI, you no longer have to deal with proxies and rotating millions of IP addresses. We’ll take care of that while you focus on the data.

2. Using and Rotating Proxies with Python (Traditional Way)

With the easy method out of the way, let’s look at a more traditional approach to proxy rotation in Python. As you will soon see, this method is way more involved and requires a lot of time and care to keep it running smoothly.

Step 1. Setting up the Prerequisites

Make sure you have Python installed on your system. You can use Python version 3.7 or higher for this tutorial. Go ahead and create a new directory where all the code for this project will be stored and create an app.py file within in:

$ mkdir proxy_rotator
$ cd proxy_rotator
$ touch app.py

You also need to have requests installed. You can easily do that via PIP:

$ pip install requests

Step 2. How to Source a Proxy List?

Before you can rotate proxies, you need a list of proxies. There are different lists available online. Some of them are paid, and some are free. Each has its own pros and cons. A very famous source of free proxies is Free Proxy List. The biggest issue with proxies from such free lists is that most of them might already be blocked by your target website, so you will have to do some testing to make sure the proxy you are using is unblocked.

You can download the proxy list from Free Proxy List into a txt file.

Note: If you go with the easy method as outlined earlier in the article, you will be happy to learn that ScraperAPI automatically monitors all of their proxies to make sure they are not blocked by the target website!

Step 3. Making a Request Without a Proxy

Let’s start by taking a look at how to make a request using requests without any proxy. You can do so in two different ways. You can either directly use the requests.get (or similar) method, or you can create a Session and use that to make requests.

The direct requests using requests.get can be made like this:

import requests 
html = requests.get("https://yasoob.me")
print(html.status_code)
# output: 200

The same request using Session can be made like this:

import requests
s = requests.Session()
html = s.get("https://yasoob.me")
print(html.status_code)
# Output: 200

It is important to discuss both of these methods as the process of using a proxy is slightly different for each of them.

Step 4. Using a Proxy with Requests

It is very straightforward to use a proxy with requests. You just need to provide requests with a dictionary containing the HTTP and HTTPS keys and their corresponding proxy URL. You can use the same proxy URL for both of these protocols.

Note: As this article uses free proxies, the proxy URLs in the code blocks might stop working by the time you are reading them. You can follow along by replacing the proxy URLs in the code samples with working proxies from Free Proxy List.

Here is some sample code for using a proxy in requests without creating a Session object:

import requests

proxies = {
   'http': 'http://47.245.97.176:9000',
   'https': 'http://47.245.97.176:9000',
}

response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(response.text)
# Output: {
#  "origin": "47.245.97.176"
# }

And here is the same example with the Session object:

import requests

proxies = {
   'http': 'http://47.245.97.176:9000',
   'https': 'http://47.245.97.176:9000',
}

s = requests.Session()
s.proxies = proxies
response = s.get('https://httpbin.org/ip')
print(response.text)
# Output: {
#  "origin": "47.245.97.176"
# }

It is common to get the CERTIFICATE_VERIFY_FAILED SSL error while using free proxies. Here is how the error will look like:

requests.exceptions.SSLError: HTTPSConnectionPool(host='httpbin.org', port=443): Max retries exceeded with url: /ip (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))

You can get around this error by passing in verify=False to the get method like so:

requests.get('https://httpbin.org/ip', proxies=proxies, verify=False)

# or

s.get('https://httpbin.org/ip', verify=False)

Step 5. Using an Authenticated Proxy with Requests

It is equally straightforward to use authenticated proxies with requests. You just need to amend the proxies dictionary and provide the username and password for each proxy URL:

proxies = {
   'http': 'http://username:password@proxy.com:8080',
   'https': 'http://username:password@proxy.com:8081',
}

Replace username and password with working credentials and you are good to go. The rest of the code for making requests will stay as it is in the previous code samples.

Step 6. Setting a Proxy Via Environment Variables

You can also use proxies without adding any proxy-specific code to Python. This is possible by setting appropriate environment variables. requests honors the HTTP_PROXY and HTTPS_PROXY environment variables. If these are set, requests will use their corresponding value as the appropriate proxy URL.

You can set these environment variables in a Unix like system by opening up the terminal and entering this code:

export HTTP_PROXY='http://47.245.97.176:9000'
export HTTPS_PROXY='http://47.245.97.176:9000'

Now you can remove any proxy specific code from your Python program and it will automatically use the proxy endpoint set via these environment variables!

Try it out by running this code and make sure the output matches the proxy endpoint set via the environment variables:

import requests

response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(response.text)
# Output: {
#  "origin": "47.245.97.176"
# }

Step 7. Rotating Proxies with Each Request

As mentioned in the introduction, proxies can also get blocked. Therefore, it is important to rotate proxies and to try not to use a single proxy for multiple requests. Let’s take a look at how you can rotate proxies in Python using requests.

Loading proxies from a proxy list

To get started, save the proxies from Free Proxy List into a proxy_list.txt file in the proxy_rotator directory. Here is what the file will look like:

196.20.125.157:8083
47.245.97.176:9000
54.39.132.131:80
183.91.3.22:11022
154.236.179.226:1981
41.65.46.178:1981
89.175.26.210:80
61.216.156.222:60808
115.144.99.220:11116
...
167.99.184.232:3128

Now open up the app.py file and write the following code to load these proxies into a list:

def load_proxy_list():
    with open("proxy_list.txt", "r") as f:
        proxy_list = f.read().strip().split()
    return proxy_list

Verify the proxy works

Now that you have a list of proxies, it is important to test that all the proxies in the list are working and to get rid of the ones that are not working. You can test this by sending a request to httpbin via the proxy and making sure the response contains the proxy IP. If the request fails for some reason, you can discard the proxy.

You can make the discarding process more fine-grained by making sure the request failed due to an issue with the proxy and not because of an unrelated network issue. For now, let’s keep things simple and discard a proxy whenever there is any error (exception). Here is some code that does this:

def check_proxy(proxy_string):
    proxies = {
    'http': f'http://{proxy_string}',
    'https': f'http://{proxy_string}',
    }

    try:
        response = requests.get('https://httpbin.org/ip', proxies=proxies, timeout=30)
        if response.json()['origin'] == proxy_string.split(":")[0]:
            # Proxy works
            return True
        # Proxy doesn't work
        return False
    except Exception:
        return False

The code is fairly straightforward, you pass in a proxy string (eg, 0.0.0.0:8080) to check_proxy as an argument, and then check_proxy sends a request to httpbin.org/ip through the passed-in proxy. If the response contains the proxy ip in the response, it returns True and if it does not (or if the request fails) then it returns False. The code also has a timeout defined for each request. If the response is not received within the defined timeout, an exception will be raised. This will make sure you do not end up with slow proxies.

Rotating the proxy with each request

You can now couple the functions in the previous two code listings and use them to rotate the proxy with each request. Here is one potential way of doing it:

from random import choice

def get_working_proxy():
    random_proxy = choice(proxy_list)
    while not is_proxy_working(random_proxy):
        proxy_list.remove(random_proxy)
        random_proxy = choice(proxy_list)
    return random_proxy

def load_url(url):
    proxy = get_working_proxy()
    proxies = {
        'http': f'http://{proxy}',
        'https': f'http://{proxy}',
    }
    response = requests.get(url, proxies=proxies)
    
    # parse the response
    # ...

    return response.status_code

urls_to_scrape = [
    "https://news.ycombinator.com/item?id=36580417",
    "https://news.ycombinator.com/item?id=36575784",
    "https://news.ycombinator.com/item?id=36577536",
    # ...
]
proxy_list = load_proxy_list()

for url in urls_to_scrape:
    print(load_url(url))

Let’s dissect this code a little. It contains a get_working_proxy() function that picks a random proxy from the proxy list, verifies that it works, and then returns it. If the proxy doesn’t work as expected, the function removes this proxy from the proxy list. Then there is the load_url() function. It gets a working proxy by calling the get_working_proxy() function and uses the returned proxy to route the request to the target URL. Finally, there is some code to start the scraping process. The important thing to note here is that a random proxy is used for each request, and this helps spread the scraping load over multiple proxies.

How to improve the proxy rotator

There are so many things you can do to improve the naive proxy rotator you have created so far. The very first thing you can do is revise the exception handling code and make sure that the proxy is discarded only if the proxy is faulty. Another thing you can do is that you can recheck discarded proxies after a while. Generally, free proxies cycle between working and not working state far too often. You can also add logic to load the proxies directly from the Free Proxy List website rather than saving them manually to a txt file first.

Conclusion

So you’ve learnt how to use proxies with requests in Python, source, verify, and rotate them. Now — you must be wondering about the best method to use with proxies. You can choose a more traditional way, but you should be prepared to tweak the code more often and constantly have an eye on updating proxies. It can get too time-consuming at some point and break your flow of data collection. The best way is to use a tool that does proxy rotation for you so you can get the data you need quickly and at a large scale.

Try ScraperAPI and get 5,000 free credits when you sign up!

About the author

Picture of Yasoob Khalid

Yasoob Khalid

Yasoob is a renowned author, blogger and a tech speaker. He writes regularly on his personal blog and has authored the Intermediate Python and Practical Python Projects books. He is currently working on Azure at Microsoft.

Related Articles

Talk to an expert and learn how to build a scalable scraping solution.