Using proxies while web scraping allows you to access websites anonymously, helping you avoid issues like IP bans or rate limiting. By sending your requests through proxies, you effectively create a buffer between yourself and the target site, concealing your actual IP address.
In this article, we’ll dive into how to:
- Use proxies with the Python Requests library
- Rotate proxies to ensure we’re undetected
- Retry failed requests to make our scrapers more resilient
Allowing you to build consistent data pipelines for your team and projects.
ScraperAPI smart IP and header rotation lets you collect public web data consistently with a simple API call.
Let’s get started and make web scraping a breeze with proxies!
TL;DR: Using Proxies in Python
For those familiar with web scraping or API interaction in Python, using proxies in your Requests workflow is straightforward.
Here’s how to use python requests with proxies:
- Obtain a Proxy: Secure a proxy address. It usually looks something like “http://your_proxy:port.”
-
Utilize Python Requests: Import
requests
and configure it to use your proxy. - Configure Your Request with the Proxy: Include your proxy in the request call when making a request.
- Bypass Restrictions: Using proxies reduces the risk of being blocked by websites, enabling smoother data collection.
Here’s a snippet of how this looks in your code:
import requests
# Replace 'http://your_proxy:port' with your actual proxy address
proxies = {
'http': 'http://your_proxy:port',
'https': 'http://your_proxy:port',
}
# Target URL you want to scrape
url = 'http://example.com'
# Making a get request through the proxy
response = requests.get(url, proxies=proxies)
print(response.text)
This code sends a request to “http://example.com”
via the proxy
you specify in the proxies dictionary. You can adjust the URL to match the
target website you wish to scrape. Similarly, update the proxy dictionary with
your proxy server details.
However, this only solves part of the problem. You must still build, maintain, and prune a large proxy pool and write the logic for rotating your proxies.
To make things more simple, we recommend using a tool like ScraperAPI to:
- Access pool of +40M proxies across +50 countries
- Automate smart IP and header rotation
- Scrape geo-lock or localized data using ScraperAPI’s built-in geotargeting
And add a toolset of scraping solutions that’ll make scraping the web a breeze.
To get started, create a free ScraperAPI account, copy your API key, and send your requests through the API:
import requests
payload = {
'api_key': 'YOUR_API_KEY',
'url': 'www.example.com',
'country': 'us'
}
r = requests.get('https://api.scraperapi.com', params=payload)
print(r.text)
Every time you send a request, ScraperAPI will use machine learning and years of statistical analysis to pick the right IP and header combination to ensure a successful request.
Want to go more in-depth into using proxies with requests? Keep reading!
How to Use a Proxy with Python Requests
Step 1: Select Your Ideal Proxy
The first step in your journey is to choose a suitable proxy. You might opt for a private HTTP or HTTPS proxy, depending on your specific needs. This proxy type offers a dedicated IP address, increasing stability and speed while providing a more secure and private connection.
Note: You can also use ScraperAPI proxy mode to access our IP pool.
Step 2: Import Python Requests
Before you can send requests through proxies, you’ll need to have the Python
Requests library ready to go. You can install it using the command
pip install requests
.
Import it into your script to get started:
import requests
Step 3: Configure Your Proxy
Once you have your proxy, it’s time to use it in your code. Replace ‘http://your_proxy:port’ and ‘https://your_proxy:port’ with your proxy’s details:
import requests
proxies = {
'http': 'http://your_proxy:port',
'https': 'https://your_proxy:port',
}
response = requests.get("http://example.com", proxies=proxies)
print(response.text)
This code routes your requests through the proxy, concealing your real IP address.
Step 4: Authenticate Your Proxy
If your proxy needs a username and password, add them to the proxy URL:
proxies = {
'http': 'http://user:password@your_proxy:port',
'https': 'https://user:password@your_proxy:port',
}
Step 5: Rotate Your Proxies (Advanced)
If you frequently scrape the same website, it would be a good practice to rotate your proxies. This means using a different proxy from a pool of proxies for each request.
Here’s a basic method to do it:
import random
# Your list of proxies
proxy_pool = [
'http://proxy1:port',
'http://proxy2:port',
]
# Selecting a random proxy
proxy = random.choice(proxy_pool)
proxies = {'http': proxy, 'https': proxy}
# Making your request with the selected proxy
response = requests.get("http://example.com", proxies=proxies)
We’ll explore managing a larger pool of proxies more in-depth later in the article, but for now, that’s it!
With these straightforward steps, you’re equipped to use proxies with Python Requests, making your web scraping efforts more effective.
Get access to a well-maintained pool of data center, residential, and mobile proxies.
Using ScraperAPI Proxy Mode with Requests
ScraperAPI with Python Requests simplifies your web scraping by letting it handle proxies, CAPTCHAs, and headers for you.
Just like with regular proxies, you would use ScraperAPI’s proxy port the same way:
import requests
proxies = {
"http": "http://scraperapi:APIKEY@proxy-server.scraperapi.com:8001"
}
r = requests.get('http://httpbin.org/ip', proxies=proxies, verify=False)
print(r.text)
Just remember to replace APIKEY
with your real API key. You’re
now using ScraperAPI’s proxy pool and all its infrastructure.
Choosing ScraperAPI means you won’t have to juggle with free proxies anymore. It automatically changes IP addresses and tries again if a request doesn’t get through, making your scraping jobs more reliable and less of a headache.
Use ScraperAPI’s Proxy Mode with Selenium
When using Selenium to scrape dynamic content, ScraperAPI can enhance the efficiency of your setup by handling proxy management through Selenium Wire.
Selenium Wire extends Selenium’s capabilities, making it easier to customize request headers and use proxies.
Here’s how you can use ScraperAPI with Selenium Wire:
from seleniumwire import webdriver
# Replace 'YOUR_SCRAPERAPI_KEY' with your actual ScraperAPI key.
options = {
'proxy': {
'http': f'http://api.scraperapi.com?api_key=YOUR_SCRAPERAPI_KEY',
'https': f'https://api.scraperapi.com?api_key=YOUR_SCRAPERAPI_KEY',
'no_proxy': 'localhost,127.0.0.1'
}
}
driver = webdriver.Chrome(seleniumwire_options=options)
# The website you're aiming to scrape.
driver.get("http://example.com")
This setup routes your Selenium-driven browser sessions through ScraperAPI’s robust proxy network and takes care of IP rotation and request retries. It significantly reduces the complexity of dealing with dynamic content and CAPTCHAs, making your scraping efforts more successful and less time-consuming than traditional proxies.
But what if you want to do IP rotation yourself? Don’t worry, we’ll cover that too!
How to Rotate Proxies with Python
As your scraping projects get larger and more complex, you’ll notice that using the same proxy all the time can cause problems like IP blocks and rate limits. A great way to solve this is by rotating proxies, which means changing your IP addresses regularly to keep your scraping hidden and avoid detection.
Step 1: Gather Your Proxies
First, you’ll need to compile a list of proxies. Here’s a free proxy list you can use.
Keep in mind that if you’re using proxies outside of ScraperAPI’s pool, the site you plan to scrape might already have blocked them, so we need to test them before implementation.
For this, create a file named proxy_list.txt in a folder called proxy_rotator and paste the downloaded proxies there.
Here’s an example of what your file might look like:
103.105.196.212:80
38.145.211.246:8899
113.161.131.43:80
172.235.5.40:8888
116.203.28.43:80
172.105.219.4:80
35.72.118.126:80
139.99.244.154:80
50.222.245.42:80
50.222.245.50:80
Step 2: Load the Proxy List
Now, let’s define the function to load your list of proxies. We’ll create a
function called fetch_proxies()
that reads the contents of the
proxy_list.txt file and returns a list of proxies.
def fetch_proxies():
proxies = []
with open("proxy_list.txt") as file:
for line in file:
clean_line = line.strip()
if clean_line:
proxies.append(clean_line)
return proxies
This function opens the proxy_list.txt file, reads each line, strips any leading or trailing whitespace, and adds the cleaned proxy to the proxies list. Finally, it returns the list of proxies.
Step 3: Validate Proxies
Now that we have our function to fetch proxies, we must ensure they are valid
and working before using them. We’ll create another function,
validate_proxy()
, to test each proxy’s functionality.
def validate_proxy(proxy):
proxy_dict = {
'http': f'http://{proxy}',
'https': f'http://{proxy}',
}
try:
response = requests.get('https://httpbin.org/ip', proxies=proxy_dict, timeout=30)
if response.json()['origin'] == proxy.split(":")[0]:
return True
return False
except Exception:
return False
This function takes a proxy as input and attempts to make a request using that proxy to ‘https://httpbin.org/ip‘. If the request is successful and the returned IP matches the proxy’s IP, we consider the proxy valid. Otherwise, it’s invalid.
Step 4: Finding a Working Proxy
Next, create a function to find a working proxy from our list. We’ll call it
find_active_proxy()
. This function will randomly select a proxy
from a list of proxies, test it using validate_proxy()
, and keep
trying until it finds a valid one.
def find_active_proxy(proxies):
selected_proxy = choice(proxies)
while not validate_proxy(selected_proxy):
proxies.remove(selected_proxy)
if not proxies:
raise Exception("No working proxies available.")
selected_proxy = choice(proxies)
return selected_proxy
Step 5: Fetching Data with the Active Proxy
It’s time to utilize the active proxy to fetch data from your target URL. The
fetch_url()
function selects an active proxy, configures it for
the request, and attempts to make a GET request to the specified URL using the
proxy. If successful, it returns the status code of the response.
This step ensures that we’re retrieving data through a rotated proxy, increasing reliability and keeping us anonymous in our web scraping tasks.
def fetch_url(url, proxies):
proxy = find_active_proxy(proxies)
proxy_setup = {
'http': f'http://{proxy}',
'https': f'http://{proxy}',
}
try:
response = requests.get(url, proxies=proxy_setup)
return response.status_code
except requests.exceptions.RequestException:
return "Failed to fetch URL"
Step 6: Rotate Through Your Proxy Pool
Now that we’ve set everything up, let’s bring it all together.
Let’s start by loading our proxies using the
fetch_proxies()
function. Once our proxies are ready, we’ll
iterate through a list of URLs, scraping each with a rotated proxy. You can
add as many URLs as you’d like to the urls_to_scrape
list.
proxies = fetch_proxies()
#(you can add more here)
urls_to_scrape = ["https://example.com/"]
for url in urls_to_scrape:
print(fetch_url(url, proxies))
This setup will make each request through a different active proxy, ensuring smooth and efficient data retrieval from each URL. This rotation of proxies increases reliability and prevents IP-based blocking, allowing for uninterrupted scraping.
ScraperAPI handles IP rotation using machine learning and statistical analysis techniques, letting you focus on what matters: data.
How to Rotate Proxies using Asyc and Aiohttp
Using aiohttp
for asynchronous proxy rotation enhances the
efficiency of web scraping operations, allowing for multiple requests to be
handled simultaneously.
We’ll modify our previous example’s fetch_url()
function to use
aiohttp
for asynchronous proxy rotation. This allows multiple
requests to be handled simultaneously, which is essential when dealing with
large data collection tasks requiring high performance and avoiding detection.
import aiohttp
import asyncio
async def fetch_url(url, proxies):
# Select an active proxy from the list each time the function is called
proxy = find_active_proxy(proxies)
proxy_url = f'http://{proxy}'
print(f'Using proxy: {proxy}')
# Create an HTTP client session
async with aiohttp.ClientSession() as session:
# Send a GET request to the URL using the selected proxy
async with session.get(url, proxy=proxy_url) as response:
print(f'Status: {response.status}')
print(await response.text())
Now, let’s implement the main()
function to manage multiple URLs:
async def main(proxies):
urls_to_scrape = [
'http://httpbin.org/get' # List the URLs you want to scrape here.
]
for url in urls_to_scrape:
await fetch_url(url, proxies)
Finally, you need to initialize and run the main function:
proxies = fetch_proxies()
asyncio.run(main(proxies))
Using aiohttp
for asynchronous proxy rotation ensures that our
web scraping tasks are more efficient, which speeds up the data extraction
process and significantly enhances the ability to manage large-scale scraping
tasks.
How to Rotate Proxies with Selenium
Using Selenium to rotate proxies is ideal for web scraping tasks requiring interaction with JavaScript-heavy websites or simulating user behavior.
We’ll modify the fetch_url()
function from our previous example
and use the Selenium library to achieve this.
import selenium
from selenium import webdriver
def fetch_url(url, proxies):
# Select an active proxy
proxy = find_active_proxy(proxies)
print(f"Using proxy: {proxy}")
# Set up proxy for Selenium
options = webdriver.ChromeOptions()
options.add_argument(f'--proxy-server={proxy}')
# Initialize Chrome driver with proxy
driver = webdriver.Chrome(options=options)
try:
# Load the URL
driver.get(url)
except Exception as e:
print(f"Failed to fetch URL: {str(e)}")
finally:
driver.quit()
# Load your initial list of proxies
proxies = fetch_proxies()
# URLs to scrape
urls_to_scrape = ["https://example.com/"]
# Scrape each URL with a rotated proxy using Selenium
for url in urls_to_scrape:
fetch_url(url, proxies)
In the modified fetch_url()
function, we utilize Selenium’s
WebDriver to interact with the Chrome browser. We configure the WebDriver to
use the selected proxy for each request, enabling us to route our traffic
through different IP addresses.
By combining Selenium with proxy rotation, we can conduct advanced web scraping tasks more effectively, ensuring reliability and anonymity throughout the process.
Proxy Rotation with ScraperAPI
Now that we’ve learned how to rotate proxies at a basic level, it’s clear that applying this to handling large datasets would involve a lot more complexity. Using a tool like ScraperAPI is a smart choice for a more straightforward and reliable way to manage rotated proxies.
Here’s why ScraperAPI can be a game-changer for your proxy management:
- Simplify Your Workflow: ScraperAPI takes on the heavy lifting of managing and rotating proxies so you can focus on what matters most—your data.
- Smart Proxy Rotation: ScraperAPI uses smart rotation based on machine learning and statistical analysis to intelligently rotate proxies, ensuring you always have the best connection for your needs.
- Maintain Proxy Health: You won’t have to worry about the upkeep of your proxies. ScraperAPI automatically weeds out non-working proxies, keeping your pool fresh.
- Ready to Scale: No matter the size of your project, ScraperAPI scales to meet your demands without missing a beat, which is perfect for growing projects.
By choosing ScraperAPI, you remove the complexity of manual proxy management and gain a straightforward, efficient tool that lets you focus on extracting and utilizing your data effectively.
Retry Failed Requests
Sometimes, we get failed requests due to network problems or other unexpected issues. In this section, we’ll explore three main ways to retry failed requests with Python Requests:
- Using an existing retry wrapper: This method is perfect for a quick and easy fix. It uses tools already available in Python to handle retries, saving you time and effort.
- Coding Your Own Retry Wrapper: If you need something more tailored to your specific needs, this method lets you build your own retry system from scratch.
However, before we decide on the best approach, we need to understand why our requests are failing.
Common Causes of Request Failures
Understanding the common problems that can cause your HTTP requests to fail will help you better prepare and implement effective retry strategies.
Here are three major causes of request failures:
Network Issues
Network issues are one of the most common reasons for failed HTTP requests. These can range from temporary disruptions in your internet connection to more significant network outages affecting larger areas. When the network is unstable, your requests might time out or get lost in transit, leading to failed attempts at retrieving or sending data.
Server Overload
Another typical cause of request failures is server overload. When the server you are trying to communicate with receives more requests than it can handle, it might start dropping incoming connections or take longer to respond. This delay can lead to timeouts, where your request isn’t processed in the expected time frame, causing it to fail.
Rate Limiting
Rate limiting is a control mechanism that APIs use to limit the number of requests a user can make in a certain period. If you send too many requests too quickly, the server might block your additional requests for a set period. This is a protective measure to prevent servers from being overwhelmed and ensure fair usage among all users.
Understanding the rate limits of the APIs you are working with is crucial, as exceeding these limits often results in failed requests.
By identifying and understanding these common issues, you can better tailor your retry logic to address specific failure scenarios, thereby improving the reliability of your HTTP requests.
Diagnosing Your Failed Requests
Once you understand the common causes of request failures, the next step is learning how to diagnose these issues when they occur. This involves identifying the problem and choosing the right strategy to handle it.
Identifying the Issue
One of the most straightforward methods to diagnose why a request failed is to look at the HTTP status codes returned. These codes are standard responses that tell you whether a request was successful and, if not, what went wrong. For instance:
- 5xx errors indicate server-side issues.
- 4xx errors suggest problems with the request, like unauthorized access or requests for nonexistent resources.
- Timeouts often do not come with a status code but are critical to identify as they indicate potential network or server overload issues.
Here are some of the most common status codes you might encounter while web scraping, which indicate different types of errors:
200 OK | The request has succeeded. This status code indicates that the operation was successfully received, understood, and accepted. |
404 Not Found | The requested resource cannot be found on the server. This is common when the target webpage has been moved or deleted. However, it can also mean your scraper has been blocked. |
500 Internal Server Error | A generic error message when the server encounters an unexpected condition. |
502 Bad Gateway | The server received an invalid response from the upstream server it accessed in attempting to fulfill the request. |
503 Service Unavailable | The server is currently unable to handle the request due to a temporary overload or scheduled maintenance. |
429 Too Many Requests | This status code is crucial for web scrapers as it indicates that you have hit the server’s rate limit. |
These status codes indicate what might be going wrong, allowing you to adjust your request strategy accordingly.
Tools and Techniques
To further diagnose network and server issues, consider using the following tools:
- Network diagnostic tools: Tools like Wireshark or Ping can help you understand whether network connectivity issues affect your requests.
- HTTP clients: Tools like Postman or curl allow you to manually send requests and inspect the detailed response from servers, including headers that may contain retry-after fields in case of rate limiting.
- Logging: Ensure your scraping scripts log enough details about failed requests. This can include the time of the request, the requested URL, the received status code, and any server response messages. This information is crucial for diagnosing persistent issues and improving the resilience of your scripts.
By effectively using these diagnostic tools and techniques, you can quickly identify the causes of failed requests, making it easier to apply the appropriate solutions to maintain the efficiency and effectiveness of your web scraping tasks.
Solutions to Common Request Failures
There are two best ways to retry Python Requests:
- Use an existing retry wrapper like Python Sessions with HTTPAdapter.
- Coding your own retry wrapper.
The first option is the best for most situations because it’s straightforward and effective. However, the second option might be better if you need something more specific.
Implementing Retry Logic Using an Existing Retry Wrapper
A practical solution for handling retries using the Python Requests library is to use an existing retry wrapper, such as HTTPAdapter. This approach simplifies setting up retry mechanisms, making your HTTP requests less prone to failures.
Step 1: Import the Necessary Modules
Before you start, ensure the requests
and
urllib3
libraries are installed in your environment. If they are
not, you can install them using pip:
pip install requests urllib3
Then, import the necessary modules in your Python script:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
Step 2: Create an Instance of HTTPAdapter with Retry Parameters
Create an instance of HTTPAdapter and configure it with a Retry strategy. The Retry class provides several options to customize how retries are handled:
retry_strategy = Retry(
total=3, # Total number of retries to allow. This limits the number of consecutive failures before giving up.
status_forcelist=[429, 500, 502, 503, 504], # A set of HTTP status codes we should force a retry on.
backoff_factor=2 # This determines the delay between retry attempts
)
adapter = HTTPAdapter(max_retries=retry_strategy)
This setup instructs the adapter to retry up to three times if the HTTP
request fails with one of the specified status codes. The
backoff_factor
introduces a delay between retries, which helps
when the server is temporarily overloaded or down.
Each retry attempt will wait for:
{backoff factor} * (2 ^ {number of total retries - 1})
seconds.
Step 3: Mount the HTTPAdapter to a Requests Session
After defining the retry strategy, attach the HTTPAdapter to
requests.Session()
. This ensures that all requests sent through
this session follow the retry rules you’ve set:
session = requests.Session()
session.mount("http://", adapter)
session.mount("https://", adapter)
Mounting the adapter to the session applies the retry logic to all types of HTTP and HTTPS requests made from this session.
Example Usage
Now, use the session to send requests. Here’s how to perform a GET request using your configured session:
url = 'http://example.com'
response = session.get(url)
print(response.status_code)
print(response.text)
This session object will automatically handle retries according to your defined settings. If it encounters errors like server unavailability or rate-limiting responses, it can retry the request up to three times, thus enhancing the reliability of your network interactions.
Submit your scraping requests to our Async Scraper and let us handle retries for you.
Coding Your Own Retry Wrapper
Creating your own retry wrapper gives you complete control over how retries are managed, which is great for situations that need special handling of HTTP request failures.
Making your own retry mechanism lets you adjust everything just the way you need, unlike our previous approach, which was quick to implement but less flexible.
Step 1: Set Up the Backoff Delay Function
First, let’s make a function called backoff_delay
. This function
determines how long to wait before sending a request again. It uses
exponential backoff, which means the waiting time gets longer each time a request fails.
import time
def backoff_delay(backoff_factor, attempts):
delay = backoff_factor * (2 ** attempts)
return delay
Step 2: Make the Retry Request Function
Next, we’ll create retry_request
, which uses the
backoff_delay
function to handle retrying HTTP requests. This
function tries to send a GET request to a given URL and will keep trying if it
gets certain types of error responses.
The function makes a request and checks the response’s HTTP status code. If this code is on a list of codes known for causing temporary issues (like server errors or rate limits), the function will plan to retry.
It then uses the backoff_delay
function to calculate how long to
wait before trying again, with the delay time increasing after each attempt
due to the exponential backoff strategy.
import requests
def retry_request(url, total=4, status_forcelist=[429, 500, 502, 503, 504], backoff_factor=1, **kwargs):
for attempt in range(total):
try:
response = requests.get(url, **kwargs)
if response.status_code in status_forcelist:
print(f"Trying again because of error {response.status_code}...")
time.sleep(backoff_delay(backoff_factor, attempt))
continue
return response # If successful, return the response
except requests.exceptions.ConnectionError as e:
print(f"Network problem on try {attempt + 1}: {e}. Trying again in {backoff_delay(backoff_factor, attempt)} seconds...")
time.sleep(backoff_delay(backoff_factor, attempt))
return None # Return None if all tries fail
Example Usage
Here’s how you might use the retry_request
function:
response = retry_request('https://example.com')
if response:
print("Request was successful:", response.status_code)
print(response.text)
else:
print(Request failed after all retries.")
Avoid Getting Blocked by Error 429 with Python Requests
When you conduct intensive web scraping or API polling, encountering an Error 429, which signifies “Too Many Requests,” is a common issue. You usually get this error when your requests exceed the rate limit set by the web server, leading to temporary blocking of your IP address or user-agent due to suspected automation.
To demonstrate this, let’s attempt to access an API that has known rate limits:
import requests
# Attempt to access a rate-limited API endpoint
response = requests.get('https://api.example.com/data')
print(response.text, response.status_code)
Running this code might give you a response indicating that you’ve been blocked with a 429 error:
{
"message": "Too many requests - try again later."
}
429
While proxies and changing user agents may provide a temporary solution in scenarios like this, they often fail to address the root problem. They need help to reliably keep up with sophisticated rate-limiting mechanisms. Instead, a dedicated web scraping API like ScraperAPI can be more effective.
ScraperAPI provides features like automatic IP rotation and request throttling to stay within the rate limits of target servers.
Here’s how you can use it with Python Requests:
- Sign Up for ScraperAPI: Create a free ScraperAPI account and obtain your API key.
- Integrate with Python Requests: Use ScraperAPI to manage your requests.
Here’s a sample code snippet demonstrating how to use ScraperAPI:
import requests
# Replace 'YOUR_API_KEY' with your actual ScraperAPI API key
api_key = 'YOUR_API_KEY'
url = 'https://api.example.com/data'
params = {
'api_key': api_key,
'url': url,
'render': 'true' # Optional: helpful if JavaScript rendering is needed
}
# Send a request through ScraperAPI
response = requests.get('http://api.scraperapi.com', params=params)
print(response.text)
Using ScraperAPI to manage your requests helps you avoid the dreaded 429 error by keeping within site rate limits. It’s an excellent tool for collecting data regularly from websites or APIs with strict rules against too many rapid requests.
Wrapping Up
Proxies are an integral part of scraping the web. They let you hide your IP address and bypass rate limiting.
However, just using proxies isn’t enough. Sites can use many other mechanisms like CAPTCHAs and JavaScript-injected content to limit the data you can gather.
By sending your Python requests through ScraperAPI endpoints, you can:
- Automate proxy rotation
- Handle CAPTCHAs
- Render dynamic content
- Interact with sites without headless browsers
- Bypass anti-scraping mechanisms like DataDome and CF Turnstile
And much more.
If you have any questions, please contact our support team or reach out to us on Twitter/X.
Until next time, happy scraping!