In this guide, we’ll see how you can easily use ScraperAPI with the Python Request library to scrape the web at scale. We will walk you through exactly how to create a scraper that will:
- Send requests to ScraperAPI using our API endpoint, Python SDK or proxy port.
- Automatically catch and retry failed requests returned by ScraperAPI.
- Spread your requests over multiple concurrent threads so you can scale up your scraping to millions of pages per day.
Full code examples can be found on GitHub here.
Getting Started: Sending Requests With ScraperAPI
Using ScraperAPI as your proxy solution is very straightforward. All you need to do is send us the URL you want to scrape to us via our API endpoint, Python SDK or proxy port and we will manage everything to do with proxy/header rotation, automatic retries, ban detection and CAPTCHA bypassing.
The following is a simple implementation that will iterate through a list of URLs and request each of them via ScraperAPI, returning the HTML response as the response.
import requests
from urllib.parse import urlencode
list_of_urls = ['http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/']
for url in list_of_urls:
params = {'api_key': API_KEY, 'url': url}
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
print(response.text)
Here are a couple of other points to note:
- Timeouts – When you send a request to the API we will automatically select the best proxy/header configuration to get a successful response. However, if the response isn’t valid (ban, CAPTCHA, taking too long) then the API will automatically retry the request with a different proxy/header configuration. We will continue this cycle for up to 60 seconds until we either get a successful response or we return a
500
error code to you. To ensure this process runs smoothly, you need to make sure you don’t set a timeout or set it to at least60
seconds. - SSL Cert Verification – In order for your requests to work properly with the API when using proxy mode your code must be configured to not verify SSL certificates. When using Python Requests this is as simple as adding the flag
verify=False
to the request. - Request Size – You can scrape images, PDFs or other files just as you would any other URL, just remember that there is a
2MB
limit per request.
Configuring Your Code To Retry Failed Requests
For most sites, over 97% of your requests will be successful on the first try, however, it is evitable that some requests will fail. For these failed requests, the API will return a 500
status code and won’t charge you for the request.
In this case, if you set your code to automatically retry these failed requests 99.9% will be successful after 3 retries unless there is an issue with the site.
Here is some example code, showing you how you can automatically retry failed requests returned by ScraperAPI. We recommend that you set your number of retries to at least 3 retries.
import requests
from bs4 import BeautifulSoup
list_of_urls = ['http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/']
NUM_RETRIES = 3
scraped_quotes = []
for url in list_of_urls:
params = {'api_key': API_KEY, 'url': url}
for _ in range(NUM_RETRIES):
try:
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
if response.status_code in [200, 404]:
## escape for loop if the API returns a successful response
break
except requests.exceptions.ConnectionError:
response = ''
## parse data if 200 status code (successful response)
if response.status_code == 200:
"""
Insert the parsing code for your use case here...
"""
## Example: parse data with beautifulsoup
html_response = response.text
soup = BeautifulSoup(html_response, "html.parser")
quotes_sections = soup.find_all('div', class_="quote")
## loop through each quotes section and extract the quote and author
for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text
## add scraped data to "scraped_quotes" list
scraped_quotes.append({
'quote': quote,
'author': author
})
print(scraped_quotes)
As you might have noticed there is no retry code if you are using the ScraperAPI SDK. This is because we’ve built it into the SDK for you. The default retry setting is 3 retries, however, you can override this by setting the retry flag to retry=5
for example.
Use Multiple Concurrent Threads To Increase Scraping Speed
ScraperAPI is designed to allow you to increase your scraping from a couple hundred pages per day, to millions of pages per day, simply by changing your plan to have a higher concurrent thread limit.
The more concurrent threads you have the more requests you can have active in parallel, and the faster you can scrape.
If you are new to high volume scraping, it can sometimes be a bit tricky to set up your code to maximise the number of concurrent threads you have available in your plan. So to make it as simple as possible for you to get set up we’ve created an example scraper that you can easily change for your use case.
For the purposes of these examples we’re going to be scraping the Quotes to Scrape and saving the scraped data to a csv file, however, this code will work on any website (except the parsing logic).
import requests
from bs4 import BeautifulSoup
import concurrent.futures
import csv
import urllib.parse
API_KEY = 'INSERT_API_KEY_HERE'
NUM_RETRIES = 3
NUM_THREADS = 5
## Example list of urls to scrape
list_of_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
## we will store the scraped data in this list
scraped_quotes = []
def scrape_url(url):
params = {'api_key': API_KEY, 'url': url}
# send request to scraperapi, and automatically retry failed requests
for _ in range(NUM_RETRIES):
try:
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
if response.status_code in [200, 404]:
## escape for loop if the API returns a successful response
break
except requests.exceptions.ConnectionError:
response = ''
## parse data if 200 status code (successful response)
if response.status_code == 200:
## Example: parse data with beautifulsoup
html_response = response.text
soup = BeautifulSoup(html_response, "html.parser")
quotes_sections = soup.find_all('div', class_="quote")
## loop through each quotes section and extract the quote and author
for quote_block in quotes_sections:
quote = quote_block.find('span', class_='text').text
author = quote_block.find('small', class_='author').text
## add scraped data to "scraped_quotes" list
scraped_quotes.append({
'quote': quote,
'author': author
})
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_url, list_of_urls)
print(scraped_quotes)