Google search results hold invaluable insights for businesses and SEO professionals alike. However, extracting this data can be a complex challenge due to Google’s robust anti-scraping measures.
In this web scraping tutorial, we’ll guide you through building a robust Google search scraper using Python’s Scrapy and ScraperAPI. Together, these web scraping tools help you bypass Google’s web scraping defenses and efficiently collect data from millions of search queries. We’ll also demonstrate how to transform this raw data into structured JSON format for easy analysis.
Additionally, we’ll share with you how to extract SEO keyword data from Google and Google Shopping data. Both with full web scraper code and step-by-step tutorial.
Want to reduce the web scraping coding and parsing hassle? Check out ScraperAPI’s Google Search Scraper.
ScraperAPI allows you to automate data collection without writing a single line of code with DataPipeline.
Scraping Google Search Results (SERPs) Without Getting Blocked
We’ll build a Google scraper using Python to collect competitors’ reviews. So, let’s imagine we’re a new startup building project management software, and we want to understand the state of the industry. Here’s the step-by-step guide on how to scrape Google SERPs using Python.
1. Choosing Your Target Keywords
Now that we know our main goal, it’s time to pick the keywords we want to scrape from Google search results to support it.
To pick your target keywords, think of the terms consumers could be searching to find your offering and identify your competitors.
In this example, we’ll target four keywords:
- “asana reviews”
- “clickup reviews”
- “best project management software”
- “best project management software for small teams”
We could add many more keywords to this list, but for this scraper tutorial, they’ll be more than enough.
Also, notice that the first two queries are related to direct competitors, while the last two will help us identify other competitors and get an initial knowledge of the state of the industry.
Note: For this stage, you can use a tool like Ahrefs or SEMrush to build a list of relevant keywords.
2. Setting Up Your Development Environment with Python
The next step is to get our machine ready to develop our Google Search scraper. For this, we’ll need a few things:
- Python version 3 or later
- Pip – to install Scrapy and other packages, we might need
- ScraperAPI
Your machine may have a pre-installed Python version. Enter python -v
into your command prompt to see if that’s the case.
If you need to install everything from scratch, follow our tutorial on basics of Scrapy in Python. We’ll be using the same setup, so get that done and come back.
Note: something to keep in mind is that the team behind Scrapy recommends installing Scrapy in a virtual environment (VE) instead of globally on your PC or laptop. If you’re unfamiliar, the above Python and Scrapy tutorial shows you how to create the VE and install all dependencies.
In this Google web extraction tutorial, we’re also going to be using ScraperAPI to avoid any IP bans or repercussions.
Google doesn’t really want us to scrape their SERPs – especially for free. As such, they have implemented advanced anti-scraping mechanisms that’ll quickly identify any bots trying to extract data automatically.
ScraperAPI utilizes third-party proxies, machine learning, huge browser farms, and years of statistical data to ensure that our scraper won’t get blocked from any site by rotating our IP address for every request (when needed), setting wait times between requests and handling CAPTCHAs.
In other words, by just adding a few lines of code, ScraperAPI will supercharge our scraper, saving us headaches and hours of work.
All we need for this tutorial is to get our API Key from ScraperAPI. To get it, just create a free ScraperAPI account to redeem 5000 free API requests.
Related: What’s the best Google SERP APIs? We’ve gathered and compared 7 best free and paid Google SERP APIs, helping you find the ideal option that suits your needs.
3. Creating Your Web Scraping Project’s Folder for The Google SERP Data
After installing Scrapy in your VE, enter this snippet into your terminal to create the necessary folders:
</p>
<pre>scrapy startproject google_scraper
cd google_scraper
scrapy genspider google api.scraperapi.com</pre>
<p>
- Scrapy will first create a new project folder called “google-scraper,” which also happens to be the project’s name.
- Next, go into this folder and run the “genspider” command to create a web scraper named “google”.
We now have many configuration files, a “spiders” folder containing our scraper, and a Python modules folder containing package files.
4. Importing All Necessary Dependencies to Your google.py File
The next step in Google SERP web data extraction process is to build a few components that will make our script as efficient as possible.
To do so, we’ll need to make our dependencies available to our scraper by adding them at the top of our file:
</p>
<pre>import scrapy
from urllib.parse import urlencode
from urllib.parse import urlparse
import json
from datetime import datetime
API_KEY = 'YOUR_API_KEY'</pre>
<p>
With these dependencies in place, we can use them to build requests and handle JSON files. This last detail is important because we’ll be using ScraperAPI’s autoparse functionality.
After sending the HTTP request, it will return the data in JSON format, simplifying the process and making it so that we don’t have to write and maintain our own parser.
5. Constructing the Google Search Query
Google employs a standard and query-able URL structure. You just need to know the URL parameters for the data you need, and you can generate a URL to query Google with.
That said, the following makes up the URL structure for all Google search queries:
</p>
<pre>http://www.google.com/search</pre>
<p>
There are several standard parameters that makeup Google search queries:
- q represents the search keyword parameter. http://www.google.com/search?q=tshirt, for example, will look for results containing the keyword “tshirt.”
- The offset point is specified by the start parameter. http://www.google.com/search?q=tshirt&start=100 is an example.
- hl is the language parameter. http://www.google.com/search?q=tshirt&hl=en is a good example.
- The as_sitesearch argument allows you to search for a domain (or website). http://www.google.com/search?q=tshirt&as sitesearch=amazon.com is one example.
- The number of results per page (maximum is 100) is specified by the num parameter. http://www.google.com/search?q=tshirt&num=50 is an example.
- The safe parameter generates only “safe” results. http://www.google.com/search?q=tshirt&safe=active is a good example.
Note: Moz’s comprehensive list of google search parameters is incredibly useful in building a query-able URL. Bookmark it for more complex scraping projects in the future.
Alright, let’s define a method to construct our Google URL using this information:
</p>
<pre>def create_google_url(query, site=''):
google_dict = {'q': query, 'num': 100, }
if site:
web = urlparse(site).netloc
google_dict['as_sitesearch'] = web
return 'http://www.google.com/search?' + urlencode(google_dict)
return 'http://www.google.com/search?' + urlencode(google_dict)</pre>
<p>
In our method, we’re setting q
as query because we’ll specify our actual keywords later in the script to make it easier to make changes to our Google scraper.
6. Defining the ScraperAPI Method
To use ScraperAPI, all we need to do is to send our request through ScraperAPI’s server by appending our query URL to the proxy URL provided by ScraperAPI using payload
and urlencode
.
The code looks like this:
</p>
<pre>def get_url(url):
payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url</pre>
<p>
Now that we have defined the logic our scraper will use to construct our target URLs, we’re ready to start scraping Google Search results.
7. Writing the Main Spider Class
In Scrapy, we can create different classes, called spiders, to scrape specific pages or groups of sites.
Thanks to this function, we can build different spiders inside the same project, making it much easier to scale and maintain.
</p>
<pre>class GoogleSpider(scrapy.Spider):
name = 'google'
allowed_domains = ['api.scraperapi.com']
custom_settings = {
'ROBOTSTXT_OBEY': False,
'LOG_LEVEL': 'INFO',
'CONCURRENT_REQUESTS_PER_DOMAIN': 10,
'RETRY_TIMES': 5}</pre>
<p>
We need to give our spider a name, as this is how Scrapy will determine which script you want to run.
The name you choose should be specific to what you’re trying to scrape, as projects with multiple spiders can get confusing if they aren’t clearly named.
Because our URLs will start with ScraperAPI’s domain, we’ll also need to add “api.scraper.com” to allowed_domains
.
ScraperAPI will change the IP address and headers between every retry before returning a failed message (which doesn’t count against our total available API credits).
We also want to tell our scraper to ignore the directive in the robots.txt file. This is because, by default, Scrapy won’t scrape any site which has a contradictory directive inside said file.
Finally, we’ve set a few constraints so that we don’t exceed the limits of our free ScraperAPI account. As you can see in the custom_settings
code above, we’re telling ScraperAPI to send 10 concurrent requests and to retry 5 times after any failed response.
8. Sending the Initial Request to Google Search
It’s finally time to send our HTTP request. It is very simple to do this with the start_requests(self)
method:
</p>
<pre>def start_requests(self):
queries = ['asana+reviews', 'clickup+reviews', 'best+project+management+software', 'best+project+management+software+for+small+teams']
url = create_google_url(query)
yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})</pre>
<p>
- It will loop through a list of queries that will be passed to the
create_google_url
function as query URL keywords. - The query URL we created will then be sent to Google Search via the proxy connection we set up in the
get_url()
function, utilizing Scrapy’syield
. - The result will then be given to the parse function to be processed (it should be in JSON format).
- The
{'pos': 0}
key-value pair is also added to the meta parameter, which is used to count the number of pages scraped.
Note: when typing keywords, remember that every word in a keyword is separated by a +
sign rather than a space to avoid confusions.
9. Parsing Google Search Results
Thanks to ScraperAPI’s auto parsing functionality, our scraper will return a JSON response to our request instead of the raw HTML. For this to work, wake sure the parameter autoparse
is set to true
in the get_url()
function.
Next, we’ll load the complete JSON response and cycle through each result, taking the data and combining it into a new item that we can utilize later.
</p>
<pre>def parse(self, response):
di = json.loads(response.text)
pos = response.meta['pos']
dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for result in di['organic_results']:
title = result['title']
snippet = result['snippet']
link = result['link']
item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
pos += 1
yield item
next_page = di['pagination']['nextPageUrl']
if next_page:
yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})</pre>
<p>
This procedure checks to see whether another page of results is available. The request is invoked again if an additional page is present, repeating until there are no additional pages.
10. Start Scraping Google Search Results
Congratulations, you’ve built your first Google search result scraper!
If you’ve been following along, your google.py file should look like this by now:
</p>
<pre>import scrapy
from urllib.parse import urlencode
from urllib.parse import urlparse
import json
from datetime import datetime
API_KEY = 'YOUR_API_KEY'
def get_url(url):
payload = {'api_key': API_KEY, 'url': url, 'autoparse': 'true', 'country_code': 'us'}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
def create_google_url(query, site=''):
google_dict = {'q': query, 'num': 100, }
if site:
web = urlparse(site).netloc
google_dict['as_sitesearch'] = web
return 'http://www.google.com/search?' + urlencode(google_dict)
return 'http://www.google.com/search?' + urlencode(google_dict)
class GoogleSpider(scrapy.Spider):
name = 'google'
allowed_domains = ['api.scraperapi.com']
custom_settings = {
'ROBOTSTXT_OBEY': False,
'LOG_LEVEL': 'INFO',
'CONCURRENT_REQUESTS_PER_DOMAIN': 10,
'RETRY_TIMES': 5}
def start_requests(self):
queries = ['asana+reviews', 'clickup+reviews', 'best+project+management+software', 'best+project+management+software+for+small+teams']
for query in queries:
url = create_google_url(query)
yield scrapy.Request(get_url(url), callback=self.parse, meta={'pos': 0})
def parse(self, response):
di = json.loads(response.text)
pos = response.meta['pos']
dt = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
for result in di['organic_results']:
title = result['title']
snippet = result['snippet']
link = result['link']
item = {'title': title, 'snippet': snippet, 'link': link, 'position': pos, 'date': dt}
pos += 1
yield item
next_page = di['pagination']['nextPageUrl']
if next_page:
yield scrapy.Request(get_url(next_page), callback=self.parse, meta={'pos': pos})</pre>
<p>
💡 Pro Tip
If you want to scrape Google SERPs from different countries (let’s say Italy), all you need to do is change the code inside the
country_code
parameter in theget_url()
function. Check out our documentation to learn every parameter you can customize in ScraperAPI.
To run our scraper, navigate to the project’s folder inside the terminal and use the following command:
</p>
<pre>scrapy crawl google -o serps.csv</pre>
<p>
Now our spider will run and store all scraped data in a new CSV file named “serps.” This feature is a big time saver and one more reason to use Scrapy for web scraping Google.
Let our team of experts build a custom plan to fit your project. All enterprise plans include premium support and a dedicated account manager.
How to Extract Keywords and Rankings From Google With Python
Google SEO keyword rankings are another data type that you can extract from Google search results. Similar to scraping for Google SERPs, you’ll also need to pay attention to the advanced anti-scraping mechanisms that may block your web scraper. This includes browser fingerprinting, CAPTCHAs, and rate limiting.
Go watch this video tutorial or read the detailed instructions on how to build a DIY web scraper for Google SEO rankings and keywords.
How to Extract Google Shopping Data With Python
Besides dealing with a common anti-scraping bot mechanism, when extracting data from Google Shopping, you’ll also need to address another challenge: the dynamic result page.
For this, we recommend using a Google Shopping endpoint, like the one offered by ScraperAPI. This allows you to scrape Google Shopping search results in JSON format with just an API call without dealing with the technical difficulties you may get from building your own Google Shopping web scraper.
That said, if you prefer to have a DIY web data extractor, you can follow this Google Shopping web scraping tutorial—or use this full web scraping code of Google Shopping search scraper:
</p>
import requests
import pandas as pd
keywords = [
'katana set',
'katana for kids',
'katana stand',
]
for keyword in keywords:
products = []
payload = {
'api_key': 'YOUR_API_KEY',
'query': keyword,
'country_code': 'it',
'tld': 'it',
'num': '100'
}
response = requests.get('https://api.scraperapi.com/structured/google/shopping', params=payload)
data = response.json()
organic_results = data['shopping_results']
print(len(organic_results))
for product in organic_results:
products.append({
'Name': product['title'],
'ID': product['docid'],
'Price': product['price'],
'URL': product['link'],
'Thumbnail': product['thumbnail']
})
df = pd.DataFrame(products)
df.to_csv(f'{keyword}.csv')
<p>
Why Scrape Google Search Results?
Scraping Google search results allows you to build marketing and SEO tools, do market research, monitor your competitors’ organic presence, help PR teams monitor brand mentions and prevent potential PR disasters, and keep track of organic rankings, to mention a few.
That said, let’s explore the three most common applications for Google search data:
1. Collecting Customer Feedback Data to Inform Your Marketing
In the modern shopping experience, it is common for consumers to look for product reviews before deciding on a purchase.
With this in mind, you can scrape Google search results to collect reviews and customer feedback from your competitor’s products to understand what’s working and what’s not working for them.
It can be to improve your product, find a way to differentiate yourself from the competition or know which features or experiences to highlight in your marketing campaigns.
2. Inform Your SEO and PPC Strategy
According to Oberlo, “Google has 91.9 percent of the market share as of 2022”.
With that many eyes on the SERPs, getting your business to the top of these pages for relevant keywords means a lot of money.
Web scraping is primarily an info-gathering tool. We can use it to know our positions in Google better and benchmark ourselves against the competition.
If we look at our positions and compare ourselves to the top pages, we can generate a strategy to outrank them.
The same goes for PPC campaigns.
Because ads appear at the top of every SERP – and sometimes at the bottom – we can tell our scraper to bring the name, description, and link to all the ads appearing on search results for our targeted keywords.
This research will help us find untargeted keywords, understand our competitors’ strategies, and evaluate the copy of their ads to differentiate ours.
Resource: How to Scrape Competitors’ Google Ads
3. Generate Content Ideas
Google also has many additional features in their SERPs like related searches, “people also ask” boxes, and more.
Scraping hundreds of keywords allows you to gather all this information in a couple of hours and organize it in an easy-to-analyze database.
Depending on the type of data and the ultimate goal you have, you can use a Google scraper for many different reasons.
4. Optimize Your Product and Service Portfolio
Extracting data from Google SERPs lets you see the types of products and services your competitors offer and how they compare to yours. For example, some direct competitors of your business may have started offering a new solution that you don’t have or something similar to your business’s unique selling proposition. Knowing what they have up their sleeves empowers you to counter their strategies before they hurt your bottom line.
5. Trend Analysis
Do you know what the hottest products on the market are right now? Understanding how the market moves, which products and services are booming, and the behavior of your customers enables you to anticipate market demands and ensure proper supply. By forecasting future trends, you can seize opportunities and maximize them to grow your business even further.
Extract Google Search Result Data with ScraperAPI’s Google Scraper
To make the most out of ScraperAPI and collect data faster and simpler, check our new Google Search endpoint. You can reduce all the code we’ve built so far to just a couple of lines and get structured JSON data in seconds.
Learn how to use it with our structured data endpoints tutorial, or test it for yourself with a free ScraperAPI account and 5,000 free API credits!
Until next time, happy scraping!
Check out other Google web scraping guides:
- How To Scrape Google Trends Data Using PyTrends
- How to Scrape Competitors’ Google Ads Data to Better Your Own
- How to Scrape AI Snippets in Google Search Results
FAQs About Google Search Results DIY Web Scraping Solution
Answering your questions about extracting and aggregating Google search result data, ScraperAPI, and Google web scraping.
1. How to Scrape Google Search Results Without Getting Blocked?
There are many factors that can prevent your web scrapers from getting blocked when extracting Google search result data. These include the robustness of your web scraper, its ability to handle CAPTCHAs and headless browser configuration, and the quality of your proxies.
Follow this Google SERPs Python web scraping tutorial to guarantee a 100% success rate in extracting Google data.
2. Why Should I Scrape Google Search Using Google SERP APIs?
Using Google SERP APIs to extract Google data has some advantages over using a DIY Google search result web scraper. For instance, if you use ScraperAPI’s Google SERP API you can get structured and parsed SERP data directly, customize the parameters to match your web scraping needs, and easily scale your web scraping projects without worrying about technical challenges.
Learn more about ScraperAPI’s Google SERP API and its advantages here.
3. Why Should I Use ScraperAPI to Help Scrape Google SERPs?
If you’re looking for a reliable web scraping tool to collect Google SERPs without breaking the bank, ScraperAPI is an excellent option. Not only does it offer smart protection against anti-scraping bots, but also Google SERP APIs and web scraping automation, allowing you to schedule up to 10,000 URLs. ScraperAPI’s pricing is affordable for any business size, with plans starting from just $49 per month.
Explore more about ScraperAPI’s features and pricing.