Knowing how to scrape dynamic content from large websites is a game-changer. Whether you’re trying to extract prices from an online store, gather social media data, or analyze real-time trends, you might find it challenging to deal with pages loaded with JavaScript that don’t reveal all their content immediately.
In this article, I’ll walk you through:
- How to scrape dynamic content from sites that rely heavily on JavaScript
- Using ScraperAPI to render and interact with dynamic sites
- Comparing ScraperAPI’s approach to traditional headless browsers to render dynamic content
And much more.
You’ll have the know-how and tools to scrape dynamic content from even the trickiest websites by the end.
TL;DR: How to Get Dynamic Data?
Here’s how you can scrape dynamic content using both ScraperAPI and Selenium. These approaches handle JavaScript, infinite scrolling, and complex user interactions, ensuring you get all the data you need.
Python and ScraperAPI
import requests
from bs4 import BeautifulSoup
API_KEY = 'your_scraperapi_key'
url = 'https://www.booking.com/searchresults.html?ss=New+York'
payload = {
'url': url,
}
headers = {
'x-sapi-api_key': API_KEY,
'x-sapi-render': 'true',
'x-sapi-instruction_set': '[{"type": "loop", "for": 5, "instructions": [{"type": "scroll", "direction": "y", "value": "bottom" }, { "type": "wait", "value": 5 }] }]'
}
response = requests.get('https://api.scraperapi.com', params=payload, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
listings = soup.find_all('div', attrs={'data-testid': 'property-card'})
print(f"Found {len(listings)} hotel listings on Booking.com")
Python And Selenium
from seleniumwire import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
API_KEY = 'your_scraperapi_key'
proxy_options = {
'proxy': {
'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001',
'no_proxy': 'localhost,127.0.0.1'
}
}
driver = webdriver.Chrome(seleniumwire_options=proxy_options)
url = 'https://www.booking.com/searchresults.html?ss=New+York'
driver.get(url)
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.END)
time.sleep(10)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
listings = soup.find_all('div', attrs={'data-testid': 'property-card'})
print(f"Found {len(listings)} hotel listings on Booking.com")
Keep reading to learn more about these methods and when to use each for your web scraping projects.
Common Pitfalls When Scraping Dynamic Content and How to Avoid Them
Scraping dynamic content from websites can be challenging, but understanding and avoiding common pitfalls can make the process much smoother. Below are four critical pitfalls to watch out for:
1. Ignoring AJAX Requests
The Challenge:
Many websites use AJAX (Asynchronous JavaScript and XML) requests to load content asynchronously without refreshing the entire page. This means that even if the main HTML is fully loaded, the data you need might still be fetched separately via AJAX calls. If your scraper doesn’t account for these requests, you could miss out on essential data.
How to Avoid It:
Inspect the network activity in your browser’s developer tools to identify any AJAX requests that occur when new content appears. Once identified, replicate these requests directly in your scraper to ensure you capture all dynamically loaded data.
import requests
# Example of handling an AJAX request
ajax_url = 'https://www.example.com/ajax_endpoint'
response = requests.get(ajax_url)
data = response.json() # Assuming the response is in JSON format
print(data)
Best Practice:
Always check for AJAX requests when scraping websites. If the content you need is loaded through AJAX, replicate these requests in your scraper to retrieve the data directly.
Resource: We used this strategy to scrape public job listings from LinkedIn.
2. Failing to Handle Pagination
The Challenge:
Many websites, especially ecommerce sites or directories, distribute their content across multiple pages. If your scraper only collects data from the first page, you could miss out on a significant amount of data.
How to Avoid It:
Implement logic in your scraper to handle pagination by detecting and following the “Next” button or by constructing URLs for subsequent pages. This ensures that your scraper collects data from all pages, not just the first one.
import requests
for page in range(1, 11): # Adjust the range based on the number of pages
url = f'https://www.example.com/search?page={page}'
response = requests.get(url)
# Process each page's results here
print(f'Page {page} data: {response.text}')
Best Practice:
Always include pagination handling in your scraping logic to ensure comprehensive data collection across all pages.
Resource: Check our guide on how to handle pagination for web scraping.
3. Handling JavaScript Execution and Complex Interactions with ScraperAPI
The Challenge:
Many websites today use JavaScript to load content dynamically and handle complex interactions like form submissions, endless scrolling, and button clicks. If your scraper only retrieves the initial HTML without executing these scripts or interactions, you could miss out on crucial data.
How to Avoid It:
ScraperAPI’s rendering feature, combined with its powerful Render Instruction Set, allows you to process JavaScript and automate page interactions. This means you can simulate user actions like entering a search term, clicking a button, or scrolling through content, all within your scraping workflow.
How to Use the Render Instruction Set
The Render Instruction Set is a JSON object you send to ScraperAPI as part of the request headers. This set of instructions tells the browser exactly what actions to perform during page rendering—such as filling out a form, clicking a button, or waiting for specific content to load. These instructions enable complex user interactions to be automated on dynamic web pages without using headless browsers on your machine.
Example: Scraping with and without Rendering and Interaction Instructions
Let’s examine an example of using ScraperAPI to automate a search on Wikipedia. The goal is to simulate entering the search term “cowboy boots” into the search bar, clicking the search button, and then waiting for the results to load.
Note: To run these snippets, create a free ScraperAPI account and replace 'YOUR_API_KEY'
with your API key.
Without Rendering and Instructions:
import requests
url = 'https://api.scraperapi.com/'
headers = {
'x-sapi-api_key': 'YOUR_API_KEY',
'x-sapi-instruction_set': '[{"type": "input", "selector": {"type": "css", "value": "#searchInput"}, "value": "cowboy boots"}, {"type": "click", "selector": {"type": "css", "value": "#search-form button[type=\\"submit\\\"]"}}, {"type": "wait_for_selector", "selector": {"type": "css", "value": "#content"}}]'
}
payload = {
'url': 'https://www.wikipedia.org'
}
response = requests.get(url, params=payload, headers=headers)
print(response.text)
In this code, we’re sending a request to ScraperAPI to scrape the Wikipedia homepage. The headers include the API key and the Render Instruction Set, but notice that the x-sapi-render
header is missing.
While the instructions to input “cowboy boots” and click the search button are sent, the JavaScript necessary to render the search results will not be executed. The expected outcome is that the search action will not be completed, and the returned HTML will likely not include the search results.
With Rendering and Instructions Enabled:
import requests
url = 'https://api.scraperapi.com/'
headers = {
'x-sapi-api_key': 'YOUR_API_KEY',
'x-sapi-render': 'true',
'x-sapi-instruction_set': '[{"type": "input", "selector": {"type": "css", "value": "#searchInput"}, "value": "cowboy boots"}, {"type": "click", "selector": {"type": "css", "value": "#search-form button[type=\\"submit\\\"]"}}, {"type": "wait_for_selector", "selector": {"type": "css", "value": "#content"}}]'
}
payload = {
'url': 'https://www.wikipedia.org'
}
response = requests.get(url, params=payload, headers=headers)
print(response.text)
We’ve added the x-sapi-render: 'true'
header in this version. This instructs ScraperAPI to render the JavaScript on the page fully, ensuring that the search input, click action, and subsequent loading of search results are all executed as if a user were interacting with the browser directly.
The expected outcome is that the returned HTML will include the search results for “cowboy boots,” reflecting the successful execution of the JavaScript
Best Practice:
When scraping websites that rely on JavaScript for content loading or require user interactions, always include the x-sapi-render: 'true'
header. This ensures that both the JavaScript execution and the interactions you define in the Render Instruction Set are carried out effectively, allowing you to capture all relevant data.
To learn more about using the Render Instruction Set and see additional examples, check out the ScraperAPI documentation.
Scrape Dynamic Content with ScraperAPI [Best Approach]
In this section, I’ll show you how to scrape dynamic hotel search results from Booking.com using ScraperAPI.
Booking.com is a perfect example of a site that loads content dynamically, like hotel listings, prices, and availability. I’ll guide you through the process, demonstrating how ScraperAPI tackles these challenges so you can capture all the data you need
Step 1: Setting Up the Scraping Project
To start scraping hotel search results from Booking.com, you’ll need to set up your environment to use ScraperAPI with Python.
- Sign Up for ScraperAPI: If you haven’t already, sign up for ScraperAPI and get your free API key.
- Install the Requests Library: Install the
requests
library in Python to make HTTP requests. Run the following command:
pip install requests
Step 2: Import the Required Libraries
First, you need to import the necessary libraries.
Open your Python script and add the following lines:
import requests
from bs4 import BeautifulSoup
These libraries are important: requests
will help you send HTTP requests, while BeautifulSoup
will allow you to parse and extract data from the HTML content that ScraperAPI returns.
Step 3: Set Up Your ScraperAPI Key
Next, define your ScraperAPI key. This key will let you access ScraperAPI’s features:
# Your ScraperAPI key
api_key = 'YOUR_API_KEY'
Make sure to replace the placeholder with your actual API key. This is essential for authenticating your requests and using ScraperAPI’s services.
Step 4: Define the URL for Scraping
Now, specify the URL of the page you want to scrape. For this tutorial, we’ll be scraping hotel search results from Booking.com:
# The URL for a Booking.com hotel search query (e.g., hotels in New York)
url = 'https://www.booking.com/searchresults.html?ss=New+York'
You can customize this URL to target different locations by changing the query parameter (e.g., ss=New+York
). This flexibility allows you to scrape data for various cities.
Step 4: Set Up the Payload for ScraperAPI
Next, set up the payload
dictionary that will be sent to ScraperAPI – within this payload, you’ll send the URL you want to scrape.
# Set up the parameters for ScraperAPI
payload = {
'url': url
}
Step 5: Configure the Headers for the Request
Now, configure the headers for your request to ScraperAPI:
- The
x-sapi-api_key
includes your ScraperAPI key for authentication - The
x-sapi-render
enables JS rendering - The
x-sapi-instruction_set
header provides detailed instructions, telling ScraperAPI to scroll down the page five times, pausing for five seconds each time. This ensures that all the dynamically loaded content is captured.
These headers will guide how ScraperAPI processes your request:
headers = {
'x-sapi-api_key': api_key,
'x-sapi-render': 'true',
'x-sapi-instruction_set': '[{"type": "loop", "for": 5, "instructions": [{"type": "scroll", "direction": "y", "value": "bottom" }, { "type": "wait", "value": 5 }] }]'
}
Step 6: Make the Request to ScraperAPI
With everything set up, it’s time to send the request to ScraperAPI. After sending the request, ScraperAPI processes the page according to your instructions and returns the fully rendered HTML, ready for you to scrape.
# Make the request to ScraperAPI
response = requests.get('http://api.scraperapi.com', params=payload, headers=headers)
Step 7: Parse the HTML Content
Once you have the response, you need to parse the HTML content. Use BeautifulSoup to do this:
soup = BeautifulSoup(response.text, 'html.parser')
This step converts the HTML text into a BeautifulSoup object, making it easy to navigate and extract the specific data you’re interested in, like hotel listings.
Step 8: Extract Hotel Listings
Now that the HTML is parsed, you can extract the hotel listings. Here’s how:
listings = soup.find_all('div', attrs={'data-testid': 'property-card'})
This finds all div
elements with the attribute data-testid
set to property-card
, which identifies hotel listings on Booking.com. This is where you’ll extract details like hotel names, prices, and ratings.
Step 9: Print the Results
Finally, let’s see what we’ve got. Print out the number of hotel listings found:
print(f"Found {len(listings)} hotel listings on Booking.com")
This will output the total number of listings, confirming that your scraping process worked and showing you how many hotels were found.
Scraping Dynamic Web Pages with Python and Selenium
Selenium provides another powerful method, particularly when you need precise control over browser actions.
In this section, I’ll walk you through scraping the same Booking.com hotel search results page using SeleniumWire; Selenium Wire extends Selenium’s capabilities by adding support for capturing requests and responses, as well as providing robust proxy integration—making it perfect for use with ScraperAPI.
Step 1: Setting Up the Scraping Project
To get started with Selenium, you’ll need to set up your environment:
- Install Selenium: First, install SeleniumWire using
pip
:
pip install selenium-wire
- Download a WebDriver: Selenium requires a WebDriver to control the browser. If you’re using Chrome, download ChromeDriver from here.
- Install BeautifulSoup: You’ll also need BeautifulSoup for parsing HTML:
Step 2: Import the Required Libraries
Next, import the libraries necessary for this project.
from seleniumwire import webdriver # Use selenium-wire's webdriver for proxy support
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
Step 3: Set Up ScraperAPI in Proxy Mode
Integrate ScraperAPI with Selenium as a proxy to enhance your scraper resilience. Using ScaperAPI as a proxy will help rotate IP addresses and manage CAPTCHAs, making your scraping more efficient and reliable.
API_KEY = 'YOUR_API_KEY'
proxy_options = { 'proxy': { 'http': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'https': f'http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001', 'no_proxy': 'localhost,127.0.0.1' # Bypass the proxy for local addresses } }
driver = webdriver.Chrome(seleniumwire_options=proxy_options)
Step 4: Navigate to the Booking.com Page
With the WebDriver set up, you can now navigate to the Booking.com search results page:
url = 'https://www.booking.com/searchresults.html?ss=New+York'
driver.get(url)
Step 5: Handle Dynamic Content with Selenium
Booking.com loads additional content as you scroll down the page. To capture all the hotel listings, you need to scroll to the bottom of the page repeatedly until no more new content loads.
# Scroll to the bottom of the page to load more content
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.END)
time.sleep(10) # Wait for the page to load
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height: # Check if the bottom has been reached
break
last_height = new_height
This loop scrolls the page to the bottom and waits for additional content to load. It continues scrolling until no new content appears, ensuring all dynamic data is captured.
Step 6: Extract and Parse the HTML Content
Once the page has fully loaded and no new content is appearing, retrieve the HTML content and parse it with BeautifulSoup:
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
listings = soup.find_all('div', attrs={'data-testid': 'property-card'})
This captures the complete HTML of the page and extracts all the hotel listings.
Step 7: Print the Results
Finally, print out the number of hotel listings found on the page:
print(f"Found {len(listings)} hotel listings on Booking.com")
This output confirms how many hotel listings your script successfully scraped.
Why Use Selenium Wire with ScraperAPI?
Selenium Wire offers extended capabilities over standard Selenium, particularly in handling network requests and proxy configurations. By combining Selenium Wire with ScraperAPI, you gain:
- IP Rotation and Anonymity: ScraperAPI manages IP rotation, reducing the risk of being blocked.
- CAPTCHA Handling: ScraperAPI can automatically solve CAPTCHAs, allowing uninterrupted scraping.
- Enhanced Control: Selenium Wire gives you fine-grained control over network requests, making it easier to troubleshoot and optimize your scraping.
Together, these tools create a robust setup for scraping even the most dynamic and complex websites.
Scrape Dynamic HTML Content Using the Hidden API
Sometimes, websites load dynamic content through hidden APIs, which can be more efficient to scrape than parsing HTML or handling JavaScript. This method involves directly accessing the structured data these APIs provide, which is particularly useful for platforms like LinkedIn.
How It’s Done:
- Monitor Network Activity:
- Use your browser’s developer tools to observe the network activity when you interact with the site (like scrolling through LinkedIn).
- Focus on XHR or Fetch requests that return JSON data—these often point to the hidden APIs.
- Identify and Extract API Endpoints:
- Find the specific API requests that fetch the data you need.
- Copy the URL of the API endpoint and note any parameters or headers required.
- Make Direct API Requests:
- With Python’s Requests library, you can directly query these API endpoints.
- The response will typically be in JSON format, which you can easily process.
Why It’s Effective: This method bypasses the complexity of scraping HTML and handling JavaScript by directly tapping into the structured data the site uses internally. It’s a cleaner, more reliable way to get the information you need, especially on sites with sophisticated front-end structures like LinkedIn.
By using hidden APIs, you can make your scraping process more efficient and less susceptible to breaking when the website changes its layout or JavaScript.
Of course, you still need to use a tool like ScraperAPI to avoid getting your scrapers banned and losing access to the site for your machines.
Wrapping Up: Selenium vs. ScraperAPI for Scraping JavaScript Content
When it comes to scraping JavaScript-heavy websites, both Selenium and ScraperAPI offer unique advantages, but they serve different needs depending on your project’s requirements. Below is a comparison of their features to help you decide which tool is best suited for your use case.
ScraperAPI | Selenium | |
Built-in Proxy Management | ✅ | ❌ |
IP Rotation and Anonymity | ✅ | ❌ |
Handling Infinite Scrolling | ✅ | ✅ |
Clicking on Elements | ✅ | ✅ |
Automatic CAPTCHA Handling | ✅ | ❌ |
JavaScript Execution | ✅ | ✅ |
Form Submission Automation | ✅ | ✅ |
Network Request Monitoring | ✅ (via proxy) | ✅ (with Selenium Wire) |
Ease of Use for Large-Scale Scraping | ✅ | ❌ |
Direct Browser Interaction | ✅ | ✅ |
Headless Browser Support | ❌ | ✅ |
- ScraperAPI has expanded its capabilities to include browser interactions such as clicking elements and submitting forms. This makes it a versatile tool that can handle almost all aspects of web scraping, from large-scale data extraction to complex interactions typically associated with Selenium.
- Selenium remains an excellent choice for projects that require direct, real-time interaction with the browser, particularly in a headless environment.
With ScraperAPI’s new features, it’s now possible to perform more complex scraping tasks without needing to switch to Selenium, making it a more comprehensive solution for a wide range of web scraping needs.
FAQs About Scraping Dynamic Content
Dynamically loaded content refers to elements on a webpage that are loaded after the initial HTML is rendered. This often happens via JavaScript, which pulls data from the server after the page is already displayed in your browser.
Examples include images, text, or interactive elements that load as you scroll or after a specific action like clicking a button. This approach allows websites to deliver a more responsive user experience, but it can complicate web scraping.
Several tools can help you scrape dynamic content. ScraperAPI is a robust option that handles JavaScript rendering, CAPTCHA solving, and proxy rotation, making it ideal for large-scale scraping.
Selenium and Selenium Wire are also popular, allowing you to interact with web pages like a real user by automating browsers and handling dynamic interactions. Other tools include Playwright and Puppeteer, which are also used for browser automation and scraping complex sites.
To parse dynamic HTML content, you can use libraries like BeautifulSoup and tools that handle JavaScript rendering, such as ScraperAPI or Selenium. These tools load the full content, including dynamically generated elements, allowing you then to extract the data with parsers like BeautifulSoup or lxml.
The key is to ensure that the content is fully loaded before attempting to parse, which might require executing JavaScript or scrolling the page.