When scraping data from the web, one of the toughest challenges you’ll face is bot protection systems like AWS WAF Bot Control. It is widely used to filter out bots and ensure that only real users can access a site’s content. While this makes scraping more complex, it’s not impossible.
In this guide, I’ll show you:
- How to bypass AWS WAF Bot Control using Python and ScraperAPI
- Scrape data from sites like Catch.com.au – which are protected by AWS WAF
- The tools and techniques to scrape AWS WAF-protected sites efficiently
Sound good? Let’s get started!
ScraperAPI lets you collect data from any website without interruptions or complicated workarounds.
What is AWS WAF Bot Control?
AWS WAF (Web Application Firewall) Bot Control is part of Amazon Web Services security suite, and it helps websites block automated traffic, like bots, while letting real users through. It analyzes incoming requests, monitors traffic behavior, and applies various techniques to filter out bots. Some of the key methods include JavaScript challenges, CAPTCHA enforcement, and IP-based blocking.
For scrapers, AWS WAF can create several challenges:
- CAPTCHA Challenges: CAPTCHAs are served when AWS suspects that a bot is making the request.
- IP Blocking: Sending too many requests from the same IP can block you.
- Rate-Limiting: AWS WAF limits the number of requests that can be made within a specific time frame, throttling any requests that exceed the limit.
Understanding how AWS WAF Bot Control operates is crucial to bypass these detection methods.
In the next section, we’ll explore the different techniques AWS WAF uses to stop bots and how you can work around them.
Understanding AWS WAF Bot Control
When it comes to bot protection, AWS WAF Bot Control uses a multi-layered defense system to block unwanted traffic. It’s designed to handle everything from basic bots that don’t try to hide their presence to advanced bots that mimic actual user behavior.
Let’s take a closer look at how AWS WAF Bot Control works and the different techniques it uses to keep websites safe.
Common Bots vs. Targeted Bots
AWS WAF Bot Control is designed to handle two types of bots, each with different strategies for detection:
Common Bots
These are the basic bots that don’t try to hide what they are. AWS WAF filters these bots out with the following:
- Signature-based detection: AWS maintains a list of known bot signatures—patterns in how bots request web pages, including specific user agents or headers. If a request matches these patterns, it’s flagged as a bot.
- IP reputation lists: AWS has a constantly updated list of IP addresses associated with bot activity. Requests from these IPs are blocked or challenged.
- User-agent validation: It checks the user-agent string in each request to make sure it’s from a real browser, not a bot pretending to be one.
- Request pattern analysis: Even if a bot tries to fly under the radar, AWS can detect it by spotting unusual request rates (like too many requests quickly) or navigation patterns that don’t match human behavior.
Targeted Bots
More sophisticated bots try to behave like real users, making them harder to detect. AWS WAF counters these with advanced techniques like:
- Behavior-based detection: This involves analyzing traffic patterns to see if users behave like bots—such as clicking through pages too fast or accessing multiple pages in an unnatural sequence.
- Machine learning (ML): AWS uses machine learning models to adapt to new bot behaviors. The system continuously learns from past data and can spot patterns that hint at bot activity, even if the bot is well-disguised.
- Browser fingerprinting: AWS collects data from the user’s browser, such as screen size, installed plugins, and fonts. Bots often have trouble replicating an actual browser fingerprint, which gives them away.
- Browser interrogation: AWS can inject JavaScript code into the webpage to check if the user can run scripts, move the mouse, or type in the keyboard. Bots struggle to replicate these actions accurately.
Both common and targeted bots can be challenged with CAPTCHAs if AWS WAF suspects their traffic. If a bot can’t solve the CAPTCHA, it gets blocked from further access.
These methods create serious barriers for scrapers, but understanding how they work is the first step in overcoming them.
Dynamic Request Validation
AWS WAF ensures that your requests look and behave like those of real users by using dynamic request validation, which includes:
- Header Validation: AWS checks that all the essential headers (like User-Agent, Accept, and Referer) are present and consistent with what a real browser would send. Missing or unusual headers can raise suspicion.
- Cookie Management: AWS tracks cookies throughout your session, expecting them to change in specific ways as you move between pages. If cookies are missing or don’t behave as expected, your request could be flagged.
- Dynamic Token Injection: AWS WAF can insert short-lived tokens (like CSRF tokens) into pages, which need to be included in your follow-up requests. Your request may be blocked if the token needs to be corrected or added.
- Stateful Inspection: AWS monitors the sequence of your requests to ensure they make sense. For example, it expects you to visit a login page before accessing protected resources. AWS WAF will intervene if your request flow doesn’t follow this logic.
To stay undetected, you’ll need to manage cookies properly, rotate headers, include necessary tokens, and follow a natural request flow—just as a real user would.
IP Blocking and Rate-Limiting
AWS WAF’s traffic management system can block or slow down your requests if you send too many too quickly. Here’s how it works:
- Adaptive Rate Limiting: AWS WAF learns what normal traffic patterns look like for each page and adjusts its limits. You could trigger the system’s defenses if you send requests too fast.
- IP Reputation Scoring: AWS WAF scores each IP address based on behavior. If your IP shows signs of suspicious activity, it will get a lower score, which can lead to increased scrutiny or outright blocking.
- Session-Based Rate Limits: AWS WAF doesn’t just watch IP addresses—it also tracks session activity. Simply rotating IPs isn’t enough; you must manage your session behavior carefully to stay undetected.
- Geolocation-Based Rules: AWS WAF applies stricter rules for traffic from certain regions known for higher bot activity. If your requests come from areas associated with bots, you might face tougher rate limits or even CAPTCHAs.
To avoid detection, switching IPs alone won’t cut it. Instead, you’ll need to rotate IPs to mimic normal traffic patterns, keeping AWS WAF’s detection methods in mind.
How Catch.com.au Uses AWS WAF Bot Control
A real-world example of AWS WAF Bot Control in action is Catch.com.au, a popular ecommerce platform. They use AWS WAF to block bots from scraping product data, attempting fraud, or disrupting user sessions. Here’s how they use AWS WAF:
- Common Bots: Catch.com.au blocks simple bots using AWS WAF’s signature-based detection and IP reputation lists.
- Targeted Bots: For more advanced bots, Catch.com.au uses behavior-based detection and browser fingerprinting to challenge suspicious traffic.
- Dynamic Request Validation: They enforce strict validation of headers and cookies and use dynamic tokens to confirm legitimate sessions.
- Incident Response: Catch.com.au uses rate limiting and IP blocklists in case of suspicious activity spikes. They also have a “break-glass” Geo-Block rule to restrict traffic to Australia and New Zealand.
- CAPTCHA: Any traffic that makes it through these defenses is challenged with CAPTCHA to ensure the user is human.
You can learn more about how Catch.com.au leverages AWS WAF in the AWS presentation here.
By understanding the general workings of AWS WAF Bot Control and its real-world implementation at Catch.com.au, you can see how website owners think about bot management.
Bypassing AWS WAF Bot Control
Now that you understand how AWS WAF Bot Control works, let’s get into the fun part: how to bypass it.
While AWS WAF puts up a strong defense with IP tracking, user-agent validation, and CAPTCHA challenges, there are ways to get around each. Let’s walk through what you’d need to do manually to scrape AWS WAF-protected sites effectively.
1. IP Rotation to Avoid Blocking
We already know that AWS WAF keeps a close eye on IP addresses, especially if they’re making a lot of requests in a short time. If AWS notices too many requests from one IP, it will get flagged quickly. So, your job is to keep things fresh by rotating IPs and making your requests look like they’re coming from all over the world.
How to Handle It:
- Set up a pool of proxies so your scraper can use different IPs for each request.
- Ensure these proxies are high-quality (preferably residential) to avoid quick detection.
For sites like Catch, which heavily relies on AWS WAF to block unwanted traffic, rotating your IPs is necessary to avoid being shut out after just a few requests. Without this, you’ll hit a wall pretty fast.
2. Rotate User-Agents and Headers
We’ve discussed how AWS WAF checks headers and user agents to ensure your traffic looks legitimate. The goal is to keep switching things up, making your scraper blend in with different browsers and devices.
How to Handle It:
- Rotate your User-Agent string for every request to simulate traffic from different browsers. Don’t forget to include common headers like
Referer
,Accept-Language
, andConnection
to make your requests look more realistic. - For a more effective approach, ScraperAPI handles User-Agent rotation and header management for you. Each request gets a fresh User-Agent and the necessary headers, making your traffic look natural without the hassle of manual setup.
Mixing up your user agents and headers makes you much less likely to get caught by AWS WAF’s validation checks, which are designed to flag repetitive or incomplete headers.
3. Keep an Eye on Sessions and Cookies
Cookies aren’t just for saving your login info—they help AWS WAF track whether requests are part of an actual browsing session. If you start making multiple requests without sending the proper session cookies, AWS will get suspicious, and your scraper could be blocked. To avoid this, you must keep track of cookies and ensure they’re consistent across requests.
How to Handle It:
- Use Python’s
requests.Session()
to store and manage cookies across multiple requests. - Start your scraping by visiting the site’s initial pages and capturing session cookies, then send those cookies with your requests.
For example, when scraping Catch.com.au, AWS WAF expects cookies to stay consistent as users move from page to page. If your scraper doesn’t handle cookies properly, it’ll stick out like a sore thumb, and AWS will step in to block you.
4. Deal with CAPTCHAs
AWS WAF uses CAPTCHAs to verify human interaction and block bot traffic. With ScraperAPI’s built-in CAPTCHA management, you can easily bypass these obstacles and keep your scraping workflow running smoothly.
By enabling render=true
in ScraperAPI, you ensure your requests can handle JavaScript challenges and pass browser integrity checks. This approach mimics real user behavior, allowing you to maintain consistent access without additional CAPTCHA-solving services.
For sites heavily protected by AWS WAF, ScraperAPI’s render feature offers a straightforward way to overcome CAPTCHA challenges and access the data you need.
5. Slow Down Your Requests (Rate-Limiting)
AWS WAF isn’t just watching your IPs or headers—it’s also keeping tabs on how fast you’re making requests. If you send too many requests too quickly, you’ll trigger AWS WAF’s rate-limiting defenses. So, slowing down and adding delays between your requests is key to flying under the radar.
How to Handle It:
- Use Python’s
time.sleep()
to add random delays between requests. - Mix things up by introducing different delays for each request, mimicking a more natural browsing pattern.
Slowing your request speed is crucial. It makes your scraper look more like a human casually browsing the site rather than a bot speed-running through product pages.
How to Bypass AWS WAF with ScraperAPI
ScraperAPI makes bypassing AWS WAF Bot Control much more straightforward by handling some of the most complex challenges, like IP rotation, session management, and JavaScript rendering. You don’t have to worry about configuring all these manually—ScraperAPI does it for you behind the scenes.
Let’s walk through how to use ScraperAPI to scrape data from Catch.com.au using Python.
Here’s the script you can use:
import requests
from bs4 import BeautifulSoup
# ScraperAPI key and target URL
API_KEY = 'your_scraperapi_key'
URL = 'https://www.catch.com.au/'
# Parameters for the API request
params = {
'api_key': API_KEY,
'url': URL,
'render': 'true' # Ensures JavaScript is rendered, crucial for AWS WAF-protected sites
}
# Send the request to ScraperAPI
response = requests.get('http://api.scraperapi.com', params=params)
# Check the response status
if response.status_code == 200:
print('Successfully bypassed AWS WAF and scraped the page.')
# Parse the response content with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Print the page content
print(soup.text) # Contains the actual HTML of the Catch homepage
else:
print(f'Failed to scrape page. Status code: {response.status_code}')
Here’s how it works:
Step 1: Set Up Your API Key
Replace "your_scraperapi_key"
with your actual ScraperAPI key. This key gives you access to ScraperAPI’s services, including IP rotation and JavaScript rendering.
Step 2: Configure Parameters for the Request
api_key
: This is your ScraperAPI key that authenticates the request.url
: The target website you want to scrape, in this case, catch.com.au.render='true'
: Many AWS WAF-protected sites rely heavily on JavaScript to deliver content and validate users. By rendering JavaScript, ScraperAPI ensures that dynamic content, like product listings or token validation scripts, is loaded correctly—just as it would be in a real user’s browser. This makes your scraper appear more human-like and helps bypass JavaScript-based protections. Always use this parameter when scraping heavily JavaScript-dependent pages.
Step 3: Send the Request
Using requests.get()
, send a GET request to ScraperAPI’s endpoint using the parameters we set up. ScraperAPI handles everything: executing JavaScript, rotating proxies, and managing cookies.
Step 4: Check the Response
If the response status code is 200, you’ve successfully bypassed AWS WAF! If you get other status codes, such as 403 (forbidden)
, it means the bypass didn’t work, and you may need to tweak your approach, such as increasing delays between requests.
Step 5: Parse the Content
Once you get the response, use BeautifulSoup
to parse the HTML content and extract the needed data. The script prints the entire HTML content of the Catch.com.au homepage, but you can customize it to extract specific data like product details or prices.
Why ScraperAPI is the Best Tool to Bypass AWS WAF
Bypassing AWS WAF Bot Control requires a solution that effortlessly handles advanced defenses. Here’s why ScraperAPI stands out and makes the process easier for you:
1. All-in-One Solution for Bot Protection
ScraperAPI is not just about proxies—it’s a full-fledged solution that handles everything from IP rotation and CAPTCHA solving to JavaScript rendering. AWS WAF uses multiple strategies like session management, header validation, and CAPTCHA challenges, and ScraperAPI is designed to handle all these layers seamlessly. There is no need to juggle multiple tools—ScraperAPI has you covered.
2. Simple to Integrate, Powerful in Action
Adding ScraperAPI to your Python script is a breeze. With just a few lines of code, you can bypass AWS WAF without the headache of manually setting up proxies, cookies, and headers. The simplicity of ScraperAPI’s integration means you spend less time troubleshooting and more time scraping the data you need.
3. Consistent Performance Against AWS WAF
AWS WAF constantly updates its defenses to stop bots, but ScraperAPI keeps up with these changes. Whether dealing with minor scrapes or large-scale operations, ScraperAPI’s reliable performance ensures fewer blocks and smoother scrapes. It’s built to handle AWS WAF’s evolving protections, giving you peace of mind.
4. Scalable for Projects Big and Small
No matter the size of your project, ScraperAPI can scale to meet your needs. Whether you’re scraping a handful of pages or millions, its automatic IP rotation and infrastructure handle large volumes of requests efficiently. You can scale your scraping without sacrificing speed or reliability.
5. Robust Support and Helpful Resources
Having good support is key when dealing with something as complex as AWS WAF. ScraperAPI provides detailed documentation and tutorials, and its support team is ready to help with any challenges. Whether you need technical guidance or help troubleshooting, ScraperAPI’s resources ensure you’re never stuck.
When it comes to bypassing AWS WAF Bot Control, ScraperAPI stands out as an all-in-one solution. Its simplicity, ease of integration, and reliable performance make it an ideal tool for scraping protected sites.
Ready to start scraping? Sign up for a free ScraperAPI account and receive 5,000 API credits to test its capabilities for seven days.