How to Bypass and Scrape PerimeterX Protected Sites with Python

Tutorial on how to bypass and scrape Peremetrix sites with Python

Bypassing PerimeterX (now Human Security) can be incredibly challenging. Anyone who’s tried to set up their own scraping infrastructure with rotating proxies and headless browsers knows the struggle. You finally get something to work, and then PerimeterX makes some changes. Then, you’re back to square one, troubleshooting and updating your entire setup.

More particularly, PerimeterX is not a Content Delivery Network. Therefore, manipulating the master server of the website it protects won’t cut it. Currently, over 153,000 websites use PerimeterX. Major companies like Zillow rely on PerimeterX to provide security and anti-bot protection for their data.

These services are all designed to block web scrapers and other automated scraping attempts. But in this article, we’ll explore how to bypass PerimeterX protection and successfully scrape websites using Python and ScraperAPI.

Never Get Blocked Again

ScraperAPI’s advanced bypassing techniques let you collect data from sites protected by all major bot blockers in the market.

What is PerimeterX (now Human Security)?

PerimeterX is a cybersecurity firm that provides tools to protect web applications from automated attacks, fraudulent activities, and web scraping. Using advanced machine learning algorithms and risk scores, PerimeterX analyzes request fingerprints and behavioral signals to detect and block bot attacks in real time.

One of its essential defenses, the HUMAN Bot Defender (formerly PerimeterX Bot Defender), is directly aimed at scraping bots. It sits directly on a website, monitoring and collecting information from incoming web requests. These requests are then analyzed according to predefined rules and later acted upon if found suspicious.

While PerimeterX does allow some legitimate bots, like Googlebot, to pass through, it remains a formidable barrier to unauthorized web scraping and other automated activities.

Recognize PerimeterX

To identify PerimeterX’s presence on a website, look for these characteristics:

  • Internal property: Check your browser’s console for the “window._pxAppId” property.
  • Collector XHR: PerimeterX can operate with or without an external server.

    • External Collector: Look for requests to domains like “px-cdn.net”, “pxchk.net”, or “px-client.net”.
    • Internal Endpoint: An internal endpoint will likely follow the format “/rf8vapwA/xhr/api/v2/collector”.
  • Cookies: PerimeterX sets cookies with names like “_px3”,” _pxhd”, and “_px_vid”.

Note: Are you frustrated that your web scrapers are blocked once and again? ScraperAPI handles rotating proxies and headless browsers for you. Try it for FREE!

Popular PerimeterX Errors

When PerimeterX determines that a visitor is an automated scraper or bot, it will block access and serve a 403 Forbidden response along with branded block pages. These responses come from the PerimeterX sensor integrated with the site’s infrastructure.

Here are some common PerimeterX error messages you might encounter:

Message or Code Meaning
403 Forbidden PerimeterX has flagged your request as potentially automated.
“Please prove you are human” This message appears when the JavaScript browser test fails, prompting the user to verify their humanity.
“Enable cookies” Irregular or missing cookies header
“Try again later” Temporary block due to suspicious activity patterns
APTCHA challenge Triggered by failed fingerprinting and behavior checks, requiring additional verification to continue.

Remember, PerimeterX’s blocking behaviors are designed to be dynamic and may change over time. Always be prepared to adapt your scraping techniques as you encounter new types of blocks or challenges.

How does PerimeterX detect bots?

You can recognize PerimeterX by the “Press & Hold” and “Please verify you are a human” messages similar to the image below:

Verification method from PerimeterX

To avoid anti-web scraping services like PerimeterX, we first should understand how they work, which really boils down to 3 categories of detection:

  • IP address
  • Javascript Fingerprint
  • Request details

Services like PerimeterX use these tools to calculate a trust score for every visitor. A low score means you’re most likely a bot, so you’ll either be required to solve a CAPTCHA challenge or denied access entirely. So, how do we get a high score?

IP Addresses / Proxies

PerimeterX analyzes IP addresses to identify suspicious activity. Ideally, we want to distribute our load through proxies. These proxies can be Datacenter, Residential, or Mobile proxies.

So, to maintain a high trust score, your scraper should rotate through a pool of residential or mobile proxies. To learn more, check our blog post on what residential proxies are and why use them for scraping.

Javascript Fingerprints

This topic is quite extensive, especially for junior developers, but here is a quick summary.

Websites can use Javascript to fingerprint the connecting client (the scraper) because Javascript leaks data about the client, including the operating system, support fonts, visual rendering capabilities, etc.

For example, if PerimeterX detects a large number of Linux clients connecting through 1280×720 windows, it can simply deduce that this sort of setup is likely a bot and gives everyone with these fingerprint details low trust scores.

If you’re using Selenium to bypass PerimeterX, you need to patch many of these leaks to escape the low trust zone. You can do this by modifying the browser to set fake fingerprint details or using a patched version of the headless browser like “Selenium-wire.”

For more on this, check out our blog: 10 tips for web scraping without getting blocked.

Request Details

PerimeterX can still give us low trust scores if our connection patterns are unusual, even if we have a large pool of IP addresses and have patched our headless browser to prevent key fingerprint details from being leaked.

It observes various specified events, which suggests that it utilizes behavioral analysis. As a result, it’s important to be careful when scraping these sites. To avoid this, your scraper should scrape in non-obvious patterns. It should also connect to non-target pages like the website’s homepage once in a while to appear more human-like.

TNow that we understand how our scraper is being detected, we can start researching how to bypass these measures. Selenium, Playwright, and Puppeteer have large communities, and the keyword to look for here is “stealth.”

Unfortunately, this is not very bulletproof, as PerimeterX can simply collect publicly known patches and adjust their service accordingly. You should have experienced this a couple of times if you have been trying out this method manually, which means you have to figure out many things yourself.

A better alternative is to use a web scraping API to scrape protected websites easily.

Bypassing PerimeterX with ScraperAPI

ScraperAPI helps you avoid getting your IP banned by rotating IP addresses, handling CAPTCHAs, and managing request limits. It simplifies the entire scraping process by providing reliable access to web pages, ensuring you can focus on extracting and analyzing data rather than handling anti-scraping measures.

Dashboard page from ScraperAPI site

Using Python and ScraperAPI makes scraping large amounts of data from PerimeterX-protected websites much easier.

In this guide, we’ll demonstrate how to scrape product information from Neiman Marcus, a well-known department store with an e-commerce website protected by PerimeterX.

Prerequisites

To follow along with this guide, ensure you have met the following prerequisites:

Step 1: Set Up Your Environment

First, let’s set up a virtual environment to avoid any conflicts with existing Python modules or libraries.

For macOS users:


  pip install virtualenv
  python3 -m virtualenv venv
  source venv/bin/activate

For Windows users:


  pip install virtualenv
  virtualenv venv
  srouce venv\Scripts\activate

Once your virtual environment is activated, install the required libraries:


  pip install requests beautifulsoup4 lxml

Step 2: Analyze the Target Website

Our target website is Neiman Marcus, a well-known department store with an e-commerce website, where we’ll try to scrape product prices.

Using tools like Wappalyzer, we can confirm that PerimeterX protects Neiman Marcus. However, to bypass the protection, we’ll need to use ScraperAPI.

Neiman Marcus website security with PerimeterX

We’ll focus on scraping product information from the men’s loafers category page.

Neiman Marcus men's loafers category page

Step 3: Set Up ScraperAPI

To get started, create a free ScraperAPI account and copy your API key from your dashboard.

Creating a free ScraperAPI account on ScraperAPI page

Step 4: Write the Scraping Script

Now, let’s create a Python script that uses ScraperAPI to bypass PerimeterX and scrape product information:


  import requests
  from bs4 import BeautifulSoup
  import json
  
  API_KEY = "YOUR_API_KEY"
  url = "https://www.neimanmarcus.com/en-ng/c/shoes-shoes-loafers-slip-ons-cat10580739?navpath=cat000000_cat000470_cat000550_cat10580739"
  
  payload = {"api_key": API_KEY, "url": url, "render": "true"}
  
  def scrape_product_info():
      response = requests.get("http://api.scraperapi.com", params=payload)
      if response.status_code != 200:
          print(f"Failed to retrieve the page. Status code: {response.status_code}")
          return []
  
      soup = BeautifulSoup(response.text, 'lxml')
      products = soup.find_all('div', class_='product-thumbnail__details')
      
      product_info = []
      for product in products:
          designer = product.find('span', class_='designer')
          name = product.find('span', class_='name')
          price = product.find('span', class_='price-no-promo')
          badge = product.find('div', class_='gift-badge')
  
          info = {
              'designer': designer.text.strip() if designer else 'N/A',
              'name': name.text.strip() if name else 'N/A',
              'price': price.text.strip() if price else 'N/A',
              'batch_details': badge.text.strip() if badge else 'N/A'
          }
          product_info.append(info)
  
      return product_info
  
  def save_to_json(product_info, filename='neiman_marcus_loafers.json'):
      with open(filename, 'w', encoding='utf-8') as jsonfile:
          json.dump(product_info, jsonfile, ensure_ascii=False, indent=4)
  
  if __name__ == "__main__":
      print("Scraping product information...")
      product_info = scrape_product_info()
      
      if product_info:
          print(f"Found {len(product_info)} products.")
          save_to_json(product_info)
          print("Product information saved to neiman_marcus_loafers.json")
      else:
          print("No products found or there was an error scraping the website.")

Note: To run this example, replace YOUR_API_KEY with your actual API key from your dashboard.

Here, we’re making a request to ScraperAPI, passing our API key and the target URL. ScraperAPI will handle the complexities of rotating IP addresses, solving CAPTCHAs, and managing other PerimeterX anti-scraping measures. The render=true parameter within the payload instructs ScraperAPI to fully render the target webpage in a headless browser, ensuring that any dynamic content protected by PerimeterX is fully loaded and available for scraping.

Once ScraperAPI returns the rendered HTML content, we use BeautifulSoup to parse it. We target specific HTML elements using their class names (product-thumbnail__details, designer, name, price-no-promo, gift-badge) to extract the desired product information.

Step 5: Run the Script

Save the script as neiman_marcus_scraper.py and run it:


  python neiman_marcus_scraper.py

This script will then scrape product information from Neiman Marcus men’s loafers page, bypassing PerimeterX protection using ScraperAPI. The scraped data will be saved in a JSON file named neiman_marcus_loafers.json.


  [
  {
      "designer": "Tod's",
      "name": "Men's Suede Moccasin Slipper Loafers",
      "price": "$725",
      "batch_details": "N/A"
  },
  {
      "designer": "Ferragamo",
      "name": "Men's Cosimo Leather Ganicni Bit Loafers",
      "price": "$895",
      "batch_details": "Best Seller"
  },
  {
      "designer": "Prada",
      "name": "Men's Monolith Patent Leather Loafers",
      "price": "$1,270",
      "batch_details": "Exclusive"
  },
  {
      "designer": "Santoni",
      "name": "Men's Kalvin Leather Penny Loafers",
      "price": "$1,330",
      "batch_details": "N/A"
  }, 
 Truncated data,
}

Congratulations, if you have been coding along so far, then you have successfully bypassed a website protected by Human Security! This approach can also be adapted to scrape other PerimeterX-protected websites.

Final Thoughts

There’s no denying that PerimeterX (now HUMAN Security) is a sophisticated anti-bot system that employs a range of techniques to protect websites from unwanted scraping. From monitoring IP addresses and analyzing request headers to using various fingerprinting methods, PerimeterX presents a formidable challenge for web scrapers.

ScraperAPI’s team is constantly updating the API bypassing methods to keep your scrapers running, making it the best way to scrape data from PerimeterX protected sites without getting blocked.

If you’re interested in learning how to scrape other popular websites, check out our other how to scrape guides:

FAQs about bypassing bot protection

Bypassing PerimeterX itself does not inherently violate any laws. However, other factors like site terms, data usage, and automation disclosure may still apply legally. To learn more, visit our blog post on: is web scraping legal? the complete guide.

ScraperAPI handles complex anti-bot evasion techniques, saving you time and effort in maintaining your own solution. It’s like having a team of developers constantly updating your scraper to ensure high success rates.

PerimeterX has very advanced bot detection capabilities but is not flawless. Well-designed scrapers using tactics in this guide have proven successful in large-scale deployments.

Cloudflare, Datadome, Imperva, and Akamai are major competitors providing bot protection. The techniques discussed in this article apply broadly to various vendor solutions. You’ll also be able to bypass these anti-bots using ScraperAPI’s tools.

About the author

John Fawole

John Fawole

John Fáwọlé is a technical writer and developer. He currently works as a freelance content marketer and consultant for tech startups.

Table of Contents

Related Articles

Talk to an expert and learn how to build a scalable scraping solution.