How to Automate Web Scraping in a Couple of Clicks

Tutorial on how to automate web scraping

Tutorial – Automate Web Scraping with ScraperAPI

Collecting web data can be a complex and time-consuming task, so what if you could run automated website scraping tasks and build large datasets in a fraction of the time?

In this article, I’ll show you how to:

  • Automate web scraping tasks using DataPipeline
  • Collect product information from Amazon automatically
  • Submit a dynamic list of products to improve scraping efficiency
  • Set up a Webhook to receive the data
Automated Web Scraping at Scale

DataPipeline lets you build and schedule entire scraping jobs with just a couple of clicks.

Automating Web Scraping with DataPipeline

DataPipeline is a user-friendly, low-code tool that simplifies and automates your web scraping tasks. With DataPipeline, you can build complex scrapers and run them at scale without maintaining custom scraping scripts, managing proxies, or handling CAPTCHAs – it’s all taken care of for you. Hence, reducing engineering resources and costs.

web scraping amazon

To get started, simply create a free ScraperAPI account. This gives you access to DataPipeline and 5,000 API credits to start scraping. Plus, if you need any assistance, ScraperAPI’s support team is always happy to help.

Setting Up Our First DataPipeline Project

To show you how DataPipeline works, I’ll walk you through automating an Amazon product data scraping project from scratch.

For this project, I’ll use the Amazon Products scraper, which relies on ASIN numbers (Amazon Standard Identification Numbers) to identify products. No worries; DataPipeline handles proxy requests, retries, and other technical challenges behind the scenes using ScraperAPI’s infrastructure.

Let’s get started!

Step 1: Create a New Amazon Project

First, I’ll log in to my ScraperAPI account and navigate to the dashboard. Then I’ll click the “Create a new DataPipeline project” option.

Creating a new Datapipeline project on ScraperAPI

This opens a new window where you can name your project and choose a template. Select the “Amazon Product Pages” template to get started.

Creating a new Amazon product page project

Step 2: Provide Input Data

Next, provide a list of ASINs for the products you want to scrape data from. With DataPipeline, you can scrape up to 10,000 items per project.

Providing a list os ASINs to scrape with DataPipeline

DataPipeline offers flexible options for providing this input data:

  • Manually enter the ASINs directly into the provided text box (one per line) – like I did in the screenshot above
  • Upload a CSV file containing the list of ASINs
  • Use a Webhook to feed the ASINs to DataPipeline

How to Get ASINs

If you’re not sure how to get ASINs, check out this blog post for a detailed guide on scraping Amazon ASINs at scale.

If you’re already familiar with ScraperAPI’s Structured Data Endpoints, you can use the following code to quickly retrieve a list of ASINs for a given search query:

</p>
  import requests
  import json
   
   
  asins = []
   
   
  payload = {
     'api_key': 'YOUR_API_KEY', 
     'query': 'Airpods',
     'country': 'us',
  }
   
   
  response = requests.get('https://api.scraperapi.com/structured/amazon/search', params=payload)
  earphones = response.json()
   
   
  all_products = earphones['results']
  for product in all_products:
     if product['asin'] != None:
         asins.append(product['asin'])
     else:
         print('not found')
   
   
  with open('asins.json', 'w') as f:
     json.dump(asins, f)

<p>

Make sure you replace 'YOUR_API_KEY' with your actual ScraperAPI API key. This code will fetch search results for “Airpods,” extract the ASINs from the structured data, and save them to an asins.json file.

Note: You can also use DataPipeline to gather ASIN numbers automatically using the Amazon Search template.

Step 3: Customize Parameters

DataPipeline allows for customization to improve the accuracy and relevance of your scraped data. You can select different parameters and additional options depending on the specific data points you need.

Customizing DataPipeline parameters

For example, you can specify the country from which you want your requests to come (geotargeting) and the TLD (top-level domain) that you want DataPipeline to target.

Step 4: Select Output Format

Next, I’ll decide how I want to receive the scraped data. DataPipeline allows two options:

  1. Download the extracted data directly to your device
  2. Deliver the data to a webhook URL, automating your workflow by integrating the data with other applications or services.

Selecting the output format

You can also choose the format you want to receive the scraped data. DataPipeline supports both JSON and CSV formats for available templates. For arbitrary URLs, it’ll return HTML.

JSON is a great option if you need structured data that’s easy to parse and work within programming languages. On the other hand, CSV is a suitable choice if you need a more spreadsheet-friendly format for analysis or reporting.

Choosing the output format between CSV and JSON

Step 5: Schedule Scraping Frequency

Next, I’ll determine how often I want DataPipeline to execute my scraping job. I can opt for a one-time scrape if I only need the data once.

Scheduling scraping frequency

If I need to keep the data up-to-date, I can schedule the scraper to run at regular intervals – daily, weekly, or at a custom frequency that suits my needs. This greatly appealed to me as it relieved me of most of the data mining tasks.

Unfolding the scraping frequency options

Step 6: Set Notification Preferences

ScraperAPI will email you notifications about the progress and completion status of your scraping jobs. You can configure your notification preferences to receive alerts about successful job completions, any errors encountered, or other relevant updates.

Setting notification preferences

This feature allows me to monitor my scraping projects without constantly checking the dashboard – which always saves a couple of minutes of work per day.

Step 7: Run Your Amazon Scraper

Once you’re ready, click “Review & Start Scraping.”

Running your Amazon scraper

Review your project configuration before clicking on the “START SCRAPING” button. You will see an estimate of the credits that will be used for each run, giving you a transparent view of the cost.

Reviewing project configuration before start scraping

You will then be redirected to the project dashboard.

Monitoring, editing and download from project

From this dashboard, you can monitor the progress of your scraper, cancel running jobs if needed, review your configurations, and download the scraped data once the jobs are complete.

Note: Check DataPipelines documentation to learn the full features of the tool.

Get Structured Ecommerce Data

ScraperAPI turns product pages into ready-to-use JSON and CSV data.

Automate Data Delivery Through Webhooks

One of the things I love about DataPipeline is the ability to deliver data directly to my applications using Webhooks. This webhook will act as the endpoint where the DataPipeline will send the scraped data, eliminating the need to download and transfer data manually.

Setting Up a Webhook

Before using the webhook functionality in DataPipeline, you must have a webhook URL set up. To make things easier, I’ll show you two options:

  • A no-code option – great for testing or for serious projects in its paid version
  • Advanced Webhook set up with Flask and Ngrok

No-Code Webhook Setup

A simple solution for quickly getting started with webhooks is to use Webhook.site. When you visit the site, it automatically generates a unique webhook URL. Copy this URL as your webhook endpoint to receive the scraping results.

Setting up a Webhook setup

This is a quick and easy way to test and see the incoming data.

Webhook scraping results incoming data

However, without a paid plan, Webhook.site only retains the data for 7 days, after which it is discarded.

Setting Up a Webhook Using Flask and Ngrok

While inspecting the content of a webhook payload can be useful (for example, by forwarding the webhook to tools like webhook.site), developers also require the capability to receive incoming webhooks directly in their application.

Here’s how you can do that using Flask</a >, a lightweight web framework for Python, and Ngrok, a tool that exposes local servers to the public internet:

  1. Install Flask and Ngrok

You can install Flask using pip:

  pip install Flask

To install Ngrok

  • On Windows (using Chocolatey):
  choco install ngrok

  • On macOS (using Homebrew):
  brew install ngrok

You can download Ngrok directly from the official website for other operating systems.

  1. Configure your auth token (one-time setup)

To use Ngrok, you need to connect it to your account using an auth token. Get your Authtoken from your Ngrok dashboard</a >.

Configuring your Authtoken

Open your terminal or command prompt and run the following command, replacing $YOUR_AUTHTOKEN with your actual auth token:

  ngrok config add-authtoken $YOUR_AUTHTOKEN

  1. Create a Flask Webhook server

Let’s create a simple Flask application to handle incoming webhook requests. Save the following code in a file named webhook.py

  from flask import Flask, request, jsonify
  import csv
  import io
  import zipfile
  from werkzeug.utils import secure_filename
  
  app = Flask(__name__)
  
  @app.route('/webhook', methods=['POST'])
  def webhook():
      content_type = request.headers.get('Content-Type')
  
      if content_type == 'application/json':
          data = request.json
          print("Received JSON data:", data)
      elif content_type == 'text/csv':
          data = request.data.decode('utf-8')
          csv_reader = csv.reader(io.StringIO(data))
          csv_data = [row for row in csv_reader]
          print("Received CSV data:", csv_data)
      elif content_type.startswith('multipart/form-data'):
          files = request.files
          for filename, file in files.items():
              print(f"Received file: {filename}")
              file.save(secure_filename(file.filename))
              if filename.endswith('.zip'):
                  with zipfile.ZipFile(file, 'r') as zip_ref:
                      zip_ref.extractall('extracted_files')
                      print(f"Extracted files: {zip_ref.namelist()}")
      elif content_type == 'application/zip':
          file = request.data
          with open('received.zip', 'wb') as f:
              f.write(file)
          with zipfile.ZipFile(io.BytesIO(file), 'r') as zip_ref:
              zip_ref.extractall('extracted_files')
              print(f"Extracted files: {zip_ref.namelist()}")
      else:
          data = request.data.decode('utf-8')
          print("Received raw data:", data)
      # process the data as needed - eg add to a database/ perform analysis
      return jsonify({'status': 'success', 'data': 'Processed data'}), 200
  
  if __name__ == '__main__':
      app.run(port=5000)

This code creates a simple Flask application that listens for incoming POST requests on the “/webhook” endpoint. It can handle different content types, such as JSON, CSV, raw data, and ZIP files. It also extracts and saves the received data in the current working directory. The data can be further processed as needed, such as saving it to a database or performing analysis.

  1. Run the Flask application

You can now run your Flask application:

  python webhook.py

  1. Start Ngrok

Open a new terminal window and start Ngrok to expose your local Flask server:

  ngrok http 5000

Ngrok will generate a public URL that forwards requests to your local Flask application. Copy the “Forwarding” URL that Ngrok provides; you’ll use this URL as your webhook endpoint in DataPipeline.

Ngrok generating a URL forward request
  1. Add the Webhook to DataPipeline

In your DataPipeline project settings, find the output settings:

Adding the Webhook to DataPipeline

Select “Webhook” as your preferred method, paste the webhook URL provided by Ngrok, and attach the “/webhook” endpoint to it.

Selecting Webhook as output in DataPipeline

DataPipeline will automatically send the scraped data to your webhook URL.

  Received file: result
  127.0.0.1 - - [25/Jun/2024 15:32:36] "POST /webhook HTTP/1.1" 200 -
  Extracted files: ['job-11814822-result.jsonl']

You can configure your webhook service or server to handle the data as needed, whether saving it to a database, sending notifications, or triggering other automated actions.

  [
  {
      "input": "B0CHJ5LJZG",
      "result": {
          "name": "Wireless Earbuds, 2024 Wireless Headphones HiFi Stereo Earphones with 4 ENC Noise Canceling Mic, 42Hs Playtime In Ear Earbud, Bluetooth 5.3 Sport Earphones with LED Power Display for Android iOS White",
          "product_information": {
              "model_name": "A60PW-USM",
              "connectivity_technology": "Wireless",
              "wireless_communication_technology": "Bluetooth",
              "special_feature": "Lightweight, Noise Isolation, Volume-Control, Microphone Feature, Sports & Exercise",
              "included_components": "Charging case *1, Type-C Charging Cable*1, User Manual*1, 3 Pairs of Ear Tips(S M L ), Wireless Earbuds*2",
              "age_range_description": "Adult",
              "material": "Plastic",
              "specific_uses_for_product": "Sports & Exercise, Running",
              "charging_time": "1.5 Hours",
              "compatible_devices": "Compatible devices with Bluetooth 5.3 or earlier, … [TRUNCATED]",
              "control_type": "Media Control",
              "cable_feature": "Without Cable",
              "item_weight": "0.634 ounces",
              "control_method": "Touch, Voice",
              "number_of_items": "1",
              "audio_driver_type": "Dynamic Driver",
              "bluetooth_range": "15 Meters",
              "bluetooth_version": "5.3",
              "carrying_case_battery_charging_time": "1.5 Hours",
              "audio_driver_size": "13 Millimeters",
              "earpiece_shape": "Rounded Tip",
              "manufacturer": "Aoslen",
              "package_dimensions": "4.17 x 3.27 x 1.46 inches",
              "asin": "B0CHJ5LJZG",
              "item_model_number": "A60Pro",
              "batteries": "1 Lithium Ion batteries required. (included)",
              "customer_reviews": {
                  "ratings_count": 3154,
                  "stars": 4.5
              },
              "best_sellers_rank": [
                  "#267 in Electronics (See Top 100 in Electronics)",
                  "#48 in Earbud & In-Ear Headphones"
              ],
              "date_first_available": "September 7, 2023"
          },
          "brand": "Brand: Aoslen",
          "brand_url": "https://www.amazon.com/Aoslen/b/ref=bl_dp_s_web_38765055011?ie=UTF8&node=38765055011&field-lbr_brands_browse-bin=Aoslen",
          "full_description": "From the brand Aoslen Wireless Earbuds                                      If you have any questions, … [TRUNCATED]",
          "pricing": "$19.99",
          "list_price": "$49.99",
          "shipping_price": "FREE",
          "availability_status": "In Stock",
          "is_coupon_exists": false,
          "images": [
              "https://m.media-amazon.com/images/I/41nBiguTHfL.jpg", [TRUNCATED]
          ],
          "product_category": "Electronics \u203a Headphones, Earbuds & Accessories \u203a Headphones & Earbuds \u203a Earbud Headphones",
          "average_rating": 4.5,
          "feature_bullets": [
              "2024 New Bluetooth 5.3 Technology: Bluetooth in ear headphones equipped with new version of bluetooth 5.3 chip, using better chip and technology. Transmit high quality lossless audio coding, ensure more stable connection, lower latency and lower power consumption during data transmission. With a stable connection distance of up to 15 meters, you can easily control your music and phone at home, in the office and on the go", [TRUNCATED]
          ],
          "total_reviews": 3154,
          "model": "A60Pro",
          "customization_options": {
              "color": [
                  {
                      "is_selected": false,
                      "url": "https://www.amazon.com/dp/B0BVQG2LVW/ref=twister_B0D14MT3VB?_encoding=UTF8&psc=1",
                      "value": "Black",
                      "price_string": "$19.99",
                      "price": 19.99,
                      "image": "https://m.media-amazon.com/images/I/41cGPiRmLHL.jpg"
                  }, [TRUNCATED]
              ]
          },
          "ships_from": "Amazon",
          "sold_by": "Aoslen US",
          "aplus_present": true
      }
  },
]

Scrape a Dynamic List of Web Pages

DataPipeline’s ability to accept webhooks as input opens up a whole new level of automation for your web scraping projects. Now, you can scrape dynamic lists of URLs, ASINs, product IDs, or even search queries without having to manually update your DataPipeline projects.

Imagine you want to track Google search results for a list of perfume queries that change regularly. Instead of manually updating your project in DataPipeline every time your list changes, you can automate the entire process.

Create a Webhook to Provide Search Queries

Let’s create a webhook endpoint using Flask to provide the DataPipeline with a list of search queries. I’ll store my search queries in a CSV file, but you could easily adapt this to fetch queries from a database or any other dynamic source.

  1. Create a perfume_queries.csv file

Create a CSV file named perfume_queries.csv and add your list of search queries, one per line (without a header row). For example:

  Perfume
  Blue de Chanel
  Dior Sauvage
  Gucci Guilty 
  Versace Eros 
  Calvin Klein Euphoria
  Marc Jacobs Daisy
  Yves Saint Laurent Black Opium
  Tom Ford Black Orchid
  Initio Oud For Greatness
  Paco Rabanne 1 Million

  1. Create the Flask Webhook:

Create a Python file (e.g., queries_webhook.py) and add the following Flask code:

  from flask import Flask, jsonify
  import csv
  
  app = Flask(__name__)
  
  def read_queries_from_csv(filepath):
      with open(filepath, newline='', encoding='utf-8') as csvfile:
          reader = csv.reader(csvfile)
          next(reader)  # Skip the header row
          queries = [row[0] for row in reader if row]  # One query per row
      return queries
  
  @app.route('/queries', methods=['GET'])
  def provide_queries(): 
      queries = read_queries_from_csv('perfume_queries.csv')
      return jsonify(queries)
  
  if __name__ == '__main__':
      app.run(debug=True, port=5000)

Here, we define a Flask app with a “route /queries” that reads the search queries from your CSV file and returns them in JSON format.

  1. Run the Flask App:

Start your Flask application:

  python queries_webhook.py

  1. Expose with Ngrok:

Use Ngrok (as explained in the previous section) to create a publicly accessible URL for your local Flask webhook:

  ngrok http 5000

Copy the forwarding URL provided by Ngrok.

Ngrok generating a URL forward request

Create a Google Search Project in DataPipeline

Go to your DataPipeline dashboard and create a new project using the “Google Search</strong >” template.

Creating a Google search project in DataPipeline

Instead of manually entering search terms, paste the Ngrok forwarding URL (that points to your /queries webhook) into the input field in DataPipeline.

Entering the search terms in DataPipeline

DataPipeline will automatically call your webhook, fetch the latest list of search queries, and use them as input for scraping Google Search results. Now, you have a fully automated pipeline where any updates to your perfume_queries.csv file will be reflected in your scraped data without any manual intervention!

Also Read: How to Scrape Google Shopping with Python</a >

Wrapping Up: Automated web scraping

In this guide, I’ve shown you how to automate web scraping using ScraperAPI’s DataPipeline

  • You learned how to automate scraping projects with DataPipeline,
  • Configure input data dynamically using Webhooks,
  • Automate data delivery through Webhooks

DataPipelines takes the hassle out of web scraping, making it easier to gather the data you need at scale. Ready to give it a try? Sign up for a free ScraperAPI account and try DataPipeline today.

Until next time, happy scraping!

About the author

John Fawole

John Fawole

John Fáwọlé is a technical writer and developer. He currently works as a freelance content marketer and consultant for tech startups.

Table of Contents

Related Articles

Talk to an expert and learn how to build a scalable scraping solution.