4 Projects to Learn Web Scraping from Scratch

Fundamentals of web scraping using Python and the Scrapy framework

4 Projects to Learn Web Scraping from Scratch

New to web scraping as a concept? This is the second part of our two-part web scraping guide. In part 1, we’ve explored the basic concepts of web scraping and its main challenges.

In this entry, though, we’ll dive into four web scraping projects that’ll help you further understand how the Scrapy framework works by:

  • Collecting data from popular sites
  • Dealing with common challenges from large sites
  • Using ScraperAPI to bypass advanced anti-bots
Download the Full Python Guide

Get the entire Python and Scrapy scraping guide in PDF format and read it at your own pace.

Project 1: Scraping iTunes’ Popular Podcasts with Emails

Objective: Find the most popular podcasts and get the hosts’ contact email.

In this example, we combine classical web scraping with a REST API and a feed parser. Let’s start with some ideas and research.

First, we need to find a list of the most popular podcasts. As iTunes (Apple Podcasts) is still considered the canonical go-to place for podcasts, we start here.

We notice that each category (and sub-category) has its own list of popular podcasts. Each is linked to the details page of a podcast. Unfortunately, the details page does not contain the host’s email address. However, we have the iTunes ID of each podcast and can query the iTunes lookup API to get the feed URL.

Note: Feeds are a structured way to provide content in a machine-readable format. They are often used for news, blogs, and podcasts. The most common feed format is RSS (Really Simple Syndication).

Inside the feed, we find the publisher’s email address. Our retrieval strategy, therefore, goes as follows:

1. Compile a list of popular podcasts by category from the iTunes website

We find a link to each category on the overview page of Apple Podcasts.

iTunes website preview

There is a list of popular podcasts for each category.

iTunes podcast preview list

Unfortunately, the email address is not available on the details page. We can, however, get the iTunes ID from the URL.

Apple podcast preview

2. Look up the feed URL via the iTunes lookup API

Apple provides an API to look up podcasts using their iTunes ID. We can use this API to get the podcast feed URL.

3. Get the publisher’s email from the feed

The feed is the machine-readable version of the podcast. It contains the podcast episodes and metadata. We can extract the publisher’s email from the feed.

Put together, the spider looks like this:

<pre class="wp-block-syntaxhighlighter-code">
import feedparser
import scrapy




class ItunespopularwithemailsSpider(scrapy.Spider):
   name = "ItunesPopularWithEmails"
   start_urls = ["https://podcasts.apple.com/us/genre/podcasts/id26"]


   def parse(self, response):
       category_links = response.css("div#genre-nav a::attr(href)").getall()
       for category_link in category_links:
           yield scrapy.Request(
               category_link,
               callback=self.parse_category,
           )
  
   def parse_category(self, response):
       breadcrumbs = response.css("div#title h1 ul.breadcrumb li a::text").getall()
       for a in response.css("div#selectedcontent a"):
           id = a.css("::attr(href)").get().split("/")[-1].replace("id", "")
           yield response.follow(
               f"https://itunes.apple.com/lookup?id={id}",
               self.parse_api,
           )
  
   def parse_api(self, response):
       data = response.json()
       result = data.get("results")[0]
       yield response.follow(result.get("feedUrl"), self.parse_feed)


   def parse_feed(self, response):
       feed = feedparser.parse(response.text)
       try:
           email = feed.feed.publisher_detail.email
       except:
           email = None
       if email:
           yield {
               "email": email,
               "feed": response.url,
           }</pre
>

Project 2: Scraping Amazon RAM Prices

Objective: Get a list of the 10 most popular RAM kits with 32GB and their prices from Amazon.

Amazon is a popular online marketplace where you can buy almost anything. In this example, we want to get the prices of the most popular RAM kits with 32GB.

You may want to collect a time series of prices to analyze the price development over time. Alternatively, you might choose a different product category or search term to conduct a competitive analysis if you sell similar products.

We start by searching for ram 32gb on Amazon. Just use your browser to search for the term and copy the URL.

Our scraper looks like this:

<pre class="wp-block-syntaxhighlighter-code">
import scrapy


API_KEY = '...'




class EconomicsamazonrampriceswithscrapingapiSpider(scrapy.Spider):
   name = "EconomicsAmazonRAMPricesWithScrapingAPI"
   allowed_domains = ["amazon.com", "www.amazon.com"]
   start_urls = ["https://www.amazon.com/s?k=ram+32gb&"]


   def start_requests(self):
       meta = { "proxy": f"http://scraperapi:{API_KEY}@proxy-server.scraperapi.com:8001" }
       for start_url in self.start_urls:
           yield scrapy.Request(start_url, callback=self.parse, meta=meta)


   def parse(self, response):
       items = response.xpath('//div[@data-asin]')
       for item in items:
           asin = item.xpath('@data-asin').extract_first()
           if not asin:
               continue
           price = item.xpath('.//span[@class="a-price"]/span/text()').extract_first()
           price = price.replace("$", "")
           price = float(price)
           if not price:
               continue
           yield {
               "asin": asin,
               "title": item.xpath('.//h2/a/span/text()').extract_first(),
               "price": price,
           }</pre
>

💡 Pro Tip

You can simplify this process using ScraperAPI’s Amazon Search endpoint. By sending your request through the API, alongside your search query, ScraperAPI will return all Amazon search results in JSON or CSV format.

Project 3: Scraping Public Holidays

Objective: Get a list of public holidays in the United States, Great Britain, Germany, France, Italy and the Netherlands.

This is a special one. Sometimes, you might have a special request like a list of public holidays, a list of international dialing codes, or a list of countries with their capitals.

For these kinds of things, GitHub is your friend. There are many repositories that contain this kind of data. In this example, I’ve found a repository that contains a list of public holidays for many countries. The data is available publicly, not only on GitHub, and can be accessed via a simple API.

Here is our spider code:

<pre class="wp-block-syntaxhighlighter-code">
import csv
import datetime
import io
import scrapy




class PublicholidaysSpider(scrapy.Spider):
   name = "PublicHolidays"
   allowed_domains = ["date.nager.at"]
   countries = ["US", "GB", "DE", "FR", "ES", "IT", "NL"]
   start_year = 2020


   def start_requests(self):
       current_year = datetime.datetime.now().year
       for country in self.countries:
           for year in range(self.start_year, current_year + 1):
               url = f"https://date.nager.at/PublicHoliday/Country/{country}/{year}/CSV"
               yield scrapy.Request(url, self.parse, cb_kwargs={"country": country, "year": year})


   def parse(self, response, year = None, country = None):
       csv_file = io.StringIO(response.text)
       reader = csv.DictReader(csv_file)
       for row in reader:
           row["year"] = year
           row["country"] = country
           yield row</pre
>

Project 4: Scraping NBA.com

Fun time! Now, let’s scrape the NBA website. We want to get a list of all players based on their weight and height. This is a good example of how to scrape a website with JavaScript-rendered content.

We start by visiting the NBA website and inspecting the page. We see that the player data is loaded dynamically with JavaScript.

Our scraper will use Playwright to interact with the website and get the player data.

<pre class="wp-block-syntaxhighlighter-code">
import scrapy




class ExamplesnbaspiderSpider(scrapy.Spider):
   name = "ExamplesNBASpider"
   allowed_domains = ["www.nba.com"]
   start_urls = ["https://www.nba.com/stats/teams"]
   custom_settings = {
           "CONCURRENT_REQUESTS": 50,
           "PLAYWRIGHT_BROWSER_TYPE": "chromium",
           "DOWNLOAD_HANDLERS": {
               "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
               "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
           },
           "PLAYWRIGHT_LAUNCH_OPTIONS": {
               "headless": False,
               "timeout": 20 * 1000,  # 20 seconds
           },
           "PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT": 10 * 1000,
       }


   def start_requests(self):
       for start_url in self.start_urls:
           yield scrapy.Request(
               url=start_url,
               meta={
                   "playwright": True,
                   "playwright_include_page": True,
               },
           )


   async def parse(self, response):
       page = response.meta['playwright_page']
       team_links = response.css("a[class*='StatsTeamsList_teamLink__']")
       for i, team_link in enumerate(team_links):
           yield scrapy.Request(
               url=response.urljoin(team_link.attrib.get("href")),
               callback=self.parse_team_page,
               meta={
                   "playwright": True,
                   "playwright_include_page": True,
               },
           )
       await page.close()


   async def parse_team_page(self, response):
       page = response.meta['playwright_page']
       team_name = response.css("div[class*='TeamHeader_name__']")
       team_name_text = "".join(team_name.css("*::text").getall())
       yield {"t": "team_page", "url": response.url}
       # Crom_table__p1iZz
       table = response.css("table[class*='Crom_table__']").css("tbody")
       for row in table.css("tr"):
           columns = row.css("td")
           yield {
               "team": team_name_text,
               "player_name": columns[0].css("::text").get(),
               "player": columns[1].css("::text").get(),
               "position": columns[2].css("::text").get(),
               "height": columns[3].css("::text").get(),
               "weight": columns[4].css("::text").get(),
               "birthdate": columns[5].css("::text").get(),
           }
       await page.close()</pre
>

Scaling our Web Scraping Projects with ScraperAPI

ScraperAPI is a service that provides scalable infrastructure and advanced solutions for professional scraping projects.

Using an external service for scraping projects comes with several advantages:

1. Circumvent scraping limits

Using a (rotating) proxy can help circumvent limits imposed by sites you scrape. Such limits include geo-blocking, rate limits, or unusual behavior (bot) detection. An external service like ScraperAPI offers a wide range of proxies located anywhere in the world, some even with IP addresses from dial-up ranges so that they look like real human visitors.

2. Automate JavaScript handling

When dealing with client-side generated JavaScript-heavy pages, we can use tools like Playwright, as seen in this mini-guides examples section. With ScraperAPI, you can forget about dealing with resource-heavy browser setups on your end. You can simply send a URL to the API and get JSON back.

3. Automatic Data Extraction

One of the most powerful features of ScraperAPI is the ability to query common ecommerce, jobs, and even search engine result pages like a REST API. Instead of parsing the HTML directly, ScraperAPI does it for you when using its structured data endpoints.

Promo Code

As a reader of this MiniGuide, you’ll get 10% off their usual prices using the code BAS10 when signing up for a plan.

ScraperAPI Code Examples

Example 1: ScraperAPI Amazon SDE

Product data from Amazon can be retrieved from the Structured Data Endpoint (“SDE”) of ScraperAPI.

Following that approach, there is no need to inspect the Amazon product page's HTML structure and handle bot prevention systems or markup changes manually.

<pre class="wp-block-syntaxhighlighter-code">
import requests


product_data = []


payload = {
'api_key': '<YOUR API KEY>',
'asin': 'B09R93MDJX'
}


response = requests.get('https://api.scraperapi.com/structured/amazon/product', params=payload)
product = response.json()


product_data.append({
"product_name": product["name"],
"product_price": product["pricing"],
"product_ASIN": product["product_information"]["asin"]
})


print(product_data)</pre
>

Example 2: JS Rendering

Similar to the Structured Data Endpoint, ScraperAPI also offers a JavaScript Rendering endpoint. This endpoint is useful when the website you want to scrape uses JavaScript to render its content.

Using this endpoint, you can get the rendered HTML of the website just as you would see it using Playwright.

<pre class="wp-block-syntaxhighlighter-code">
import requests
payload = {'api_key': 'APIKEY', 'url':'https://httpbin.org/ip', 'render': 'true'}
r = requests.get('https://api.scraperapi.com', params=payload)
print(r.text)


# Scrapy users can simply replace the urls in their start_urls and parse function
# ...other scrapy setup code
start_urls = ['https://api.scraperapi.com?api_key=APIKEY&url=' + url + '&render=true']


def parse(self, response):
# ...your parsing logic here
   yield scrapy.Request('https://api.scraperapi.com/?api_key=APIKEY&url=' + url + '&render=true', self.parse</pre
>

About the author

Bas Steins

Bas Steins

Bas Steins is a seasoned software developer with over 20 years of experience. Holding a degree in Computer Science, Bas has worked extensively with Python, Golang, and .NET. As a trainer and consultant, Bas is passionate about helping teams create better software solutions. With a diverse background spanning healthcare, biotech, finance, and web development, Bas brings a wealth of knowledge to each project, guiding teams towards success. He publishes his thoughts on his blog, on LinkedIn, and on X.

Table of Contents

Related Articles

Talk to an expert and learn how to build a scalable scraping solution.