Web Scraping with asyncio#

Web scraping is the process of extracting useful data from websites, and it becomes especially challenging and time-consuming when dealing with hundreds or thousands of pages. The traditional synchronous approach scrapes only one page at a time and is slow. With asyncio, we can leverage asynchronous I/O to scrape multiple pages concurrently, which significantly speeds up the process; however, asyncio can only utilize a single CPU core.

Modern computers have multiple CPU cores; yet, asyncio only takes advantage of a single core. However, with free-threaded Python, we can run multiple asyncio workers in threads to utilize all available cores.

Web scraping example with free-threaded Python#

This example demonstrates how to use free-threaded Python to run multiple asyncio workers in parallel, allowing us to scrape numerous pages concurrently across multiple cores.

It uses aiohttp for asynchronous HTTP requests and bs4 for parsing HTML. The example script scrapes Hacker News stories and their comments, demonstrating how to efficiently scrape a large number of pages using asyncio and free-threaded Python.

Install the required packages with:
```
pip install aiohttp beautifulsoup4
```

Create the script file:

# scraper.py

import aiohttp
import asyncio
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from time import perf_counter
from argparse import ArgumentParser

BASE_URL = "https://news.ycombinator.com/news?p={}"
ITEM_URL = "https://news.ycombinator.com/item?id={}"


async def fetch(session: aiohttp.ClientSession, url: str) -> str:
    async with session.get(url, timeout=100) as response:
        return await response.text()


def parse_stories(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "html.parser")
    stories = []

    for item in soup.select(".athing"):
        title_tag = item.select_one(".titleline > a")
        story_id = item.get("id")

        if title_tag and story_id:
            title = title_tag.text.strip()
            link = title_tag["href"].strip()
            stories.append({"id": story_id, "title": title, "link": link})

    return stories


def parse_comments(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "html.parser")
    comments = []

    for row in soup.select("tr.comtr"):
        user_tag = row.select_one(".hnuser")
        comment_tag = row.select_one(".commtext")

        if user_tag and comment_tag:
            user = user_tag.text.strip()
            text = comment_tag.get_text(separator=" ", strip=True)
            comments.append({"user": user, "text": text})

    return comments


async def fetch_story_with_comments(
    session: aiohttp.ClientSession, story: dict
) -> dict:
    comment_html = await fetch(session, ITEM_URL.format(story["id"]))
    story["comments"] = parse_comments(comment_html)
    return story


async def worker(queue: Queue, all_stories: list) -> None:
    async with aiohttp.ClientSession() as session:
        while True:
            async with asyncio.TaskGroup() as tg:
                try:
                    page = queue.get(block=False)
                except Empty:
                    break
                html = await fetch(session, page)
                stories = parse_stories(html)
                if not stories:
                    break
                for story in stories:
                    tg.create_task(fetch_story_with_comments(session, story))
            all_stories.extend(stories)


def main(multithreaded: bool) -> None:
    queue = Queue()
    all_stories = []
    for page in range(1, 101):
        queue.put(BASE_URL.format(page))
    start_time = perf_counter()
    if multithreaded:
        print("Using multithreading for fetching stories...")
        workers: int = 8  # no of CPU cores to use
        with ThreadPoolExecutor(max_workers=workers) as executor:
            for _ in range(workers):
                executor.submit(lambda: asyncio.run(worker(queue, all_stories)))
    else:
        print("Using single thread for fetching stories...")
        asyncio.run(worker(queue, all_stories))
    end_time = perf_counter()
    print(
        f"Scraping speed: {len(all_stories) / (end_time - start_time):.0f} stories/sec"
    )


if __name__ == "__main__":
    parser = ArgumentParser(description="Scrape Hacker News stories and comments.")
    parser.add_argument(
        "--multithreaded",
        action="store_true",
        default=False,
        help="Use multithreading for fetching stories.",
    )
    args = parser.parse_args()
    main(args.multithreaded)

Run the script with single thread:
```
python scraper.py
```
Run the script with multiple threads by using the --multithreaded flag:
```
python scraper.py --multithreaded
```
Run the script using free-threaded Python with multiple threads:
```
python -X gil=0 scraper.py --multithreaded
```

Example results and explanation#

Compare the performance results of each script run using a 12-core CPU:

Configuration	Stories/sec
default build, single thread	12
default build, multithreaded	35
free-threaded build, multithreaded	80

The default build performs better with multiple threads than with a single thread, because Python releases the GIL during I/O operations. This allows other threads to run while a thread is waiting for network responses. This leads to some parallelism, but it is limited by the Global Interpreter Lock (GIL).

The free-threaded build enables true parallelism across multiple cores, significantly increasing scraping speed. This example demonstrates how free-threaded Python can be used to efficiently scrape large amounts of data from the web by leveraging multiple CPU cores.