Can CeylanVienna-based, globally curious.
Learn/Backend

Why parallelising your scraper will get you silently banned

Anti-bot systems don't always return 403s. Sometimes they return empty results, and your logs look clean. Here's why sequential requests are the architecture, not a workaround.

2026-04-18·2 min read·intermediate

The trap

The instinct when you have 20 keywords to scrape is to fire them all at once. Parallel requests, faster results, done in seconds instead of minutes.

Anti-bot systems exploit exactly this instinct. When they detect burst traffic from a single session, they don't always block you, they serve you empty results. No error. No log entry. Just zero products, every time.

Your code works. Your metrics look fine. Your data is silently wrong.

Why this happens

Modern anti-bot systems (DataDome, Cloudflare, PerimeterX) use behavioural fingerprinting, not just IP blocking. A human browsing a marketplace makes one request, pauses, makes another. A bot sends ten requests in 200ms. The system doesn't ban you, it feeds you garbage until you go away.

This is called a shadow ban in scraping terminology. The session remains valid, the HTTP status codes stay 200, but the response payloads are sanitised to contain nothing useful.

The fix isn't a library, it's a constraint

The correct architecture is:

for keyword in searches:
    results = scrape_keyword(keyword)
    store_results(results)
    time.sleep(3)  # non-negotiable

That sleep is load-bearing. It is not a politeness convention. Removing it will silently corrupt your dataset.

Key rules:

  • Never use asyncio.gather() or ThreadPoolExecutor for scrape calls
  • Delay between requests: 3s minimum, increase if you see empty results return
  • One scrape session, do not spawn multiple instances

The signal that something is wrong

If you're seeing empty results and no errors, assume anti-bot interference before assuming a bug. The diagnostic is:

  1. Run a single manual request from your browser on the same IP
  2. Run the same keyword via your scraper
  3. Compare the counts

If the browser returns 40 products and the scraper returns 0, you've been shadow-banned. Increase the delay and restart with a fresh session.

The architectural implication

Sequential scraping means your scrape job takes minutes, not seconds. Design around this, don't treat it as a performance problem to optimise away. A scheduler that runs every 4 hours with a 3-minute sequential job is correct. A parallel job that completes in 10 seconds with corrupted data is wrong.

The right question is not "how do I make this faster?" but "how do I make this reliable?"

More like this, straight to your inbox.

I write about Backend and a handful of other things I actually care about. No schedule, no filler. Just when I have something worth saying.

More on Backend

Batch email sends before rate limits look like caps

A newsletter send to 13 people reported 5 accepted and 8 failed. It looked like a hidden recipient cap. The real problem was parallel API calls hitting a provider rate limit.

Separate the editorial date from the publish timestamp, they mean different things

Content systems routinely conflate two different concepts: the date the author wrote something, and when it was actually published. Treating them as one field causes sorting bugs, broken date displays, and incorrect analytics. They need to be separate from the start.

The data isolation audit: every endpoint must be scoped to the requesting user

The most common multi-tenant security bug is an endpoint that returns the right data for the right user, most of the time. A systematic audit ensures user_id filtering is never accidentally omitted.

If this raised a question, I'd be happy to talk about it.

Find me →
← Back to Learn