Can CeylanVienna-based, globally curious.
Learn/Backend

Why parallelising your scraper will get you silently banned

Anti-bot systems don't always return 403s. Sometimes they return empty results — and your logs look clean. Here's why sequential requests are the architecture, not a workaround.

2026-04-18·2 min read·intermediate

The trap

The instinct when you have 20 keywords to scrape is to fire them all at once. Parallel requests, faster results, done in seconds instead of minutes.

Anti-bot systems exploit exactly this instinct. When they detect burst traffic from a single session, they don't always block you — they serve you empty results. No error. No log entry. Just zero products, every time.

Your code works. Your metrics look fine. Your data is silently wrong.

Why this happens

Modern anti-bot systems (DataDome, Cloudflare, PerimeterX) use behavioural fingerprinting, not just IP blocking. A human browsing a marketplace makes one request, pauses, makes another. A bot sends ten requests in 200ms. The system doesn't ban you — it feeds you garbage until you go away.

This is called a shadow ban in scraping terminology. The session remains valid, the HTTP status codes stay 200, but the response payloads are sanitised to contain nothing useful.

The fix isn't a library — it's a constraint

The correct architecture is:

for keyword in searches:
    results = scrape_keyword(keyword)
    store_results(results)
    time.sleep(3)  # non-negotiable

That sleep is load-bearing. It is not a politeness convention. Removing it will silently corrupt your dataset.

Key rules:

  • Never use asyncio.gather() or ThreadPoolExecutor for scrape calls
  • Delay between requests: 3s minimum, increase if you see empty results return
  • One scrape session — do not spawn multiple instances

The signal that something is wrong

If you're seeing empty results and no errors, assume anti-bot interference before assuming a bug. The diagnostic is:

  1. Run a single manual request from your browser on the same IP
  2. Run the same keyword via your scraper
  3. Compare the counts

If the browser returns 40 products and the scraper returns 0, you've been shadow-banned. Increase the delay and restart with a fresh session.

The architectural implication

Sequential scraping means your scrape job takes minutes, not seconds. Design around this — don't treat it as a performance problem to optimise away. A scheduler that runs every 4 hours with a 3-minute sequential job is correct. A parallel job that completes in 10 seconds with corrupted data is wrong.

The right question is not "how do I make this faster?" but "how do I make this reliable?"

More like this, straight to your inbox.

I write about Backend and a handful of other things I actually care about. No schedule, no filler — just when I have something worth saying.

More on Backend

Cache AI API results by content hash to prevent cost explosions

Users upload the same image multiple times. AI APIs charge per call. A cache keyed on SHA-256 of the input bytes ensures you pay for each unique input once — not once per upload.

Scheduled publishing without a cron: runtime-evaluated date filters

You don't need a cron job to make content appear on schedule. Evaluate the scheduled date at request time and the content becomes visible automatically — no deployment, no job, no database update required.

If this raised a question, I'd be happy to talk about it.

Find me →
← Back to Learn