The trap
The instinct when you have 20 keywords to scrape is to fire them all at once. Parallel requests, faster results, done in seconds instead of minutes.
Anti-bot systems exploit exactly this instinct. When they detect burst traffic from a single session, they don't always block you — they serve you empty results. No error. No log entry. Just zero products, every time.
Your code works. Your metrics look fine. Your data is silently wrong.
Why this happens
Modern anti-bot systems (DataDome, Cloudflare, PerimeterX) use behavioural fingerprinting, not just IP blocking. A human browsing a marketplace makes one request, pauses, makes another. A bot sends ten requests in 200ms. The system doesn't ban you — it feeds you garbage until you go away.
This is called a shadow ban in scraping terminology. The session remains valid, the HTTP status codes stay 200, but the response payloads are sanitised to contain nothing useful.
The fix isn't a library — it's a constraint
The correct architecture is:
for keyword in searches:
results = scrape_keyword(keyword)
store_results(results)
time.sleep(3) # non-negotiable
That sleep is load-bearing. It is not a politeness convention. Removing it will silently corrupt your dataset.
Key rules:
- Never use
asyncio.gather()orThreadPoolExecutorfor scrape calls - Delay between requests: 3s minimum, increase if you see empty results return
- One scrape session — do not spawn multiple instances
The signal that something is wrong
If you're seeing empty results and no errors, assume anti-bot interference before assuming a bug. The diagnostic is:
- Run a single manual request from your browser on the same IP
- Run the same keyword via your scraper
- Compare the counts
If the browser returns 40 products and the scraper returns 0, you've been shadow-banned. Increase the delay and restart with a fresh session.
The architectural implication
Sequential scraping means your scrape job takes minutes, not seconds. Design around this — don't treat it as a performance problem to optimise away. A scheduler that runs every 4 hours with a 3-minute sequential job is correct. A parallel job that completes in 10 seconds with corrupted data is wrong.
The right question is not "how do I make this faster?" but "how do I make this reliable?"