Pricing & Product Intelligence

Pricing intelligence at 250M pages a day.

I designed and built the distributed scraping and data platform behind a pricing-intelligence company, turning hundreds of millions of raw product pages a day into clean, normalized pricing intelligence. The hard part wasn't reading one page. It was reading 250 million of them, correctly, cheaply, every day.

Distributed crawlingAnti-bot resilienceParsing & dedup pipelineCost control at scale
Oshri Cohen, pricing and product intelligence platform
Oshri CohenDigital products delivered
The problem

Scale breaks everything that worked at small volume.

A scraper that works on a thousand pages is a script. At hundreds of millions a day, every assumption you made quietly becomes the bottleneck, the bill, or the outage.

Targets fight back

Sites rotate layouts, throttle traffic, fingerprint clients and deploy anti-bot defenses. At this volume you hit all of them, all the time, and a 1% failure rate is millions of lost pages.

Raw pages aren't data

A captured page is noise: stale prices, duplicates, malformed markup, currency and unit mismatches. Pricing intelligence only exists after parsing, normalization and dedup, done reliably at the same scale.

Economics is the real constraint

At 250M pages a day, a few cents per thousand pages is the difference between a healthy margin and a platform that bankrupts the company. Correctness and cost are the same engineering problem.

What I built

A platform engineered for correctness and economics.

Distributed crawling

A horizontally scaled crawl fleet with scheduling and back-pressure, so the system pulls hundreds of millions of pages a day without overwhelming targets or itself, and recovers cleanly from failure.

Resilience by design

Defenses against anti-bot systems, rotation, and layout drift. Sites change constantly; parsers and crawl strategies are built to detect breakage and degrade gracefully instead of silently emitting garbage.

Parsing & normalization

A pipeline that turns raw HTML into structured records: parsed, normalized to common product and currency shapes, and validated, so downstream pricing intelligence is comparable across sources.

Dedup & freshness

Hundreds of millions of pages collapse into the unique products and price points that matter, with freshness tracking so customers see current prices, not yesterday's snapshot.

Warehouse & scheduling

A data warehouse sized for this throughput, fed by schedulers that decide what to crawl, how often, and in what order, balancing coverage, freshness and cost against finite capacity.

Cost control at volume

Per-page economics tracked and tuned end to end, compute, bandwidth and storage, because at this scale efficiency is the product. This work sits at the core of how I think about AI and data at scale.

The result

Throughput that holds up every single day.

250M+/day
Product data pages processed into clean pricing intelligence
24/7
Continuous crawling with scheduling, back-pressure and failure recovery
Per-page
Cost tracked and tuned end to end so the unit economics actually work

At 250 million pages a day, correctness and cost are the same problem. A platform that reads the web cleanly but can't pay for itself is a science project, not a business.

Oshri Cohen · On data platforms at scale
Common questions

What teams ask about this.

How do you keep scraping reliable when sites are actively trying to block you?

You assume breakage is the normal state, not the exception. The crawl fleet rotates and adapts, parsers are validated against expected shapes so layout changes are detected rather than silently passed through, and the pipeline degrades gracefully and recovers instead of failing whole runs. At 250M pages a day a small failure rate is millions of lost pages, so resilience is a first-class design goal, not a retry loop bolted on at the end.

How do raw pages become actual pricing intelligence?

Through a pipeline: capture, parse, normalize and dedup. Raw HTML is parsed into structured records, normalized to common product and currency shapes, validated, then deduplicated down to the unique products and price points that matter, with freshness tracking so customers see current prices. The captured page is just noise until that pipeline runs reliably at the same scale as the crawl.

How do you control cost at that volume?

By treating per-page economics as a core metric and tuning compute, bandwidth and storage end to end. At 250M pages a day, a few cents per thousand pages decides whether the platform has a healthy margin or quietly bankrupts the company, so I engineer for correctness and economics together rather than optimizing one and discovering the other later.

Need a data platform
that holds up at real scale?

Whether it's pricing intelligence, large-scale crawling, or any pipeline that has to be correct and cheap at extreme volume, let's map what it actually takes.

hello@oshricohen.me(514) 777-3883Canada · USA · Remote