Finance & Crypto

Why Sending Raw HTML to an LLM for Web Scraping Is a Mistake (and What to Do Instead)

2026-05-06 01:54:24

The Hidden Cost of Large DOM Inputs

When developers first attempt to build a web scraper using large language models (LLMs), the natural instinct is to feed the entire page's HTML into the model and ask it to extract relevant data. This seems straightforward, but it quickly reveals a major inefficiency: a typical product listing page contains 500–700 KB of raw DOM markup. Processing that much input means paying for approximately 150,000 tokens per request, enduring 15–30 seconds of latency, and frequently hitting context limits—especially for complex pages. Many projects stall at this first hurdle.

Why Sending Raw HTML to an LLM for Web Scraping Is a Mistake (and What to Do Instead)
Source: dev.to

The Reality Check: 15 Models, Consistent Performance

Over a four-month period, an exhaustive evaluation was conducted across 15 different models, including GPT-4, GPT-4o, Gemini 1.5 Pro, Gemini Ultra, Claude 3 Opus, Claude 3.5 Sonnet, Mistral Large, Llama 3 70B, Cohere Command R+, and several smaller fine-tuned variants. The results fell into a predictable pattern:

No model solved the core latency problem because the fundamental approach—sending massive, unprocessed HTML—was flawed from the start.

The Breakthrough: Pre-Processing DOM

The real bottleneck was not the model's reasoning capability but the sheer volume of input data. To address this, a DOM pre-processor was developed with the following steps:

  1. Strip all <script>, <style>, and tracking pixel elements.
  2. Remove navigation, footer, and sidebar components.
  3. Collapse deeply nested wrappers that carry no semantic meaning.
  4. Apply SimHash to deduplicate structurally identical subtrees.

The result was a dramatic reduction from 580 KB to just 4.2 KB—a 99.3% decrease in input size. With a 4 KB input, every model became fast. More importantly, the reduced input made repeating structural patterns obvious: product cards, directory rows, and search results repeated 20, 50, or 100 times. This insight led to a fundamental shift in the architecture.

The Architecture Decision: Heuristics Before AI

Once the structural patterns were visible, it became clear that paying an LLM to detect those patterns was unnecessary. Instead, a heuristic detector was designed to:

Why Sending Raw HTML to an LLM for Web Scraping Is a Mistake (and What to Do Instead)
Source: dev.to

Then, AI enters only after detection—not to identify the list, but to label fields and structure the output. This reduces the LLM's job from 150,000 tokens to approximately 200 tokens. The resulting performance is dramatic:

StepApproachLatency
List detectionHeuristics0.2 ms
Field labelingLLM (small input)~2 s
Total~2 s

Compare this to the naive LLM approach, which takes 25–35 seconds per page.

What Was Actually Shipped

This architecture became the foundation for Clura, a heuristic-first AI web scraper Chrome extension. On any page, Clura automatically detects every list using the heuristic engine. Users simply pick the desired list and the fields to extract; all records are retrieved in seconds. There are no prompts to describe data, no training phase, and no long waits. The heuristic layer handles detection; AI handles labeling.

The Lesson: LLMs Excel at Meaning, Not Scanning HTML

Large language models are exceptional at understanding what something means. They are terrible at scanning 600 KB of HTML to find where something is. That is a structural pattern problem—and structural pattern problems are what algorithms are built for. By combining fast, cheap heuristics for pattern detection with small, targeted LLM calls for semantic labeling, you can achieve speeds and accuracy that neither method can reach alone.

Explore

Interop 2026: Driving Web Consistency Across Browsers for a Fifth Year Elementary Data PyPI Compromise: Q&A on the GitHub Actions Attack Apple Ecosystem Decoded: Expert Answers to Your Burning Questions Achieving Resilient Scalability: A GitHub-Inspired Guide to High Availability React Native 0.84: Boosted Performance with Hermes V1 and Streamlined Builds