Back to blog
Tutorial

How to Extract Structured Data from Any Website Using AI (No Selectors Needed)

Stop writing XPath and CSS selectors. Discover how Vision-Language Models (VLMs) and LLMs allow you to extract perfect JSON data from websites using only natural language prompts.

G
GYD Team·Engineering
April 20, 20266 min read

For twenty years, web scraping has been defined by one deeply frustrating task: inspecting the DOM, finding a deeply nested CSS class like div.product-card > span.price-tag-v2, and praying the developers never change it.

Spoiler alert: They always change it.

Selector drift is the bane of every data engineer's existence. But the era of brittle scraping is officially over. Today, you can use AI to extract perfect, structured JSON from any webpage without writing a single CSS selector.

The Evolution of Scraping

To understand why this is such a massive leap forward, look at the evolution of extraction:

  1. Regex (The Dark Ages): Trying to parse HTML with regular expressions. (Don't do this. You will summon Cthulhu).
  2. DOM Parsing (The Standard Era): Using BeautifulSoup or Cheerio to query classes and IDs. Works well until the frontend framework re-compiles the CSS modules and changes .price to .sc-bczRLJ.
  3. Semantic AI Extraction (The Modern Era): We hand the raw HTML (or a screenshot) to an AI, give it a JSON schema, and tell it: "Find the product name, price, and rating, and format it exactly like this."

How AI Extraction Actually Works

There are two main ways modern platforms use AI to structure data:

1. HTML to Markdown to LLM

The most common approach for text-heavy pages (like news articles or company bios) is to convert the bloated HTML into clean Markdown. This removes the JavaScript, the CSS, and the nested divs, leaving only the semantic content.

You then feed that Markdown into an LLM (like GPT-4o or Claude 3.5 Sonnet) along with a system prompt: "Extract the executive team members and their titles from this text into a JSON array."

2. Vision-Language Models (VLMs)

For highly visual sites (like e-commerce grids, dashboards, or complex tables), Markdown isn't enough. The visual layout *is* the data. Modern scrapers like GYD.AI take a screenshot of the fully rendered page, map the visual coordinates of text elements, and feed the image to a Vision Model.

The AI "looks" at the page just like a human does. It understands that a big bold number next to a product image is the price, regardless of what the CSS class is called.

Building Your First AI Scraper

Let's look at how easy this is using an AI scraping API like GYD.AI. We don't need Playwright. We don't need BeautifulSoup. We just need to define what we want.

Imagine we want to scrape a real estate listing. Here is our exact schema:

{
  "address": "Full street address",
  "price": "Current asking price as an integer",
  "bedrooms": "Number of bedrooms",
  "bathrooms": "Number of bathrooms",
  "sqft": "Square footage as an integer",
  "realtor_name": "Name of the listing agent"
}

Instead of digging through Zillow's heavily obfuscated DOM, we just send that schema to the API along with the URL:

const response = await fetch('https://api.gyd.ai/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: "https://www.zillow.com/homedetails/...",
    schema: {
      address: "string",
      price: "number",
      bedrooms: "number",
      bathrooms: "number",
      sqft: "number",
      realtor_name: "string"
    },
    // The magic:
    instructions: "Extract the core property details. Ensure the price and sqft are pure numbers."
  })
});

const data = await response.json();
console.log(data);

The API handles the anti-bot bypass, the JavaScript rendering, the visual parsing, and the strict JSON formatting. You get back perfect, typed data:

{
  "address": "123 Tech Lane, Austin, TX 78701",
  "price": 850000,
  "bedrooms": 3,
  "bathrooms": 2,
  "sqft": 2100,
  "realtor_name": "Sarah Connor"
}

The Benefits of "Self-Healing" Scrapers

The most profound advantage of AI extraction isn't just that it's faster to write. It's that it requires almost zero maintenance.

If Zillow decides to redesign their entire website tomorrow, changing every single DOM element, a traditional scraper will instantly crash in production. You will wake up to PagerDuty alerts.

With an AI extractor, the AI simply looks at the new design, recognizes that the price is now in a blue box instead of a red box, and extracts it anyway. Your pipeline doesn't break. You don't have to rewrite any code. The scraper has "self-healed."

Stop writing selectors. The future of data extraction is entirely semantic.