Engineering · 9 min read

From Product Pages to Knowledge Graphs: Making Your Store AI-Discoverable

An AI agent tasked with finding the best wireless noise-cancelling headphones under $300 faces a fundamental challenge. It does not see the web the way a human does.

The Readability Problem

An AI agent tasked with finding the best wireless noise-cancelling headphones under $300 faces a fundamental challenge. It does not see the web the way a human does. Where a human sees a clean product page with a hero image, a bold price, and five gold stars, the agent receives a stream of HTML that looks something like this:

<div class="pdp-main" data-product-id="wh1000xm5">
  <div class="gallery-wrapper">
    <div class="swiper-container">
      <div class="swiper-slide active">
        <img src="/images/products/wh1000xm5-hero.jpg"
             data-zoom="/images/products/wh1000xm5-hero-4k.jpg"
             class="gallery__img pdp-gallery__main-img" />
      </div>
    </div>
  </div>
  <div class="pdp-info">
    <span class="brand-label">Sony</span>
    <h1 class="pdp-title">WH-1000XM5 Wireless Headphones</h1>
    <div class="price-block">
      <span class="pdp-price__current">$349.99</span>
      <span class="pdp-price__was">$399.99</span>
    </div>
    <div class="rating-stars" data-rating="4.7">★★★★★</div>
    <span class="review-count">(2,847 reviews)</span>
  </div>
</div>

The class names are arbitrary. pdp-price__current means nothing to a machine that has not been specifically trained on this particular site's naming conventions. Another store might use product__price--sale, offer-price, or just price. The agent must reverse-engineer the site's design system to extract basic product facts.

This is the readability problem. HTML was designed for human visual rendering, not for machine comprehension. And as AI agents become primary consumers of product information, this architectural mismatch creates a growing barrier to commerce.

What Structured Data Looks Like

The solution is JSON-LD embedded in the page's <head> section. JSON-LD (JavaScript Object Notation for Linked Data) uses the Schema.org vocabulary to describe entities in a way that any machine can parse without understanding the visual layout.

Here is the same product expressed as JSON-LD:

{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "WH-1000XM5 Wireless Headphones",
  "brand": {
    "@type": "Brand",
    "name": "Sony"
  },
  "image": "https://store.example.com/images/products/wh1000xm5-hero.jpg",
  "sku": "WH1000XM5-BLK",
  "gtin13": "4548736132610",
  "description": "Industry-leading noise cancellation with 30-hour battery life.",
  "offers": {
    "@type": "Offer",
    "price": 349.99,
    "priceCurrency": "USD",
    "availability": "https://schema.org/InStock",
    "seller": {
      "@type": "Organization",
      "name": "AudioGear Pro"
    }
  },
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": 4.7,
    "reviewCount": 2847
  }
}

Every field is unambiguous. The price is a number with a currency code. Availability is a defined Schema.org enum. The brand is a typed entity, not a CSS class name. An AI agent can consume this data with zero parsing ambiguity.

The Schema.org Vocabulary for Commerce

Schema.org provides a rich set of types relevant to commerce. The most critical ones for AI agent discoverability are:

Product is the core entity. It describes an individual product with fields for name, description, images, SKU, GTIN/UPC, brand, color, material, weight, and more.

Offer represents a specific commercial offering of a product -- its price, currency, availability, condition (new, used, refurbished), delivery information, and the seller. A single Product can have multiple Offers from different sellers.

Brand provides a typed reference to the manufacturer or brand, enabling cross-merchant brand queries.

Review and AggregateRating give agents access to social proof data without scraping review widgets. An AggregateRating with ratingValue and reviewCount lets an agent factor reputation into recommendations.

BreadcrumbList describes the category hierarchy, allowing agents to understand where a product sits in a store's taxonomy. A breadcrumb trail of "Electronics > Audio > Headphones > Over-Ear" provides categorical context that improves search relevance.

FAQPage captures commonly asked questions and answers about a product. AI agents increasingly use FAQ data to answer follow-up questions during conversational shopping sessions.

JSON-LD is the preferred format over older alternatives like Microdata (embedded in HTML attributes) and RDFa. The reason is practical: JSON-LD lives in a <script> tag in the document head, completely separate from the visual DOM. This means structured data can be added, updated, or corrected without touching the page's HTML template -- a significant advantage for stores that use locked-down CMS templates or third-party themes.

From HTML to Knowledge Graph

ORBEXA's core transformation pipeline converts unstructured merchant store pages into a queryable knowledge graph. The process follows five stages.

Step 1: Crawl and identify product entities. The system fetches product pages from a merchant's store and identifies which pages contain product data versus informational content, collection pages, or blog posts. Entity identification uses URL patterns (e.g., /products/ paths on Shopify stores), HTML meta tags, and existing Schema.org markup when present.

Step 2: Extract attributes. From each identified product page, the extraction layer pulls key attributes: product name, price (current and compare-at), currency, availability status, images, variant options (size, color, material), descriptions, SKU, and category information. For platforms with structured APIs like Shopify, this extraction happens through the API. For custom stores, it relies on DOM analysis and ML-based extraction.

Step 3: Normalize to Schema.org vocabulary. Extracted attributes are mapped to Schema.org types and properties. A Shopify variant.price of "34999" (cents) becomes a Schema.org Offer with "price": 349.99 and "priceCurrency": "USD". Availability strings like "in_stock", "available", or "true" all map to "https://schema.org/InStock". Brand names are normalized against a canonical brand dictionary.

Step 4: Generate JSON-LD structured data. The normalized data is serialized into valid JSON-LD documents. Each product gets a complete structured data block that includes the Product entity, its Offers, Brand reference, AggregateRating (when reviews are available), and BreadcrumbList for category context. These JSON-LD blocks are validated against Schema.org specifications and Google's structured data requirements.

Step 5: Create AI-optimized query endpoints. The structured data is served through protocol-specific endpoints. MCP resources expose products as typed resources that Claude and other LLMs can discover via the MCP protocol. UCP endpoints serve RESTful product data following Google and Shopify's Universal Commerce Protocol specification. ACP methods provide JSON-RPC access for OpenAI agent integrations. Each endpoint formats the same underlying knowledge graph data for its target protocol.

The result is a knowledge graph where each product is a node with typed, validated attributes and relationships (to brands, categories, reviews, and related products). AI agents query this graph through protocol endpoints rather than scraping HTML.

What a Knowledge Graph Entry Looks Like

To make this concrete, here is what a single product looks like after ORBEXA's pipeline processes it. The merchant's original store page was a Shopify product page with standard Liquid template HTML.

The knowledge graph entry for this product:

{
  "id": "prod_8f2a4b1c",
  "source": "shopify",
  "merchantId": "merchant_abc123",
  "schema": {
    "@context": "https://schema.org",
    "@type": "Product",
    "name": "Ceramic Pour-Over Coffee Dripper",
    "brand": { "@type": "Brand", "name": "Hario" },
    "image": [
      "https://cdn.shopify.com/s/files/1/ceramic-dripper-1.jpg",
      "https://cdn.shopify.com/s/files/1/ceramic-dripper-2.jpg"
    ],
    "description": "Handcrafted ceramic dripper with spiral ridges for optimal extraction. Fits standard #2 filters.",
    "sku": "HARIO-V60-02-WHT",
    "gtin13": "4977642723115",
    "offers": {
      "@type": "Offer",
      "price": 29.95,
      "priceCurrency": "USD",
      "availability": "https://schema.org/InStock",
      "itemCondition": "https://schema.org/NewCondition"
    },
    "aggregateRating": {
      "@type": "AggregateRating",
      "ratingValue": 4.8,
      "reviewCount": 342
    }
  },
  "quality": {
    "completeness": 95,
    "accuracy": 98,
    "freshness": 100,
    "consistency": 97,
    "composite": 97.3
  },
  "metadata": {
    "lastCrawled": "2026-02-18T08:30:00Z",
    "extractionMethod": "shopify_api",
    "variantCount": 3,
    "categoryPath": "Kitchen > Coffee & Tea > Pour-Over"
  }
}

This entry contains everything an AI agent needs: structured product data in Schema.org format, quality scores for trust calibration, and metadata about data provenance. An agent can use the composite quality score to weight this product's reliability relative to alternatives. The extractionMethod field tells the agent whether the data came from a reliable API or from less certain DOM parsing.

The Platform Shift

The move toward structured, protocol-accessible commerce data is not just an ORBEXA initiative. It reflects a broader platform shift.

In the summer of 2025, Shopify activated default MCP endpoints on every store in its ecosystem. This meant that millions of merchants -- many of whom had never heard of MCP -- suddenly had AI-accessible protocol endpoints serving their product data. The move signaled that Shopify considers agent-readiness a fundamental platform capability, not an opt-in developer feature.

Google and Shopify's Universal Commerce Protocol (UCP), announced in January 2025, defined a standard for how structured product data should flow between merchant systems and AI agents. UCP specifies discovery mechanisms (via .well-known/ucp.json), catalog endpoints, search interfaces, and checkout flows. It provides the specification; platforms like ORBEXA implement it.

These are not isolated moves. They reflect a convergence around the idea that AI agents need structured, standardized data to function effectively in commerce -- and that the industry is standardizing on specific protocols to deliver it.

Multi-Tenant Architecture

ORBEXA is designed as a multi-tenant infrastructure platform. Each merchant's data is isolated at the application layer -- a merchant's API key only accesses their own product data -- but the underlying infrastructure is shared for efficiency and cross-merchant data improvements.

The multi-tenant architecture enables features that would be impractical in single-tenant deployments. Cross-merchant brand normalization, for instance, benefits from seeing how thousands of stores reference the same brands. Category taxonomy mapping improves with more training data from diverse merchants.

Custom domain support allows merchants to serve protocol endpoints from their own domain. A merchant configures a CNAME record pointing their domain to ORBEXA's edge, and SSL is automatically provisioned. From the AI agent's perspective, it is interacting directly with the merchant's domain, not with ORBEXA's infrastructure. This preserves brand identity while providing structured protocol access.

Real-Time vs. Batch Processing

Not all product data requires the same freshness guarantee. ORBEXA uses a tiered processing model.

Real-time processing handles price changes, inventory updates, and availability status transitions. These are the data points most likely to cause agent errors if stale. When a product goes out of stock, the knowledge graph must reflect this within minutes, not hours. Webhook-based integrations (Shopify, WooCommerce) enable near-instant propagation.

Batch processing handles less time-sensitive operations: full catalog re-crawls, enrichment passes (GTIN resolution, brand normalization), quality score recalculation, and schema drift detection. These run on scheduled intervals -- typically every few hours for active merchants.

The boundary between real-time and batch is configurable per merchant. A flash-sale retailer might need real-time processing for all data points. A bookstore with stable pricing can operate entirely on batch processing without meaningful accuracy loss.

The Impact

Stores with structured data served through standardized protocol endpoints appear in the surfaces that increasingly drive commerce: ChatGPT's shopping experience, Google's AI Mode in Search, Microsoft Copilot's product recommendations, and Perplexity's buy-enabled search results.

The transformation from HTML product pages to knowledge graph entries is not a cosmetic change. It is the difference between being visible and being invisible to the next generation of commerce interfaces. As AI agents grow from niche tools to primary shopping interfaces, the stores that invest in structured, protocol-accessible data will capture a disproportionate share of agent-driven transactions.

The knowledge graph is the new storefront. The protocol endpoint is the new front door.

← Back to News