AI training data

Train your food-AI on the largest commercial-licensable US menu corpus.

44.3M+ structured menu items across 449k US restaurants. Methodology-documented, geo-coded, multi-source verified, available as a frozen snapshot or live API. Commercial license terms designed for model training and synthetic-data generation.

Sample delivered within 48 hours · 30-day Parquet preview

Frozen snapshots, methodology included

Training pipelines work better against fixed corpora. Get a Parquet dump, a methodology PDF, and a license PDF in one delivery — no API rate limits, no surprise schema changes.

First-party POS tier

Filter to data_tier=primary for operator-published data only — higher signal for menu-understanding tasks, fewer aggregator-curated inaccuracies, defensible provenance.

Multi-source verified

Three feeds cross-validate each item, with source-tier provenance attached. The same fact backed by multiple sources is a stronger training signal than any single feed alone.

# After purchase, you receive a signed S3 URL valid for 30 days.
import pyarrow.parquet as pq
table = pq.read_table("souslab-corpus-2026-05.parquet")
print(table.num_rows) # → 44_312_589
print(table.schema.names[:8]) # ['restaurant_id', 'name', 'cuisines',
# 'address', 'price', 'item_name', ...]
print(table.column("price").to_pandas().describe())
Pricing

Commercial licenses, sized to your buyer profile.

Indie AI startup from $1k. Enterprise vertical AI from $5k–$50k. Foundation model lab $25k+. Sample delivered within 48 hours of request, methodology PDF and license PDF included with every contract.
FAQ

Common questions

What does the snapshot contain?
Items, prices, modifiers, restaurant metadata (name, address, geo, cuisines), chain affiliation when known, and source-tier provenance.
What format?
Parquet by default. CSV / JSON on request.
Is the license commercial-use friendly?
Yes. Standard commercial license, no field-of-use restrictions for model training or synthetic-data generation. Resale of raw data is excluded.
Can I get refresh deliveries?
Yes — recurring snapshots monthly or quarterly. See /data for ongoing-delivery pricing.
What's the freshness?
Refreshed daily upstream; snapshot reflects the most recent crawl as of the delivery date.
Do you have synthetic-data variants?
Available on request for buyers building generation pipelines. Custom labeling (allergens, dietary categories, cuisine taxonomy) on top of the corpus is also priced separately.
Get the sample

One email, methodology PDF + a 30-day Parquet preview back within 48 hours.