44.3M+ structured menu items across 449k US restaurants. Methodology-documented, geo-coded, multi-source verified, available as a frozen snapshot or live API. Commercial license terms designed for model training and synthetic-data generation.
Sample delivered within 48 hours · 30-day Parquet preview
Training pipelines work better against fixed corpora. Get a Parquet dump, a methodology PDF, and a license PDF in one delivery — no API rate limits, no surprise schema changes.
Filter to data_tier=primary for operator-published data only — higher signal for menu-understanding tasks, fewer aggregator-curated inaccuracies, defensible provenance.
Three feeds cross-validate each item, with source-tier provenance attached. The same fact backed by multiple sources is a stronger training signal than any single feed alone.
# After purchase, you receive a signed S3 URL valid for 30 days.import pyarrow.parquet as pqtable = pq.read_table("souslab-corpus-2026-05.parquet")print(table.num_rows) # → 44_312_589print(table.schema.names[:8]) # ['restaurant_id', 'name', 'cuisines',# 'address', 'price', 'item_name', ...]print(table.column("price").to_pandas().describe())