Spaces:
Running
Running
metadata
title: AI Training Data Deals
sdk: docker
app_port: 3000
pinned: false
license: mit
AI Training Data Deals Dashboard
A system for tracking AI training data licensing deals with automated discovery and extraction.
Features
- Deals Dashboard: Searchable, filterable deals table with market analytics
- Automated Discovery: Multi-source deal discovery (RSS, News API, SEC, Exa, Perplexity)
- 5-Stage Extraction Pipeline: Preprocessing β Regex β Normalization β Canonicalization β Deduplication
- Model Registry: Track AI models with token estimates and training data linkages
- Auto-Enrichment: Automatically infers missing metadata (deal type, pricing, duration, rights)
Tech Stack
- Frontend: Next.js 14 (App Router) + React Server Components
- Database: SQLite via Prisma ORM
- Styling: Tailwind CSS
- Ingestion: Python pipeline with Exa API integration
Quick Start
Prerequisites
- Node.js 18+
- Python 3.10+
Setup
Install dependencies:
npm install cd ingestion && pip install -r requirements.txt && cd .. cd registry && pip install -r requirements.txt && cd ..Generate Prisma client:
npm run db:generateNote: Python Prisma client is optional. If setup fails, just run
npm run db:generatefor Node.js client only.Initialize database:
npm run db:push npm run db:seedConfigure API keys (create
.envfile):DATABASE_URL=file:./prisma/dev.db EXA_API_KEY=your_exa_api_key_here # Recommended - primary discovery engine NEWS_API_KEY=your_news_api_key_here # Optional PERPLEXITY_API_KEY=your_perplexity_api_key_here # OptionalNote: RSS feeds work without API keys. Exa API is recommended for best results.
Start development server:
npm run dev
Using the App
Navigation
- Deals (
/) - Main dashboard with searchable deals table and market analytics - Timeline (
/timeline) - Chronological view of deals by year - Models (
/models) - Model Registry with token estimates - Linkages (
/linkages) - Connections between deals and models - Normalization (
/normalization) - Pricing normalization tool
Key Features
- Deal Discovery: Click "Discover Deals" to trigger automated discovery
- Auto-Enrichment: Automatically enriches deals with missing metadata
- Pricing Normalization: Click prices to see normalized per-unit costs
- Tooltips: Hover over underlined terms for explanations
Discovery & Ingestion
Quick Start
# Discover deals from Exa API (90 days back)
npm run discover
# Discover from all sources
npm run discover:all
Discovery Sources
- RSS Feeds: Public feeds from OpenAI, Google, Anthropic, Meta (no API key required)
- Exa API: AI-powered search (recommended - get key from https://exa.ai/)
- News API: News articles (optional - get key from https://newsapi.org/)
- SEC Filings: SEC EDGAR framework
- Perplexity Feed: AI-powered feed acquisition (optional - get key from https://www.perplexity.ai/)
Model Registry
Model ingestion happens automatically when you visit /models. You can also:
- Manually ingest:
npm run registry:ingest - Create linkages:
npm run registry:linkages - Enrich dates:
npm run registry:enrich-dates
Project Structure
βββ app/ # Next.js App Router
β βββ api/ # API routes
β βββ components/ # React components
β β βββ ui/ # Shared UI components
β β βββ deals/ # Deal-specific components
β β βββ models/ # Model-specific components
β β βββ linkages/ # Linkage components
β βββ deals/ # Deals pages
β βββ models/ # Model Registry pages
β βββ linkages/ # Linkage pages
βββ lib/ # Shared utilities
β βββ api/ # API client functions
β βββ utils/ # Utility functions
β βββ types/ # TypeScript types
βββ prisma/ # Database schema
βββ ingestion/ # Python scraping pipeline
β βββ pipeline/ # Extraction pipeline stages
β βββ scrapers/ # Source scrapers
β βββ discovery/ # Discovery engines
βββ registry/ # Model registry Python code
βββ docker/ # Docker configuration files
βββ config/ # Configuration files
Development
Frontend
npm run dev- Start development servernpm run db:studio- Open Prisma Studionpm run db:seed- Re-seed database
Backend Pipelines
npm run pipeline:monitor- Run monitoring cyclenpm run registry:ingest- Ingest priority modelsnpm run registry:linkages- Create linkagesnpm run registry:enrich-dates- Enrich model release dates
Docker
# Development with hot reload
docker-compose -f docker/docker-compose.dev.yml up --build
# Production
docker-compose -f docker/docker-compose.yml up --build
Troubleshooting
Python Prisma Client Generation Fails
This is expected - the Python Prisma client is optional. Run npm run db:generate for Node.js client only.
Discovery Engine Not Finding Deals
- RSS feeds: Should work immediately without API keys
- Exa API: Verify
EXA_API_KEYis set correctly in.env - Check terminal output for error messages
- Ensure database is initialized:
npm run db:push
Database Connection Issues
- Verify
DATABASE_URLin.envpoints to:file:./prisma/dev.db - Run
npm run db:pushto sync schema - Check file permissions on database file
License
MIT