odl-training-data / README.md
midah's picture
Bundle dataset + supply-chain views, prep for HF Spaces deployment
23d5e1e
---
title: AI Training Data Deals
sdk: docker
app_port: 3000
pinned: false
license: mit
---
# AI Training Data Deals Dashboard
A system for tracking AI training data licensing deals with automated discovery and extraction.
## Features
- **Deals Dashboard**: Searchable, filterable deals table with market analytics
- **Automated Discovery**: Multi-source deal discovery (RSS, News API, SEC, Exa, Perplexity)
- **5-Stage Extraction Pipeline**: Preprocessing β†’ Regex β†’ Normalization β†’ Canonicalization β†’ Deduplication
- **Model Registry**: Track AI models with token estimates and training data linkages
- **Auto-Enrichment**: Automatically infers missing metadata (deal type, pricing, duration, rights)
## Tech Stack
- **Frontend**: Next.js 14 (App Router) + React Server Components
- **Database**: SQLite via Prisma ORM
- **Styling**: Tailwind CSS
- **Ingestion**: Python pipeline with Exa API integration
## Quick Start
### Prerequisites
- Node.js 18+
- Python 3.10+
### Setup
1. **Install dependencies:**
```bash
npm install
cd ingestion && pip install -r requirements.txt && cd ..
cd registry && pip install -r requirements.txt && cd ..
```
2. **Generate Prisma client:**
```bash
npm run db:generate
```
Note: Python Prisma client is optional. If setup fails, just run `npm run db:generate` for Node.js client only.
3. **Initialize database:**
```bash
npm run db:push
npm run db:seed
```
4. **Configure API keys** (create `.env` file):
```bash
DATABASE_URL=file:./prisma/dev.db
EXA_API_KEY=your_exa_api_key_here # Recommended - primary discovery engine
NEWS_API_KEY=your_news_api_key_here # Optional
PERPLEXITY_API_KEY=your_perplexity_api_key_here # Optional
```
Note: RSS feeds work without API keys. Exa API is recommended for best results.
5. **Start development server:**
```bash
npm run dev
```
Open http://localhost:3000
## Using the App
### Navigation
- **Deals** (`/`) - Main dashboard with searchable deals table and market analytics
- **Timeline** (`/timeline`) - Chronological view of deals by year
- **Models** (`/models`) - Model Registry with token estimates
- **Linkages** (`/linkages`) - Connections between deals and models
- **Normalization** (`/normalization`) - Pricing normalization tool
### Key Features
- **Deal Discovery**: Click "Discover Deals" to trigger automated discovery
- **Auto-Enrichment**: Automatically enriches deals with missing metadata
- **Pricing Normalization**: Click prices to see normalized per-unit costs
- **Tooltips**: Hover over underlined terms for explanations
## Discovery & Ingestion
### Quick Start
```bash
# Discover deals from Exa API (90 days back)
npm run discover
# Discover from all sources
npm run discover:all
```
### Discovery Sources
- **RSS Feeds**: Public feeds from OpenAI, Google, Anthropic, Meta (no API key required)
- **Exa API**: AI-powered search (recommended - get key from https://exa.ai/)
- **News API**: News articles (optional - get key from https://newsapi.org/)
- **SEC Filings**: SEC EDGAR framework
- **Perplexity Feed**: AI-powered feed acquisition (optional - get key from https://www.perplexity.ai/)
### Model Registry
Model ingestion happens automatically when you visit `/models`. You can also:
- Manually ingest: `npm run registry:ingest`
- Create linkages: `npm run registry:linkages`
- Enrich dates: `npm run registry:enrich-dates`
## Project Structure
```
β”œβ”€β”€ app/ # Next.js App Router
β”‚ β”œβ”€β”€ api/ # API routes
β”‚ β”œβ”€β”€ components/ # React components
β”‚ β”‚ β”œβ”€β”€ ui/ # Shared UI components
β”‚ β”‚ β”œβ”€β”€ deals/ # Deal-specific components
β”‚ β”‚ β”œβ”€β”€ models/ # Model-specific components
β”‚ β”‚ └── linkages/ # Linkage components
β”‚ β”œβ”€β”€ deals/ # Deals pages
β”‚ β”œβ”€β”€ models/ # Model Registry pages
β”‚ └── linkages/ # Linkage pages
β”œβ”€β”€ lib/ # Shared utilities
β”‚ β”œβ”€β”€ api/ # API client functions
β”‚ β”œβ”€β”€ utils/ # Utility functions
β”‚ └── types/ # TypeScript types
β”œβ”€β”€ prisma/ # Database schema
β”œβ”€β”€ ingestion/ # Python scraping pipeline
β”‚ β”œβ”€β”€ pipeline/ # Extraction pipeline stages
β”‚ β”œβ”€β”€ scrapers/ # Source scrapers
β”‚ └── discovery/ # Discovery engines
β”œβ”€β”€ registry/ # Model registry Python code
β”œβ”€β”€ docker/ # Docker configuration files
└── config/ # Configuration files
```
## Development
### Frontend
- `npm run dev` - Start development server
- `npm run db:studio` - Open Prisma Studio
- `npm run db:seed` - Re-seed database
### Backend Pipelines
- `npm run pipeline:monitor` - Run monitoring cycle
- `npm run registry:ingest` - Ingest priority models
- `npm run registry:linkages` - Create linkages
- `npm run registry:enrich-dates` - Enrich model release dates
## Docker
```bash
# Development with hot reload
docker-compose -f docker/docker-compose.dev.yml up --build
# Production
docker-compose -f docker/docker-compose.yml up --build
```
## Troubleshooting
### Python Prisma Client Generation Fails
This is expected - the Python Prisma client is optional. Run `npm run db:generate` for Node.js client only.
### Discovery Engine Not Finding Deals
- **RSS feeds**: Should work immediately without API keys
- **Exa API**: Verify `EXA_API_KEY` is set correctly in `.env`
- Check terminal output for error messages
- Ensure database is initialized: `npm run db:push`
### Database Connection Issues
- Verify `DATABASE_URL` in `.env` points to: `file:./prisma/dev.db`
- Run `npm run db:push` to sync schema
- Check file permissions on database file
## License
MIT