odl-training-data / README.md
midah's picture
Bundle dataset + supply-chain views, prep for HF Spaces deployment
23d5e1e
metadata
title: AI Training Data Deals
sdk: docker
app_port: 3000
pinned: false
license: mit

AI Training Data Deals Dashboard

A system for tracking AI training data licensing deals with automated discovery and extraction.

Features

  • Deals Dashboard: Searchable, filterable deals table with market analytics
  • Automated Discovery: Multi-source deal discovery (RSS, News API, SEC, Exa, Perplexity)
  • 5-Stage Extraction Pipeline: Preprocessing β†’ Regex β†’ Normalization β†’ Canonicalization β†’ Deduplication
  • Model Registry: Track AI models with token estimates and training data linkages
  • Auto-Enrichment: Automatically infers missing metadata (deal type, pricing, duration, rights)

Tech Stack

  • Frontend: Next.js 14 (App Router) + React Server Components
  • Database: SQLite via Prisma ORM
  • Styling: Tailwind CSS
  • Ingestion: Python pipeline with Exa API integration

Quick Start

Prerequisites

  • Node.js 18+
  • Python 3.10+

Setup

  1. Install dependencies:

    npm install
    cd ingestion && pip install -r requirements.txt && cd ..
    cd registry && pip install -r requirements.txt && cd ..
    
  2. Generate Prisma client:

    npm run db:generate
    

    Note: Python Prisma client is optional. If setup fails, just run npm run db:generate for Node.js client only.

  3. Initialize database:

    npm run db:push
    npm run db:seed
    
  4. Configure API keys (create .env file):

    DATABASE_URL=file:./prisma/dev.db
    EXA_API_KEY=your_exa_api_key_here  # Recommended - primary discovery engine
    NEWS_API_KEY=your_news_api_key_here  # Optional
    PERPLEXITY_API_KEY=your_perplexity_api_key_here  # Optional
    

    Note: RSS feeds work without API keys. Exa API is recommended for best results.

  5. Start development server:

    npm run dev
    

    Open http://localhost:3000

Using the App

Navigation

  • Deals (/) - Main dashboard with searchable deals table and market analytics
  • Timeline (/timeline) - Chronological view of deals by year
  • Models (/models) - Model Registry with token estimates
  • Linkages (/linkages) - Connections between deals and models
  • Normalization (/normalization) - Pricing normalization tool

Key Features

  • Deal Discovery: Click "Discover Deals" to trigger automated discovery
  • Auto-Enrichment: Automatically enriches deals with missing metadata
  • Pricing Normalization: Click prices to see normalized per-unit costs
  • Tooltips: Hover over underlined terms for explanations

Discovery & Ingestion

Quick Start

# Discover deals from Exa API (90 days back)
npm run discover

# Discover from all sources
npm run discover:all

Discovery Sources

  • RSS Feeds: Public feeds from OpenAI, Google, Anthropic, Meta (no API key required)
  • Exa API: AI-powered search (recommended - get key from https://exa.ai/)
  • News API: News articles (optional - get key from https://newsapi.org/)
  • SEC Filings: SEC EDGAR framework
  • Perplexity Feed: AI-powered feed acquisition (optional - get key from https://www.perplexity.ai/)

Model Registry

Model ingestion happens automatically when you visit /models. You can also:

  • Manually ingest: npm run registry:ingest
  • Create linkages: npm run registry:linkages
  • Enrich dates: npm run registry:enrich-dates

Project Structure

β”œβ”€β”€ app/                    # Next.js App Router
β”‚   β”œβ”€β”€ api/               # API routes
β”‚   β”œβ”€β”€ components/        # React components
β”‚   β”‚   β”œβ”€β”€ ui/           # Shared UI components
β”‚   β”‚   β”œβ”€β”€ deals/        # Deal-specific components
β”‚   β”‚   β”œβ”€β”€ models/       # Model-specific components
β”‚   β”‚   └── linkages/     # Linkage components
β”‚   β”œβ”€β”€ deals/            # Deals pages
β”‚   β”œβ”€β”€ models/           # Model Registry pages
β”‚   └── linkages/         # Linkage pages
β”œβ”€β”€ lib/                   # Shared utilities
β”‚   β”œβ”€β”€ api/              # API client functions
β”‚   β”œβ”€β”€ utils/            # Utility functions
β”‚   └── types/            # TypeScript types
β”œβ”€β”€ prisma/                # Database schema
β”œβ”€β”€ ingestion/             # Python scraping pipeline
β”‚   β”œβ”€β”€ pipeline/         # Extraction pipeline stages
β”‚   β”œβ”€β”€ scrapers/         # Source scrapers
β”‚   └── discovery/        # Discovery engines
β”œβ”€β”€ registry/              # Model registry Python code
β”œβ”€β”€ docker/               # Docker configuration files
└── config/                # Configuration files

Development

Frontend

  • npm run dev - Start development server
  • npm run db:studio - Open Prisma Studio
  • npm run db:seed - Re-seed database

Backend Pipelines

  • npm run pipeline:monitor - Run monitoring cycle
  • npm run registry:ingest - Ingest priority models
  • npm run registry:linkages - Create linkages
  • npm run registry:enrich-dates - Enrich model release dates

Docker

# Development with hot reload
docker-compose -f docker/docker-compose.dev.yml up --build

# Production
docker-compose -f docker/docker-compose.yml up --build

Troubleshooting

Python Prisma Client Generation Fails

This is expected - the Python Prisma client is optional. Run npm run db:generate for Node.js client only.

Discovery Engine Not Finding Deals

  • RSS feeds: Should work immediately without API keys
  • Exa API: Verify EXA_API_KEY is set correctly in .env
  • Check terminal output for error messages
  • Ensure database is initialized: npm run db:push

Database Connection Issues

  • Verify DATABASE_URL in .env points to: file:./prisma/dev.db
  • Run npm run db:push to sync schema
  • Check file permissions on database file

License

MIT