Spaces:
Running
Running
| title: AI Training Data Deals | |
| sdk: docker | |
| app_port: 3000 | |
| pinned: false | |
| license: mit | |
| # AI Training Data Deals Dashboard | |
| A system for tracking AI training data licensing deals with automated discovery and extraction. | |
| ## Features | |
| - **Deals Dashboard**: Searchable, filterable deals table with market analytics | |
| - **Automated Discovery**: Multi-source deal discovery (RSS, News API, SEC, Exa, Perplexity) | |
| - **5-Stage Extraction Pipeline**: Preprocessing β Regex β Normalization β Canonicalization β Deduplication | |
| - **Model Registry**: Track AI models with token estimates and training data linkages | |
| - **Auto-Enrichment**: Automatically infers missing metadata (deal type, pricing, duration, rights) | |
| ## Tech Stack | |
| - **Frontend**: Next.js 14 (App Router) + React Server Components | |
| - **Database**: SQLite via Prisma ORM | |
| - **Styling**: Tailwind CSS | |
| - **Ingestion**: Python pipeline with Exa API integration | |
| ## Quick Start | |
| ### Prerequisites | |
| - Node.js 18+ | |
| - Python 3.10+ | |
| ### Setup | |
| 1. **Install dependencies:** | |
| ```bash | |
| npm install | |
| cd ingestion && pip install -r requirements.txt && cd .. | |
| cd registry && pip install -r requirements.txt && cd .. | |
| ``` | |
| 2. **Generate Prisma client:** | |
| ```bash | |
| npm run db:generate | |
| ``` | |
| Note: Python Prisma client is optional. If setup fails, just run `npm run db:generate` for Node.js client only. | |
| 3. **Initialize database:** | |
| ```bash | |
| npm run db:push | |
| npm run db:seed | |
| ``` | |
| 4. **Configure API keys** (create `.env` file): | |
| ```bash | |
| DATABASE_URL=file:./prisma/dev.db | |
| EXA_API_KEY=your_exa_api_key_here # Recommended - primary discovery engine | |
| NEWS_API_KEY=your_news_api_key_here # Optional | |
| PERPLEXITY_API_KEY=your_perplexity_api_key_here # Optional | |
| ``` | |
| Note: RSS feeds work without API keys. Exa API is recommended for best results. | |
| 5. **Start development server:** | |
| ```bash | |
| npm run dev | |
| ``` | |
| Open http://localhost:3000 | |
| ## Using the App | |
| ### Navigation | |
| - **Deals** (`/`) - Main dashboard with searchable deals table and market analytics | |
| - **Timeline** (`/timeline`) - Chronological view of deals by year | |
| - **Models** (`/models`) - Model Registry with token estimates | |
| - **Linkages** (`/linkages`) - Connections between deals and models | |
| - **Normalization** (`/normalization`) - Pricing normalization tool | |
| ### Key Features | |
| - **Deal Discovery**: Click "Discover Deals" to trigger automated discovery | |
| - **Auto-Enrichment**: Automatically enriches deals with missing metadata | |
| - **Pricing Normalization**: Click prices to see normalized per-unit costs | |
| - **Tooltips**: Hover over underlined terms for explanations | |
| ## Discovery & Ingestion | |
| ### Quick Start | |
| ```bash | |
| # Discover deals from Exa API (90 days back) | |
| npm run discover | |
| # Discover from all sources | |
| npm run discover:all | |
| ``` | |
| ### Discovery Sources | |
| - **RSS Feeds**: Public feeds from OpenAI, Google, Anthropic, Meta (no API key required) | |
| - **Exa API**: AI-powered search (recommended - get key from https://exa.ai/) | |
| - **News API**: News articles (optional - get key from https://newsapi.org/) | |
| - **SEC Filings**: SEC EDGAR framework | |
| - **Perplexity Feed**: AI-powered feed acquisition (optional - get key from https://www.perplexity.ai/) | |
| ### Model Registry | |
| Model ingestion happens automatically when you visit `/models`. You can also: | |
| - Manually ingest: `npm run registry:ingest` | |
| - Create linkages: `npm run registry:linkages` | |
| - Enrich dates: `npm run registry:enrich-dates` | |
| ## Project Structure | |
| ``` | |
| βββ app/ # Next.js App Router | |
| β βββ api/ # API routes | |
| β βββ components/ # React components | |
| β β βββ ui/ # Shared UI components | |
| β β βββ deals/ # Deal-specific components | |
| β β βββ models/ # Model-specific components | |
| β β βββ linkages/ # Linkage components | |
| β βββ deals/ # Deals pages | |
| β βββ models/ # Model Registry pages | |
| β βββ linkages/ # Linkage pages | |
| βββ lib/ # Shared utilities | |
| β βββ api/ # API client functions | |
| β βββ utils/ # Utility functions | |
| β βββ types/ # TypeScript types | |
| βββ prisma/ # Database schema | |
| βββ ingestion/ # Python scraping pipeline | |
| β βββ pipeline/ # Extraction pipeline stages | |
| β βββ scrapers/ # Source scrapers | |
| β βββ discovery/ # Discovery engines | |
| βββ registry/ # Model registry Python code | |
| βββ docker/ # Docker configuration files | |
| βββ config/ # Configuration files | |
| ``` | |
| ## Development | |
| ### Frontend | |
| - `npm run dev` - Start development server | |
| - `npm run db:studio` - Open Prisma Studio | |
| - `npm run db:seed` - Re-seed database | |
| ### Backend Pipelines | |
| - `npm run pipeline:monitor` - Run monitoring cycle | |
| - `npm run registry:ingest` - Ingest priority models | |
| - `npm run registry:linkages` - Create linkages | |
| - `npm run registry:enrich-dates` - Enrich model release dates | |
| ## Docker | |
| ```bash | |
| # Development with hot reload | |
| docker-compose -f docker/docker-compose.dev.yml up --build | |
| # Production | |
| docker-compose -f docker/docker-compose.yml up --build | |
| ``` | |
| ## Troubleshooting | |
| ### Python Prisma Client Generation Fails | |
| This is expected - the Python Prisma client is optional. Run `npm run db:generate` for Node.js client only. | |
| ### Discovery Engine Not Finding Deals | |
| - **RSS feeds**: Should work immediately without API keys | |
| - **Exa API**: Verify `EXA_API_KEY` is set correctly in `.env` | |
| - Check terminal output for error messages | |
| - Ensure database is initialized: `npm run db:push` | |
| ### Database Connection Issues | |
| - Verify `DATABASE_URL` in `.env` points to: `file:./prisma/dev.db` | |
| - Run `npm run db:push` to sync schema | |
| - Check file permissions on database file | |
| ## License | |
| MIT | |