Spaces:
Running
Running
| # Ingest Pipeline | |
| The ingest pipeline is the heart of OpenMark. It merges all your data, embeds everything, and writes to both ChromaDB and Neo4j. | |
| --- | |
| ## Command | |
| ```bash | |
| python scripts/ingest.py [options] | |
| ``` | |
| | Flag | Default | Description | | |
| |------|---------|-------------| | |
| | `--provider local` | from `.env` | Use local pplx-embed models | | |
| | `--provider azure` | from `.env` | Use Azure AI Foundry embeddings | | |
| | `--fresh-raindrop` | off | Also pull live from Raindrop API during merge | | |
| | `--skip-similar` | off | Skip SIMILAR_TO edge computation (saves ~30 min) | | |
| --- | |
| ## Pipeline Steps | |
| ### Step 1 β Merge | |
| Loads and deduplicates all sources: | |
| - `CATEGORIZED.json` β pre-categorized bookmarks from Edge + Raindrop + daily.dev | |
| - `linkedin_saved.json` β LinkedIn saved posts | |
| - `youtube_MASTER.json` β liked videos, watch later, playlists (not subscriptions) | |
| Deduplication is URL-based (case-insensitive, trailing slash stripped). If the same URL appears in multiple sources, the first occurrence wins. | |
| Each item gets a `doc_text` field built for embedding: | |
| ``` | |
| {title} | {category} | {tag1 tag2 tag3} | {content/excerpt/channel} | |
| ``` | |
| This rich text is what gets embedded β not just the title. | |
| **Output:** ~8,000 normalized items in memory. | |
| --- | |
| ### Step 2 β Embedding | |
| Loads the embedding provider specified by `EMBEDDING_PROVIDER` in `.env` (or `--provider` flag). | |
| **Local (pplx-embed):** | |
| - Query model: `perplexity-ai/pplx-embed-v1-0.6b` β used for user search queries | |
| - Document model: `perplexity-ai/pplx-embed-context-v1-0.6b` β used for bookmark documents | |
| - Output dimension: 1024 | |
| - Downloaded once to HuggingFace cache (~1.2 GB total), free on every subsequent run | |
| - **Known compatibility issue:** pplx-embed requires `sentence-transformers==3.3.1` and two runtime patches (applied automatically in `local.py`). See [troubleshooting.md](troubleshooting.md) for details. | |
| **Azure:** | |
| - Uses `text-embedding-ada-002` (or configured `AZURE_DEPLOYMENT_EMBED`) | |
| - Output dimension: 1536 | |
| - Cost: ~β¬0.30 for 8,000 items (as of 2026) | |
| - Batched in groups of 100 with progress logging | |
| --- | |
| ### Step 3 β ChromaDB Ingest | |
| Embeds all documents in batches of 100 and stores in ChromaDB. | |
| - Skips items already in ChromaDB (resumable β safe to re-run) | |
| - Stores: URL (as ID), embedding vector, title, category, source, score, tags | |
| - Uses cosine similarity space (`hnsw:space: cosine`) | |
| - Database written to disk at `CHROMA_PATH` (default: `OpenMark/data/chroma_db/`) | |
| **Timing:** | |
| | Provider | 8K items | Notes | | |
| |----------|----------|-------| | |
| | Local pplx-embed (CPU) | ~20 min | No GPU detected = CPU inference | | |
| | Local pplx-embed (GPU) | ~3 min | NVIDIA GPU with CUDA | | |
| | Azure AI Foundry | ~5 min | Network bound | | |
| --- | |
| ### Step 4 β Neo4j Ingest | |
| Creates nodes and relationships in batches of 200. | |
| **Nodes created:** | |
| - `Bookmark` β url, title, score | |
| - `Category` β name | |
| - `Tag` β name | |
| - `Source` β name (raindrop, linkedin, youtube_liked, edge, dailydev, etc.) | |
| - `Domain` β extracted from URL (e.g. `github.com`, `medium.com`) | |
| **Relationships created:** | |
| - `(Bookmark)-[:IN_CATEGORY]->(Category)` | |
| - `(Bookmark)-[:TAGGED]->(Tag)` | |
| - `(Bookmark)-[:FROM_SOURCE]->(Source)` | |
| - `(Bookmark)-[:FROM_DOMAIN]->(Domain)` | |
| - `(Tag)-[:CO_OCCURS_WITH {count}]-(Tag)` β built after all nodes are written | |
| **Timing:** ~3-5 minutes for 8K items. | |
| **Idempotent:** Uses `MERGE` everywhere β safe to re-run, won't create duplicates. | |
| --- | |
| ### Step 5 β SIMILAR_TO Edges | |
| This is the most powerful and most time-consuming step. | |
| For each of the 8K bookmarks, OpenMark queries ChromaDB for its top-5 nearest semantic neighbors and writes those as `SIMILAR_TO` edges in Neo4j with a similarity score. | |
| ``` | |
| (Bookmark {url: "...langchain-docs..."})-[:SIMILAR_TO {score: 0.94}]->(Bookmark {url: "...langgraph-tutorial..."}) | |
| ``` | |
| These edges encode **semantic connections you never manually created**. The knowledge graph becomes a web of meaning, not just a web of tags. | |
| **Timing:** ~25-40 minutes on CPU for 8K items. This is the longest step. | |
| **Skip it if you're in a hurry:** | |
| ```bash | |
| python scripts/ingest.py --skip-similar | |
| ``` | |
| Everything else works without SIMILAR_TO edges. You only lose the `find_similar_bookmarks` tool in the agent and the graph traversal from those edges. | |
| **Only edges with similarity > 0.5 are written.** Low-quality connections are discarded. | |
| --- | |
| ## Re-running the Pipeline | |
| The pipeline is safe to re-run at any time: | |
| - **ChromaDB:** skips already-ingested URLs automatically | |
| - **Neo4j:** uses `MERGE` β no duplicates created | |
| - **SIMILAR_TO:** edges are overwritten (not duplicated) via `MERGE` | |
| To add new bookmarks after the first run: | |
| 1. Update your source files (fresh Raindrop pull, new LinkedIn export, etc.) | |
| 2. Run `python scripts/ingest.py` β only new items get embedded and stored | |
| --- | |
| ## Checking What's Ingested | |
| ```bash | |
| # Quick stats | |
| python scripts/search.py --stats | |
| # Search to verify | |
| python scripts/search.py "RAG tools" | |
| # Neo4j β open browser | |
| # http://localhost:7474 | |
| # Run: MATCH (b:Bookmark) RETURN count(b) | |
| ``` | |