Spaces:
Running
Running
File size: 5,092 Bytes
81598c5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | # Ingest Pipeline
The ingest pipeline is the heart of OpenMark. It merges all your data, embeds everything, and writes to both ChromaDB and Neo4j.
---
## Command
```bash
python scripts/ingest.py [options]
```
| Flag | Default | Description |
|------|---------|-------------|
| `--provider local` | from `.env` | Use local pplx-embed models |
| `--provider azure` | from `.env` | Use Azure AI Foundry embeddings |
| `--fresh-raindrop` | off | Also pull live from Raindrop API during merge |
| `--skip-similar` | off | Skip SIMILAR_TO edge computation (saves ~30 min) |
---
## Pipeline Steps
### Step 1 β Merge
Loads and deduplicates all sources:
- `CATEGORIZED.json` β pre-categorized bookmarks from Edge + Raindrop + daily.dev
- `linkedin_saved.json` β LinkedIn saved posts
- `youtube_MASTER.json` β liked videos, watch later, playlists (not subscriptions)
Deduplication is URL-based (case-insensitive, trailing slash stripped). If the same URL appears in multiple sources, the first occurrence wins.
Each item gets a `doc_text` field built for embedding:
```
{title} | {category} | {tag1 tag2 tag3} | {content/excerpt/channel}
```
This rich text is what gets embedded β not just the title.
**Output:** ~8,000 normalized items in memory.
---
### Step 2 β Embedding
Loads the embedding provider specified by `EMBEDDING_PROVIDER` in `.env` (or `--provider` flag).
**Local (pplx-embed):**
- Query model: `perplexity-ai/pplx-embed-v1-0.6b` β used for user search queries
- Document model: `perplexity-ai/pplx-embed-context-v1-0.6b` β used for bookmark documents
- Output dimension: 1024
- Downloaded once to HuggingFace cache (~1.2 GB total), free on every subsequent run
- **Known compatibility issue:** pplx-embed requires `sentence-transformers==3.3.1` and two runtime patches (applied automatically in `local.py`). See [troubleshooting.md](troubleshooting.md) for details.
**Azure:**
- Uses `text-embedding-ada-002` (or configured `AZURE_DEPLOYMENT_EMBED`)
- Output dimension: 1536
- Cost: ~β¬0.30 for 8,000 items (as of 2026)
- Batched in groups of 100 with progress logging
---
### Step 3 β ChromaDB Ingest
Embeds all documents in batches of 100 and stores in ChromaDB.
- Skips items already in ChromaDB (resumable β safe to re-run)
- Stores: URL (as ID), embedding vector, title, category, source, score, tags
- Uses cosine similarity space (`hnsw:space: cosine`)
- Database written to disk at `CHROMA_PATH` (default: `OpenMark/data/chroma_db/`)
**Timing:**
| Provider | 8K items | Notes |
|----------|----------|-------|
| Local pplx-embed (CPU) | ~20 min | No GPU detected = CPU inference |
| Local pplx-embed (GPU) | ~3 min | NVIDIA GPU with CUDA |
| Azure AI Foundry | ~5 min | Network bound |
---
### Step 4 β Neo4j Ingest
Creates nodes and relationships in batches of 200.
**Nodes created:**
- `Bookmark` β url, title, score
- `Category` β name
- `Tag` β name
- `Source` β name (raindrop, linkedin, youtube_liked, edge, dailydev, etc.)
- `Domain` β extracted from URL (e.g. `github.com`, `medium.com`)
**Relationships created:**
- `(Bookmark)-[:IN_CATEGORY]->(Category)`
- `(Bookmark)-[:TAGGED]->(Tag)`
- `(Bookmark)-[:FROM_SOURCE]->(Source)`
- `(Bookmark)-[:FROM_DOMAIN]->(Domain)`
- `(Tag)-[:CO_OCCURS_WITH {count}]-(Tag)` β built after all nodes are written
**Timing:** ~3-5 minutes for 8K items.
**Idempotent:** Uses `MERGE` everywhere β safe to re-run, won't create duplicates.
---
### Step 5 β SIMILAR_TO Edges
This is the most powerful and most time-consuming step.
For each of the 8K bookmarks, OpenMark queries ChromaDB for its top-5 nearest semantic neighbors and writes those as `SIMILAR_TO` edges in Neo4j with a similarity score.
```
(Bookmark {url: "...langchain-docs..."})-[:SIMILAR_TO {score: 0.94}]->(Bookmark {url: "...langgraph-tutorial..."})
```
These edges encode **semantic connections you never manually created**. The knowledge graph becomes a web of meaning, not just a web of tags.
**Timing:** ~25-40 minutes on CPU for 8K items. This is the longest step.
**Skip it if you're in a hurry:**
```bash
python scripts/ingest.py --skip-similar
```
Everything else works without SIMILAR_TO edges. You only lose the `find_similar_bookmarks` tool in the agent and the graph traversal from those edges.
**Only edges with similarity > 0.5 are written.** Low-quality connections are discarded.
---
## Re-running the Pipeline
The pipeline is safe to re-run at any time:
- **ChromaDB:** skips already-ingested URLs automatically
- **Neo4j:** uses `MERGE` β no duplicates created
- **SIMILAR_TO:** edges are overwritten (not duplicated) via `MERGE`
To add new bookmarks after the first run:
1. Update your source files (fresh Raindrop pull, new LinkedIn export, etc.)
2. Run `python scripts/ingest.py` β only new items get embedded and stored
---
## Checking What's Ingested
```bash
# Quick stats
python scripts/search.py --stats
# Search to verify
python scripts/search.py "RAG tools"
# Neo4j β open browser
# http://localhost:7474
# Run: MATCH (b:Bookmark) RETURN count(b)
```
|