File size: 5,092 Bytes
81598c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# Ingest Pipeline

The ingest pipeline is the heart of OpenMark. It merges all your data, embeds everything, and writes to both ChromaDB and Neo4j.

---

## Command

```bash
python scripts/ingest.py [options]
```

| Flag | Default | Description |
|------|---------|-------------|
| `--provider local` | from `.env` | Use local pplx-embed models |
| `--provider azure` | from `.env` | Use Azure AI Foundry embeddings |
| `--fresh-raindrop` | off | Also pull live from Raindrop API during merge |
| `--skip-similar` | off | Skip SIMILAR_TO edge computation (saves ~30 min) |

---

## Pipeline Steps

### Step 1 β€” Merge

Loads and deduplicates all sources:
- `CATEGORIZED.json` β€” pre-categorized bookmarks from Edge + Raindrop + daily.dev
- `linkedin_saved.json` β€” LinkedIn saved posts
- `youtube_MASTER.json` β€” liked videos, watch later, playlists (not subscriptions)

Deduplication is URL-based (case-insensitive, trailing slash stripped). If the same URL appears in multiple sources, the first occurrence wins.

Each item gets a `doc_text` field built for embedding:
```
{title} | {category} | {tag1 tag2 tag3} | {content/excerpt/channel}
```
This rich text is what gets embedded β€” not just the title.

**Output:** ~8,000 normalized items in memory.

---

### Step 2 β€” Embedding

Loads the embedding provider specified by `EMBEDDING_PROVIDER` in `.env` (or `--provider` flag).

**Local (pplx-embed):**
- Query model: `perplexity-ai/pplx-embed-v1-0.6b` β€” used for user search queries
- Document model: `perplexity-ai/pplx-embed-context-v1-0.6b` β€” used for bookmark documents
- Output dimension: 1024
- Downloaded once to HuggingFace cache (~1.2 GB total), free on every subsequent run
- **Known compatibility issue:** pplx-embed requires `sentence-transformers==3.3.1` and two runtime patches (applied automatically in `local.py`). See [troubleshooting.md](troubleshooting.md) for details.

**Azure:**
- Uses `text-embedding-ada-002` (or configured `AZURE_DEPLOYMENT_EMBED`)
- Output dimension: 1536
- Cost: ~€0.30 for 8,000 items (as of 2026)
- Batched in groups of 100 with progress logging

---

### Step 3 β€” ChromaDB Ingest

Embeds all documents in batches of 100 and stores in ChromaDB.

- Skips items already in ChromaDB (resumable β€” safe to re-run)
- Stores: URL (as ID), embedding vector, title, category, source, score, tags
- Uses cosine similarity space (`hnsw:space: cosine`)
- Database written to disk at `CHROMA_PATH` (default: `OpenMark/data/chroma_db/`)

**Timing:**
| Provider | 8K items | Notes |
|----------|----------|-------|
| Local pplx-embed (CPU) | ~20 min | No GPU detected = CPU inference |
| Local pplx-embed (GPU) | ~3 min | NVIDIA GPU with CUDA |
| Azure AI Foundry | ~5 min | Network bound |

---

### Step 4 β€” Neo4j Ingest

Creates nodes and relationships in batches of 200.

**Nodes created:**
- `Bookmark` β€” url, title, score
- `Category` β€” name
- `Tag` β€” name
- `Source` β€” name (raindrop, linkedin, youtube_liked, edge, dailydev, etc.)
- `Domain` β€” extracted from URL (e.g. `github.com`, `medium.com`)

**Relationships created:**
- `(Bookmark)-[:IN_CATEGORY]->(Category)`
- `(Bookmark)-[:TAGGED]->(Tag)`
- `(Bookmark)-[:FROM_SOURCE]->(Source)`
- `(Bookmark)-[:FROM_DOMAIN]->(Domain)`
- `(Tag)-[:CO_OCCURS_WITH {count}]-(Tag)` β€” built after all nodes are written

**Timing:** ~3-5 minutes for 8K items.

**Idempotent:** Uses `MERGE` everywhere β€” safe to re-run, won't create duplicates.

---

### Step 5 β€” SIMILAR_TO Edges

This is the most powerful and most time-consuming step.

For each of the 8K bookmarks, OpenMark queries ChromaDB for its top-5 nearest semantic neighbors and writes those as `SIMILAR_TO` edges in Neo4j with a similarity score.

```
(Bookmark {url: "...langchain-docs..."})-[:SIMILAR_TO {score: 0.94}]->(Bookmark {url: "...langgraph-tutorial..."})
```

These edges encode **semantic connections you never manually created**. The knowledge graph becomes a web of meaning, not just a web of tags.

**Timing:** ~25-40 minutes on CPU for 8K items. This is the longest step.

**Skip it if you're in a hurry:**
```bash
python scripts/ingest.py --skip-similar
```
Everything else works without SIMILAR_TO edges. You only lose the `find_similar_bookmarks` tool in the agent and the graph traversal from those edges.

**Only edges with similarity > 0.5 are written.** Low-quality connections are discarded.

---

## Re-running the Pipeline

The pipeline is safe to re-run at any time:

- **ChromaDB:** skips already-ingested URLs automatically
- **Neo4j:** uses `MERGE` β€” no duplicates created
- **SIMILAR_TO:** edges are overwritten (not duplicated) via `MERGE`

To add new bookmarks after the first run:
1. Update your source files (fresh Raindrop pull, new LinkedIn export, etc.)
2. Run `python scripts/ingest.py` β€” only new items get embedded and stored

---

## Checking What's Ingested

```bash
# Quick stats
python scripts/search.py --stats

# Search to verify
python scripts/search.py "RAG tools"

# Neo4j β€” open browser
# http://localhost:7474
# Run: MATCH (b:Bookmark) RETURN count(b)
```