Spaces:
Running
A newer version of the Gradio SDK is available:
6.9.0
Ingest Pipeline
The ingest pipeline is the heart of OpenMark. It merges all your data, embeds everything, and writes to both ChromaDB and Neo4j.
Command
python scripts/ingest.py [options]
| Flag | Default | Description |
|---|---|---|
--provider local |
from .env |
Use local pplx-embed models |
--provider azure |
from .env |
Use Azure AI Foundry embeddings |
--fresh-raindrop |
off | Also pull live from Raindrop API during merge |
--skip-similar |
off | Skip SIMILAR_TO edge computation (saves ~30 min) |
Pipeline Steps
Step 1 β Merge
Loads and deduplicates all sources:
CATEGORIZED.jsonβ pre-categorized bookmarks from Edge + Raindrop + daily.devlinkedin_saved.jsonβ LinkedIn saved postsyoutube_MASTER.jsonβ liked videos, watch later, playlists (not subscriptions)
Deduplication is URL-based (case-insensitive, trailing slash stripped). If the same URL appears in multiple sources, the first occurrence wins.
Each item gets a doc_text field built for embedding:
{title} | {category} | {tag1 tag2 tag3} | {content/excerpt/channel}
This rich text is what gets embedded β not just the title.
Output: ~8,000 normalized items in memory.
Step 2 β Embedding
Loads the embedding provider specified by EMBEDDING_PROVIDER in .env (or --provider flag).
Local (pplx-embed):
- Query model:
perplexity-ai/pplx-embed-v1-0.6bβ used for user search queries - Document model:
perplexity-ai/pplx-embed-context-v1-0.6bβ used for bookmark documents - Output dimension: 1024
- Downloaded once to HuggingFace cache (~1.2 GB total), free on every subsequent run
- Known compatibility issue: pplx-embed requires
sentence-transformers==3.3.1and two runtime patches (applied automatically inlocal.py). See troubleshooting.md for details.
Azure:
- Uses
text-embedding-ada-002(or configuredAZURE_DEPLOYMENT_EMBED) - Output dimension: 1536
- Cost: ~β¬0.30 for 8,000 items (as of 2026)
- Batched in groups of 100 with progress logging
Step 3 β ChromaDB Ingest
Embeds all documents in batches of 100 and stores in ChromaDB.
- Skips items already in ChromaDB (resumable β safe to re-run)
- Stores: URL (as ID), embedding vector, title, category, source, score, tags
- Uses cosine similarity space (
hnsw:space: cosine) - Database written to disk at
CHROMA_PATH(default:OpenMark/data/chroma_db/)
Timing:
| Provider | 8K items | Notes |
|---|---|---|
| Local pplx-embed (CPU) | ~20 min | No GPU detected = CPU inference |
| Local pplx-embed (GPU) | ~3 min | NVIDIA GPU with CUDA |
| Azure AI Foundry | ~5 min | Network bound |
Step 4 β Neo4j Ingest
Creates nodes and relationships in batches of 200.
Nodes created:
Bookmarkβ url, title, scoreCategoryβ nameTagβ nameSourceβ name (raindrop, linkedin, youtube_liked, edge, dailydev, etc.)Domainβ extracted from URL (e.g.github.com,medium.com)
Relationships created:
(Bookmark)-[:IN_CATEGORY]->(Category)(Bookmark)-[:TAGGED]->(Tag)(Bookmark)-[:FROM_SOURCE]->(Source)(Bookmark)-[:FROM_DOMAIN]->(Domain)(Tag)-[:CO_OCCURS_WITH {count}]-(Tag)β built after all nodes are written
Timing: ~3-5 minutes for 8K items.
Idempotent: Uses MERGE everywhere β safe to re-run, won't create duplicates.
Step 5 β SIMILAR_TO Edges
This is the most powerful and most time-consuming step.
For each of the 8K bookmarks, OpenMark queries ChromaDB for its top-5 nearest semantic neighbors and writes those as SIMILAR_TO edges in Neo4j with a similarity score.
(Bookmark {url: "...langchain-docs..."})-[:SIMILAR_TO {score: 0.94}]->(Bookmark {url: "...langgraph-tutorial..."})
These edges encode semantic connections you never manually created. The knowledge graph becomes a web of meaning, not just a web of tags.
Timing: ~25-40 minutes on CPU for 8K items. This is the longest step.
Skip it if you're in a hurry:
python scripts/ingest.py --skip-similar
Everything else works without SIMILAR_TO edges. You only lose the find_similar_bookmarks tool in the agent and the graph traversal from those edges.
Only edges with similarity > 0.5 are written. Low-quality connections are discarded.
Re-running the Pipeline
The pipeline is safe to re-run at any time:
- ChromaDB: skips already-ingested URLs automatically
- Neo4j: uses
MERGEβ no duplicates created - SIMILAR_TO: edges are overwritten (not duplicated) via
MERGE
To add new bookmarks after the first run:
- Update your source files (fresh Raindrop pull, new LinkedIn export, etc.)
- Run
python scripts/ingest.pyβ only new items get embedded and stored
Checking What's Ingested
# Quick stats
python scripts/search.py --stats
# Search to verify
python scripts/search.py "RAG tools"
# Neo4j β open browser
# http://localhost:7474
# Run: MATCH (b:Bookmark) RETURN count(b)