OpenMark / docs /ingest.md
codingwithadi's picture
Upload folder using huggingface_hub
81598c5 verified
|
raw
history blame
5.09 kB

Ingest Pipeline

The ingest pipeline is the heart of OpenMark. It merges all your data, embeds everything, and writes to both ChromaDB and Neo4j.


Command

python scripts/ingest.py [options]
Flag Default Description
--provider local from .env Use local pplx-embed models
--provider azure from .env Use Azure AI Foundry embeddings
--fresh-raindrop off Also pull live from Raindrop API during merge
--skip-similar off Skip SIMILAR_TO edge computation (saves ~30 min)

Pipeline Steps

Step 1 β€” Merge

Loads and deduplicates all sources:

  • CATEGORIZED.json β€” pre-categorized bookmarks from Edge + Raindrop + daily.dev
  • linkedin_saved.json β€” LinkedIn saved posts
  • youtube_MASTER.json β€” liked videos, watch later, playlists (not subscriptions)

Deduplication is URL-based (case-insensitive, trailing slash stripped). If the same URL appears in multiple sources, the first occurrence wins.

Each item gets a doc_text field built for embedding:

{title} | {category} | {tag1 tag2 tag3} | {content/excerpt/channel}

This rich text is what gets embedded β€” not just the title.

Output: ~8,000 normalized items in memory.


Step 2 β€” Embedding

Loads the embedding provider specified by EMBEDDING_PROVIDER in .env (or --provider flag).

Local (pplx-embed):

  • Query model: perplexity-ai/pplx-embed-v1-0.6b β€” used for user search queries
  • Document model: perplexity-ai/pplx-embed-context-v1-0.6b β€” used for bookmark documents
  • Output dimension: 1024
  • Downloaded once to HuggingFace cache (~1.2 GB total), free on every subsequent run
  • Known compatibility issue: pplx-embed requires sentence-transformers==3.3.1 and two runtime patches (applied automatically in local.py). See troubleshooting.md for details.

Azure:

  • Uses text-embedding-ada-002 (or configured AZURE_DEPLOYMENT_EMBED)
  • Output dimension: 1536
  • Cost: ~€0.30 for 8,000 items (as of 2026)
  • Batched in groups of 100 with progress logging

Step 3 β€” ChromaDB Ingest

Embeds all documents in batches of 100 and stores in ChromaDB.

  • Skips items already in ChromaDB (resumable β€” safe to re-run)
  • Stores: URL (as ID), embedding vector, title, category, source, score, tags
  • Uses cosine similarity space (hnsw:space: cosine)
  • Database written to disk at CHROMA_PATH (default: OpenMark/data/chroma_db/)

Timing:

Provider 8K items Notes
Local pplx-embed (CPU) ~20 min No GPU detected = CPU inference
Local pplx-embed (GPU) ~3 min NVIDIA GPU with CUDA
Azure AI Foundry ~5 min Network bound

Step 4 β€” Neo4j Ingest

Creates nodes and relationships in batches of 200.

Nodes created:

  • Bookmark β€” url, title, score
  • Category β€” name
  • Tag β€” name
  • Source β€” name (raindrop, linkedin, youtube_liked, edge, dailydev, etc.)
  • Domain β€” extracted from URL (e.g. github.com, medium.com)

Relationships created:

  • (Bookmark)-[:IN_CATEGORY]->(Category)
  • (Bookmark)-[:TAGGED]->(Tag)
  • (Bookmark)-[:FROM_SOURCE]->(Source)
  • (Bookmark)-[:FROM_DOMAIN]->(Domain)
  • (Tag)-[:CO_OCCURS_WITH {count}]-(Tag) β€” built after all nodes are written

Timing: ~3-5 minutes for 8K items.

Idempotent: Uses MERGE everywhere β€” safe to re-run, won't create duplicates.


Step 5 β€” SIMILAR_TO Edges

This is the most powerful and most time-consuming step.

For each of the 8K bookmarks, OpenMark queries ChromaDB for its top-5 nearest semantic neighbors and writes those as SIMILAR_TO edges in Neo4j with a similarity score.

(Bookmark {url: "...langchain-docs..."})-[:SIMILAR_TO {score: 0.94}]->(Bookmark {url: "...langgraph-tutorial..."})

These edges encode semantic connections you never manually created. The knowledge graph becomes a web of meaning, not just a web of tags.

Timing: ~25-40 minutes on CPU for 8K items. This is the longest step.

Skip it if you're in a hurry:

python scripts/ingest.py --skip-similar

Everything else works without SIMILAR_TO edges. You only lose the find_similar_bookmarks tool in the agent and the graph traversal from those edges.

Only edges with similarity > 0.5 are written. Low-quality connections are discarded.


Re-running the Pipeline

The pipeline is safe to re-run at any time:

  • ChromaDB: skips already-ingested URLs automatically
  • Neo4j: uses MERGE β€” no duplicates created
  • SIMILAR_TO: edges are overwritten (not duplicated) via MERGE

To add new bookmarks after the first run:

  1. Update your source files (fresh Raindrop pull, new LinkedIn export, etc.)
  2. Run python scripts/ingest.py β€” only new items get embedded and stored

Checking What's Ingested

# Quick stats
python scripts/search.py --stats

# Search to verify
python scripts/search.py "RAG tools"

# Neo4j β€” open browser
# http://localhost:7474
# Run: MATCH (b:Bookmark) RETURN count(b)