OpenMark / docs /ingest.md
codingwithadi's picture
Upload folder using huggingface_hub
81598c5 verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

Ingest Pipeline

The ingest pipeline is the heart of OpenMark. It merges all your data, embeds everything, and writes to both ChromaDB and Neo4j.


Command

python scripts/ingest.py [options]
Flag Default Description
--provider local from .env Use local pplx-embed models
--provider azure from .env Use Azure AI Foundry embeddings
--fresh-raindrop off Also pull live from Raindrop API during merge
--skip-similar off Skip SIMILAR_TO edge computation (saves ~30 min)

Pipeline Steps

Step 1 β€” Merge

Loads and deduplicates all sources:

  • CATEGORIZED.json β€” pre-categorized bookmarks from Edge + Raindrop + daily.dev
  • linkedin_saved.json β€” LinkedIn saved posts
  • youtube_MASTER.json β€” liked videos, watch later, playlists (not subscriptions)

Deduplication is URL-based (case-insensitive, trailing slash stripped). If the same URL appears in multiple sources, the first occurrence wins.

Each item gets a doc_text field built for embedding:

{title} | {category} | {tag1 tag2 tag3} | {content/excerpt/channel}

This rich text is what gets embedded β€” not just the title.

Output: ~8,000 normalized items in memory.


Step 2 β€” Embedding

Loads the embedding provider specified by EMBEDDING_PROVIDER in .env (or --provider flag).

Local (pplx-embed):

  • Query model: perplexity-ai/pplx-embed-v1-0.6b β€” used for user search queries
  • Document model: perplexity-ai/pplx-embed-context-v1-0.6b β€” used for bookmark documents
  • Output dimension: 1024
  • Downloaded once to HuggingFace cache (~1.2 GB total), free on every subsequent run
  • Known compatibility issue: pplx-embed requires sentence-transformers==3.3.1 and two runtime patches (applied automatically in local.py). See troubleshooting.md for details.

Azure:

  • Uses text-embedding-ada-002 (or configured AZURE_DEPLOYMENT_EMBED)
  • Output dimension: 1536
  • Cost: ~€0.30 for 8,000 items (as of 2026)
  • Batched in groups of 100 with progress logging

Step 3 β€” ChromaDB Ingest

Embeds all documents in batches of 100 and stores in ChromaDB.

  • Skips items already in ChromaDB (resumable β€” safe to re-run)
  • Stores: URL (as ID), embedding vector, title, category, source, score, tags
  • Uses cosine similarity space (hnsw:space: cosine)
  • Database written to disk at CHROMA_PATH (default: OpenMark/data/chroma_db/)

Timing:

Provider 8K items Notes
Local pplx-embed (CPU) ~20 min No GPU detected = CPU inference
Local pplx-embed (GPU) ~3 min NVIDIA GPU with CUDA
Azure AI Foundry ~5 min Network bound

Step 4 β€” Neo4j Ingest

Creates nodes and relationships in batches of 200.

Nodes created:

  • Bookmark β€” url, title, score
  • Category β€” name
  • Tag β€” name
  • Source β€” name (raindrop, linkedin, youtube_liked, edge, dailydev, etc.)
  • Domain β€” extracted from URL (e.g. github.com, medium.com)

Relationships created:

  • (Bookmark)-[:IN_CATEGORY]->(Category)
  • (Bookmark)-[:TAGGED]->(Tag)
  • (Bookmark)-[:FROM_SOURCE]->(Source)
  • (Bookmark)-[:FROM_DOMAIN]->(Domain)
  • (Tag)-[:CO_OCCURS_WITH {count}]-(Tag) β€” built after all nodes are written

Timing: ~3-5 minutes for 8K items.

Idempotent: Uses MERGE everywhere β€” safe to re-run, won't create duplicates.


Step 5 β€” SIMILAR_TO Edges

This is the most powerful and most time-consuming step.

For each of the 8K bookmarks, OpenMark queries ChromaDB for its top-5 nearest semantic neighbors and writes those as SIMILAR_TO edges in Neo4j with a similarity score.

(Bookmark {url: "...langchain-docs..."})-[:SIMILAR_TO {score: 0.94}]->(Bookmark {url: "...langgraph-tutorial..."})

These edges encode semantic connections you never manually created. The knowledge graph becomes a web of meaning, not just a web of tags.

Timing: ~25-40 minutes on CPU for 8K items. This is the longest step.

Skip it if you're in a hurry:

python scripts/ingest.py --skip-similar

Everything else works without SIMILAR_TO edges. You only lose the find_similar_bookmarks tool in the agent and the graph traversal from those edges.

Only edges with similarity > 0.5 are written. Low-quality connections are discarded.


Re-running the Pipeline

The pipeline is safe to re-run at any time:

  • ChromaDB: skips already-ingested URLs automatically
  • Neo4j: uses MERGE β€” no duplicates created
  • SIMILAR_TO: edges are overwritten (not duplicated) via MERGE

To add new bookmarks after the first run:

  1. Update your source files (fresh Raindrop pull, new LinkedIn export, etc.)
  2. Run python scripts/ingest.py β€” only new items get embedded and stored

Checking What's Ingested

# Quick stats
python scripts/search.py --stats

# Search to verify
python scripts/search.py "RAG tools"

# Neo4j β€” open browser
# http://localhost:7474
# Run: MATCH (b:Bookmark) RETURN count(b)