OpenMark / docs /architecture.md
codingwithadi's picture
Upload folder using huggingface_hub
81598c5 verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

Architecture

Overview

OpenMark uses a dual-store architecture β€” two databases working together, each doing what it's best at.

                        User Query
                            β”‚
                    LangGraph Agent
                    (gpt-4o-mini)
                   /              \
          ChromaDB               Neo4j
        (vector store)        (graph store)

        "find by meaning"     "find by connection"
        "what's similar?"     "how are things linked?"

Embedding Layer

The embedding layer is provider-agnostic β€” swap between local and cloud with one env var.

EMBEDDING_PROVIDER=local   β†’  LocalEmbedder  (pplx-embed, runs on your machine)
EMBEDDING_PROVIDER=azure   β†’  AzureEmbedder  (Azure AI Foundry, API call)

Why two pplx-embed models?

Perplexity AI ships two variants:

  • pplx-embed-v1-0.6b β€” for encoding queries (what the user types)
  • pplx-embed-context-v1-0.6b β€” for encoding documents (the bookmarks, surrounding context matters)

Using the correct model for each role improves retrieval quality. Most implementations use one model for both β€” this is the correct production pattern.

The compatibility patches:

pplx-embed models ship with custom Python code (st_quantize.py) that has two incompatibilities with modern libraries:

  1. sentence_transformers 4.x removed the Module base class β€” pplx-embed's code imports it. Fixed by aliasing torch.nn.Module to sentence_transformers.models.Module before import.

  2. transformers 4.57 added list_repo_templates() β€” it looks for an additional_chat_templates folder in every model repo. pplx-embed doesn't have one, causing a hard 404 crash. Fixed by monkey-patching the function to return an empty list on exception.

Both patches are applied in openmark/embeddings/local.py before any model loading.

Why sentence-transformers==3.3.1 specifically?

Version 4.x removed the Module base class that pplx-embed depends on. Pin to 3.3.1.


ChromaDB

Local, file-based vector database. No server, no API key, no cloud.

Collection: openmark_bookmarks Similarity metric: cosine Data path: CHROMA_PATH in .env (default: OpenMark/data/chroma_db/)

What's stored per item:

{
    "id":       url,           # primary key
    "document": doc_text,      # rich text used for embedding
    "metadata": {
        "title":    str,
        "category": str,
        "source":   str,       # raindrop, linkedin, youtube_liked, edge, etc.
        "score":    float,     # quality score 1-10
        "tags":     str,       # comma-separated
        "folder":   str,
    },
    "embedding": [float x 1024]  # or 1536 for Azure
}

Querying:

collection.query(
    query_embeddings=[embedder.embed_query("RAG tools")],
    n_results=10,
    where={"category": {"$eq": "RAG & Vector Search"}},  # optional filter
)

Neo4j Graph Schema

(:Bookmark {url, title, score})
    -[:IN_CATEGORY]->   (:Category {name})
    -[:TAGGED]->        (:Tag {name})
    -[:FROM_SOURCE]->   (:Source {name})
    -[:FROM_DOMAIN]->   (:Domain {name})
    -[:SIMILAR_TO {score}]->  (:Bookmark)  ← from embeddings

(:Tag)-[:CO_OCCURS_WITH {count}]-(:Tag)    ← tags that appear together

Useful Cypher queries:

// Count everything
MATCH (b:Bookmark) RETURN count(b) AS bookmarks
MATCH (t:Tag) RETURN count(t) AS tags

// Top categories
MATCH (b:Bookmark)-[:IN_CATEGORY]->(c:Category)
RETURN c.name, count(b) AS count ORDER BY count DESC

// All bookmarks tagged 'rag'
MATCH (b:Bookmark)-[:TAGGED]->(t:Tag {name: 'rag'})
RETURN b.title, b.url ORDER BY b.score DESC

// Find what connects to 'langchain' tag (2 hops)
MATCH (t:Tag {name: 'langchain'})-[:CO_OCCURS_WITH*1..2]-(related:Tag)
RETURN related.name, count(*) AS strength ORDER BY strength DESC

// Similar bookmarks to a URL
MATCH (b:Bookmark {url: 'https://...'})-[r:SIMILAR_TO]-(other)
RETURN other.title, other.url, r.score ORDER BY r.score DESC

// Most connected domains
MATCH (b:Bookmark)-[:FROM_DOMAIN]->(d:Domain)
RETURN d.name, count(b) AS saved ORDER BY saved DESC LIMIT 20

LangGraph Agent

Built with create_react_agent from LangGraph 1.0.x.

Model: Azure gpt-4o-mini (streaming enabled) Memory: MemorySaver β€” conversation history persists per thread_id within a session

Tools:

Tool Store Description
search_semantic ChromaDB Natural language vector search
search_by_category ChromaDB Filter by category + optional query
find_by_tag Neo4j Exact tag lookup
find_similar_bookmarks Neo4j SIMILAR_TO edge traversal
explore_tag_cluster Neo4j CO_OCCURS_WITH traversal (2 hops)
get_stats Both Count totals
run_cypher Neo4j Raw Cypher for power users

Agent routing: The LLM decides which tool(s) to call based on the query. For "what do I know about RAG" it will call search_semantic + search_by_category + find_by_tag. For "how does LangGraph connect to my Neo4j saves" it will call explore_tag_cluster and run_cypher.


Gradio UI

Three tabs:

Tab What it does
Chat Full LangGraph agent conversation. Remembers context within session.
Search Direct ChromaDB search with category filter, min score slider, result count.
Stats Neo4j category breakdown + top tags. Loads on startup.

Run: python openmark/ui/app.py β†’ http://localhost:7860


Data Flow Summary

Source files (JSON, HTML)
        β”‚
   merge.py β†’ normalize.py
        β”‚
   8,007 items with doc_text
        β”‚
   EmbeddingProvider.embed_documents()
        β”‚
   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
   β”‚         β”‚
ChromaDB   Neo4j
add()      MERGE nodes + relationships
           CO_OCCURS_WITH edges
           SIMILAR_TO edges (from ChromaDB top-5 per item)