Spaces:

codingwithadi
/

OpenMark

Running

App Files Files Community

OpenMark / docs /architecture.md

codingwithadi

Upload folder using huggingface_hub

81598c5 verified about 23 hours ago

preview code

raw

history blame contribute delete

5.98 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

Architecture

Overview

OpenMark uses a dual-store architecture — two databases working together, each doing what it's best at.

                        User Query
                            │
                    LangGraph Agent
                    (gpt-4o-mini)
                   /              \
          ChromaDB               Neo4j
        (vector store)        (graph store)

        "find by meaning"     "find by connection"
        "what's similar?"     "how are things linked?"

Embedding Layer

The embedding layer is provider-agnostic — swap between local and cloud with one env var.

EMBEDDING_PROVIDER=local   →  LocalEmbedder  (pplx-embed, runs on your machine)
EMBEDDING_PROVIDER=azure   →  AzureEmbedder  (Azure AI Foundry, API call)

Why two pplx-embed models?

Perplexity AI ships two variants:

pplx-embed-v1-0.6b — for encoding queries (what the user types)
pplx-embed-context-v1-0.6b — for encoding documents (the bookmarks, surrounding context matters)

Using the correct model for each role improves retrieval quality. Most implementations use one model for both — this is the correct production pattern.

The compatibility patches:

pplx-embed models ship with custom Python code (st_quantize.py) that has two incompatibilities with modern libraries:

sentence_transformers 4.x removed the Module base class — pplx-embed's code imports it. Fixed by aliasing torch.nn.Module to sentence_transformers.models.Module before import.
transformers 4.57 added list_repo_templates() — it looks for an additional_chat_templates folder in every model repo. pplx-embed doesn't have one, causing a hard 404 crash. Fixed by monkey-patching the function to return an empty list on exception.

Both patches are applied in openmark/embeddings/local.py before any model loading.

Why sentence-transformers==3.3.1 specifically?

Version 4.x removed the Module base class that pplx-embed depends on. Pin to 3.3.1.

ChromaDB

Local, file-based vector database. No server, no API key, no cloud.

Collection: openmark_bookmarks Similarity metric: cosine Data path: CHROMA_PATH in .env (default: OpenMark/data/chroma_db/)

What's stored per item:

{
    "id":       url,           # primary key
    "document": doc_text,      # rich text used for embedding
    "metadata": {
        "title":    str,
        "category": str,
        "source":   str,       # raindrop, linkedin, youtube_liked, edge, etc.
        "score":    float,     # quality score 1-10
        "tags":     str,       # comma-separated
        "folder":   str,
    },
    "embedding": [float x 1024]  # or 1536 for Azure
}

Querying:

collection.query(
    query_embeddings=[embedder.embed_query("RAG tools")],
    n_results=10,
    where={"category": {"$eq": "RAG & Vector Search"}},  # optional filter
)

Neo4j Graph Schema

(:Bookmark {url, title, score})
    -[:IN_CATEGORY]->   (:Category {name})
    -[:TAGGED]->        (:Tag {name})
    -[:FROM_SOURCE]->   (:Source {name})
    -[:FROM_DOMAIN]->   (:Domain {name})
    -[:SIMILAR_TO {score}]->  (:Bookmark)  ← from embeddings

(:Tag)-[:CO_OCCURS_WITH {count}]-(:Tag)    ← tags that appear together

Useful Cypher queries:

// Count everything
MATCH (b:Bookmark) RETURN count(b) AS bookmarks
MATCH (t:Tag) RETURN count(t) AS tags

// Top categories
MATCH (b:Bookmark)-[:IN_CATEGORY]->(c:Category)
RETURN c.name, count(b) AS count ORDER BY count DESC

// All bookmarks tagged 'rag'
MATCH (b:Bookmark)-[:TAGGED]->(t:Tag {name: 'rag'})
RETURN b.title, b.url ORDER BY b.score DESC

// Find what connects to 'langchain' tag (2 hops)
MATCH (t:Tag {name: 'langchain'})-[:CO_OCCURS_WITH*1..2]-(related:Tag)
RETURN related.name, count(*) AS strength ORDER BY strength DESC

// Similar bookmarks to a URL
MATCH (b:Bookmark {url: 'https://...'})-[r:SIMILAR_TO]-(other)
RETURN other.title, other.url, r.score ORDER BY r.score DESC

// Most connected domains
MATCH (b:Bookmark)-[:FROM_DOMAIN]->(d:Domain)
RETURN d.name, count(b) AS saved ORDER BY saved DESC LIMIT 20

LangGraph Agent

Built with create_react_agent from LangGraph 1.0.x.

Model: Azure gpt-4o-mini (streaming enabled) Memory: MemorySaver — conversation history persists per thread_id within a session

Tools:

Tool	Store	Description
`search_semantic`	ChromaDB	Natural language vector search
`search_by_category`	ChromaDB	Filter by category + optional query
`find_by_tag`	Neo4j	Exact tag lookup
`find_similar_bookmarks`	Neo4j	SIMILAR_TO edge traversal
`explore_tag_cluster`	Neo4j	CO_OCCURS_WITH traversal (2 hops)
`get_stats`	Both	Count totals
`run_cypher`	Neo4j	Raw Cypher for power users

Agent routing: The LLM decides which tool(s) to call based on the query. For "what do I know about RAG" it will call search_semantic + search_by_category + find_by_tag. For "how does LangGraph connect to my Neo4j saves" it will call explore_tag_cluster and run_cypher.

Gradio UI

Three tabs:

Tab	What it does
Chat	Full LangGraph agent conversation. Remembers context within session.
Search	Direct ChromaDB search with category filter, min score slider, result count.
Stats	Neo4j category breakdown + top tags. Loads on startup.

Run: python openmark/ui/app.py → http://localhost:7860

Data Flow Summary

Source files (JSON, HTML)
        │
   merge.py → normalize.py
        │
   8,007 items with doc_text
        │
   EmbeddingProvider.embed_documents()
        │
   ┌────┴────┐
   │         │
ChromaDB   Neo4j
add()      MERGE nodes + relationships
           CO_OCCURS_WITH edges
           SIMILAR_TO edges (from ChromaDB top-5 per item)