Spaces:
Running
Running
| # Architecture | |
| ## Overview | |
| OpenMark uses a **dual-store architecture** β two databases working together, each doing what it's best at. | |
| ``` | |
| User Query | |
| β | |
| LangGraph Agent | |
| (gpt-4o-mini) | |
| / \ | |
| ChromaDB Neo4j | |
| (vector store) (graph store) | |
| "find by meaning" "find by connection" | |
| "what's similar?" "how are things linked?" | |
| ``` | |
| --- | |
| ## Embedding Layer | |
| The embedding layer is **provider-agnostic** β swap between local and cloud with one env var. | |
| ``` | |
| EMBEDDING_PROVIDER=local β LocalEmbedder (pplx-embed, runs on your machine) | |
| EMBEDDING_PROVIDER=azure β AzureEmbedder (Azure AI Foundry, API call) | |
| ``` | |
| **Why two pplx-embed models?** | |
| Perplexity AI ships two variants: | |
| - `pplx-embed-v1-0.6b` β for encoding **queries** (what the user types) | |
| - `pplx-embed-context-v1-0.6b` β for encoding **documents** (the bookmarks, surrounding context matters) | |
| Using the correct model for each role improves retrieval quality. Most implementations use one model for both β this is the correct production pattern. | |
| **The compatibility patches:** | |
| pplx-embed models ship with custom Python code (`st_quantize.py`) that has two incompatibilities with modern libraries: | |
| 1. **`sentence_transformers 4.x` removed the `Module` base class** β pplx-embed's code imports it. Fixed by aliasing `torch.nn.Module` to `sentence_transformers.models.Module` before import. | |
| 2. **`transformers 4.57` added `list_repo_templates()`** β it looks for an `additional_chat_templates` folder in every model repo. pplx-embed doesn't have one, causing a hard 404 crash. Fixed by monkey-patching the function to return an empty list on exception. | |
| Both patches are applied in `openmark/embeddings/local.py` before any model loading. | |
| **Why `sentence-transformers==3.3.1` specifically?** | |
| Version 4.x removed the `Module` base class that pplx-embed depends on. Pin to 3.3.1. | |
| --- | |
| ## ChromaDB | |
| Local, file-based vector database. No server, no API key, no cloud. | |
| **Collection:** `openmark_bookmarks` | |
| **Similarity metric:** cosine | |
| **Data path:** `CHROMA_PATH` in `.env` (default: `OpenMark/data/chroma_db/`) | |
| **What's stored per item:** | |
| ```python | |
| { | |
| "id": url, # primary key | |
| "document": doc_text, # rich text used for embedding | |
| "metadata": { | |
| "title": str, | |
| "category": str, | |
| "source": str, # raindrop, linkedin, youtube_liked, edge, etc. | |
| "score": float, # quality score 1-10 | |
| "tags": str, # comma-separated | |
| "folder": str, | |
| }, | |
| "embedding": [float x 1024] # or 1536 for Azure | |
| } | |
| ``` | |
| **Querying:** | |
| ```python | |
| collection.query( | |
| query_embeddings=[embedder.embed_query("RAG tools")], | |
| n_results=10, | |
| where={"category": {"$eq": "RAG & Vector Search"}}, # optional filter | |
| ) | |
| ``` | |
| --- | |
| ## Neo4j Graph Schema | |
| ``` | |
| (:Bookmark {url, title, score}) | |
| -[:IN_CATEGORY]-> (:Category {name}) | |
| -[:TAGGED]-> (:Tag {name}) | |
| -[:FROM_SOURCE]-> (:Source {name}) | |
| -[:FROM_DOMAIN]-> (:Domain {name}) | |
| -[:SIMILAR_TO {score}]-> (:Bookmark) β from embeddings | |
| (:Tag)-[:CO_OCCURS_WITH {count}]-(:Tag) β tags that appear together | |
| ``` | |
| **Useful Cypher queries:** | |
| ```cypher | |
| // Count everything | |
| MATCH (b:Bookmark) RETURN count(b) AS bookmarks | |
| MATCH (t:Tag) RETURN count(t) AS tags | |
| // Top categories | |
| MATCH (b:Bookmark)-[:IN_CATEGORY]->(c:Category) | |
| RETURN c.name, count(b) AS count ORDER BY count DESC | |
| // All bookmarks tagged 'rag' | |
| MATCH (b:Bookmark)-[:TAGGED]->(t:Tag {name: 'rag'}) | |
| RETURN b.title, b.url ORDER BY b.score DESC | |
| // Find what connects to 'langchain' tag (2 hops) | |
| MATCH (t:Tag {name: 'langchain'})-[:CO_OCCURS_WITH*1..2]-(related:Tag) | |
| RETURN related.name, count(*) AS strength ORDER BY strength DESC | |
| // Similar bookmarks to a URL | |
| MATCH (b:Bookmark {url: 'https://...'})-[r:SIMILAR_TO]-(other) | |
| RETURN other.title, other.url, r.score ORDER BY r.score DESC | |
| // Most connected domains | |
| MATCH (b:Bookmark)-[:FROM_DOMAIN]->(d:Domain) | |
| RETURN d.name, count(b) AS saved ORDER BY saved DESC LIMIT 20 | |
| ``` | |
| --- | |
| ## LangGraph Agent | |
| Built with `create_react_agent` from LangGraph 1.0.x. | |
| **Model:** Azure gpt-4o-mini (streaming enabled) | |
| **Memory:** `MemorySaver` β conversation history persists per `thread_id` within a session | |
| **Tools:** | |
| | Tool | Store | Description | | |
| |------|-------|-------------| | |
| | `search_semantic` | ChromaDB | Natural language vector search | | |
| | `search_by_category` | ChromaDB | Filter by category + optional query | | |
| | `find_by_tag` | Neo4j | Exact tag lookup | | |
| | `find_similar_bookmarks` | Neo4j | SIMILAR_TO edge traversal | | |
| | `explore_tag_cluster` | Neo4j | CO_OCCURS_WITH traversal (2 hops) | | |
| | `get_stats` | Both | Count totals | | |
| | `run_cypher` | Neo4j | Raw Cypher for power users | | |
| **Agent routing:** The LLM decides which tool(s) to call based on the query. For "what do I know about RAG" it will call `search_semantic` + `search_by_category` + `find_by_tag`. For "how does LangGraph connect to my Neo4j saves" it will call `explore_tag_cluster` and `run_cypher`. | |
| --- | |
| ## Gradio UI | |
| Three tabs: | |
| | Tab | What it does | | |
| |-----|-------------| | |
| | Chat | Full LangGraph agent conversation. Remembers context within session. | | |
| | Search | Direct ChromaDB search with category filter, min score slider, result count. | | |
| | Stats | Neo4j category breakdown + top tags. Loads on startup. | | |
| Run: `python openmark/ui/app.py` β `http://localhost:7860` | |
| --- | |
| ## Data Flow Summary | |
| ``` | |
| Source files (JSON, HTML) | |
| β | |
| merge.py β normalize.py | |
| β | |
| 8,007 items with doc_text | |
| β | |
| EmbeddingProvider.embed_documents() | |
| β | |
| ββββββ΄βββββ | |
| β β | |
| ChromaDB Neo4j | |
| add() MERGE nodes + relationships | |
| CO_OCCURS_WITH edges | |
| SIMILAR_TO edges (from ChromaDB top-5 per item) | |
| ``` | |