Spaces:
Running
Running
File size: 5,984 Bytes
81598c5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 | # Architecture
## Overview
OpenMark uses a **dual-store architecture** β two databases working together, each doing what it's best at.
```
User Query
β
LangGraph Agent
(gpt-4o-mini)
/ \
ChromaDB Neo4j
(vector store) (graph store)
"find by meaning" "find by connection"
"what's similar?" "how are things linked?"
```
---
## Embedding Layer
The embedding layer is **provider-agnostic** β swap between local and cloud with one env var.
```
EMBEDDING_PROVIDER=local β LocalEmbedder (pplx-embed, runs on your machine)
EMBEDDING_PROVIDER=azure β AzureEmbedder (Azure AI Foundry, API call)
```
**Why two pplx-embed models?**
Perplexity AI ships two variants:
- `pplx-embed-v1-0.6b` β for encoding **queries** (what the user types)
- `pplx-embed-context-v1-0.6b` β for encoding **documents** (the bookmarks, surrounding context matters)
Using the correct model for each role improves retrieval quality. Most implementations use one model for both β this is the correct production pattern.
**The compatibility patches:**
pplx-embed models ship with custom Python code (`st_quantize.py`) that has two incompatibilities with modern libraries:
1. **`sentence_transformers 4.x` removed the `Module` base class** β pplx-embed's code imports it. Fixed by aliasing `torch.nn.Module` to `sentence_transformers.models.Module` before import.
2. **`transformers 4.57` added `list_repo_templates()`** β it looks for an `additional_chat_templates` folder in every model repo. pplx-embed doesn't have one, causing a hard 404 crash. Fixed by monkey-patching the function to return an empty list on exception.
Both patches are applied in `openmark/embeddings/local.py` before any model loading.
**Why `sentence-transformers==3.3.1` specifically?**
Version 4.x removed the `Module` base class that pplx-embed depends on. Pin to 3.3.1.
---
## ChromaDB
Local, file-based vector database. No server, no API key, no cloud.
**Collection:** `openmark_bookmarks`
**Similarity metric:** cosine
**Data path:** `CHROMA_PATH` in `.env` (default: `OpenMark/data/chroma_db/`)
**What's stored per item:**
```python
{
"id": url, # primary key
"document": doc_text, # rich text used for embedding
"metadata": {
"title": str,
"category": str,
"source": str, # raindrop, linkedin, youtube_liked, edge, etc.
"score": float, # quality score 1-10
"tags": str, # comma-separated
"folder": str,
},
"embedding": [float x 1024] # or 1536 for Azure
}
```
**Querying:**
```python
collection.query(
query_embeddings=[embedder.embed_query("RAG tools")],
n_results=10,
where={"category": {"$eq": "RAG & Vector Search"}}, # optional filter
)
```
---
## Neo4j Graph Schema
```
(:Bookmark {url, title, score})
-[:IN_CATEGORY]-> (:Category {name})
-[:TAGGED]-> (:Tag {name})
-[:FROM_SOURCE]-> (:Source {name})
-[:FROM_DOMAIN]-> (:Domain {name})
-[:SIMILAR_TO {score}]-> (:Bookmark) β from embeddings
(:Tag)-[:CO_OCCURS_WITH {count}]-(:Tag) β tags that appear together
```
**Useful Cypher queries:**
```cypher
// Count everything
MATCH (b:Bookmark) RETURN count(b) AS bookmarks
MATCH (t:Tag) RETURN count(t) AS tags
// Top categories
MATCH (b:Bookmark)-[:IN_CATEGORY]->(c:Category)
RETURN c.name, count(b) AS count ORDER BY count DESC
// All bookmarks tagged 'rag'
MATCH (b:Bookmark)-[:TAGGED]->(t:Tag {name: 'rag'})
RETURN b.title, b.url ORDER BY b.score DESC
// Find what connects to 'langchain' tag (2 hops)
MATCH (t:Tag {name: 'langchain'})-[:CO_OCCURS_WITH*1..2]-(related:Tag)
RETURN related.name, count(*) AS strength ORDER BY strength DESC
// Similar bookmarks to a URL
MATCH (b:Bookmark {url: 'https://...'})-[r:SIMILAR_TO]-(other)
RETURN other.title, other.url, r.score ORDER BY r.score DESC
// Most connected domains
MATCH (b:Bookmark)-[:FROM_DOMAIN]->(d:Domain)
RETURN d.name, count(b) AS saved ORDER BY saved DESC LIMIT 20
```
---
## LangGraph Agent
Built with `create_react_agent` from LangGraph 1.0.x.
**Model:** Azure gpt-4o-mini (streaming enabled)
**Memory:** `MemorySaver` β conversation history persists per `thread_id` within a session
**Tools:**
| Tool | Store | Description |
|------|-------|-------------|
| `search_semantic` | ChromaDB | Natural language vector search |
| `search_by_category` | ChromaDB | Filter by category + optional query |
| `find_by_tag` | Neo4j | Exact tag lookup |
| `find_similar_bookmarks` | Neo4j | SIMILAR_TO edge traversal |
| `explore_tag_cluster` | Neo4j | CO_OCCURS_WITH traversal (2 hops) |
| `get_stats` | Both | Count totals |
| `run_cypher` | Neo4j | Raw Cypher for power users |
**Agent routing:** The LLM decides which tool(s) to call based on the query. For "what do I know about RAG" it will call `search_semantic` + `search_by_category` + `find_by_tag`. For "how does LangGraph connect to my Neo4j saves" it will call `explore_tag_cluster` and `run_cypher`.
---
## Gradio UI
Three tabs:
| Tab | What it does |
|-----|-------------|
| Chat | Full LangGraph agent conversation. Remembers context within session. |
| Search | Direct ChromaDB search with category filter, min score slider, result count. |
| Stats | Neo4j category breakdown + top tags. Loads on startup. |
Run: `python openmark/ui/app.py` β `http://localhost:7860`
---
## Data Flow Summary
```
Source files (JSON, HTML)
β
merge.py β normalize.py
β
8,007 items with doc_text
β
EmbeddingProvider.embed_documents()
β
ββββββ΄βββββ
β β
ChromaDB Neo4j
add() MERGE nodes + relationships
CO_OCCURS_WITH edges
SIMILAR_TO edges (from ChromaDB top-5 per item)
```
|