File size: 5,984 Bytes
81598c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
# Architecture

## Overview

OpenMark uses a **dual-store architecture** β€” two databases working together, each doing what it's best at.

```
                        User Query
                            β”‚
                    LangGraph Agent
                    (gpt-4o-mini)
                   /              \
          ChromaDB               Neo4j
        (vector store)        (graph store)

        "find by meaning"     "find by connection"
        "what's similar?"     "how are things linked?"
```

---

## Embedding Layer

The embedding layer is **provider-agnostic** β€” swap between local and cloud with one env var.

```
EMBEDDING_PROVIDER=local   β†’  LocalEmbedder  (pplx-embed, runs on your machine)
EMBEDDING_PROVIDER=azure   β†’  AzureEmbedder  (Azure AI Foundry, API call)
```

**Why two pplx-embed models?**

Perplexity AI ships two variants:
- `pplx-embed-v1-0.6b` β€” for encoding **queries** (what the user types)
- `pplx-embed-context-v1-0.6b` β€” for encoding **documents** (the bookmarks, surrounding context matters)

Using the correct model for each role improves retrieval quality. Most implementations use one model for both β€” this is the correct production pattern.

**The compatibility patches:**

pplx-embed models ship with custom Python code (`st_quantize.py`) that has two incompatibilities with modern libraries:

1. **`sentence_transformers 4.x` removed the `Module` base class** β€” pplx-embed's code imports it. Fixed by aliasing `torch.nn.Module` to `sentence_transformers.models.Module` before import.

2. **`transformers 4.57` added `list_repo_templates()`** β€” it looks for an `additional_chat_templates` folder in every model repo. pplx-embed doesn't have one, causing a hard 404 crash. Fixed by monkey-patching the function to return an empty list on exception.

Both patches are applied in `openmark/embeddings/local.py` before any model loading.

**Why `sentence-transformers==3.3.1` specifically?**

Version 4.x removed the `Module` base class that pplx-embed depends on. Pin to 3.3.1.

---

## ChromaDB

Local, file-based vector database. No server, no API key, no cloud.

**Collection:** `openmark_bookmarks`
**Similarity metric:** cosine
**Data path:** `CHROMA_PATH` in `.env` (default: `OpenMark/data/chroma_db/`)

**What's stored per item:**
```python
{
    "id":       url,           # primary key
    "document": doc_text,      # rich text used for embedding
    "metadata": {
        "title":    str,
        "category": str,
        "source":   str,       # raindrop, linkedin, youtube_liked, edge, etc.
        "score":    float,     # quality score 1-10
        "tags":     str,       # comma-separated
        "folder":   str,
    },
    "embedding": [float x 1024]  # or 1536 for Azure
}
```

**Querying:**
```python
collection.query(
    query_embeddings=[embedder.embed_query("RAG tools")],
    n_results=10,
    where={"category": {"$eq": "RAG & Vector Search"}},  # optional filter
)
```

---

## Neo4j Graph Schema

```
(:Bookmark {url, title, score})
    -[:IN_CATEGORY]->   (:Category {name})
    -[:TAGGED]->        (:Tag {name})
    -[:FROM_SOURCE]->   (:Source {name})
    -[:FROM_DOMAIN]->   (:Domain {name})
    -[:SIMILAR_TO {score}]->  (:Bookmark)  ← from embeddings

(:Tag)-[:CO_OCCURS_WITH {count}]-(:Tag)    ← tags that appear together
```

**Useful Cypher queries:**

```cypher
// Count everything
MATCH (b:Bookmark) RETURN count(b) AS bookmarks
MATCH (t:Tag) RETURN count(t) AS tags

// Top categories
MATCH (b:Bookmark)-[:IN_CATEGORY]->(c:Category)
RETURN c.name, count(b) AS count ORDER BY count DESC

// All bookmarks tagged 'rag'
MATCH (b:Bookmark)-[:TAGGED]->(t:Tag {name: 'rag'})
RETURN b.title, b.url ORDER BY b.score DESC

// Find what connects to 'langchain' tag (2 hops)
MATCH (t:Tag {name: 'langchain'})-[:CO_OCCURS_WITH*1..2]-(related:Tag)
RETURN related.name, count(*) AS strength ORDER BY strength DESC

// Similar bookmarks to a URL
MATCH (b:Bookmark {url: 'https://...'})-[r:SIMILAR_TO]-(other)
RETURN other.title, other.url, r.score ORDER BY r.score DESC

// Most connected domains
MATCH (b:Bookmark)-[:FROM_DOMAIN]->(d:Domain)
RETURN d.name, count(b) AS saved ORDER BY saved DESC LIMIT 20
```

---

## LangGraph Agent

Built with `create_react_agent` from LangGraph 1.0.x.

**Model:** Azure gpt-4o-mini (streaming enabled)
**Memory:** `MemorySaver` β€” conversation history persists per `thread_id` within a session

**Tools:**

| Tool | Store | Description |
|------|-------|-------------|
| `search_semantic` | ChromaDB | Natural language vector search |
| `search_by_category` | ChromaDB | Filter by category + optional query |
| `find_by_tag` | Neo4j | Exact tag lookup |
| `find_similar_bookmarks` | Neo4j | SIMILAR_TO edge traversal |
| `explore_tag_cluster` | Neo4j | CO_OCCURS_WITH traversal (2 hops) |
| `get_stats` | Both | Count totals |
| `run_cypher` | Neo4j | Raw Cypher for power users |

**Agent routing:** The LLM decides which tool(s) to call based on the query. For "what do I know about RAG" it will call `search_semantic` + `search_by_category` + `find_by_tag`. For "how does LangGraph connect to my Neo4j saves" it will call `explore_tag_cluster` and `run_cypher`.

---

## Gradio UI

Three tabs:

| Tab | What it does |
|-----|-------------|
| Chat | Full LangGraph agent conversation. Remembers context within session. |
| Search | Direct ChromaDB search with category filter, min score slider, result count. |
| Stats | Neo4j category breakdown + top tags. Loads on startup. |

Run: `python openmark/ui/app.py` β†’ `http://localhost:7860`

---

## Data Flow Summary

```
Source files (JSON, HTML)
        β”‚
   merge.py β†’ normalize.py
        β”‚
   8,007 items with doc_text
        β”‚
   EmbeddingProvider.embed_documents()
        β”‚
   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
   β”‚         β”‚
ChromaDB   Neo4j
add()      MERGE nodes + relationships
           CO_OCCURS_WITH edges
           SIMILAR_TO edges (from ChromaDB top-5 per item)
```