SandyVeliz commited on
Commit
760e282
·
verified ·
1 Parent(s): becb252

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +316 -204
README.md CHANGED
@@ -1,204 +1,316 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- - es
6
- base_model: Qwen/Qwen3.5-9B
7
- tags:
8
- - knowledge-graph
9
- - entity-extraction
10
- - relation-extraction
11
- - structured-output
12
- - json
13
- - topic-detection
14
- - acervo
15
- - fine-tuned
16
- - LoRA
17
- datasets:
18
- - custom
19
- pipeline_tag: text-generation
20
- library_name: transformers
21
- model-index:
22
- - name: acervo-extractor-v2
23
- results:
24
- - task:
25
- type: structured-output
26
- name: Knowledge Graph Extraction
27
- metrics:
28
- - name: JSON Parse Rate
29
- type: accuracy
30
- value: 100
31
- - name: Extraction Accuracy
32
- type: accuracy
33
- value: 85
34
- ---
35
-
36
- # Acervo Extractor v2
37
-
38
- A fine-tuned version of [Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) specialized in **knowledge graph extraction** from conversations. Given a conversation turn and existing graph context, the model outputs structured JSON with intent classification, topic detection, retrieval decision, entities, relations, and facts.
39
-
40
- > **Base model:** Qwen3.5-9B | **Method:** QLoRA (4-bit, r=16, alpha=32) | **Training:** ~1,000 examples, 3 epochs
41
-
42
- Built for [Acervo](https://github.com/SandyVeliz/acervo) — a semantic compression layer for AI agents that replaces raw conversation history with compressed knowledge graph nodes.
43
-
44
- ## What it does
45
-
46
- **Input:** A conversation message + existing graph nodes as context
47
-
48
- **Output:** Structured JSON with:
49
- - **Intent classification** — overview / specific / chat / followup
50
- - **Topic classification** — same / subtopic / changed
51
- - **Retrieval decision** summary_only / with_chunks
52
- - **Entities** — people, projects, technologies, events, places, etc.
53
- - **Relations** uses_technology, maintains, part_of, participated_in, etc.
54
- - **Facts** specific claims attached to existing entities
55
-
56
- ### Example
57
-
58
- **Input:**
59
- ```
60
- EXISTING NODES:
61
- [{"id": "beacon", "label": "Beacon", "type": "project", "layer": "PERSONAL"}]
62
-
63
- TOPIC HINT: same (high confidence from keyword match)
64
- CURRENT TOPIC: Beacon development
65
-
66
- PREVIOUS ASSISTANT: How's the project going?
67
- USER: Beacon ya tiene 50 mil usuarios y estamos migrando a Kubernetes.
68
- ```
69
-
70
- **Output:**
71
- ```json
72
- {
73
- "topic": {"action": "same"},
74
- "entities": [
75
- {
76
- "id": "kubernetes",
77
- "label": "Kubernetes",
78
- "type": "technology",
79
- "layer": "UNIVERSAL",
80
- "attributes": {},
81
- "facts": [],
82
- "existing_id": null
83
- }
84
- ],
85
- "relations": [
86
- {"source": "beacon", "target": "kubernetes", "relation": "uses_technology"}
87
- ],
88
- "facts": [
89
- {"entity": "beacon", "text": "Has 50,000 users", "speaker": "user"}
90
- ]
91
- }
92
- ```
93
-
94
- ## Key capabilities
95
-
96
- | Capability | Description |
97
- |---|---|
98
- | **Bilingual** | Handles English and Spanish input natively |
99
- | **Empty output** | Returns empty arrays for small talk and pure queries (no hallucinated entities) |
100
- | **Dedup awareness** | References existing nodes via `existing_id` instead of creating duplicates |
101
- | **Implicit references** | Maps "our project", "the app", "Alice's work" to existing graph nodes |
102
- | **Event extraction** | Creates event nodes with participants, narrative position, and chronological markers |
103
- | **Controlled vocabulary** | Uses strict enums for types (8) and relations (15) |
104
- | **Topic detection** | Classifies same/subtopic/changed with optional hint from upstream classifiers |
105
-
106
- ## Training details
107
-
108
- | Parameter | Value |
109
- |---|---|
110
- | **Base model** | Qwen/Qwen3.5-9B |
111
- | **Method** | LoRA (QLoRA 4-bit) |
112
- | **Framework** | Unsloth + Transformers |
113
- | **Dataset size** | ~582 examples (450 base + 112 supplementary + 20 stress test) |
114
- | **Training** | Initial 3 epochs (lr=2e-4) + incremental 2 epochs (lr=5e-5) |
115
- | **Max sequence length** | 2048 |
116
- | **Languages** | English (~70%), Spanish (~30%) |
117
-
118
- ### Dataset composition
119
-
120
- | Category | % | Description |
121
- |---|---|---|
122
- | Facts about existing entities | 30% | "Our project has 50k users" → fact on existing node |
123
- | New entity extraction | 20% | First mentions of people, projects, technologies |
124
- | Empty output (small talk / queries) | 15% | "Thanks!", "What tech does X use?" → `[]` |
125
- | Topic changes | 10% | Implicit and explicit topic switches |
126
- | Subtopic shifts | 10% | Diving deeper into an aspect |
127
- | Literary events | 5% | Events with narrative_position and chronological_marker |
128
- | Corrections / updates | 5% | "We switched from React to Vue" |
129
- | Dedup / existing references | 5% | "nuestro proyecto" → existing_id: "beacon" |
130
-
131
- ## Schema
132
-
133
- ### Entity types (enum)
134
- ```
135
- person, organization, project, technology, place, event, document, concept
136
- ```
137
-
138
- ### Relation types (enum)
139
- ```
140
- part_of, created_by, maintains, works_at, member_of,
141
- uses_technology, depends_on, alternative_to,
142
- located_in, deployed_on, produces, serves, documented_in,
143
- participated_in, triggered_by, resulted_in
144
- ```
145
-
146
- ### Layers
147
- - **PERSONAL** — user owns, created, or directly uses it
148
- - **UNIVERSAL** — public knowledge (technologies, fictional characters, cities)
149
-
150
- ## Usage
151
-
152
- ### With Transformers + LoRA
153
-
154
- ```python
155
- from peft import PeftModel
156
- from transformers import AutoModelForCausalLM, AutoTokenizer
157
-
158
- base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B", device_map="auto")
159
- model = PeftModel.from_pretrained(base_model, "SandyVeliz/acervo-extractor-qwen3.5-9b")
160
- tokenizer = AutoTokenizer.from_pretrained("SandyVeliz/acervo-extractor-qwen3.5-9b")
161
-
162
- messages = [
163
- {"role": "system", "content": "You are a knowledge extractor for a personal knowledge graph. Analyze the conversation and return a single JSON object with topic classification, entities, relations, and facts. Output valid JSON only, no markdown, no explanation."},
164
- {"role": "user", "content": "EXISTING NODES:\n[]\n\nTOPIC HINT: unresolved\nCURRENT TOPIC: null\n\nPREVIOUS ASSISTANT: null\nUSER: I work at Acme Corp building a React app called Beacon with PostgreSQL."}
165
- ]
166
-
167
- inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
168
- outputs = model.generate(inputs.to(model.device), max_new_tokens=1024, temperature=0.1)
169
- print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
170
- ```
171
-
172
- ### With Unsloth (recommended for inference)
173
-
174
- ```python
175
- from unsloth import FastLanguageModel
176
-
177
- model, tokenizer = FastLanguageModel.from_pretrained(
178
- "SandyVeliz/acervo-extractor-qwen3.5-9b",
179
- max_seq_length=2048, load_in_4bit=True,
180
- )
181
- FastLanguageModel.for_inference(model)
182
- ```
183
-
184
- ### With Acervo (intended use)
185
-
186
- ```python
187
- from acervo import Acervo, OpenAIClient
188
-
189
- llm = OpenAIClient(base_url="http://localhost:1234/v1", model="acervo-extractor")
190
- memory = Acervo(llm=llm, owner="user")
191
- ```
192
-
193
- ## Intended use
194
-
195
- This model is designed as the extraction component inside [Acervo](https://github.com/SandyVeliz/acervo), a semantic compression layer for AI agents. It replaces general-purpose LLM calls for topic detection and entity extraction with a specialized, faster model.
196
-
197
- It can also be used standalone for:
198
- - Building knowledge graphs from conversations
199
- - Structured entity/relation extraction from text
200
- - Topic detection in multi-turn dialogues
201
-
202
- ## License
203
-
204
- Apache 2.0 same as the base model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - es
6
+ base_model: Qwen/Qwen3.5-9B
7
+ tags:
8
+ - knowledge-graph
9
+ - entity-extraction
10
+ - relation-extraction
11
+ - intent-classification
12
+ - structured-output
13
+ - json
14
+ - topic-detection
15
+ - acervo
16
+ - fine-tuned
17
+ - LoRA
18
+ datasets:
19
+ - custom
20
+ pipeline_tag: text-generation
21
+ library_name: transformers
22
+ model-index:
23
+ - name: acervo-extractor-v2
24
+ results:
25
+ - task:
26
+ type: structured-output
27
+ name: Knowledge Graph Extraction
28
+ metrics:
29
+ - name: JSON Parse Rate
30
+ type: accuracy
31
+ value: 100
32
+ - name: Extraction Accuracy
33
+ type: accuracy
34
+ value: 85
35
+ ---
36
+
37
+ # Acervo Extractor v2
38
+
39
+ A fine-tuned version of [Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) specialized in **knowledge graph extraction** from conversations. Given a conversation turn and existing graph context, the model outputs structured JSON with intent classification, topic detection, retrieval decision, entities, relations, and facts.
40
+
41
+ > **Base model:** Qwen3.5-9B | **Method:** QLoRA (4-bit, r=16, alpha=32) | **Training:** ~1,000 examples, 3 epochs
42
+
43
+ Built for [Acervo](https://github.com/SandyVeliz/acervo) — a semantic compression layer for AI agents that replaces raw conversation history with compressed knowledge graph nodes.
44
+
45
+ > **Supersedes:** [acervo-extractor-qwen3.5-9b](https://huggingface.co/SandyVeliz/acervo-extractor-qwen3.5-9b) (v1, deprecated)
46
+
47
+ ## What's new in v2
48
+
49
+ v1 only handled topic detection and entity extraction. v2 adds **intent classification** and **retrieval decision** two fields that were previously handled by regex/keyword heuristics outside the model.
50
+
51
+ | Feature | v1 | v2 |
52
+ |---|---|---|
53
+ | Topic detection | same / subtopic / changed | same / subtopic / changed |
54
+ | **Intent classification** | - | overview / specific / chat / followup |
55
+ | **Retrieval decision** | - | summary_only / with_chunks |
56
+ | Entity extraction | 8 types, 15 relations | 8 types, 15 relations |
57
+ | **Code extraction** | - | Extract entities from code snippets |
58
+ | **Document extraction** | - | Extract from READMEs, changelogs, docs |
59
+ | **Prose extraction** | - | Extract characters, locations from literature |
60
+ | Training examples | 612 | ~1,000 |
61
+ | S1 Intent accuracy | 78% | 92%+ (target) |
62
+
63
+ ### Why intent matters
64
+
65
+ v1 benchmarks showed **78% intent accuracy** — the model classified overview questions as specific (6 out of 9 failures). This cascaded: wrong intent led to wrong retrieval strategy (56% S2 accuracy) and wrong budget allocation (32% S3 accuracy).
66
+
67
+ v2 trains the model to classify intent directly, replacing the external regex classifier.
68
+
69
+ ### Why retrieval matters
70
+
71
+ The `retrieval` field tells the system whether to fetch full document chunks or just use node summaries:
72
+ - `summary_only` — for overview questions, chat, conceptual queries (cheaper, faster)
73
+ - `with_chunks` — for code lookups, specific facts, detailed analysis (needs raw content)
74
+
75
+ ## Output schema
76
+
77
+ ### v1 output (deprecated)
78
+ ```json
79
+ {
80
+ "topic": {"action": "same|changed|subtopic", "label": "..."},
81
+ "entities": [...],
82
+ "relations": [...],
83
+ "facts": [...]
84
+ }
85
+ ```
86
+
87
+ ### v2 output (new fields highlighted)
88
+ ```json
89
+ {
90
+ "intent": "overview|specific|chat|followup", // NEW
91
+ "topic": {"action": "same|changed|subtopic", "label": "..."},
92
+ "retrieval": "summary_only|with_chunks", // NEW
93
+ "entities": [...],
94
+ "relations": [...],
95
+ "facts": [...]
96
+ }
97
+ ```
98
+
99
+ ## Intent types
100
+
101
+ | Intent | Description | Examples |
102
+ |---|---|---|
103
+ | `overview` | High-level summary, counts, listings, general info | "What is this project?", "How many files?", "Give me a summary" |
104
+ | `specific` | Precise detail, specific code, particular fact | "How does auth work?", "Show me the controller", "What's the deadline?" |
105
+ | `chat` | Casual conversation, acknowledgments, opinions | "Thanks", "That's interesting", "Ok", "Good job" |
106
+ | `followup` | Continuing previous topic with more depth | "Tell me more", "What about the other one?", "Expand on that" |
107
+
108
+ ## Examples
109
+
110
+ ### Intent: overview
111
+ ```
112
+ USER: What is this project about?
113
+ ```
114
+ ```json
115
+ {
116
+ "intent": "overview",
117
+ "topic": {"action": "same", "label": null},
118
+ "retrieval": "summary_only",
119
+ "entities": [],
120
+ "relations": [],
121
+ "facts": []
122
+ }
123
+ ```
124
+
125
+ ### Intent: specific (with extraction)
126
+ ```
127
+ USER: Beacon ya tiene 50 mil usuarios y estamos migrando a Kubernetes.
128
+ ```
129
+ ```json
130
+ {
131
+ "intent": "specific",
132
+ "topic": {"action": "same", "label": null},
133
+ "retrieval": "with_chunks",
134
+ "entities": [
135
+ {
136
+ "id": "kubernetes",
137
+ "label": "Kubernetes",
138
+ "type": "technology",
139
+ "layer": "UNIVERSAL",
140
+ "attributes": {},
141
+ "facts": [],
142
+ "existing_id": null
143
+ }
144
+ ],
145
+ "relations": [
146
+ {"source": "beacon", "target": "kubernetes", "relation": "uses_technology"}
147
+ ],
148
+ "facts": [
149
+ {"entity": "beacon", "text": "Has 50,000 users", "speaker": "user"}
150
+ ]
151
+ }
152
+ ```
153
+
154
+ ### Intent: chat (empty output)
155
+ ```
156
+ USER: That's interesting, thanks!
157
+ ```
158
+ ```json
159
+ {
160
+ "intent": "chat",
161
+ "topic": {"action": "same", "label": null},
162
+ "retrieval": "summary_only",
163
+ "entities": [],
164
+ "relations": [],
165
+ "facts": []
166
+ }
167
+ ```
168
+
169
+ ### Intent: followup
170
+ ```
171
+ PREVIOUS ASSISTANT: The auth module uses JWT tokens with 24-hour expiry.
172
+ USER: Tell me more about that.
173
+ ```
174
+ ```json
175
+ {
176
+ "intent": "followup",
177
+ "topic": {"action": "same", "label": null},
178
+ "retrieval": "with_chunks",
179
+ "entities": [],
180
+ "relations": [],
181
+ "facts": []
182
+ }
183
+ ```
184
+
185
+ ## Key capabilities
186
+
187
+ | Capability | Description |
188
+ |---|---|
189
+ | **Intent classification** | Classifies user intent to drive retrieval strategy |
190
+ | **Retrieval decision** | Decides summary_only vs with_chunks for downstream pipeline |
191
+ | **Bilingual** | Handles English and Spanish input natively |
192
+ | **Empty output** | Returns empty arrays for small talk and pure queries (no hallucinated entities) |
193
+ | **Dedup awareness** | References existing nodes via `existing_id` instead of creating duplicates |
194
+ | **Code extraction** | Extracts technologies, patterns, and dependencies from code snippets |
195
+ | **Document extraction** | Extracts entities from READMEs, changelogs, sprint reviews, API docs |
196
+ | **Prose extraction** | Extracts characters, locations, events from literature and narratives |
197
+ | **Controlled vocabulary** | Uses strict enums for types (8) and relations (15) |
198
+ | **Topic detection** | Classifies same/subtopic/changed with optional hint from upstream classifiers |
199
+
200
+ ## Training details
201
+
202
+ | Parameter | Value |
203
+ |---|---|
204
+ | **Base model** | Qwen/Qwen3.5-9B |
205
+ | **Method** | LoRA (QLoRA 4-bit, r=16, alpha=32) |
206
+ | **Framework** | Unsloth + Transformers + TRL |
207
+ | **Dataset size** | ~1,000 examples |
208
+ | **Training** | v1 base (3 epochs, lr=2e-4) + v2 incremental (2 epochs, lr=5e-5) + v3 intent+retrieval (3 epochs, lr=5e-5) |
209
+ | **Max sequence length** | 2048 |
210
+ | **Languages** | English (~65%), Spanish (~35%) |
211
+ | **Hardware** | NVIDIA RTX 5070 Ti (16GB VRAM) |
212
+
213
+ ### Dataset composition
214
+
215
+ | Category | Count | Description |
216
+ |---|---|---|
217
+ | Conversation extraction (v1) | 350 | Facts, entities, relations from conversations |
218
+ | Topic detection (v1) | 120 | Topic changes, subtopics |
219
+ | Empty output (v1) | 90 | Small talk, queries with no extraction |
220
+ | Corrections / dedup (v1) | 52 | "We switched from React to Vue", existing references |
221
+ | Stress / edge cases (v1) | 22 | Edge cases from v1 testing |
222
+ | **Intent classification (v2)** | **100** | Overview, specific, chat, followup examples |
223
+ | **Retrieval decision (v2)** | **80** | summary_only vs with_chunks |
224
+ | **Code extraction (v2)** | **50** | TypeScript, Python, YAML, Docker, SQL |
225
+ | **Literature extraction (v2)** | **40** | Characters, locations, events from prose |
226
+ | **Documentation extraction (v2)** | **40** | READMEs, changelogs, sprint reviews, API docs |
227
+ | **S1.5 improvement (v2)** | **30** | Extracting from assistant responses |
228
+ | **S1 failure variations (v2)** | **50** | Variations of 9 v0.4 benchmark failures |
229
+
230
+ ## Schema
231
+
232
+ ### Entity types (enum)
233
+ ```
234
+ person, organization, project, technology, place, event, document, concept
235
+ ```
236
+
237
+ ### Relation types (enum)
238
+ ```
239
+ part_of, created_by, maintains, works_at, member_of,
240
+ uses_technology, depends_on, alternative_to,
241
+ located_in, deployed_on, produces, serves, documented_in,
242
+ participated_in, triggered_by, resulted_in
243
+ ```
244
+
245
+ ### Layers
246
+ - **PERSONAL** — user owns, created, or directly uses it
247
+ - **UNIVERSAL** — public knowledge (technologies, fictional characters, cities)
248
+
249
+ ## Usage
250
+
251
+ ### With LM Studio / Ollama (GGUF)
252
+
253
+ Download the GGUF file from the `gguf/` folder and load in LM Studio. The model appears as **acervo-extractor-v2**.
254
+
255
+ ### With Transformers + LoRA
256
+
257
+ ```python
258
+ from peft import PeftModel
259
+ from transformers import AutoModelForCausalLM, AutoTokenizer
260
+
261
+ base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B", device_map="auto")
262
+ model = PeftModel.from_pretrained(base_model, "SandyVeliz/acervo-extractor-v2")
263
+ tokenizer = AutoTokenizer.from_pretrained("SandyVeliz/acervo-extractor-v2")
264
+
265
+ messages = [
266
+ {"role": "system", "content": "You are a knowledge extractor for a personal knowledge graph. Analyze the conversation and return a single JSON object with: intent, topic, retrieval, entities, relations, and facts.\n\nIntent — classify the user's intent:\n- \"overview\": user wants a high-level summary, project description, general information, counts, or listings.\n- \"specific\": user wants a precise detail, specific code, a particular fact, or a specific section.\n- \"chat\": casual conversation, greetings, acknowledgments, opinions, or thanks.\n- \"followup\": continuing the previous topic with more depth, \"tell me more\", or referencing something just discussed.\n\nRetrieval — decide what data the system should fetch:\n- \"summary_only\": the node summary is enough (overview, chat, conceptual questions).\n- \"with_chunks\": the user needs specific content from documents (code lookups, specific facts, detailed analysis).\n\nOutput valid JSON only, no markdown, no explanation."},
267
+ {"role": "user", "content": "EXISTING NODES:\n[]\n\nTOPIC HINT: unresolved\nCURRENT TOPIC: null\n\nPREVIOUS ASSISTANT: null\nUSER: I work at Acme Corp building a React app called Beacon with PostgreSQL."}
268
+ ]
269
+
270
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
271
+ outputs = model.generate(inputs.to(model.device), max_new_tokens=1024, temperature=0.1)
272
+ print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
273
+ ```
274
+
275
+ ### With Unsloth (recommended for inference)
276
+
277
+ ```python
278
+ from unsloth import FastLanguageModel
279
+
280
+ model, tokenizer = FastLanguageModel.from_pretrained(
281
+ "SandyVeliz/acervo-extractor-v2",
282
+ max_seq_length=2048, load_in_4bit=True,
283
+ )
284
+ FastLanguageModel.for_inference(model)
285
+ ```
286
+
287
+ ### With Acervo (intended use)
288
+
289
+ ```python
290
+ from acervo import Acervo, OpenAIClient
291
+
292
+ llm = OpenAIClient(base_url="http://localhost:1234/v1", model="acervo-extractor-v2")
293
+ memory = Acervo(llm=llm, owner="user")
294
+ ```
295
+
296
+ ## Intended use
297
+
298
+ This model is designed as the extraction component inside [Acervo](https://github.com/SandyVeliz/acervo), a semantic compression layer for AI agents. It replaces general-purpose LLM calls for topic detection, intent classification, and entity extraction with a specialized, faster model.
299
+
300
+ It can also be used standalone for:
301
+ - Building knowledge graphs from conversations
302
+ - Structured entity/relation extraction from text
303
+ - Topic detection in multi-turn dialogues
304
+ - Intent classification for conversational AI
305
+ - Retrieval strategy decisions (RAG pipelines)
306
+
307
+ ## Version history
308
+
309
+ | Version | Repo | Examples | Key changes |
310
+ |---|---|---|---|
311
+ | v1 | [acervo-extractor-qwen3.5-9b](https://huggingface.co/SandyVeliz/acervo-extractor-qwen3.5-9b) | 612 | Topic detection + entity extraction |
312
+ | **v2** | **acervo-extractor-v2** | **~1,000** | **+ Intent classification, retrieval decision, code/doc/prose extraction** |
313
+
314
+ ## License
315
+
316
+ Apache 2.0 — same as the base model.