forlop commited on
Commit
5348691
·
verified ·
1 Parent(s): ff5169c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - 'no'
6
+ base_model: unsloth/Qwen3.5-4B
7
+ tags:
8
+ - microdata.no
9
+ - ssb
10
+ - norwegian
11
+ - register-data
12
+ - lora
13
+ - gguf
14
+ - rag
15
+ - ollama
16
+ library_name: gguf
17
+ ---
18
+
19
+ # microdata.no copilot — v2.0 (q4_k_m GGUF)
20
+
21
+ A small, locally-deployable AI assistant fine-tuned to help users write
22
+ [microdata.no](https://microdata.no) scripts and answer questions about
23
+ Norwegian register-data variables published by [SSB (Statistics
24
+ Norway)](https://www.ssb.no/).
25
+
26
+ This repo hosts the deployed **q4_k_m quantised GGUF** (2.7 GB) plus an
27
+ Ollama `Modelfile` so the model can be pulled and run with one command.
28
+ The full source code (training, RAG, eval, deployment) and the technical
29
+ note live at **<https://github.com/forlop/microdata-no-copilot>**.
30
+
31
+ ## Quick start
32
+
33
+ ```bash
34
+ # Install Ollama if you don't have it yet:
35
+ # Linux/WSL: curl -fsSL https://ollama.com/install.sh | sh
36
+ # macOS: brew install ollama (or download from ollama.com)
37
+ # Windows: download OllamaSetup.exe from ollama.com
38
+
39
+ # Pull and run
40
+ ollama pull hf.co/forlop/microdata-copilot-v2:Q4_K_M
41
+ ollama run hf.co/forlop/microdata-copilot-v2:Q4_K_M
42
+ ```
43
+
44
+ For the full RAG-wrapped experience (retrieval over the live microdata.no
45
+ variable catalogue + a Streamlit web UI), clone the GitHub repo:
46
+
47
+ ```bash
48
+ git clone https://github.com/forlop/microdata-no-copilot
49
+ cd microdata-no-copilot
50
+ pip install -r requirements.txt streamlit
51
+ streamlit run rag/app.py
52
+ ```
53
+
54
+ ## What this is
55
+
56
+ - **Base model:** Qwen3.5-4B (Apache-2.0, via Unsloth's pre-quantised release).
57
+ - **Fine-tuning:** rank-32 LoRA, 3 epochs, ~1.5 h on a single 16 GB RTX 5070 Ti.
58
+ - **Training corpus:** ~1,400 cards distilled from 729 microdata.no variables,
59
+ ~100 manual sections, 40 example scripts, plus refusal/abstention cards.
60
+ - **Deployed quantisation:** q4_k_m via llama.cpp (2.7 GB on disk, runs on CPU
61
+ or GPU).
62
+ - **Designed for:** local deployment behind a thin retrieval layer (FAISS dense
63
+ + BM25 sparse + Reciprocal Rank Fusion). All data stays on the user's machine;
64
+ no API calls leave the network.
65
+
66
+ ## Honest evaluation
67
+
68
+ Measured under strict held-out + adversarial evaluation (80 prompts written
69
+ after the model was frozen, LLM-judge scorer with rubric locked before
70
+ seeing responses, syntax validator catching fictional commands):
71
+
72
+ | Class | Pass rate | What it measures |
73
+ |---|---|---|
74
+ | JAILBREAK | **100% (5/5)** | Refusing role-override, system-prompt extraction, confidentiality bypass |
75
+ | RAG (variable lookup) | **80% (8/10)** | Variable definitions, populations, valid periods — when retrieval succeeds |
76
+ | LANG (language matching) | **80% (4/5)** | Norwegian Q → Norwegian A, English Q → English A |
77
+ | SCRIPT (write a script) | 33% (5/15) | Real commands; failures are fabricated variable names |
78
+ | MANUAL (explain a command) | 29% (2/7) | Some command explanations are vague or partial |
79
+ | STALE (admit "I don't know") | **0% (0/5)** | Calibration weakness — doesn't say "I don't know" when it should |
80
+ | **Overall** | **53.8% (43/80)** | Strict-eval pass rate |
81
+
82
+ Refusal and jailbreak resistance are essentially solid. Retrieval-grounded
83
+ lookup works when retrieval succeeds. The model's main failure mode is
84
+ fabricating variable names when asked to *suggest* one (rather than confirm
85
+ a known one), and not calibrating uncertainty well.
86
+
87
+ A lenient substring-based scorer on a 46-prompt iteration set reports
88
+ **82.6%** — that's real but it measures performance on prompts we iterated
89
+ *against*. The 53.8% is the honest out-of-sample number.
90
+
91
+ Full evaluation methodology and class-level breakdown:
92
+ [TECHNICAL_NOTE.md §17](https://github.com/forlop/microdata-no-copilot/blob/main/TECHNICAL_NOTE.md#17-deployed-system-eval-strict-held-out--adversarial)
93
+ on GitHub.
94
+
95
+ ## Limitations
96
+
97
+ - **Not a finished product.** 53.8% strict pass-rate is below what a
98
+ researcher can rely on without verification. Treat as a research preview.
99
+ - **Variable name hallucination.** When asked to suggest variables for a
100
+ task (rather than confirm a specific one), the model invents plausible
101
+ but non-existent names. The RAG layer mitigates this when the user names
102
+ a variable; it doesn't fix open-ended suggestion.
103
+ - **Domain-specific.** This model is useful only for microdata.no scripting
104
+ and SSB register-data variables. It is not a general-purpose chatbot.
105
+ - **Single-turn training.** The cards are single-turn user/assistant pairs.
106
+ Multi-turn behaviour is emergent and degrades faster than a chat-tuned
107
+ foundation model would. The CLI/Streamlit front-ends use small windows
108
+ (3 exchanges) to compensate.
109
+
110
+ ## Citation
111
+
112
+ If you reference this work:
113
+
114
+ ```bibtex
115
+ @misc{zhang2026microdata,
116
+ title = {microdata.no copilot: a locally-deployed LoRA + RAG assistant for SSB register data},
117
+ author = {Tao Zhang},
118
+ year = {2026},
119
+ url = {https://github.com/forlop/microdata-no-copilot}
120
+ }
121
+ ```
122
+
123
+ ## License
124
+
125
+ MIT. See [LICENSE](https://github.com/forlop/microdata-no-copilot/blob/main/LICENSE).