pawlaszc commited on
Commit
90c0a2c
·
verified ·
1 Parent(s): e1fbcfc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +298 -0
README.md CHANGED
@@ -35,4 +35,302 @@ model-index:
35
  - type: accuracy
36
  value: 61.8
37
  name: Hard Queries Accuracy
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ---
 
 
 
35
  - type: accuracy
36
  value: 61.8
37
  name: Hard Queries Accuracy
38
+
39
+
40
+ ---
41
+
42
+ # ForensicSQL-Llama-3.2-3B
43
+
44
+ ## Model Description
45
+
46
+ **ForensicSQL** is a fine-tuned Llama 3.2 3B model specialised for generating SQLite queries for mobile forensics databases. The model converts natural language forensic investigation requests into executable SQL queries across various mobile app databases (WhatsApp, Signal, iOS Health, Android SMS, etc.).
47
+
48
+ This model was developed as part of a master's thesis investigating LLM fine-tuning for forensic database analysis.
49
+
50
+ ## Model Details
51
+
52
+ - **Base Model:** Llama 3.2 3B Instruct
53
+ - **Fine-tuning Method:** LoRA (Low-Rank Adaptation)
54
+ - **Training Dataset:** 768 forensic SQL examples across 148 categories
55
+ - **Training Framework:** Hugging Face Transformers + PEFT
56
+ - **Model Size:**
57
+ - Full (FP16): ~6 GB
58
+ - GGUF Q4_K_M: ~2.3 GB
59
+ - GGUF Q5_K_M: ~2.8 GB
60
+ - GGUF Q8_0: ~3.8 GB
61
+
62
+ ## Performance
63
+
64
+ ### Overall Results
65
+ - **Overall Accuracy:** 79.0%
66
+ - **Schema Generation Errors:** 0% (completely eliminated)
67
+ - **Executable Queries:** 79%
68
+
69
+ ### Breakdown by Difficulty
70
+ | Difficulty | Accuracy | Examples |
71
+ |------------------------|----------|----------|
72
+ | Easy (single-table) | 94.3% | 33/35 |
73
+ | Medium (simple joins) | 80.6% | 25/31 |
74
+ | Hard (complex queries) | 61.8% | 21/34 |
75
+
76
+ ### Error Analysis
77
+ | Error Type | Percentage | Description |
78
+ |----------------------|------------|------------------------------------|
79
+ | Column Hallucination | 18% | References non-existent columns |
80
+ | Syntax Errors | 3% | Invalid SQL syntax |
81
+ | Schema Generation | 0% | Eliminated through proper training |
82
+
83
+ ## Intended Use
84
+
85
+ ### Primary Use Cases
86
+ - Mobile forensics investigations
87
+ - Automated SQL query generation for forensic databases
88
+ - Educational tool for learning forensic database analysis
89
+ - Research in text-to-SQL for specialized domains
90
+
91
+ ### Out-of-Scope Use
92
+ - General-purpose SQL generation (use specialized models)
93
+ - Production systems requiring >95% accuracy
94
+ - Real-time critical forensic decisions without human review
95
+
96
+ ## How to Use
97
+
98
+ ### Quick Start (Transformers)
99
+
100
+ ```python
101
+ from transformers import AutoModelForCausalLM, AutoTokenizer
102
+ import torch
103
+
104
+ # Load model and tokenizer
105
+ model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B"
106
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
107
+ model = AutoModelForCausalLM.from_pretrained(
108
+ model_name,
109
+ torch_dtype=torch.float16,
110
+ device_map="auto"
111
+ )
112
+
113
+ # Prepare input
114
+ schema = """
115
+ CREATE TABLE messages (
116
+ _id INTEGER PRIMARY KEY,
117
+ address TEXT,
118
+ body TEXT,
119
+ date INTEGER,
120
+ read INTEGER
121
+ );
122
+ """
123
+
124
+ request = "Find all unread messages from yesterday"
125
+
126
+ prompt = f"""Generate a valid SQLite query for this forensic database request.
127
+
128
+ Database Schema:
129
+ {schema}
130
+
131
+ Request: {request}
132
+
133
+ SQLite Query:
134
+ """
135
+
136
+ # Generate SQL
137
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
138
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
139
+
140
+ with torch.no_grad():
141
+ outputs = model.generate(
142
+ **inputs,
143
+ max_new_tokens=200,
144
+ do_sample=False,
145
+ )
146
+
147
+ # Decode only the generated part
148
+ input_length = inputs['input_ids'].shape[1]
149
+ generated_tokens = outputs[0][input_length:]
150
+ sql = tokenizer.decode(generated_tokens, skip_special_tokens=True)
151
+
152
+ print(sql.strip())
153
+ # Output: SELECT * FROM messages WHERE read = 0 AND date > ...
154
+ ```
155
+
156
+ ### Using GGUF Files (llama.cpp / Ollama)
157
+
158
+ **With llama.cpp:**
159
+ ```bash
160
+ # Download GGUF file
161
+ wget https://huggingface.co/pawlaszc/ForensicSQL-Llama-3.2-3B/resolve/main/forensic-sql-q4_k_m.gguf
162
+
163
+ # Run inference
164
+ ./llama-cli -m forensic-sql-q4_k_m.gguf -p "Generate SQL..."
165
+ ```
166
+
167
+ **With Ollama:**
168
+ ```bash
169
+ # Create Modelfile
170
+ FROM ./forensic-sql-q4_k_m.gguf
171
+ PARAMETER temperature 0
172
+ PARAMETER top_p 0.9
173
+
174
+ # Import
175
+ ollama create forensic-sql -f Modelfile
176
+
177
+ # Use
178
+ ollama run forensic-sql "Schema: ...\nRequest: Find messages\nSQL:"
179
+ ```
180
+
181
+ ### Python Helper Class
182
+
183
+ ```python
184
+ class ForensicSQLGenerator:
185
+ def __init__(self, model_name="pawalaszc/ForensicSQL-Llama-3.2-3B"):
186
+ from transformers import AutoModelForCausalLM, AutoTokenizer
187
+ import torch
188
+
189
+ self.tokenizer = AutoTokenizer.from_pretrained(model_name)
190
+ self.model = AutoModelForCausalLM.from_pretrained(
191
+ model_name,
192
+ torch_dtype=torch.float16,
193
+ device_map="auto"
194
+ )
195
+ self.model.eval()
196
+
197
+ def generate_sql(self, schema: str, request: str) -> str:
198
+ prompt = f"""Generate a valid SQLite query for this forensic database request.
199
+
200
+ Database Schema:
201
+ {schema}
202
+
203
+ Request: {request}
204
+
205
+ SQLite Query:
206
+ """
207
+ inputs = self.tokenizer(
208
+ prompt,
209
+ return_tensors="pt",
210
+ truncation=True,
211
+ max_length=2048
212
+ )
213
+ inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
214
+
215
+ input_length = inputs['input_ids'].shape[1]
216
+
217
+ with torch.no_grad():
218
+ outputs = self.model.generate(
219
+ **inputs,
220
+ max_new_tokens=200,
221
+ do_sample=False,
222
+ )
223
+
224
+ generated_tokens = outputs[0][input_length:]
225
+ sql = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
226
+
227
+ return sql.strip().split("\n")[0].strip().rstrip(";") + ";"
228
+
229
+ # Usage
230
+ generator = ForensicSQLGenerator()
231
+ sql = generator.generate_sql(schema, request)
232
+ ```
233
+
234
+ ## Training Details
235
+
236
+ ### Training Data
237
+ - **Size:** 768 examples (original) → 2,304 examples (with augmentation)
238
+ - **Categories:** 148 forensic database categories
239
+ - **Sources:** WhatsApp, Signal, iMessage, SMS, iOS apps, Android apps
240
+ - **Augmentation:** 3x paraphrase augmentation per example
241
+
242
+ ### Training Procedure
243
+ - **Method:** LoRA fine-tuning
244
+ - **LoRA Rank:** 16
245
+ - **LoRA Alpha:** 32
246
+ - **Target Modules:** q_proj, k_proj, v_proj, o_proj
247
+ - **Epochs:** 5
248
+ - **Learning Rate:** 2e-5
249
+ - **Batch Size:** 1 (gradient accumulation: 4)
250
+ - **Max Sequence Length:** 2048 (critical for preventing truncation)
251
+ - **Optimizer:** AdamW
252
+ - **Scheduler:** Cosine with warmup (10%)
253
+ - **Hardware:** Apple M2 (MPS)
254
+ - **Training Time:** ~3.5 hours
255
+
256
+ ### Key Training Insights
257
+
258
+ **Critical Discovery: Sequence Length Matters**
259
+
260
+ Initial training attempts with `max_seq_length=512` resulted in only 50% accuracy because 92% of training examples were truncated. The model learned to generate schema definitions (CREATE TABLE) instead of queries.
261
+
262
+ Increasing to `max_seq_length=2048` eliminated truncation and improved accuracy from 50% to 79% (+29pp).
263
+
264
+ **Lesson:** Data preprocessing and proper sequence length configuration are critical for fine-tuning success.
265
+
266
+ ## Limitations
267
+
268
+ ### Known Issues
269
+ 1. **Column Hallucination (18%):** Model sometimes references non-existent columns
270
+ 2. **Complex Joins:** Performance drops on multi-table queries requiring JOINs (62%)
271
+ 3. **Schema Understanding:** Limited understanding of foreign key relationships
272
+
273
+ ### When to Use Human Review
274
+ - Complex multi-table queries
275
+ - Critical forensic investigations
276
+ - Queries involving data deletion or modification
277
+ - When accuracy >95% is required
278
+
279
+ ## Evaluation
280
+
281
+ ### Test Set
282
+ - **Size:** 100 queries (random sample from held-out data)
283
+ - **Seed:** 42 (reproducible)
284
+ - **Evaluation Metric:** Exact match (query results must match expected results)
285
+
286
+ ### Ablation Studies
287
+
288
+ | Configuration | Accuracy | Notes |
289
+ |--------------------------------|----------|----------------------|
290
+ | Zero-shot baseline | 45% | No fine-tuning |
291
+ | Final training (max_len=2048) | 79% | No truncation |
292
+
293
+
294
+ ## Citation
295
+
296
+ If you use this model in your research, please cite:
297
+
298
+ ```bibtex
299
+ @mastersthesis{forensicsql2025,
300
+ author = {Dirk Pawlaszczyk},
301
+ title = {Fine-Tuning Large Language Models for Forensic SQL Query Generation},
302
+ school = {[Hochschule Mittweida University of Applied Sciences]},
303
+ year = {2026},
304
+ type = {Journal}
305
+ }
306
+ ```
307
+
308
+ ## Model Card Authors
309
+
310
+ Dirk Pawlaszczyk
311
+
312
+ ## Model Card Contact
313
+
314
+ For questions or issues, please open an issue on the (https://github.com/pawlaszczyk/forensic-sql) or contact pawlaszc@hs-mittweida.de.
315
+
316
+ ## License
317
+
318
+ This model is released under the Apache 2.0 License, following the base Llama 3.2 license.
319
+
320
+ ## Acknowledgments
321
+
322
+ - Base model: Meta's Llama 3.2 3B Instruct
323
+ - Training framework: Hugging Face Transformers, PEFT
324
+ - Dataset creation: Custom forensic database schemas
325
+ - Inspiration: Text-to-SQL research community
326
+
327
+ ## Additional Resources
328
+
329
+ - **Dataset:** pawlaszc/mobile-forensics-sql
330
+ - **GitHub:** https://github.com/pawlaszc/forensic-sql
331
+ - **Paper:** [Link when published]
332
+ - **Demo:** [HuggingFace Space if you create one]
333
+
334
  ---
335
+
336
+ **Disclaimer:** This model is intended for research and educational purposes. Always validate generated SQL queries before execution in production forensic investigations. The model may produce incorrect queries that could lead to data loss or incorrect conclusions if used without proper review.