pawlaszc commited on
Commit
8ec29e4
·
verified ·
1 Parent(s): 8597c7c

Upload MODEL_CARD.md

Browse files
Files changed (1) hide show
  1. MODEL_CARD.md +335 -0
MODEL_CARD.md ADDED
@@ -0,0 +1,335 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - sql
8
+ - forensics
9
+ - text-to-sql
10
+ - llama
11
+ - fine-tuned
12
+ base_model: unsloth/Llama-3.2-3B-Instruct
13
+ datasets:
14
+ - [your-username]/mobile-forensics-sql
15
+ metrics:
16
+ - accuracy
17
+ model-index:
18
+ - name: ForensicSQL-Llama-3.2-3B
19
+ results:
20
+ - task:
21
+ type: text-to-sql
22
+ name: Text-to-SQL Generation
23
+ dataset:
24
+ type: mobile-forensics
25
+ name: Mobile Forensics SQL Dataset
26
+ metrics:
27
+ - type: accuracy
28
+ value: 79.0
29
+ name: Overall Accuracy
30
+ - type: accuracy
31
+ value: 94.3
32
+ name: Easy Queries Accuracy
33
+ - type: accuracy
34
+ value: 80.6
35
+ name: Medium Queries Accuracy
36
+ - type: accuracy
37
+ value: 61.8
38
+ name: Hard Queries Accuracy
39
+ ---
40
+
41
+ # ForensicSQL-Llama-3.2-3B
42
+
43
+ ## Model Description
44
+
45
+ **ForensicSQL** is a fine-tuned Llama 3.2 3B model specialized for generating SQLite queries for mobile forensics databases. The model converts natural language forensic investigation requests into executable SQL queries across various mobile app databases (WhatsApp, Signal, iOS Health, Android SMS, etc.).
46
+
47
+ This model was developed as part of a master's thesis investigating LLM fine-tuning for forensic database analysis.
48
+
49
+ ## Model Details
50
+
51
+ - **Base Model:** Llama 3.2 3B Instruct
52
+ - **Fine-tuning Method:** LoRA (Low-Rank Adaptation)
53
+ - **Training Dataset:** 768 forensic SQL examples across 148 categories
54
+ - **Training Framework:** Hugging Face Transformers + PEFT
55
+ - **Model Size:**
56
+ - Full (FP16): ~6 GB
57
+ - GGUF Q4_K_M: ~2.3 GB
58
+ - GGUF Q5_K_M: ~2.8 GB
59
+ - GGUF Q8_0: ~3.8 GB
60
+
61
+ ## Performance
62
+
63
+ ### Overall Results
64
+ - **Overall Accuracy:** 79.0%
65
+ - **Schema Generation Errors:** 0% (completely eliminated)
66
+ - **Executable Queries:** 79%
67
+
68
+ ### Breakdown by Difficulty
69
+ | Difficulty | Accuracy | Examples |
70
+ |------------------------|----------|----------|
71
+ | Easy (single-table) | 94.3% | 33/35 |
72
+ | Medium (simple joins) | 80.6% | 25/31 |
73
+ | Hard (complex queries) | 61.8% | 21/34 |
74
+
75
+ ### Error Analysis
76
+ | Error Type | Percentage | Description |
77
+ |----------------------|------------|------------------------------------|
78
+ | Column Hallucination | 18% | References non-existent columns |
79
+ | Syntax Errors | 3% | Invalid SQL syntax |
80
+ | Schema Generation | 0% | Eliminated through proper training |
81
+
82
+ ## Intended Use
83
+
84
+ ### Primary Use Cases
85
+ - Mobile forensics investigations
86
+ - Automated SQL query generation for forensic databases
87
+ - Educational tool for learning forensic database analysis
88
+ - Research in text-to-SQL for specialized domains
89
+
90
+ ### Out-of-Scope Use
91
+ - General-purpose SQL generation (use specialized models)
92
+ - Production systems requiring >95% accuracy
93
+ - Real-time critical forensic decisions without human review
94
+
95
+ ## How to Use
96
+
97
+ ### Quick Start (Transformers)
98
+
99
+ ```python
100
+ from transformers import AutoModelForCausalLM, AutoTokenizer
101
+ import torch
102
+
103
+ # Load model and tokenizer
104
+ model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B"
105
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
106
+ model = AutoModelForCausalLM.from_pretrained(
107
+ model_name,
108
+ torch_dtype=torch.float16,
109
+ device_map="auto"
110
+ )
111
+
112
+ # Prepare input
113
+ schema = """
114
+ CREATE TABLE messages (
115
+ _id INTEGER PRIMARY KEY,
116
+ address TEXT,
117
+ body TEXT,
118
+ date INTEGER,
119
+ read INTEGER
120
+ );
121
+ """
122
+
123
+ request = "Find all unread messages from yesterday"
124
+
125
+ prompt = f"""Generate a valid SQLite query for this forensic database request.
126
+
127
+ Database Schema:
128
+ {schema}
129
+
130
+ Request: {request}
131
+
132
+ SQLite Query:
133
+ """
134
+
135
+ # Generate SQL
136
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
137
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
138
+
139
+ with torch.no_grad():
140
+ outputs = model.generate(
141
+ **inputs,
142
+ max_new_tokens=200,
143
+ do_sample=False,
144
+ )
145
+
146
+ # Decode only the generated part
147
+ input_length = inputs['input_ids'].shape[1]
148
+ generated_tokens = outputs[0][input_length:]
149
+ sql = tokenizer.decode(generated_tokens, skip_special_tokens=True)
150
+
151
+ print(sql.strip())
152
+ # Output: SELECT * FROM messages WHERE read = 0 AND date > ...
153
+ ```
154
+
155
+ ### Using GGUF Files (llama.cpp / Ollama)
156
+
157
+ **With llama.cpp:**
158
+ ```bash
159
+ # Download GGUF file
160
+ wget https://huggingface.co/pawlaszc/ForensicSQL-Llama-3.2-3B/resolve/main/forensic-sql-q4_k_m.gguf
161
+
162
+ # Run inference
163
+ ./llama-cli -m forensic-sql-q4_k_m.gguf -p "Generate SQL..."
164
+ ```
165
+
166
+ **With Ollama:**
167
+ ```bash
168
+ # Create Modelfile
169
+ FROM ./forensic-sql-q4_k_m.gguf
170
+ PARAMETER temperature 0
171
+ PARAMETER top_p 0.9
172
+
173
+ # Import
174
+ ollama create forensic-sql -f Modelfile
175
+
176
+ # Use
177
+ ollama run forensic-sql "Schema: ...\nRequest: Find messages\nSQL:"
178
+ ```
179
+
180
+ ### Python Helper Class
181
+
182
+ ```python
183
+ class ForensicSQLGenerator:
184
+ def __init__(self, model_name="pawalaszc/ForensicSQL-Llama-3.2-3B"):
185
+ from transformers import AutoModelForCausalLM, AutoTokenizer
186
+ import torch
187
+
188
+ self.tokenizer = AutoTokenizer.from_pretrained(model_name)
189
+ self.model = AutoModelForCausalLM.from_pretrained(
190
+ model_name,
191
+ torch_dtype=torch.float16,
192
+ device_map="auto"
193
+ )
194
+ self.model.eval()
195
+
196
+ def generate_sql(self, schema: str, request: str) -> str:
197
+ prompt = f"""Generate a valid SQLite query for this forensic database request.
198
+
199
+ Database Schema:
200
+ {schema}
201
+
202
+ Request: {request}
203
+
204
+ SQLite Query:
205
+ """
206
+ inputs = self.tokenizer(
207
+ prompt,
208
+ return_tensors="pt",
209
+ truncation=True,
210
+ max_length=2048
211
+ )
212
+ inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
213
+
214
+ input_length = inputs['input_ids'].shape[1]
215
+
216
+ with torch.no_grad():
217
+ outputs = self.model.generate(
218
+ **inputs,
219
+ max_new_tokens=200,
220
+ do_sample=False,
221
+ )
222
+
223
+ generated_tokens = outputs[0][input_length:]
224
+ sql = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
225
+
226
+ return sql.strip().split("\n")[0].strip().rstrip(";") + ";"
227
+
228
+ # Usage
229
+ generator = ForensicSQLGenerator()
230
+ sql = generator.generate_sql(schema, request)
231
+ ```
232
+
233
+ ## Training Details
234
+
235
+ ### Training Data
236
+ - **Size:** 768 examples (original) → 2,304 examples (with augmentation)
237
+ - **Categories:** 148 forensic database categories
238
+ - **Sources:** WhatsApp, Signal, iMessage, SMS, iOS apps, Android apps
239
+ - **Augmentation:** 3x paraphrase augmentation per example
240
+
241
+ ### Training Procedure
242
+ - **Method:** LoRA fine-tuning
243
+ - **LoRA Rank:** 16
244
+ - **LoRA Alpha:** 32
245
+ - **Target Modules:** q_proj, k_proj, v_proj, o_proj
246
+ - **Epochs:** 5
247
+ - **Learning Rate:** 2e-5
248
+ - **Batch Size:** 1 (gradient accumulation: 4)
249
+ - **Max Sequence Length:** 2048 (critical for preventing truncation)
250
+ - **Optimizer:** AdamW
251
+ - **Scheduler:** Cosine with warmup (10%)
252
+ - **Hardware:** Apple M2 (MPS)
253
+ - **Training Time:** ~3.5 hours
254
+
255
+ ### Key Training Insights
256
+
257
+ **Critical Discovery: Sequence Length Matters**
258
+
259
+ Initial training attempts with `max_seq_length=512` resulted in only 50% accuracy because 92% of training examples were truncated. The model learned to generate schema definitions (CREATE TABLE) instead of queries.
260
+
261
+ Increasing to `max_seq_length=2048` eliminated truncation and improved accuracy from 50% to 79% (+29pp).
262
+
263
+ **Lesson:** Data preprocessing and proper sequence length configuration are critical for fine-tuning success.
264
+
265
+ ## Limitations
266
+
267
+ ### Known Issues
268
+ 1. **Column Hallucination (18%):** Model sometimes references non-existent columns
269
+ 2. **Complex Joins:** Performance drops on multi-table queries requiring JOINs (62%)
270
+ 3. **Schema Understanding:** Limited understanding of foreign key relationships
271
+
272
+ ### When to Use Human Review
273
+ - Complex multi-table queries
274
+ - Critical forensic investigations
275
+ - Queries involving data deletion or modification
276
+ - When accuracy >95% is required
277
+
278
+ ## Evaluation
279
+
280
+ ### Test Set
281
+ - **Size:** 100 queries (random sample from held-out data)
282
+ - **Seed:** 42 (reproducible)
283
+ - **Evaluation Metric:** Exact match (query results must match expected results)
284
+
285
+ ### Ablation Studies
286
+
287
+ | Configuration | Accuracy | Notes |
288
+ |--------------------------------|----------|----------------------|
289
+ | Zero-shot baseline | 45% | No fine-tuning |
290
+ | Final training (max_len=2048) | 79% | No truncation |
291
+
292
+
293
+ ## Citation
294
+
295
+ If you use this model in your research, please cite:
296
+
297
+ ```bibtex
298
+ @mastersthesis{forensicsql2025,
299
+ author = {Dirk Pawlaszczyk},
300
+ title = {Fine-Tuning Large Language Models for Forensic SQL Query Generation},
301
+ school = {[Hochschule Mittweida University of Applied Sciences]},
302
+ year = {2026},
303
+ type = {Journal}
304
+ }
305
+ ```
306
+
307
+ ## Model Card Authors
308
+
309
+ Dirk Pawlaszczyk
310
+
311
+ ## Model Card Contact
312
+
313
+ For questions or issues, please open an issue on the (https://github.com/pawlaszczyk/forensic-sql) or contact pawlaszc@hs-mittweida.de.
314
+
315
+ ## License
316
+
317
+ This model is released under the Apache 2.0 License, following the base Llama 3.2 license.
318
+
319
+ ## Acknowledgments
320
+
321
+ - Base model: Meta's Llama 3.2 3B Instruct
322
+ - Training framework: Hugging Face Transformers, PEFT
323
+ - Dataset creation: Custom forensic database schemas
324
+ - Inspiration: Text-to-SQL research community
325
+
326
+ ## Additional Resources
327
+
328
+ - **Dataset:** pawlaszc/mobile-forensics-sql
329
+ - **GitHub:** https://github.com/pawlaszc/forensic-sql
330
+ - **Paper:** [Link when published]
331
+ - **Demo:** [HuggingFace Space if you create one]
332
+
333
+ ---
334
+
335
+ **Disclaimer:** This model is intended for research and educational purposes. Always validate generated SQL queries before execution in production forensic investigations. The model may produce incorrect queries that could lead to data loss or incorrect conclusions if used without proper review.