gemma3n-qa-v4-fixed
A fine-tuned Gemma 3n model for document-grounded question answering that eliminates hallucination and knows when to say "I don't know."
| Metric | This Model | Baseline | Improvement |
|---|---|---|---|
| Exact Match | 83.2% | 22.0% | +61.2 pts |
| Token F1 | 90.0% | 34.8% | +55.2 pts |
| Abstention F1 | 98.9% | ~0% | +98.9 pts |
TL;DR
This model answers questions only from provided context. When the answer isn't there, it says NOT FOUND IN DOCUMENTS instead of making things up.
The problem it solves: The baseline Gemma 3n hallucinates answers not in the context. Ask "Who is the president of France?" with context about the Eiffel Tower, and baseline confidently says "Emmanuel Macron" - information it made up. This fine-tuned version correctly responds "NOT FOUND IN DOCUMENTS."
Quick Start
With Ollama
# Download the model
curl -L -o gemma3n-qa-v4-fixed.gguf https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf
# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./gemma3n-qa-v4-fixed.gguf
TEMPLATE """<bos><start_of_turn>user
{{ .System }}
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>"""
PARAMETER stop <end_of_turn>
PARAMETER stop <eos>
PARAMETER temperature 0
EOF
# Create and run
ollama create gemma3n-qa-v4-fixed -f Modelfile
ollama run gemma3n-qa-v4-fixed
Python API (Ollama)
import requests
def ask_document(question: str, context: str) -> str:
prompt = f"""You are a helpful assistant that answers questions based on provided context.
If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".
Question: {question}
Context:
{context}"""
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "gemma3n-qa-v4-fixed",
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# Example
answer = ask_document(
question="When was the Eiffel Tower built?",
context="The Eiffel Tower was built from 1887 to 1889 by Gustave Eiffel."
)
print(answer) # Output: "from 1887 to 1889"
The Hallucination Problem (Why This Model Exists)
Baseline Behavior (Bad)
Question: Who is the president of France?
Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel.
Baseline Response: "Emmanuel Macron" ← HALLUCINATED! Not in context!
Fine-tuned Behavior (Good)
Question: Who is the president of France?
Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel.
Fine-tuned Response: "NOT FOUND IN DOCUMENTS" ← Correct abstention!
This is critical for RAG applications where you need the model to be honest about what it doesn't know.
Prompt Format (Required)
The model requires this specific prompt format to work correctly:
You are a helpful assistant that answers questions based on provided context.
If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".
Question: {your question}
Context:
{your context}
Without the abstention instruction, the model may not properly refuse to answer questions outside the context.
Performance
Benchmark Results (6,046 test examples)
| Metric | Value | Description |
|---|---|---|
| Exact Match | 83.2% | Answer exactly matches gold standard |
| Token F1 | 90.0% | Token overlap with gold answer |
| Abstention Precision | 98.2% | When it abstains, it's correct |
| Abstention Recall | 99.7% | It catches almost all unanswerable questions |
| Abstention F1 | 98.9% | Combined abstention performance |
Comparison with Baseline
| Metric | Fine-tuned | Baseline (gemma3n:e4b) | Improvement |
|---|---|---|---|
| Exact Match | 83.2% | 22.0% | +61.2 pts (+278%) |
| Token F1 | 90.0% | 34.8% | +55.2 pts (+159%) |
| Abstention F1 | 98.9% | ~0% | Model learned abstention |
Statistical Significance
- p-value: < 0.00001 (highly significant)
- 95% CI: 82.3% - 84.1% (fine-tuned) vs 13.9% - 30.1% (baseline)
- Confidence intervals don't overlap
Hardware Requirements
| Hardware | Supported | Latency | Notes |
|---|---|---|---|
| CPU only (8 cores, 32GB RAM) | Yes | 4-6 sec | Validated on n2-standard-8 |
| NVIDIA T4 (16GB) | Yes | <1 sec | Recommended |
| Consumer GPU (8GB) | Yes | 1-2 sec | Works with Q4_K_M |
| Apple Silicon | Yes | 1-3 sec | Via llama.cpp |
Memory requirement: ~10 GB RAM for inference
Training Details
Base Model
- Model: Google Gemma 3n E4B (4B effective parameters)
- Source:
unsloth/gemma-3n-E4B-it-unsloth-bnb-4bit
Fine-tuning Configuration
| Parameter | Value |
|---|---|
| Method | LoRA (Low-Rank Adaptation) |
| Rank (r) | 32 |
| Alpha | 64 |
| Dropout | 0.05 |
| Learning Rate | 2e-4 |
| Epochs | 3 |
| Batch Size | 4 (effective: 16 with grad accum) |
| Precision | bfloat16 |
| Training Time | ~20 hours on A100 40GB |
Training Data
- Dataset: adorosario/gemma3n-qa-synthetic
- Size: 57,081 examples (45,220 train / 5,815 val / 6,046 test)
- Composition: 73% answerable QA, 27% abstention examples
- Source: Synthetic generation from SimpleQA-Verified knowledge base
- Generation: GPT-4o-mini
- Cost: ~$15-20 USD
Critical Implementation Detail
The v4 success came from manual label masking - training only on model responses, not on the prompt. Previous versions (v1, v3) failed because this wasn't properly implemented.
How-To Guides
Use with llama.cpp
# Download
wget https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf
# Run
./llama-cli -m gemma3n-qa-v4-fixed-q4_k_m.gguf \
-p "You are a helpful assistant...\n\nQuestion: ...\n\nContext:\n..." \
--temp 0
Use in a RAG Pipeline
from langchain.llms import Ollama
llm = Ollama(model="gemma3n-qa-v4-fixed", temperature=0)
def rag_query(question: str, retrieved_docs: list) -> str:
context = "\n\n".join(retrieved_docs)
prompt = f"""You are a helpful assistant that answers questions based on provided context.
If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".
Question: {question}
Context:
{context}"""
return llm.invoke(prompt)
Use with AnythingLLM
- Import the GGUF into Ollama (see Quick Start)
- In AnythingLLM, select
gemma3n-qa-v4-fixedas the model - Set system prompt to include the abstention instruction
- Set temperature to 0
Limitations
What This Model Does Well
- Extracting answers from provided context
- Knowing when to abstain ("NOT FOUND IN DOCUMENTS")
- Running on CPU-only hardware
- Fast inference (4-6 seconds on CPU)
What This Model Does NOT Do
- Generate answers beyond the context (by design)
- Multi-hop reasoning requiring external knowledge
- Non-English languages (trained on English only)
- Long contexts beyond 4096 tokens
- Multi-turn conversation (single-turn QA only)
Known Issues
- Requires specific prompt format for abstention
- ~2% quality loss from Q4_K_M quantization
- May struggle with heavily paraphrased answers
Files
| File | Size | Description |
|---|---|---|
gemma3n-qa-v4-fixed-q4_k_m.gguf |
7.68 GB | Main model (Q4_K_M quantization) |
Citation
@misc{gemma3n-qa-v4-fixed-2025,
author = {Do Rosario, Alden},
title = {gemma3n-qa-v4-fixed: Fine-tuned Gemma 3n for Document-Grounded QA with Abstention},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/adorosario/gemma3n-qa-v4-fixed},
note = {Fine-tuned for extractive QA with learned abstention behavior}
}
Related Resources
- Training Dataset: adorosario/gemma3n-qa-synthetic
- Base Model: Google Gemma 3n
- Training Framework: Unsloth
Acknowledgments
- Google for the Gemma 3n base model
- Unsloth team for efficient fine-tuning tools
- OpenAI for GPT-4o-mini used in synthetic data generation
- Downloads last month
- 30
4-bit
Model tree for adorosario/gemma3n-qa-v4-fixed
Dataset used to train adorosario/gemma3n-qa-v4-fixed
Evaluation results
- Exact Match on SimpleQA-Verified Synthetic Testself-reported83.200
- Token F1 on SimpleQA-Verified Synthetic Testself-reported90.000
- Abstention F1 on SimpleQA-Verified Synthetic Testself-reported98.900