gemma3n-qa-v4-fixed

A fine-tuned Gemma 3n model for document-grounded question answering that eliminates hallucination and knows when to say "I don't know."

Metric	This Model	Baseline	Improvement
Exact Match	83.2%	22.0%	+61.2 pts
Token F1	90.0%	34.8%	+55.2 pts
Abstention F1	98.9%	~0%	+98.9 pts

TL;DR

This model answers questions only from provided context. When the answer isn't there, it says NOT FOUND IN DOCUMENTS instead of making things up.

The problem it solves: The baseline Gemma 3n hallucinates answers not in the context. Ask "Who is the president of France?" with context about the Eiffel Tower, and baseline confidently says "Emmanuel Macron" - information it made up. This fine-tuned version correctly responds "NOT FOUND IN DOCUMENTS."

Quick Start

With Ollama

# Download the model
curl -L -o gemma3n-qa-v4-fixed.gguf https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./gemma3n-qa-v4-fixed.gguf
TEMPLATE """<bos><start_of_turn>user
{{ .System }}

{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>"""
PARAMETER stop <end_of_turn>
PARAMETER stop <eos>
PARAMETER temperature 0
EOF

# Create and run
ollama create gemma3n-qa-v4-fixed -f Modelfile
ollama run gemma3n-qa-v4-fixed

Python API (Ollama)

import requests

def ask_document(question: str, context: str) -> str:
    prompt = f"""You are a helpful assistant that answers questions based on provided context.
If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".

Question: {question}

Context:
{context}"""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "gemma3n-qa-v4-fixed",
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Example
answer = ask_document(
    question="When was the Eiffel Tower built?",
    context="The Eiffel Tower was built from 1887 to 1889 by Gustave Eiffel."
)
print(answer)  # Output: "from 1887 to 1889"

The Hallucination Problem (Why This Model Exists)

Baseline Behavior (Bad)

Question: Who is the president of France?
Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel.

Baseline Response: "Emmanuel Macron"  ← HALLUCINATED! Not in context!

Fine-tuned Behavior (Good)

Question: Who is the president of France?
Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel.

Fine-tuned Response: "NOT FOUND IN DOCUMENTS"  ← Correct abstention!

This is critical for RAG applications where you need the model to be honest about what it doesn't know.

Prompt Format (Required)

The model requires this specific prompt format to work correctly:

You are a helpful assistant that answers questions based on provided context.
If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".

Question: {your question}

Context:
{your context}

Without the abstention instruction, the model may not properly refuse to answer questions outside the context.

Performance

Benchmark Results (6,046 test examples)

Metric	Value	Description
Exact Match	83.2%	Answer exactly matches gold standard
Token F1	90.0%	Token overlap with gold answer
Abstention Precision	98.2%	When it abstains, it's correct
Abstention Recall	99.7%	It catches almost all unanswerable questions
Abstention F1	98.9%	Combined abstention performance

Comparison with Baseline

Metric	Fine-tuned	Baseline (gemma3n:e4b)	Improvement
Exact Match	83.2%	22.0%	+61.2 pts (+278%)
Token F1	90.0%	34.8%	+55.2 pts (+159%)
Abstention F1	98.9%	~0%	Model learned abstention

Statistical Significance

p-value: < 0.00001 (highly significant)
95% CI: 82.3% - 84.1% (fine-tuned) vs 13.9% - 30.1% (baseline)
Confidence intervals don't overlap

Hardware Requirements

Hardware	Supported	Latency	Notes
CPU only (8 cores, 32GB RAM)	Yes	4-6 sec	Validated on n2-standard-8
NVIDIA T4 (16GB)	Yes	<1 sec	Recommended
Consumer GPU (8GB)	Yes	1-2 sec	Works with Q4_K_M
Apple Silicon	Yes	1-3 sec	Via llama.cpp

Memory requirement: ~10 GB RAM for inference

Training Details

Base Model

Model: Google Gemma 3n E4B (4B effective parameters)
Source: unsloth/gemma-3n-E4B-it-unsloth-bnb-4bit

Fine-tuning Configuration

Parameter	Value
Method	LoRA (Low-Rank Adaptation)
Rank (r)	32
Alpha	64
Dropout	0.05
Learning Rate	2e-4
Epochs	3
Batch Size	4 (effective: 16 with grad accum)
Precision	bfloat16
Training Time	~20 hours on A100 40GB

Training Data

Dataset: adorosario/gemma3n-qa-synthetic
Size: 57,081 examples (45,220 train / 5,815 val / 6,046 test)
Composition: 73% answerable QA, 27% abstention examples
Source: Synthetic generation from SimpleQA-Verified knowledge base
Generation: GPT-4o-mini
Cost: ~$15-20 USD

Critical Implementation Detail

The v4 success came from manual label masking - training only on model responses, not on the prompt. Previous versions (v1, v3) failed because this wasn't properly implemented.

How-To Guides

Use with llama.cpp

# Download
wget https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf

# Run
./llama-cli -m gemma3n-qa-v4-fixed-q4_k_m.gguf \
  -p "You are a helpful assistant...\n\nQuestion: ...\n\nContext:\n..." \
  --temp 0

Use in a RAG Pipeline

from langchain.llms import Ollama

llm = Ollama(model="gemma3n-qa-v4-fixed", temperature=0)

def rag_query(question: str, retrieved_docs: list) -> str:
    context = "\n\n".join(retrieved_docs)
    prompt = f"""You are a helpful assistant that answers questions based on provided context.
If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".

Question: {question}

Context:
{context}"""
    return llm.invoke(prompt)

Use with AnythingLLM

Import the GGUF into Ollama (see Quick Start)
In AnythingLLM, select gemma3n-qa-v4-fixed as the model
Set system prompt to include the abstention instruction
Set temperature to 0

Limitations

What This Model Does Well

Extracting answers from provided context
Knowing when to abstain ("NOT FOUND IN DOCUMENTS")
Running on CPU-only hardware
Fast inference (4-6 seconds on CPU)

What This Model Does NOT Do

Generate answers beyond the context (by design)
Multi-hop reasoning requiring external knowledge
Non-English languages (trained on English only)
Long contexts beyond 4096 tokens
Multi-turn conversation (single-turn QA only)

Known Issues

Requires specific prompt format for abstention
~2% quality loss from Q4_K_M quantization
May struggle with heavily paraphrased answers

Files

File	Size	Description
`gemma3n-qa-v4-fixed-q4_k_m.gguf`	7.68 GB	Main model (Q4_K_M quantization)

Citation

@misc{gemma3n-qa-v4-fixed-2025,
  author = {Do Rosario, Alden},
  title = {gemma3n-qa-v4-fixed: Fine-tuned Gemma 3n for Document-Grounded QA with Abstention},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/adorosario/gemma3n-qa-v4-fixed},
  note = {Fine-tuned for extractive QA with learned abstention behavior}
}

Related Resources

Training Dataset: adorosario/gemma3n-qa-synthetic
Base Model: Google Gemma 3n
Training Framework: Unsloth

Acknowledgments

Google for the Gemma 3n base model
Unsloth team for efficient fine-tuning tools
OpenAI for GPT-4o-mini used in synthetic data generation

Downloads last month: 1

GGUF

Model size

7B params

Architecture

gemma3n

Hardware compatibility

4-bit

Model tree for adorosario/gemma3n-qa-v4-fixed

Base model

google/gemma-3n-E4B

Finetuned

google/gemma-3n-E4B-it

Quantized

(65)

this model

Dataset used to train adorosario/gemma3n-qa-v4-fixed

Evaluation results

Exact Match on SimpleQA-Verified Synthetic Test
self-reported

83.200
Token F1 on SimpleQA-Verified Synthetic Test
self-reported

90.000
Abstention F1 on SimpleQA-Verified Synthetic Test
self-reported

98.900