gemma3n-qa-v4-fixed

A fine-tuned Gemma 3n model for document-grounded question answering that eliminates hallucination and knows when to say "I don't know."

Metric This Model Baseline Improvement
Exact Match 83.2% 22.0% +61.2 pts
Token F1 90.0% 34.8% +55.2 pts
Abstention F1 98.9% ~0% +98.9 pts

TL;DR

This model answers questions only from provided context. When the answer isn't there, it says NOT FOUND IN DOCUMENTS instead of making things up.

The problem it solves: The baseline Gemma 3n hallucinates answers not in the context. Ask "Who is the president of France?" with context about the Eiffel Tower, and baseline confidently says "Emmanuel Macron" - information it made up. This fine-tuned version correctly responds "NOT FOUND IN DOCUMENTS."


Quick Start

With Ollama

# Download the model
curl -L -o gemma3n-qa-v4-fixed.gguf https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./gemma3n-qa-v4-fixed.gguf
TEMPLATE """<bos><start_of_turn>user
{{ .System }}

{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>"""
PARAMETER stop <end_of_turn>
PARAMETER stop <eos>
PARAMETER temperature 0
EOF

# Create and run
ollama create gemma3n-qa-v4-fixed -f Modelfile
ollama run gemma3n-qa-v4-fixed

Python API (Ollama)

import requests

def ask_document(question: str, context: str) -> str:
    prompt = f"""You are a helpful assistant that answers questions based on provided context.
If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".

Question: {question}

Context:
{context}"""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "gemma3n-qa-v4-fixed",
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Example
answer = ask_document(
    question="When was the Eiffel Tower built?",
    context="The Eiffel Tower was built from 1887 to 1889 by Gustave Eiffel."
)
print(answer)  # Output: "from 1887 to 1889"

The Hallucination Problem (Why This Model Exists)

Baseline Behavior (Bad)

Question: Who is the president of France?
Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel.

Baseline Response: "Emmanuel Macron"  ← HALLUCINATED! Not in context!

Fine-tuned Behavior (Good)

Question: Who is the president of France?
Context: The Eiffel Tower is in Paris. It was built by Gustave Eiffel.

Fine-tuned Response: "NOT FOUND IN DOCUMENTS"  ← Correct abstention!

This is critical for RAG applications where you need the model to be honest about what it doesn't know.


Prompt Format (Required)

The model requires this specific prompt format to work correctly:

You are a helpful assistant that answers questions based on provided context.
If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".

Question: {your question}

Context:
{your context}

Without the abstention instruction, the model may not properly refuse to answer questions outside the context.


Performance

Benchmark Results (6,046 test examples)

Metric Value Description
Exact Match 83.2% Answer exactly matches gold standard
Token F1 90.0% Token overlap with gold answer
Abstention Precision 98.2% When it abstains, it's correct
Abstention Recall 99.7% It catches almost all unanswerable questions
Abstention F1 98.9% Combined abstention performance

Comparison with Baseline

Metric Fine-tuned Baseline (gemma3n:e4b) Improvement
Exact Match 83.2% 22.0% +61.2 pts (+278%)
Token F1 90.0% 34.8% +55.2 pts (+159%)
Abstention F1 98.9% ~0% Model learned abstention

Statistical Significance

  • p-value: < 0.00001 (highly significant)
  • 95% CI: 82.3% - 84.1% (fine-tuned) vs 13.9% - 30.1% (baseline)
  • Confidence intervals don't overlap

Hardware Requirements

Hardware Supported Latency Notes
CPU only (8 cores, 32GB RAM) Yes 4-6 sec Validated on n2-standard-8
NVIDIA T4 (16GB) Yes <1 sec Recommended
Consumer GPU (8GB) Yes 1-2 sec Works with Q4_K_M
Apple Silicon Yes 1-3 sec Via llama.cpp

Memory requirement: ~10 GB RAM for inference


Training Details

Base Model

  • Model: Google Gemma 3n E4B (4B effective parameters)
  • Source: unsloth/gemma-3n-E4B-it-unsloth-bnb-4bit

Fine-tuning Configuration

Parameter Value
Method LoRA (Low-Rank Adaptation)
Rank (r) 32
Alpha 64
Dropout 0.05
Learning Rate 2e-4
Epochs 3
Batch Size 4 (effective: 16 with grad accum)
Precision bfloat16
Training Time ~20 hours on A100 40GB

Training Data

  • Dataset: adorosario/gemma3n-qa-synthetic
  • Size: 57,081 examples (45,220 train / 5,815 val / 6,046 test)
  • Composition: 73% answerable QA, 27% abstention examples
  • Source: Synthetic generation from SimpleQA-Verified knowledge base
  • Generation: GPT-4o-mini
  • Cost: ~$15-20 USD

Critical Implementation Detail

The v4 success came from manual label masking - training only on model responses, not on the prompt. Previous versions (v1, v3) failed because this wasn't properly implemented.


How-To Guides

Use with llama.cpp

# Download
wget https://huggingface.co/adorosario/gemma3n-qa-v4-fixed/resolve/main/gemma3n-qa-v4-fixed-q4_k_m.gguf

# Run
./llama-cli -m gemma3n-qa-v4-fixed-q4_k_m.gguf \
  -p "You are a helpful assistant...\n\nQuestion: ...\n\nContext:\n..." \
  --temp 0

Use in a RAG Pipeline

from langchain.llms import Ollama

llm = Ollama(model="gemma3n-qa-v4-fixed", temperature=0)

def rag_query(question: str, retrieved_docs: list) -> str:
    context = "\n\n".join(retrieved_docs)
    prompt = f"""You are a helpful assistant that answers questions based on provided context.
If the answer is not found in the context, respond with "NOT FOUND IN DOCUMENTS".

Question: {question}

Context:
{context}"""
    return llm.invoke(prompt)

Use with AnythingLLM

  1. Import the GGUF into Ollama (see Quick Start)
  2. In AnythingLLM, select gemma3n-qa-v4-fixed as the model
  3. Set system prompt to include the abstention instruction
  4. Set temperature to 0

Limitations

What This Model Does Well

  • Extracting answers from provided context
  • Knowing when to abstain ("NOT FOUND IN DOCUMENTS")
  • Running on CPU-only hardware
  • Fast inference (4-6 seconds on CPU)

What This Model Does NOT Do

  • Generate answers beyond the context (by design)
  • Multi-hop reasoning requiring external knowledge
  • Non-English languages (trained on English only)
  • Long contexts beyond 4096 tokens
  • Multi-turn conversation (single-turn QA only)

Known Issues

  • Requires specific prompt format for abstention
  • ~2% quality loss from Q4_K_M quantization
  • May struggle with heavily paraphrased answers

Files

File Size Description
gemma3n-qa-v4-fixed-q4_k_m.gguf 7.68 GB Main model (Q4_K_M quantization)

Citation

@misc{gemma3n-qa-v4-fixed-2025,
  author = {Do Rosario, Alden},
  title = {gemma3n-qa-v4-fixed: Fine-tuned Gemma 3n for Document-Grounded QA with Abstention},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/adorosario/gemma3n-qa-v4-fixed},
  note = {Fine-tuned for extractive QA with learned abstention behavior}
}

Related Resources


Acknowledgments

  • Google for the Gemma 3n base model
  • Unsloth team for efficient fine-tuning tools
  • OpenAI for GPT-4o-mini used in synthetic data generation
Downloads last month
30
GGUF
Model size
7B params
Architecture
gemma3n
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for adorosario/gemma3n-qa-v4-fixed

Quantized
(57)
this model

Dataset used to train adorosario/gemma3n-qa-v4-fixed

Evaluation results