---
title: Chronos-3B
emoji: 🕰️
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 4.44.0
app_file: pipeline.py
pinned: true
license: apache-2.0
datasets:
- TinyModels/chronos-wiki-corpus
language:
- en
tags:
- rag
- qwen2.5
- history
- world-war
- 20th-century
- retrieval-augmented-generation
pipeline_tag: text-generation
---

# Chronos 🕰️ — Your 20th-Century Historian

> *"Those who cannot remember the past are condemned to repeat it."*
> — George Santayana, philosopher, and apparently someone who never met a hallucinating LLM.

**Chronos** is a retrieval-augmented generation (RAG) AI that knows the 20th century the way your history teacher wished they did — but without the monotone voice and the overhead projector.

It pairs a **Qwen 2.5 3B** language model with a **FAISS-powered knowledge base** built from hundreds of Wikipedia articles: World War I, World War II, the Cold War, the Space Race, major political upheavals, key inventions, and everything in between. Ask it something historical and it digs through its archives like a librarian who actually enjoys their job. Ask it something casual and it just... talks to you. Like a person. Imagine that.

---

## What Chronos actually is

Most history bots either hallucinate confidently or refuse to answer anything fun. Chronos tries to do neither. It was built with one goal: give accurate, evidence-backed historical answers while still being a conversation worth having.

It will not make up that Churchill and Stalin went to the same barber. It will tell you, with sources, what actually happened at Yalta. And if you just want to say hi, it'll say hi back.

---

## 🧠 How it works

Chronos runs a two-layer architecture depending on what you ask:

**Layer 1 — Casual chat**
For greetings, small talk, or anything outside the historical lane, the Qwen 3B model answers directly. No retrieval, no database, just the raw language model being friendly. This is also the layer that handles "who are you?" before the bot accidentally goes looking through WWII chunks for an answer about itself. (Yes, that happened. No, it was not funny at the time.)

**Layer 2 — Historical RAG**
When the question touches 20th-century history, the pipeline kicks in:

1. A keyword detector flags the query as historical
2. The **e5-base-v2** bi-encoder retrieves 30 candidate chunks from the FAISS index
3. A **cross-encoder** (`ms-marco-MiniLM-L-12-v2`) re-ranks them and keeps the top 4
4. Those 4 chunks are packed into a prompt alongside the question
5. Qwen generates an answer grounded in the retrieved context

There's also a **confidence threshold** — if even the best-ranked chunk scores too low, Chronos says "I don't have enough information" rather than inventing something. This is called honesty. More AI systems should try it.

A small **hard-coded safety net** handles a handful of ultra-high-stakes questions (think: "Who led Nazi Germany?") before retrieval even begins, guaranteeing accuracy on the facts that really cannot be wrong.

---

## 📦 What's inside

| Component | What it is |
|---|---|
| **Base LLM** | Qwen/Qwen2.5-3B-Instruct (4-bit quantized) |
| **Bi-encoder** | intfloat/e5-base-v2 |
| **Cross-encoder** | cross-encoder/ms-marco-MiniLM-L-12-v2 |
| **FAISS index** | jjk_index.faiss — historical Wikipedia chunks |
| **Chunks** | chunks.txt (~12 MB, one paragraph per line) |
| **Pipeline** | pipeline.py — one class, one `.ask()` method, done |
| **Config** | rag_config.json |

---

## 🚀 Quick start

```python
from huggingface_hub import snapshot_download
from pipeline import Chronos

model_dir = snapshot_download("QuantaSparkLabs/Chronos-3B")
bot = Chronos(model_dir)

# historical question
print(bot.ask("What caused World War I?"))

# casual
print(bot.ask("Hey, what's up?"))
```

> **Requirements:** 4-bit quantization means you need `bitsandbytes` and a GPU with at least 6 GB VRAM. CPU inference works but you'll age noticeably while waiting.

---

## 📊 Evaluation Results

These are internal evaluations run manually on Chronos-3B — no benchmark leaderboard, no cherry-picked test set, just honest testing across real question types. Take them at face value.

| Category | Score | Notes |
|---|---|---|
| **Factual Accuracy** (hard facts) | ✅ 10/10 | Critical questions — leaders, dates, core events — are answered instantly by the built-in safety net with zero error |
| **Historical Knowledge** (open-ended) | 🔶 8/10 | Most open questions (causes of wars, event explanations, country lists) are answered correctly via the knowledge base. Occasionally the confidence filter returns "I don't know" when the retrieval score is borderline |
| **Hallucination Control** | ✅ 9/10 | The confidence threshold + cross-encoder combination means Chronos almost never invents false history. It prefers to admit ignorance over guessing |
| **Casual Friendliness** | ✅ 9/10 | Greetings, identity questions, and small talk are handled with a warm, lively personality. It never feels robotic |
| **Consistency** (multi-turn) | 🔶 7/10 | Follow-up questions work, but the model can lose the thread across a long conversation. Multi-turn memory is a known limitation and a future goal |
| **Speed** (T4 GPU) | ✅ 8/10 | Answers generate in 2–5 seconds. Initial model download is heavy (~6 GB), but subsequent inferences are fast |

**Overall: 8/10**

Chronos is reliable, personable, and historically faithful. It occasionally needs a second prompt on very obscure questions, but it does not fabricate harmful falsehoods. The known shortcomings — multi-turn coherence and occasional over-caution on borderline retrievals — are well-understood and can be tuned further.

> These results reflect self-evaluation on a curated internal test set. Independent benchmarking on formal QA datasets (e.g. TriviaQA, NaturalQuestions) is a planned next step.

---

## 🔧 The bugs we fought (and eventually won)

Look, no project ships clean. Here's what actually happened during development, because pretending otherwise helps nobody.

---

**Bug 1 — The cryptic list comparison crash**

```
TypeError: '<=' not supported between instances of 'list' and 'int'
```

Every single answer crashed with this. Took an embarrassing amount of time to realize Gradio's `ChatInterface` passes `(message, history)` to your function, and our function was accidentally catching the history list as `max_new_tokens`. The fix was three words: fix the signature. The debugging took considerably longer.

---

**Bug 2 — Qwen's corrupted generation config**

The upstream Qwen 3B repo had `max_new_tokens` stored as a list in `generation_config.json` instead of an integer. This is the kind of bug that makes you question everything you know about software. We fixed it by loading a clean `GenerationConfig` manually and overwriting the bad file in our upload. Not glamorous. Worked perfectly.

---

**Bug 3 — The LFS ghost files**

Upload kept failing with "your push was rejected because an LFS pointer pointed to a file that does not exist." The cause was leftover LFS metadata from a previous interrupted upload haunting the repository like a very technical ghost. Solution: nuke the repo, start fresh, upload every file individually with `upload_file` instead of `upload_folder`. Tedious. Effective.

---

**Bug 4 — The Finnish shipwreck incident**

Asked "Who was the leader of Germany during WWI?" and got a confident paragraph about a Finnish shipwreck. This is what happens when a retriever fetches irrelevant chunks and a language model tries to connect them anyway. The fix was the confidence threshold — if the cross-encoder score is too low, Chronos admits it doesn't know. Hallucinations are not a feature.

*(We still don't know where the shipwreck came from. Some mysteries are better left unsolved.)*

---

**Bug 5 — The identity crisis**

"Who are you?" triggered a historical retrieval search, found nothing relevant, and the bot replied "I don't have enough information." Chronos literally did not know who it was. We fixed this by adding an identity handler at the very top of `ask()`, before any retrieval logic runs. An AI having an existential crisis is only funny in retrospect.

---

## 🌐 Run it locally

```python
import gradio as gr
from pipeline import Chronos

bot = Chronos("path/to/downloaded/model")

def chat(message, history):
    return bot.ask(message)

gr.ChatInterface(fn=chat, title="Chronos 🕰️").launch()
```

Or deploy on Hugging Face Spaces — the repo already includes `pipeline.py`. Point `app_file` at it and you're done.

---

## 🤝 Contributing

Found a historical inaccuracy? Want to add more chunks to the knowledge base? Think Chronos got something wrong about the Battle of Stalingrad?

- Open a discussion on the [Community Tab](https://huggingface.co/QuantaSparkLabs/Chronos-3B/discussions)
- Submit a PR with new or corrected chunks
- Flag wrong answers in any Gradio demo — we review them periodically

All contributions welcome. History is big. The knowledge base can always be bigger.

---

## A final note

This project took longer than expected, broke in ways that felt personal, and shipped anyway. That's the job. Chronos is dedicated to everyone who has ever stared at a stack trace at 2am and decided to keep going — and to everyone who genuinely loves history and thinks it deserves better than a model that makes things up.

It does. You do. Here it is.

---

*Built with perseverance, caffeine, and a deep respect for the 20th century.*
*QuantaSparkLabs*