---
title: Irminsul
emoji: 🌿
colorFrom: green
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---

# Irminsul
**A production-shaped LLMOps stack — QLoRA fine-tuning on Colab, RAG pipeline, containerized serving, and cloud deployment.**
[](https://huggingface.co/spaces/MukulRay/Irminsul)
[](https://github.com/MukulRay1603/Irminsul)
[](https://github.com/MukulRay1603/irminsul-corpus)
[](LICENSE)
---
Most LLM projects stop at inference. This one builds the full stack: a QLoRA fine-tuned Llama 3.1 8B served through a RAG pipeline, with guardrails, a domain-specific knowledge base, and a containerized FastAPI server designed for cloud deployment.
**[→ Try the live demo](https://huggingface.co/spaces/MukulRay/Irminsul)**
---
## About Irminsul
Irminsul is a domain-specific AI assistant for Genshin Impact — built not because Genshin needed an AI assistant, but because it provided a concrete, evaluable knowledge domain to build an LLMOps pipeline around. Every component was chosen deliberately:
- A knowledge domain rich enough to evaluate retrieval quality (characters, mechanics, lore)
- Ground truth data available (KQM Theorycrafting Library, game stat APIs) to measure hallucination
- Community signal data (patch notes, meta shifts) to test corpus freshness
The domain is the test harness. The pipeline is the project.
---
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ User Query │
└─────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Guardrails Layer │
│ • Injection detection (pattern matching) │
│ • Domain validation (cosine similarity vs anchor embeddings) │
│ • Output sanitization │
└─────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI /generate │
└──────────┬──────────────────────────────────────────────────────┘
│
├─── Embed query (sentence-transformers, local, CPU)
│ │
│ ▼
│ Pinecone ── semantic search ──► top-k chunks
│ │
▼ ▼
LangChain RetrievalQA (stuff chain)
│
▼
┌────────────────────────────────────┐
│ LLM Backend │
│ │
│ Groq (live demo) │
│ llama-3.3-70b-versatile │
│ ~300 tok/s, free tier │
│ ──── OR ──── │
│ Local (fine-tuned) │
│ Llama 3.1 8B QLoRA │
│ 4-bit NF4, RTX 3060 6GB │
│ (inference only — trained on │
│ Colab A100) │
└────────────────────────────────────┘
│
▼
Grounded answer + source attribution
```
---
## Components
### Fine-Tuned Model
Llama 3.1 8B Instruct fine-tuned with QLoRA on the Stanford Alpaca dataset (52K instruction-following examples), trained on Google Colab Pro (A100). Local inference runs in 4-bit NF4 quantization on an RTX 3060 6GB.
**[→ View training notebook on Colab](https://colab.research.google.com/drive/1wXz6V196IXEEU3FKwxDJ7BBxRh79QqEF?usp=sharing)**
| Parameter | Value |
|---|---|
| Base model | `meta-llama/Llama-3.1-8B-Instruct` |
| Dataset | Stanford Alpaca (`tatsu-lab/alpaca`, 52K examples) |
| Method | QLoRA via PEFT |
| Rank / Alpha | r=16, α=32, dropout=0.05 |
| Target modules | q_proj, v_proj, k_proj, o_proj |
| Learning rate | 2e-4 (cosine schedule, warmup 3%) |
| Batch size | 4 per device × 4 grad accumulation = effective 16 |
| Epochs | 2 |
| Optimizer | paged_adamw_32bit |
| Quantization (inference) | 4-bit NF4, bfloat16 compute dtype |
| Training infra | Google Colab Pro (A100 40GB) |
| Experiment tracking | MLflow (3 runs) |
**[→ Download exp2_lr2e-4_r16 model ](https://drive.google.com/drive/folders/1vAVXDXzT5lThnvlgQwXRi0ParmyB3V0P?usp=sharing)**
Three experiments run sequentially, each tracked in MLflow:
| Experiment | LR | Rank | Result |
|---|---|---|---|
| exp1_lr1e-4_r16 | 1e-4 | 16 | Conservative baseline |
| exp2_lr2e-4_r16 | 2e-4 | 16 | **Winner** — best loss/quality balance |
| exp3_lr2e-4_r8 | 2e-4 | 8 | Tests if rank=16 is worth the extra params |
Winning checkpoint (`exp2_lr2e-4_r16`) selected by faithfulness (0.826) and ROUGE-L (0.466), both computed locally via cosine similarity and token overlap against a held-out eval set.
### RAG Pipeline
Documents are chunked, embedded locally with `sentence-transformers/all-MiniLM-L6-v2` (384-dim, zero API cost), and stored in Pinecone serverless. Retrieval is semantic, top-k configurable per query.
| Component | Choice | Reason |
|---|---|---|
| Embedder | all-MiniLM-L6-v2 | Runs locally, strong semantic retrieval, 384-dim fits free Pinecone tier |
| Vector DB | Pinecone serverless | Zero ops, cosine similarity, free tier sufficient for corpus size |
| Chunking | Word-level, 300 words, 40-word overlap | Preserves semantic units across chunk boundaries |
| Chain | LangChain RetrievalQA (stuff) | Simple, inspectable, returns source documents |
### Knowledge Corpus
Corpus is maintained in a [separate repository](https://github.com/MukulRay1603/irminsul-corpus) with an autonomous update pipeline. It ingests from three tiers of sources with different trust levels:
| Tier | Source | Files | Trust |
|---|---|---|---|
| 1 — Ground Truth | KQM Theorycrafting Library (peer-reviewed mechanics) | ~305 | Highest — cite in builds |
| 1 — Ground Truth | genshin-db API (exact character/weapon/artifact stats) | ~406 | Highest — exact game data |
| 2 — Expert Synthesis | Gemini-authored prose grounded in Tier 1 | ~83 | High — no hallucinated stats |
| 3 — Community Signal | Official patch notes, banner history, event calendar | ~80 | Medium — tagged explicitly |
A GitHub Actions workflow runs every Sunday at 2am UTC, pulls fresh data, commits the docs, and re-ingests ~4,000 vectors to Pinecone automatically.
### Guardrails
Two layers of input validation before any LLM call:
1. **Injection detection** — pattern matching against known jailbreak phrases (`ignore previous instructions`, `act as`, `DAN mode`, etc.)
2. **Domain validation** — cosine similarity between the query embedding and a set of Genshin-domain anchor sentences. Queries scoring below threshold (0.35) are rejected with a domain-scoped error message before touching the LLM.
Output is sanitized to strip generation artifacts (`` tokens, trailing whitespace) and length-checked.
### Serving Layer
FastAPI with:
- Async lifespan model loading (model loads once at startup, not per request)
- Typed Pydantic request/response models with `blocked` flag for guardrail rejections
- CORS enabled for cross-origin UI
- `/health` endpoint reporting model load status
- Browser UI served from the same process (no separate frontend server)
---
## Stack
| Layer | Technology |
|---|---|
| Base model | Llama 3.1 8B Instruct |
| Fine-tuning | QLoRA via PEFT (r=16, α=32, lr=2e-4) |
| Experiment tracking | MLflow |
| Quantization | BitsAndBytes 4-bit NF4, bfloat16 compute |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 |
| Vector DB | Pinecone serverless (cosine, 384-dim) |
| RAG chain | LangChain RetrievalQA |
| Serving | FastAPI + Uvicorn |
| Containerization | Docker (python:3.12-slim) |
| Live demo hosting | HuggingFace Spaces (CPU Basic) |
| Production deployment | Azure Container Apps + ACR |
| LLM backend (demo) | Groq API (llama-3.3-70b-versatile) |
| Corpus pipeline | GitHub Actions (weekly, autonomous) |
---
## Quickstart
### Option 1 — Groq backend (no GPU required)
```bash
# 1. Clone
git clone https://github.com/MukulRay1603/Irminsul.git
cd Irminsul
# 2. Install
python -m venv venv && source venv/bin/activate
# Windows: venv\Scripts\activate
pip install -r requirements.txt
# 3. Configure
cp .env.example .env
# Set PINECONE_API_KEY and GROQ_API_KEY in .env
# LLM_BACKEND=groq is the default
# 4. Ingest corpus (or use pre-ingested Pinecone index)
python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40
# 5. Run
uvicorn main:app --reload --port 8000
# UI: http://localhost:8000
# API docs: http://localhost:8000/docs
```
### Option 2 — Local fine-tuned model (GPU required for inference, 6GB+ VRAM)
```bash
# Same steps 1–3, then:
# Set LLM_BACKEND=local and MODEL_PATH in .env
# 4. Download model
# Place the merged QLoRA model at: ./models/merged/exp2_lr2e-4_r16/
# (Or update MODEL_PATH in .env)
# 5. Run
uvicorn main:app --reload --port 8000
```
### Docker
```bash
# Groq backend (no GPU)
docker build -t irminsul:latest .
docker run -p 8000:8000 \
-e PINECONE_API_KEY=your_key \
-e GROQ_API_KEY=your_key \
-e LLM_BACKEND=groq \
irminsul:latest
```
---
## API
| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/` | Browser UI |
| `GET` | `/health` | Model load status + ready flag |
| `POST` | `/generate` | RAG query → grounded answer + sources |
| `POST` | `/ingest` | Ingest docs from a local directory path |
**Request:**
```json
{
"query": "What weapons should Hu Tao use on a budget?",
"top_k": 3
}
```
**Response:**
```json
{
"answer": "For Hu Tao on a budget, Dragon's Bane is the strongest F2P option — it scales with Elemental Mastery and deals significant bonus damage on vaporized hits. White Tassel is the best 3-star alternative for pure Normal Attack scaling.",
"sources": ["docs/generated/characters/hu_tao.md", "docs/tcl/characters/pyro/hutao.md"],
"latency_ms": 1240.5,
"blocked": false
}
```
If a query is rejected by guardrails, `blocked: true` is returned with the rejection reason in `answer`. No LLM call is made.
---
## Deployment
See **[DEPLOYMENT.md](DEPLOYMENT.md)** for the full guide covering:
- Local development setup
- Docker (local and cloud)
- Azure Container Apps (one-shot `deploy_azure.sh`)
- Cost breakdown and the reasoning behind the demo setup
- GPU serving path for the fine-tuned model
**Why the live demo runs on HuggingFace + Groq, not Azure GPU:**
Serving the fine-tuned Llama 3.1 8B requires a GPU instance. The minimum viable option on Azure (NC4as T4 v3) costs ~$360/month — not justified for a portfolio project. The Dockerfile and `deploy_azure.sh` are written for the Azure path; the live demo swaps the LLM backend to Groq via a single environment variable. The RAG pipeline, guardrails, and serving layer are identical.
---
## Project Structure
```
Irminsul/
├── main.py # FastAPI app: endpoints, lifespan, CORS, response models
├── rag.py # LangChain RAG chain, dual backend (Groq / local Llama)
├── embedder.py # sentence-transformers singleton (loads once, reused)
├── ingest.py # Doc loader → word chunker → Pinecone upsert
├── guardrails.py # Input validation: injection detection + domain cosine check
├── index.html # Browser UI: dark Dendro theme, query history, source display
│
├── LLMOps_Pipeline.ipynb # Full training notebook: QLoRA, MLflow, eval (Colab A100)
│
├── Dockerfile # python:3.12-slim, model NOT baked in
├── deploy_azure.sh # One-shot ACR build + Container Apps deploy
├── .env.example # Environment variable reference
│
├── DEPLOYMENT.md # Full deployment guide + cost analysis
├── requirements.txt
├── assets/ # Screenshots and assets used in this README
│ ├── banner.png
│ ├── ui_main.png
│ ├── ui_response.png
│ └── mlflow_runs.png
└── models/ # gitignored — place merged model here locally
└── merged/
└── exp2_lr2e-4_r16/
```
---
## Evaluation
Winning checkpoint evaluated against a held-out set using a custom local eval (cosine similarity for faithfulness, token overlap for ROUGE-L). RAGAS was attempted but hit async timeout issues on Colab — custom eval used instead, results are fully reproducible from the notebook.
| Metric | Score | Method |
|---|---|---|
| Faithfulness | 0.826 | Cosine similarity: ground truth → answer embedding |
| ROUGE-L | 0.466 | Token overlap vs reference answers |
Full RAG pipeline evaluation (context recall, answer relevance) is a planned addition — see [What's Next](#whats-next).
---
## What's Next
## What's Next
- [ ] **End-to-end RAG evaluation** — RAGAS pipeline measuring faithfulness, context recall,
and answer relevance on a held-out Genshin question set; results logged to MLflow
alongside fine-tuning metrics for a single unified eval story
- [ ] **Smarter chunking** — swap word-level splitter for `MarkdownHeaderTextSplitter`
so retrieval respects document structure (character sections stay together,
stat tables don't get split mid-row)
- [ ] **Streaming responses** — SSE endpoint for lower perceived latency on long answers
- [ ] **CI/CD on push** — GitHub Actions → ACR build → `az containerapp update --image`
for zero-downtime rolling deploys to Azure on every merge to main
---
## Related: irminsul-corpus
The knowledge base is maintained in a companion repository:
**[MukulRay1603/irminsul-corpus](https://github.com/MukulRay1603/irminsul-corpus)**
It runs a fully autonomous weekly pipeline: pulls fresh game data from the KQM Theorycrafting Library and genshin-db API, synthesizes prose with Gemini 2.5 Flash, commits ~800 documents to the repo, and re-ingests ~4,000 vectors to Pinecone — without any manual intervention.
---
## License
MIT — see [LICENSE](LICENSE) for details.
Genshin Impact is owned by HoYoverse. This project is not affiliated with or endorsed by HoYoverse.
---
Built to learn the full MLOps lifecycle — fine-tuning on Colab, quantized inference on consumer hardware, retrieval, serving, and cloud deployment. Every component chosen deliberately, not for hype.