Irminsul / README.md
MukulRay's picture
fix: add HF Spaces yaml front-matter
e4ec4e1
metadata
title: Irminsul
emoji: 🌿
colorFrom: green
colorTo: green
sdk: docker
app_port: 7860
pinned: false
Irminsul Banner

Irminsul

A production-shaped LLMOps stack β€” QLoRA fine-tuning on Colab, RAG pipeline, containerized serving, and cloud deployment.

Live Demo GitHub Corpus Pipeline License


Most LLM projects stop at inference. This one builds the full stack: a QLoRA fine-tuned Llama 3.1 8B served through a RAG pipeline, with guardrails, a domain-specific knowledge base, and a containerized FastAPI server designed for cloud deployment.

β†’ Try the live demo


About Irminsul

Irminsul is a domain-specific AI assistant for Genshin Impact β€” built not because Genshin needed an AI assistant, but because it provided a concrete, evaluable knowledge domain to build an LLMOps pipeline around. Every component was chosen deliberately:

  • A knowledge domain rich enough to evaluate retrieval quality (characters, mechanics, lore)
  • Ground truth data available (KQM Theorycrafting Library, game stat APIs) to measure hallucination
  • Community signal data (patch notes, meta shifts) to test corpus freshness

The domain is the test harness. The pipeline is the project.


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         User Query                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Guardrails Layer                             β”‚
β”‚  β€’ Injection detection (pattern matching)                       β”‚
β”‚  β€’ Domain validation (cosine similarity vs anchor embeddings)   β”‚
β”‚  β€’ Output sanitization                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  FastAPI /generate                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β”œβ”€β”€β”€ Embed query (sentence-transformers, local, CPU)
           β”‚              β”‚
           β”‚              β–Ό
           β”‚         Pinecone ── semantic search ──► top-k chunks
           β”‚              β”‚
           β–Ό              β–Ό
    LangChain RetrievalQA (stuff chain)
           β”‚
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚         LLM Backend                β”‚
  β”‚                                    β”‚
  β”‚  Groq (live demo)                  β”‚
  β”‚  llama-3.3-70b-versatile           β”‚
  β”‚  ~300 tok/s, free tier             β”‚
  β”‚             ──── OR ────           β”‚
  β”‚  Local (fine-tuned)                β”‚
  β”‚  Llama 3.1 8B QLoRA               β”‚
  β”‚  4-bit NF4, RTX 3060 6GB          β”‚
  β”‚  (inference only β€” trained on     β”‚
  β”‚   Colab A100)                     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚
           β–Ό
    Grounded answer + source attribution

Components

Fine-Tuned Model

Llama 3.1 8B Instruct fine-tuned with QLoRA on the Stanford Alpaca dataset (52K instruction-following examples), trained on Google Colab Pro (A100). Local inference runs in 4-bit NF4 quantization on an RTX 3060 6GB.

β†’ View training notebook on Colab

Parameter Value
Base model meta-llama/Llama-3.1-8B-Instruct
Dataset Stanford Alpaca (tatsu-lab/alpaca, 52K examples)
Method QLoRA via PEFT
Rank / Alpha r=16, Ξ±=32, dropout=0.05
Target modules q_proj, v_proj, k_proj, o_proj
Learning rate 2e-4 (cosine schedule, warmup 3%)
Batch size 4 per device Γ— 4 grad accumulation = effective 16
Epochs 2
Optimizer paged_adamw_32bit
Quantization (inference) 4-bit NF4, bfloat16 compute dtype
Training infra Google Colab Pro (A100 40GB)
Experiment tracking MLflow (3 runs)

β†’ Download exp2_lr2e-4_r16 model

Three experiments run sequentially, each tracked in MLflow:

Experiment LR Rank Result
exp1_lr1e-4_r16 1e-4 16 Conservative baseline
exp2_lr2e-4_r16 2e-4 16 Winner β€” best loss/quality balance
exp3_lr2e-4_r8 2e-4 8 Tests if rank=16 is worth the extra params

Winning checkpoint (exp2_lr2e-4_r16) selected by faithfulness (0.826) and ROUGE-L (0.466), both computed locally via cosine similarity and token overlap against a held-out eval set.

RAG Pipeline

Documents are chunked, embedded locally with sentence-transformers/all-MiniLM-L6-v2 (384-dim, zero API cost), and stored in Pinecone serverless. Retrieval is semantic, top-k configurable per query.

Component Choice Reason
Embedder all-MiniLM-L6-v2 Runs locally, strong semantic retrieval, 384-dim fits free Pinecone tier
Vector DB Pinecone serverless Zero ops, cosine similarity, free tier sufficient for corpus size
Chunking Word-level, 300 words, 40-word overlap Preserves semantic units across chunk boundaries
Chain LangChain RetrievalQA (stuff) Simple, inspectable, returns source documents

Knowledge Corpus

Corpus is maintained in a separate repository with an autonomous update pipeline. It ingests from three tiers of sources with different trust levels:

Tier Source Files Trust
1 β€” Ground Truth KQM Theorycrafting Library (peer-reviewed mechanics) ~305 Highest β€” cite in builds
1 β€” Ground Truth genshin-db API (exact character/weapon/artifact stats) ~406 Highest β€” exact game data
2 β€” Expert Synthesis Gemini-authored prose grounded in Tier 1 ~83 High β€” no hallucinated stats
3 β€” Community Signal Official patch notes, banner history, event calendar ~80 Medium β€” tagged explicitly

A GitHub Actions workflow runs every Sunday at 2am UTC, pulls fresh data, commits the docs, and re-ingests ~4,000 vectors to Pinecone automatically.

Guardrails

Two layers of input validation before any LLM call:

  1. Injection detection β€” pattern matching against known jailbreak phrases (ignore previous instructions, act as, DAN mode, etc.)
  2. Domain validation β€” cosine similarity between the query embedding and a set of Genshin-domain anchor sentences. Queries scoring below threshold (0.35) are rejected with a domain-scoped error message before touching the LLM.

Output is sanitized to strip generation artifacts (</s> tokens, trailing whitespace) and length-checked.

Serving Layer

FastAPI with:

  • Async lifespan model loading (model loads once at startup, not per request)
  • Typed Pydantic request/response models with blocked flag for guardrail rejections
  • CORS enabled for cross-origin UI
  • /health endpoint reporting model load status
  • Browser UI served from the same process (no separate frontend server)

Stack

Layer Technology
Base model Llama 3.1 8B Instruct
Fine-tuning QLoRA via PEFT (r=16, Ξ±=32, lr=2e-4)
Experiment tracking MLflow
Quantization BitsAndBytes 4-bit NF4, bfloat16 compute
Embeddings sentence-transformers/all-MiniLM-L6-v2
Vector DB Pinecone serverless (cosine, 384-dim)
RAG chain LangChain RetrievalQA
Serving FastAPI + Uvicorn
Containerization Docker (python:3.12-slim)
Live demo hosting HuggingFace Spaces (CPU Basic)
Production deployment Azure Container Apps + ACR
LLM backend (demo) Groq API (llama-3.3-70b-versatile)
Corpus pipeline GitHub Actions (weekly, autonomous)

Quickstart

Option 1 β€” Groq backend (no GPU required)

# 1. Clone
git clone https://github.com/MukulRay1603/Irminsul.git
cd Irminsul

# 2. Install
python -m venv venv && source venv/bin/activate
# Windows: venv\Scripts\activate
pip install -r requirements.txt

# 3. Configure
cp .env.example .env
# Set PINECONE_API_KEY and GROQ_API_KEY in .env
# LLM_BACKEND=groq is the default

# 4. Ingest corpus (or use pre-ingested Pinecone index)
python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40

# 5. Run
uvicorn main:app --reload --port 8000
# UI: http://localhost:8000
# API docs: http://localhost:8000/docs

Option 2 β€” Local fine-tuned model (GPU required for inference, 6GB+ VRAM)

# Same steps 1–3, then:
# Set LLM_BACKEND=local and MODEL_PATH in .env

# 4. Download model
# Place the merged QLoRA model at: ./models/merged/exp2_lr2e-4_r16/
# (Or update MODEL_PATH in .env)

# 5. Run
uvicorn main:app --reload --port 8000

Docker

# Groq backend (no GPU)
docker build -t irminsul:latest .
docker run -p 8000:8000 \
  -e PINECONE_API_KEY=your_key \
  -e GROQ_API_KEY=your_key \
  -e LLM_BACKEND=groq \
  irminsul:latest

API

Method Endpoint Description
GET / Browser UI
GET /health Model load status + ready flag
POST /generate RAG query β†’ grounded answer + sources
POST /ingest Ingest docs from a local directory path

Request:

{
  "query": "What weapons should Hu Tao use on a budget?",
  "top_k": 3
}

Response:

{
  "answer": "For Hu Tao on a budget, Dragon's Bane is the strongest F2P option β€” it scales with Elemental Mastery and deals significant bonus damage on vaporized hits. White Tassel is the best 3-star alternative for pure Normal Attack scaling.",
  "sources": ["docs/generated/characters/hu_tao.md", "docs/tcl/characters/pyro/hutao.md"],
  "latency_ms": 1240.5,
  "blocked": false
}

If a query is rejected by guardrails, blocked: true is returned with the rejection reason in answer. No LLM call is made.


Deployment

See DEPLOYMENT.md for the full guide covering:

  • Local development setup
  • Docker (local and cloud)
  • Azure Container Apps (one-shot deploy_azure.sh)
  • Cost breakdown and the reasoning behind the demo setup
  • GPU serving path for the fine-tuned model

Why the live demo runs on HuggingFace + Groq, not Azure GPU:

Serving the fine-tuned Llama 3.1 8B requires a GPU instance. The minimum viable option on Azure (NC4as T4 v3) costs ~$360/month β€” not justified for a portfolio project. The Dockerfile and deploy_azure.sh are written for the Azure path; the live demo swaps the LLM backend to Groq via a single environment variable. The RAG pipeline, guardrails, and serving layer are identical.


Project Structure

Irminsul/
β”œβ”€β”€ main.py                  # FastAPI app: endpoints, lifespan, CORS, response models
β”œβ”€β”€ rag.py                   # LangChain RAG chain, dual backend (Groq / local Llama)
β”œβ”€β”€ embedder.py              # sentence-transformers singleton (loads once, reused)
β”œβ”€β”€ ingest.py                # Doc loader β†’ word chunker β†’ Pinecone upsert
β”œβ”€β”€ guardrails.py            # Input validation: injection detection + domain cosine check
β”œβ”€β”€ index.html               # Browser UI: dark Dendro theme, query history, source display
β”‚
β”œβ”€β”€ LLMOps_Pipeline.ipynb    # Full training notebook: QLoRA, MLflow, eval (Colab A100)
β”‚
β”œβ”€β”€ Dockerfile               # python:3.12-slim, model NOT baked in
β”œβ”€β”€ deploy_azure.sh          # One-shot ACR build + Container Apps deploy
β”œβ”€β”€ .env.example             # Environment variable reference
β”‚
β”œβ”€β”€ DEPLOYMENT.md            # Full deployment guide + cost analysis
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ assets/                  # Screenshots and assets used in this README
β”‚   β”œβ”€β”€ banner.png
β”‚   β”œβ”€β”€ ui_main.png
β”‚   β”œβ”€β”€ ui_response.png
β”‚   └── mlflow_runs.png
└── models/                  # gitignored β€” place merged model here locally
    └── merged/
        └── exp2_lr2e-4_r16/

Evaluation

Winning checkpoint evaluated against a held-out set using a custom local eval (cosine similarity for faithfulness, token overlap for ROUGE-L). RAGAS was attempted but hit async timeout issues on Colab β€” custom eval used instead, results are fully reproducible from the notebook.

Metric Score Method
Faithfulness 0.826 Cosine similarity: ground truth β†’ answer embedding
ROUGE-L 0.466 Token overlap vs reference answers

Full RAG pipeline evaluation (context recall, answer relevance) is a planned addition β€” see What's Next.


What's Next

What's Next

  • End-to-end RAG evaluation β€” RAGAS pipeline measuring faithfulness, context recall, and answer relevance on a held-out Genshin question set; results logged to MLflow alongside fine-tuning metrics for a single unified eval story
  • Smarter chunking β€” swap word-level splitter for MarkdownHeaderTextSplitter so retrieval respects document structure (character sections stay together, stat tables don't get split mid-row)
  • Streaming responses β€” SSE endpoint for lower perceived latency on long answers
  • CI/CD on push β€” GitHub Actions β†’ ACR build β†’ az containerapp update --image for zero-downtime rolling deploys to Azure on every merge to main

Related: irminsul-corpus

The knowledge base is maintained in a companion repository:

MukulRay1603/irminsul-corpus

It runs a fully autonomous weekly pipeline: pulls fresh game data from the KQM Theorycrafting Library and genshin-db API, synthesizes prose with Gemini 2.5 Flash, commits ~800 documents to the repo, and re-ingests ~4,000 vectors to Pinecone β€” without any manual intervention.


License

MIT β€” see LICENSE for details.

Genshin Impact is owned by HoYoverse. This project is not affiliated with or endorsed by HoYoverse.


Built to learn the full MLOps lifecycle β€” fine-tuning on Colab, quantized inference on consumer hardware, retrieval, serving, and cloud deployment. Every component chosen deliberately, not for hype.