--- title: Irminsul emoji: 🌿 colorFrom: green colorTo: green sdk: docker app_port: 7860 pinned: false ---
Irminsul Banner # Irminsul **A production-shaped LLMOps stack — QLoRA fine-tuning on Colab, RAG pipeline, containerized serving, and cloud deployment.** [![Live Demo](https://img.shields.io/badge/Live_Demo-HuggingFace_Spaces-FFD21E?style=flat&logo=huggingface)](https://huggingface.co/spaces/MukulRay/Irminsul) [![GitHub](https://img.shields.io/badge/GitHub-MukulRay1603-181717?style=flat&logo=github)](https://github.com/MukulRay1603/Irminsul) [![Corpus Pipeline](https://img.shields.io/badge/Corpus_Pipeline-irminsul--corpus-2ea44f?style=flat&logo=github)](https://github.com/MukulRay1603/irminsul-corpus) [![License](https://img.shields.io/badge/License-MIT-green?style=flat)](LICENSE)
--- Most LLM projects stop at inference. This one builds the full stack: a QLoRA fine-tuned Llama 3.1 8B served through a RAG pipeline, with guardrails, a domain-specific knowledge base, and a containerized FastAPI server designed for cloud deployment. **[→ Try the live demo](https://huggingface.co/spaces/MukulRay/Irminsul)** --- ## About Irminsul Irminsul is a domain-specific AI assistant for Genshin Impact — built not because Genshin needed an AI assistant, but because it provided a concrete, evaluable knowledge domain to build an LLMOps pipeline around. Every component was chosen deliberately: - A knowledge domain rich enough to evaluate retrieval quality (characters, mechanics, lore) - Ground truth data available (KQM Theorycrafting Library, game stat APIs) to measure hallucination - Community signal data (patch notes, meta shifts) to test corpus freshness The domain is the test harness. The pipeline is the project. --- ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ User Query │ └─────────────────────┬───────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Guardrails Layer │ │ • Injection detection (pattern matching) │ │ • Domain validation (cosine similarity vs anchor embeddings) │ │ • Output sanitization │ └─────────────────────┬───────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ FastAPI /generate │ └──────────┬──────────────────────────────────────────────────────┘ │ ├─── Embed query (sentence-transformers, local, CPU) │ │ │ ▼ │ Pinecone ── semantic search ──► top-k chunks │ │ ▼ ▼ LangChain RetrievalQA (stuff chain) │ ▼ ┌────────────────────────────────────┐ │ LLM Backend │ │ │ │ Groq (live demo) │ │ llama-3.3-70b-versatile │ │ ~300 tok/s, free tier │ │ ──── OR ──── │ │ Local (fine-tuned) │ │ Llama 3.1 8B QLoRA │ │ 4-bit NF4, RTX 3060 6GB │ │ (inference only — trained on │ │ Colab A100) │ └────────────────────────────────────┘ │ ▼ Grounded answer + source attribution ``` --- ## Components ### Fine-Tuned Model Llama 3.1 8B Instruct fine-tuned with QLoRA on the Stanford Alpaca dataset (52K instruction-following examples), trained on Google Colab Pro (A100). Local inference runs in 4-bit NF4 quantization on an RTX 3060 6GB. **[→ View training notebook on Colab](https://colab.research.google.com/drive/1wXz6V196IXEEU3FKwxDJ7BBxRh79QqEF?usp=sharing)** | Parameter | Value | |---|---| | Base model | `meta-llama/Llama-3.1-8B-Instruct` | | Dataset | Stanford Alpaca (`tatsu-lab/alpaca`, 52K examples) | | Method | QLoRA via PEFT | | Rank / Alpha | r=16, α=32, dropout=0.05 | | Target modules | q_proj, v_proj, k_proj, o_proj | | Learning rate | 2e-4 (cosine schedule, warmup 3%) | | Batch size | 4 per device × 4 grad accumulation = effective 16 | | Epochs | 2 | | Optimizer | paged_adamw_32bit | | Quantization (inference) | 4-bit NF4, bfloat16 compute dtype | | Training infra | Google Colab Pro (A100 40GB) | | Experiment tracking | MLflow (3 runs) | **[→ Download exp2_lr2e-4_r16 model ](https://drive.google.com/drive/folders/1vAVXDXzT5lThnvlgQwXRi0ParmyB3V0P?usp=sharing)** Three experiments run sequentially, each tracked in MLflow: | Experiment | LR | Rank | Result | |---|---|---|---| | exp1_lr1e-4_r16 | 1e-4 | 16 | Conservative baseline | | exp2_lr2e-4_r16 | 2e-4 | 16 | **Winner** — best loss/quality balance | | exp3_lr2e-4_r8 | 2e-4 | 8 | Tests if rank=16 is worth the extra params | Winning checkpoint (`exp2_lr2e-4_r16`) selected by faithfulness (0.826) and ROUGE-L (0.466), both computed locally via cosine similarity and token overlap against a held-out eval set. ### RAG Pipeline Documents are chunked, embedded locally with `sentence-transformers/all-MiniLM-L6-v2` (384-dim, zero API cost), and stored in Pinecone serverless. Retrieval is semantic, top-k configurable per query. | Component | Choice | Reason | |---|---|---| | Embedder | all-MiniLM-L6-v2 | Runs locally, strong semantic retrieval, 384-dim fits free Pinecone tier | | Vector DB | Pinecone serverless | Zero ops, cosine similarity, free tier sufficient for corpus size | | Chunking | Word-level, 300 words, 40-word overlap | Preserves semantic units across chunk boundaries | | Chain | LangChain RetrievalQA (stuff) | Simple, inspectable, returns source documents | ### Knowledge Corpus Corpus is maintained in a [separate repository](https://github.com/MukulRay1603/irminsul-corpus) with an autonomous update pipeline. It ingests from three tiers of sources with different trust levels: | Tier | Source | Files | Trust | |---|---|---|---| | 1 — Ground Truth | KQM Theorycrafting Library (peer-reviewed mechanics) | ~305 | Highest — cite in builds | | 1 — Ground Truth | genshin-db API (exact character/weapon/artifact stats) | ~406 | Highest — exact game data | | 2 — Expert Synthesis | Gemini-authored prose grounded in Tier 1 | ~83 | High — no hallucinated stats | | 3 — Community Signal | Official patch notes, banner history, event calendar | ~80 | Medium — tagged explicitly | A GitHub Actions workflow runs every Sunday at 2am UTC, pulls fresh data, commits the docs, and re-ingests ~4,000 vectors to Pinecone automatically. ### Guardrails Two layers of input validation before any LLM call: 1. **Injection detection** — pattern matching against known jailbreak phrases (`ignore previous instructions`, `act as`, `DAN mode`, etc.) 2. **Domain validation** — cosine similarity between the query embedding and a set of Genshin-domain anchor sentences. Queries scoring below threshold (0.35) are rejected with a domain-scoped error message before touching the LLM. Output is sanitized to strip generation artifacts (`` tokens, trailing whitespace) and length-checked. ### Serving Layer FastAPI with: - Async lifespan model loading (model loads once at startup, not per request) - Typed Pydantic request/response models with `blocked` flag for guardrail rejections - CORS enabled for cross-origin UI - `/health` endpoint reporting model load status - Browser UI served from the same process (no separate frontend server) --- ## Stack | Layer | Technology | |---|---| | Base model | Llama 3.1 8B Instruct | | Fine-tuning | QLoRA via PEFT (r=16, α=32, lr=2e-4) | | Experiment tracking | MLflow | | Quantization | BitsAndBytes 4-bit NF4, bfloat16 compute | | Embeddings | sentence-transformers/all-MiniLM-L6-v2 | | Vector DB | Pinecone serverless (cosine, 384-dim) | | RAG chain | LangChain RetrievalQA | | Serving | FastAPI + Uvicorn | | Containerization | Docker (python:3.12-slim) | | Live demo hosting | HuggingFace Spaces (CPU Basic) | | Production deployment | Azure Container Apps + ACR | | LLM backend (demo) | Groq API (llama-3.3-70b-versatile) | | Corpus pipeline | GitHub Actions (weekly, autonomous) | --- ## Quickstart ### Option 1 — Groq backend (no GPU required) ```bash # 1. Clone git clone https://github.com/MukulRay1603/Irminsul.git cd Irminsul # 2. Install python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate pip install -r requirements.txt # 3. Configure cp .env.example .env # Set PINECONE_API_KEY and GROQ_API_KEY in .env # LLM_BACKEND=groq is the default # 4. Ingest corpus (or use pre-ingested Pinecone index) python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40 # 5. Run uvicorn main:app --reload --port 8000 # UI: http://localhost:8000 # API docs: http://localhost:8000/docs ``` ### Option 2 — Local fine-tuned model (GPU required for inference, 6GB+ VRAM) ```bash # Same steps 1–3, then: # Set LLM_BACKEND=local and MODEL_PATH in .env # 4. Download model # Place the merged QLoRA model at: ./models/merged/exp2_lr2e-4_r16/ # (Or update MODEL_PATH in .env) # 5. Run uvicorn main:app --reload --port 8000 ``` ### Docker ```bash # Groq backend (no GPU) docker build -t irminsul:latest . docker run -p 8000:8000 \ -e PINECONE_API_KEY=your_key \ -e GROQ_API_KEY=your_key \ -e LLM_BACKEND=groq \ irminsul:latest ``` --- ## API | Method | Endpoint | Description | |---|---|---| | `GET` | `/` | Browser UI | | `GET` | `/health` | Model load status + ready flag | | `POST` | `/generate` | RAG query → grounded answer + sources | | `POST` | `/ingest` | Ingest docs from a local directory path | **Request:** ```json { "query": "What weapons should Hu Tao use on a budget?", "top_k": 3 } ``` **Response:** ```json { "answer": "For Hu Tao on a budget, Dragon's Bane is the strongest F2P option — it scales with Elemental Mastery and deals significant bonus damage on vaporized hits. White Tassel is the best 3-star alternative for pure Normal Attack scaling.", "sources": ["docs/generated/characters/hu_tao.md", "docs/tcl/characters/pyro/hutao.md"], "latency_ms": 1240.5, "blocked": false } ``` If a query is rejected by guardrails, `blocked: true` is returned with the rejection reason in `answer`. No LLM call is made. --- ## Deployment See **[DEPLOYMENT.md](DEPLOYMENT.md)** for the full guide covering: - Local development setup - Docker (local and cloud) - Azure Container Apps (one-shot `deploy_azure.sh`) - Cost breakdown and the reasoning behind the demo setup - GPU serving path for the fine-tuned model **Why the live demo runs on HuggingFace + Groq, not Azure GPU:** Serving the fine-tuned Llama 3.1 8B requires a GPU instance. The minimum viable option on Azure (NC4as T4 v3) costs ~$360/month — not justified for a portfolio project. The Dockerfile and `deploy_azure.sh` are written for the Azure path; the live demo swaps the LLM backend to Groq via a single environment variable. The RAG pipeline, guardrails, and serving layer are identical. --- ## Project Structure ``` Irminsul/ ├── main.py # FastAPI app: endpoints, lifespan, CORS, response models ├── rag.py # LangChain RAG chain, dual backend (Groq / local Llama) ├── embedder.py # sentence-transformers singleton (loads once, reused) ├── ingest.py # Doc loader → word chunker → Pinecone upsert ├── guardrails.py # Input validation: injection detection + domain cosine check ├── index.html # Browser UI: dark Dendro theme, query history, source display │ ├── LLMOps_Pipeline.ipynb # Full training notebook: QLoRA, MLflow, eval (Colab A100) │ ├── Dockerfile # python:3.12-slim, model NOT baked in ├── deploy_azure.sh # One-shot ACR build + Container Apps deploy ├── .env.example # Environment variable reference │ ├── DEPLOYMENT.md # Full deployment guide + cost analysis ├── requirements.txt ├── assets/ # Screenshots and assets used in this README │ ├── banner.png │ ├── ui_main.png │ ├── ui_response.png │ └── mlflow_runs.png └── models/ # gitignored — place merged model here locally └── merged/ └── exp2_lr2e-4_r16/ ``` --- ## Evaluation Winning checkpoint evaluated against a held-out set using a custom local eval (cosine similarity for faithfulness, token overlap for ROUGE-L). RAGAS was attempted but hit async timeout issues on Colab — custom eval used instead, results are fully reproducible from the notebook. | Metric | Score | Method | |---|---|---| | Faithfulness | 0.826 | Cosine similarity: ground truth → answer embedding | | ROUGE-L | 0.466 | Token overlap vs reference answers | Full RAG pipeline evaluation (context recall, answer relevance) is a planned addition — see [What's Next](#whats-next). --- ## What's Next ## What's Next - [ ] **End-to-end RAG evaluation** — RAGAS pipeline measuring faithfulness, context recall, and answer relevance on a held-out Genshin question set; results logged to MLflow alongside fine-tuning metrics for a single unified eval story - [ ] **Smarter chunking** — swap word-level splitter for `MarkdownHeaderTextSplitter` so retrieval respects document structure (character sections stay together, stat tables don't get split mid-row) - [ ] **Streaming responses** — SSE endpoint for lower perceived latency on long answers - [ ] **CI/CD on push** — GitHub Actions → ACR build → `az containerapp update --image` for zero-downtime rolling deploys to Azure on every merge to main --- ## Related: irminsul-corpus The knowledge base is maintained in a companion repository: **[MukulRay1603/irminsul-corpus](https://github.com/MukulRay1603/irminsul-corpus)** It runs a fully autonomous weekly pipeline: pulls fresh game data from the KQM Theorycrafting Library and genshin-db API, synthesizes prose with Gemini 2.5 Flash, commits ~800 documents to the repo, and re-ingests ~4,000 vectors to Pinecone — without any manual intervention. --- ## License MIT — see [LICENSE](LICENSE) for details. Genshin Impact is owned by HoYoverse. This project is not affiliated with or endorsed by HoYoverse. ---
Built to learn the full MLOps lifecycle — fine-tuning on Colab, quantized inference on consumer hardware, retrieval, serving, and cloud deployment. Every component chosen deliberately, not for hype.