Spaces:

MukulRay
/

Irminsul

Sleeping

App Files Files Community

MukulRay commited on Mar 22

Commit

3fbe97b

1 Parent(s): 3898be2

docs: overhaul README, add DEPLOYMENT.md, deploy_azure.sh, .env.example

Browse files

Files changed (5) hide show

.gitignore +0 -0
DEPLOYMENT.md +208 -0
README.md +310 -90
deploy_azure.sh +149 -0
requirements.txt +11 -8

.gitignore CHANGED Viewed

Binary files a/.gitignore and b/.gitignore differ

DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,208 @@

+# Deployment Guide
+This document covers all deployment options for Irminsul, the cost tradeoffs between them, and the architectural decisions behind the live demo setup.
+---
+## Deployment Options
+Irminsul supports two LLM backends and multiple hosting targets. Choose based on your infrastructure and budget.
+| Backend | Where to Run | GPU Required | Cost |
+|---|---|---|---|
+| **Groq** (recommended) | Anywhere — no GPU | No | Free tier available |
+| **Local Llama** (fine-tuned model) | Local machine / GPU VM | Yes (6GB+ VRAM) | Hardware cost / ~$0.50–1.50/hr on Azure |
+---
+## Live Demo: HuggingFace Spaces + Groq
+**Why this is the live demo environment:**
+The fine-tuned Llama 3.1 8B model is 16GB on disk and requires a GPU-enabled instance to serve at acceptable latency. On Azure, the minimum viable GPU instance for this model is the **NC4as T4 v3** (~$0.50/hr, ~$360/month). Running this persistently for a portfolio project is not cost-effective.
+The live demo instead uses:
+- **HuggingFace Spaces** — free CPU hosting for the FastAPI container
+- **Groq API** — runs `llama-3.3-70b-versatile` on Groq's Language Processing Units (LPUs) at ~300 tokens/second, for free under the public tier
+This demonstrates the identical RAG architecture — the LLM backend is swapped via a single environment variable (`LLM_BACKEND=groq`). The retrieval pipeline, guardrails, response format, and API contract are unchanged.
+```
+Live demo:  https://huggingface.co/spaces/MukulRay/Irminsul
+```
+---
+## Option A: Local Development
+The full stack including the fine-tuned model runs locally on an RTX 3060 6GB:
+```bash
+# 1. Clone and install
+git clone https://github.com/MukulRay1603/Irminsul.git
+cd Irminsul
+python -m venv venv && source venv/bin/activate
+pip install -r requirements.txt
+# 2. Configure
+cp .env.example .env
+# Edit .env — set MODEL_PATH, PINECONE_API_KEY
+# 3. Ingest corpus
+python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40
+# 4. Serve
+uvicorn main:app --host 0.0.0.0 --port 8000
+```
+**Memory profile:**
+| Component | VRAM |
+|---|---|
+| Llama 3.1 8B @ 4-bit NF4 | ~4.5 GB |
+| all-MiniLM-L6-v2 embedder | ~90 MB |
+| Inference headroom | ~1.2 GB |
+| **Total** | **~5.8 GB** |
+The model loads with `max_memory={0: "5.5GiB", "cpu": "24GiB"}` — layers that don't fit on GPU overflow to RAM automatically via `accelerate`.
+---
+## Option B: Docker (Local or Any Cloud)
+The Dockerfile is intentionally slim — the model is **not baked in**. It's injected at runtime via `MODEL_PATH`.
+```bash
+# Build
+docker build -t irminsul:latest .
+# Run with Groq backend (no GPU needed)
+docker run -p 8000:8000 \
+  -e PINECONE_API_KEY=your_key \
+  -e GROQ_API_KEY=your_key \
+  -e PINECONE_INDEX=llmops-rag \
+  -e LLM_BACKEND=groq \
+  irminsul:latest
+# Run with local model (GPU required)
+docker run -p 8000:8000 \
+  --gpus all \
+  -v /path/to/models:/app/models \
+  -e PINECONE_API_KEY=your_key \
+  -e MODEL_PATH=/app/models/merged/exp2_lr2e-4_r16 \
+  -e LLM_BACKEND=local \
+  irminsul:latest
+```
+---
+## Option C: Azure Container Apps
+Azure Container Apps (ACA) is the production deployment target. The `deploy_azure.sh` script provisions the full stack in one command.
+### Prerequisites
+```bash
+# Install Azure CLI
+# macOS:
+brew install azure-cli
+# Linux:
+curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
+# Windows: https://aka.ms/installazurecliwindows
+# Log in
+az login
+az account show  # confirm your subscription
+```
+### One-shot deploy
+```bash
+export PINECONE_API_KEY=your_pinecone_key
+export GROQ_API_KEY=your_groq_key
+chmod +x deploy_azure.sh
+./deploy_azure.sh
+```
+The script:
+1. Creates resource group `irminsul-rg` in East US
+2. Creates Azure Container Registry `irminsulacr`
+3. Builds the Docker image via **ACR Tasks** — the source code is uploaded and built in Azure's cloud; no local Docker daemon needed
+4. Creates a Container Apps environment
+5. Deploys the app with secrets injected as environment variables
+6. Outputs the live HTTPS URL
+### Tearing down
+```bash
+# Delete everything — stops all billing immediately
+az group delete --name irminsul-rg --yes --no-wait
+```
+### Cost breakdown (Groq backend, no GPU)
+| Resource | SKU | Cost |
+|---|---|---|
+| Container Apps | Consumption plan | Free (180k vCPU-s/month) |
+| ACR | Basic | ~$5/month |
+| Outbound bandwidth | First 100GB | Free |
+| **Total** | | **~$5/month** |
+On Azure for Students ($100 credit), this runs for ~20 months.
+### Why not GPU on Azure?
+To serve the fine-tuned Llama model in production, a GPU instance is required:
+| Instance | GPU | VRAM | Cost |
+|---|---|---|---|
+| NC4as T4 v3 | Tesla T4 | 16 GB | ~$0.50/hr = **~$360/month** |
+| NC6s v3 | Tesla V100 | 16 GB | ~$0.90/hr = **~$648/month** |
+At these prices, a portfolio project running 24/7 would exhaust the $100 student credit in under a week. The Groq backend delivers the same RAG functionality at zero marginal cost, making it the right engineering tradeoff.
+### Serving the fine-tuned model on Azure (production path)
+If cost were not a constraint, the correct architecture is:
+1. **Upload model to Azure Blob Storage** (~$0.02/GB/month for 16GB = ~$0.32/month)
+2. **Mount as a volume** in Container Apps — the container sees it at `/app/models/`
+3. **Switch to GPU SKU** — replace `--cpu 1.0 --memory 2.0Gi` in `deploy_azure.sh` with a GPU-enabled workload profile
+4. **Set `LLM_BACKEND=local`** in env vars
+The Docker image and application code require zero changes for this path. The abstraction was designed for it.
+---
+## Environment Variables Reference
+| Variable | Required | Default | Description |
+|---|---|---|---|
+| `PINECONE_API_KEY` | Yes | — | Pinecone serverless API key |
+| `PINECONE_INDEX` | No | `llmops-rag` | Pinecone index name |
+| `LLM_BACKEND` | No | `groq` | `groq` or `local` |
+| `GROQ_API_KEY` | If Groq | — | Groq API key |
+| `GROQ_MODEL` | No | `llama-3.3-70b-versatile` | Groq model name |
+| `MODEL_PATH` | If local | `./models/merged/exp2_lr2e-4_r16` | Path to merged model |
+| `EMBED_MODEL` | No | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model |
+---
+## CI/CD (Planned)
+The intended CI/CD pipeline:
+```
+git push main
+    │
+    ▼
+GitHub Actions
+    ├── Run tests
+    ├── Build Docker image
+    ├── Push to ACR
+    └── az containerapp update --image new-tag
+```
+This would give zero-downtime rolling deploys on every push to main. Currently, re-running `deploy_azure.sh` achieves the same result with a cold start.

README.md CHANGED Viewed

@@ -6,95 +6,235 @@ sdk: docker
 pinned: false
 ---
 # Irminsul
-> Fine-tuned Llama 3.1 8B · QLoRA · Pinecone RAG · FastAPI · Azure Container Apps
-A full end-to-end LLMOps serving stack — from a QLoRA fine-tuned model running in 4-bit NF4 on consumer hardware, through a retrieval-augmented generation pipeline, to a containerized API deployed on Azure. Built to be production-shaped, not just a demo.
-**[→ Live Demo](https://mukulray1603.github.io/Irminsul/demo.html)**
 ---
-## About Irminsul
-Most LLM projects stop at inference. This one goes further:
-- **Fine-tuned model** — Llama 3.1 8B fine-tuned with QLoRA (rank 16, lr 2e-4) on a custom dataset, merged and served locally in 4-bit NF4 quantization on an RTX 3060 6GB
-- **RAG pipeline** — Documents ingested, chunked, embedded with `sentence-transformers/all-MiniLM-L6-v2` (fully local, zero API cost), and stored in Pinecone. Retrieval is semantic, top-k configurable at query time
-- **Serving layer** — FastAPI with async lifespan model loading, typed Pydantic request/response models, CORS, health check, and a clean browser UI served from the same process
-- **Containerized** — Dockerfile built for slim Python 3.12, model loaded at runtime via env-configurable path (not baked in)
-- **Cloud-ready** — One-shot Azure deployment via ACR + Container Apps, with Pinecone key injected as a secret
-- **Domain knowledge** — RAG corpus built around Genshin Impact lore, character builds, and elemental mechanics, serving as a rich real-world knowledge base for retrieval evaluation
 ---
 ## Architecture
 ```
-User query
-    │
-    ▼
-FastAPI  /generate
-    │
-    ├── Embed query (sentence-transformers, local)
-    │       │
-    │       ▼
-    │   Pinecone — semantic search → top-k chunks
-    │       │
-    ▼       ▼
-  LangChain RetrievalQA
-    │
-    ▼
-Llama 3.1 8B (QLoRA fine-tuned, 4-bit NF4)
-    │
-    ▼
-Grounded answer + source attribution
 ```
 ---
 ## Stack
 | Layer | Technology |
 |---|---|
 | Base model | Llama 3.1 8B Instruct |
 | Fine-tuning | QLoRA via PEFT (r=16, α=32, lr=2e-4) |
 | Quantization | BitsAndBytes 4-bit NF4, bfloat16 compute |
 | Embeddings | sentence-transformers/all-MiniLM-L6-v2 |
-| Vector DB | Pinecone (serverless, cosine similarity) |
 | RAG chain | LangChain RetrievalQA |
 | Serving | FastAPI + Uvicorn |
 | Containerization | Docker (python:3.12-slim) |
-| Cloud | Azure Container Apps + ACR |
 ---
 ## Quickstart
 ```bash
-# 1. Clone and set up environment
 git clone https://github.com/MukulRay1603/Irminsul.git
 cd Irminsul
-python -m venv venv && source venv/bin/activate  # Windows: venv\Scripts\activate
 pip install -r requirements.txt
-# 2. Configure environment
 cp .env.example .env
-# Fill in PINECONE_API_KEY in .env
-# 3. Add your fine-tuned model
-# Place merged model at: ./models/merged/exp2_lr2e-4_r16
-# Or update MODEL_PATH in .env to point to your model
-# 4. Ingest documents
 python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40
-# 5. Start the server
 uvicorn main:app --reload --port 8000
-# UI available at http://localhost:8000
-# API docs at http://localhost:8000/docs
 ```
 ---
@@ -104,81 +244,161 @@ uvicorn main:app --reload --port 8000
 | Method | Endpoint | Description |
 |---|---|---|
 | `GET` | `/` | Browser UI |
-| `GET` | `/health` | Model load status |
-| `POST` | `/generate` | RAG query → grounded answer |
-| `POST` | `/ingest` | Ingest docs from local directory |
-**Example:**
-```bash
-curl -X POST http://localhost:8000/generate \
-  -H "Content-Type: application/json" \
-  -d '{"query": "What weapons should Hu Tao use on a budget?", "top_k": 3}'
 ```
 ```json
 {
   "answer": "For Hu Tao on a budget, Dragon's Bane is the strongest F2P option — it scales with Elemental Mastery and deals significant bonus damage on vaporized hits. White Tassel is the best 3-star alternative for pure Normal Attack scaling.",
-  "sources": ["docs/character_builds.md"],
-  "latency_ms": 4821.3
 }
 ```
 ---
-## Memory profile (RTX 3060 6GB)
-| Component | VRAM |
-|---|---|
-| Llama 3.1 8B @ 4-bit NF4 | ~4.5 GB |
-| all-MiniLM-L6-v2 embedder | ~90 MB |
-| Inference headroom | ~1.2 GB |
-Running the embedder on CPU frees ~90MB if needed — set `device_map="cpu"` in `rag.py`.
 ---
-## Deploy to Azure
-```bash
-export PINECONE_API_KEY=your_key
-chmod +x deploy_azure.sh
-./deploy_azure.sh
 ```
-The script provisions a resource group, builds and pushes the image via ACR Tasks (no local Docker build needed), creates a Container Apps environment, and deploys with the Pinecone key injected as a secret. Prints the live HTTPS endpoint on completion.
-**Model in Azure:** The merged model (~16GB) isn't baked into the image. Recommended approach: mount from Azure Blob Storage as a volume for cheapest cold start on student credits.
 ---
-## Project structure
-```
-Irminsul/
-├── main.py           # FastAPI app — endpoints, lifespan, CORS
-├── rag.py            # Model loading, 4-bit config, LangChain RAG chain
-├── embedder.py       # sentence-transformers singleton wrapper
-├── ingest.py         # Doc loader → chunker → Pinecone upsert
-├── index.html        # Browser UI (dark theme, query history, source display)
-├── Dockerfile
-├── deploy_azure.sh   # One-shot Azure Container Apps deploy
-├── requirements.txt
-├── .env.example
-└── docs/             # Corpus + GitHub Pages demo
-    └── demo.html
-```
 ---
-## What's next
-- [ ] Swap naive word chunker for `MarkdownHeaderTextSplitter` for better retrieval precision
-- [ ] Add metadata filtering to Pinecone queries (filter by character, content type)
-- [ ] Streaming response via SSE for lower perceived latency
-- [ ] Expand corpus — per-character deep dives with stat thresholds and rotation guides
-- [ ] CI/CD pipeline — GitHub Actions → ACR build → Container Apps deploy on push
 ---
-Built while learning the full MLOps lifecycle — fine-tuning, quantization, retrieval, serving, and cloud deployment — on consumer hardware. Every component chosen deliberately, not for hype.

 pinned: false
 ---
+<div align="center">
+<img src="docs/assets/banner.png" alt="Irminsul Banner" width="100%">
+<!-- PLACEHOLDER: Add a banner image. Can be a dark-themed graphic with the Irminsul logo/name.
+     Recommended: 1280x320px, dark green/forest aesthetic matching the UI.
+     Tools: Figma, Canva, or even a screenshot of the UI works. -->
 # Irminsul
+**A production-shaped LLMOps stack — QLoRA fine-tuning on Colab, RAG pipeline, containerized serving, and cloud deployment.**
+[![Live Demo](https://img.shields.io/badge/Live_Demo-HuggingFace_Spaces-FFD21E?style=flat&logo=huggingface)](https://huggingface.co/spaces/MukulRay/Irminsul)
+[![GitHub](https://img.shields.io/badge/GitHub-MukulRay1603-181717?style=flat&logo=github)](https://github.com/MukulRay1603/Irminsul)
+[![Corpus Pipeline](https://img.shields.io/badge/Corpus_Pipeline-irminsul--corpus-2ea44f?style=flat&logo=github)](https://github.com/MukulRay1603/irminsul-corpus)
+[![License](https://img.shields.io/badge/License-MIT-green?style=flat)](LICENSE)
+</div>
+---
+Most LLM projects stop at inference. This one builds the full stack: a QLoRA fine-tuned Llama 3.1 8B served through a RAG pipeline, with guardrails, a domain-specific knowledge base, and a containerized FastAPI server designed for cloud deployment.
+**[→ Try the live demo](https://huggingface.co/spaces/MukulRay/Irminsul)**
 ---
+## What This Is
+Irminsul is a domain-specific AI assistant for Genshin Impact — built not because Genshin needed an AI assistant, but because it provided a concrete, evaluable knowledge domain to build an LLMOps pipeline around. Every component was chosen deliberately:
+- A knowledge domain rich enough to evaluate retrieval quality (characters, mechanics, lore)
+- Ground truth data available (KQM Theorycrafting Library, game stat APIs) to measure hallucination
+- Community signal data (patch notes, meta shifts) to test corpus freshness
+The domain is the test harness. The pipeline is the project.
 ---
 ## Architecture
 ```
+┌─────────────────────────────────────────────────────────────────┐
+│                         User Query                              │
+└─────────────────────┬───────────────────────────────────────────┘
+                      │
+                      ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                    Guardrails Layer                             │
+│  • Injection detection (pattern matching)                       │
+│  • Domain validation (cosine similarity vs anchor embeddings)   │
+│  • Output sanitization                                          │
+└─────────────────────┬──────────────��────────────────────────────┘
+                      │
+                      ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                  FastAPI /generate                               │
+└──────────┬──────────────────────────────────────────────────────┘
+           │
+           ├─── Embed query (sentence-transformers, local, CPU)
+           │              │
+           │              ▼
+           │         Pinecone ── semantic search ──► top-k chunks
+           │              │
+           ▼              ▼
+    LangChain RetrievalQA (stuff chain)
+           │
+           ▼
+  ┌────────────────────────────────────┐
+  │         LLM Backend                │
+  │                                    │
+  │  Groq (live demo)                  │
+  │  llama-3.3-70b-versatile           │
+  │  ~300 tok/s, free tier             │
+  │             ──── OR ────           │
+  │  Local (fine-tuned)                │
+  │  Llama 3.1 8B QLoRA               │
+  │  4-bit NF4, RTX 3060 6GB          │
+  │  (inference only — trained on     │
+  │   Colab A100)                     │
+  └────────────────────────────────────┘
+           │
+           ▼
+    Grounded answer + source attribution
 ```
 ---
+## Components
+### Fine-Tuned Model
+Llama 3.1 8B Instruct fine-tuned with QLoRA on a custom instruction dataset, trained on Google Colab Pro (A100). Local inference runs in 4-bit NF4 quantization on an RTX 3060 6GB.
+**[→ View training notebook on Colab](https://colab.research.google.com/drive/YOUR_NOTEBOOK_LINK_HERE)**
+<!-- PLACEHOLDER: Replace YOUR_NOTEBOOK_LINK_HERE with your actual Colab share link
+     File → Share → Copy link (set to "Anyone with the link can view") -->
+| Parameter | Value |
+|---|---|
+| Base model | `meta-llama/Llama-3.1-8B-Instruct` |
+| Method | QLoRA via PEFT |
+| Rank / Alpha | r=16, α=32 |
+| Learning rate | 2e-4 |
+| Quantization (inference) | 4-bit NF4, bfloat16 compute |
+| Training infra | Google Colab Pro (A100) |
+| Experiment tracking | MLflow (3 runs) |
+Three experiments were tracked in MLflow. Winning checkpoint selected by faithfulness score (0.826) and ROUGE-L (0.466) on a held-out eval set.
+<!-- PLACEHOLDER: Add MLflow experiment screenshot here
+     docs/assets/mlflow_experiments.png
+     A screenshot of your MLflow UI showing the 3 runs and metrics comparison -->
+### RAG Pipeline
+Documents are chunked, embedded locally with `sentence-transformers/all-MiniLM-L6-v2` (384-dim, zero API cost), and stored in Pinecone serverless. Retrieval is semantic, top-k configurable per query.
+| Component | Choice | Reason |
+|---|---|---|
+| Embedder | all-MiniLM-L6-v2 | Runs locally, strong semantic retrieval, 384-dim fits free Pinecone tier |
+| Vector DB | Pinecone serverless | Zero ops, cosine similarity, free tier sufficient for corpus size |
+| Chunking | Word-level, 300 words, 40-word overlap | Preserves semantic units across chunk boundaries |
+| Chain | LangChain RetrievalQA (stuff) | Simple, inspectable, returns source documents |
+### Knowledge Corpus
+Corpus is maintained in a [separate repository](https://github.com/MukulRay1603/irminsul-corpus) with an autonomous update pipeline. It ingests from three tiers of sources with different trust levels:
+| Tier | Source | Files | Trust |
+|---|---|---|---|
+| 1 — Ground Truth | KQM Theorycrafting Library (peer-reviewed mechanics) | ~305 | Highest — cite in builds |
+| 1 — Ground Truth | genshin-db API (exact character/weapon/artifact stats) | ~406 | Highest — exact game data |
+| 2 — Expert Synthesis | Gemini-authored prose grounded in Tier 1 | ~83 | High — no hallucinated stats |
+| 3 — Community Signal | Official patch notes, banner history, event calendar | ~80 | Medium — tagged explicitly |
+A GitHub Actions workflow runs every Sunday at 2am UTC, pulls fresh data, commits the docs, and re-ingests ~4,000 vectors to Pinecone automatically.
+### Guardrails
+Two layers of input validation before any LLM call:
+1. **Injection detection** — pattern matching against known jailbreak phrases (`ignore previous instructions`, `act as`, `DAN mode`, etc.)
+2. **Domain validation** — cosine similarity between the query embedding and a set of Genshin-domain anchor sentences. Queries scoring below threshold (0.35) are rejected with a domain-scoped error message before touching the LLM.
+Output is sanitized to strip generation artifacts (`</s>` tokens, trailing whitespace) and length-checked.
+### Serving Layer
+FastAPI with:
+- Async lifespan model loading (model loads once at startup, not per request)
+- Typed Pydantic request/response models with `blocked` flag for guardrail rejections
+- CORS enabled for cross-origin UI
+- `/health` endpoint reporting model load status
+- Browser UI served from the same process (no separate frontend server)
+---
 ## Stack
 | Layer | Technology |
 |---|---|
 | Base model | Llama 3.1 8B Instruct |
 | Fine-tuning | QLoRA via PEFT (r=16, α=32, lr=2e-4) |
+| Experiment tracking | MLflow |
 | Quantization | BitsAndBytes 4-bit NF4, bfloat16 compute |
 | Embeddings | sentence-transformers/all-MiniLM-L6-v2 |
+| Vector DB | Pinecone serverless (cosine, 384-dim) |
 | RAG chain | LangChain RetrievalQA |
 | Serving | FastAPI + Uvicorn |
 | Containerization | Docker (python:3.12-slim) |
+| Live demo hosting | HuggingFace Spaces (CPU Basic) |
+| Production deployment | Azure Container Apps + ACR |
+| LLM backend (demo) | Groq API (llama-3.3-70b-versatile) |
+| Corpus pipeline | GitHub Actions (weekly, autonomous) |
 ---
 ## Quickstart
+### Option 1 — Groq backend (no GPU required)
 ```bash
+# 1. Clone
 git clone https://github.com/MukulRay1603/Irminsul.git
 cd Irminsul
+# 2. Install
+python -m venv venv && source venv/bin/activate
+# Windows: venv\Scripts\activate
 pip install -r requirements.txt
+# 3. Configure
 cp .env.example .env
+# Set PINECONE_API_KEY and GROQ_API_KEY in .env
+# LLM_BACKEND=groq is the default
+# 4. Ingest corpus (or use pre-ingested Pinecone index)
 python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40
+# 5. Run
+uvicorn main:app --reload --port 8000
+# UI: http://localhost:8000
+# API docs: http://localhost:8000/docs
+```
+### Option 2 — Local fine-tuned model (GPU required for inference, 6GB+ VRAM)
+```bash
+# Same steps 1–3, then:
+# Set LLM_BACKEND=local and MODEL_PATH in .env
+# 4. Download model
+# Place the merged QLoRA model at: ./models/merged/exp2_lr2e-4_r16/
+# (Or update MODEL_PATH in .env)
+# 5. Run
 uvicorn main:app --reload --port 8000
+```
+### Docker
+```bash
+# Groq backend (no GPU)
+docker build -t irminsul:latest .
+docker run -p 8000:8000 \
+  -e PINECONE_API_KEY=your_key \
+  -e GROQ_API_KEY=your_key \
+  -e LLM_BACKEND=groq \
+  irminsul:latest
 ```
 ---
 | Method | Endpoint | Description |
 |---|---|---|
 | `GET` | `/` | Browser UI |
+| `GET` | `/health` | Model load status + ready flag |
+| `POST` | `/generate` | RAG query → grounded answer + sources |
+| `POST` | `/ingest` | Ingest docs from a local directory path |
+**Request:**
+```json
+{
+  "query": "What weapons should Hu Tao use on a budget?",
+  "top_k": 3
+}
 ```
+**Response:**
 ```json
 {
   "answer": "For Hu Tao on a budget, Dragon's Bane is the strongest F2P option — it scales with Elemental Mastery and deals significant bonus damage on vaporized hits. White Tassel is the best 3-star alternative for pure Normal Attack scaling.",
+  "sources": ["docs/generated/characters/hu_tao.md", "docs/tcl/characters/pyro/hutao.md"],
+  "latency_ms": 1240.5,
+  "blocked": false
 }
 ```
+If a query is rejected by guardrails, `blocked: true` is returned with the rejection reason in `answer`. No LLM call is made.
 ---
+## Deployment
+See **[DEPLOYMENT.md](DEPLOYMENT.md)** for the full guide covering:
+- Local development setup
+- Docker (local and cloud)
+- Azure Container Apps (one-shot `deploy_azure.sh`)
+- Cost breakdown and the reasoning behind the demo setup
+- GPU serving path for the fine-tuned model
+**Why the live demo runs on HuggingFace + Groq, not Azure GPU:**
+Serving the fine-tuned Llama 3.1 8B requires a GPU instance. The minimum viable option on Azure (NC4as T4 v3) costs ~$360/month — not justified for a portfolio project. The Dockerfile and `deploy_azure.sh` are written for the Azure path; the live demo swaps the LLM backend to Groq via a single environment variable. The RAG pipeline, guardrails, and serving layer are identical.
 ---
+## Project Structure
 ```
+Irminsul/
+├── main.py              # FastAPI app: endpoints, lifespan, CORS, response models
+├── rag.py               # LangChain RAG chain, dual backend (Groq / local Llama)
+├── embedder.py          # sentence-transformers singleton (loads once, reused)
+├── ingest.py            # Doc loader → word chunker → Pinecone upsert
+├── guardrails.py        # Input validation: injection detection + domain cosine check
+├── index.html           # Browser UI: dark Dendro theme, query history, source display
+│
+├── Dockerfile           # python:3.12-slim, model NOT baked in
+├── deploy_azure.sh      # One-shot ACR build + Container Apps deploy
+├── .env.example         # Environment variable reference
+│
+├── DEPLOYMENT.md        # Full deployment guide + cost analysis
+├── requirements.txt
+├── images/              # Screenshots and assets used in this README
+│   ├── banner.png
+│   ├── ui_main.png
+│   ├── ui_response.png
+│   └── mlflow_runs.png
+└── docs/
+    ├── corpus/          # Legacy manual corpus docs
+    └── demo.html        # GitHub Pages demo page
+```
+---
+## Evaluation
+<!-- PLACEHOLDER: Fill this section once you have eval numbers ready.
+     Consider running a small eval set (20-50 questions) with:
+     - Faithfulness: Does the answer contradict the retrieved context?
+     - Answer relevance: Does the answer address the question?
+     - Context recall: Did retrieval find the right documents?
+     Tools to consider: RAGAS (pip install ragas) against your Pinecone index.
+     Example format:
+| Metric | Score | Method |
+|---|---|---|
+| Faithfulness | 0.826 | Custom eval, n=50 |
+| ROUGE-L | 0.466 | vs reference answers |
+| Context recall | TBD | RAGAS |
+| Answer relevance | TBD | RAGAS |
+     The fine-tuned model numbers (0.826 faithfulness, 0.466 ROUGE-L) came from
+     your MLflow eval during training — pull those into this table.
+-->
+The fine-tuned model was evaluated during training with a held-out set:
+| Metric | Score |
+|---|---|
+| Faithfulness | 0.826 |
+| ROUGE-L | 0.466 |
+Full RAG pipeline evaluation (context recall, answer relevance) is a planned addition — see [What's Next](#whats-next).
 ---
+## Screenshots
+<!-- PLACEHOLDER: Add screenshots once you have them.
+     Save to images/ and uncomment these lines:
+![Irminsul UI](images/ui_main.png)
+![Response with sources](images/ui_response.png)
+![MLflow experiment runs](images/mlflow_runs.png)
+     Tips:
+     - ui_main.png: screenshot of http://localhost:8000 before any query
+     - ui_response.png: run a query (try "best build for Hu Tao") so the answer + sources section is visible
+     - mlflow_runs.png: from your Colab — the experiment comparison table showing 3 runs
+-->
+*Screenshots coming soon — [try the live demo](https://huggingface.co/spaces/MukulRay/Irminsul) to see it in action.*
 ---
+## What's Next
+- [ ] **RAGAS evaluation** — systematic RAG eval (faithfulness, context recall, answer relevance) on a held-out question set
+- [ ] **MarkdownHeaderTextSplitter** — replace naive word chunker for section-aware chunking that respects document structure
+- [ ] **Metadata filtering** — filter Pinecone queries by character, content tier, or topic category
+- [ ] **Streaming responses** — SSE for lower perceived latency on long answers
+- [ ] **CI/CD pipeline** — GitHub Actions → ACR build → `az containerapp update` on push to main
+- [ ] **Corpus expansion** — constellation effects, rotation guides, and ER/EM thresholds per character
 ---
+## Related: irminsul-corpus
+The knowledge base is maintained in a companion repository:
+**[MukulRay1603/irminsul-corpus](https://github.com/MukulRay1603/irminsul-corpus)**
+It runs a fully autonomous weekly pipeline: pulls fresh game data from the KQM Theorycrafting Library and genshin-db API, synthesizes prose with Gemini 2.5 Flash, commits ~800 documents to the repo, and re-ingests ~4,000 vectors to Pinecone — without any manual intervention.
+---
+## License
+MIT — see [LICENSE](LICENSE) for details.
+Genshin Impact is owned by HoYoverse. This project is not affiliated with or endorsed by HoYoverse.
+---
+<div align="center">
+Built to learn the full MLOps lifecycle — fine-tuning, quantization, retrieval, serving, and cloud deployment — on consumer hardware. Every component chosen deliberately, not for hype.
+</div>

deploy_azure.sh ADDED Viewed

	@@ -0,0 +1,149 @@

+#!/bin/bash
+# ─────────────────────────────────────────────────────────────────────────────
+# deploy_azure.sh — One-shot Azure Container Apps deployment for Irminsul
+#
+# Prerequisites:
+#   - Azure CLI installed: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli
+#   - Logged in: az login
+#   - Subscription active: az account show
+#
+# Usage:
+#   export PINECONE_API_KEY=your_key
+#   export GROQ_API_KEY=your_key
+#   chmod +x deploy_azure.sh
+#   ./deploy_azure.sh
+#
+# What this script does:
+#   1. Creates a resource group in East US
+#   2. Creates an Azure Container Registry (ACR)
+#   3. Builds the Docker image via ACR Tasks (no local Docker build needed)
+#   4. Creates a Container Apps environment
+#   5. Deploys the container with secrets injected as env vars
+#   6. Prints the live HTTPS URL
+#
+# Cost note:
+#   This stack (Groq backend, no GPU) runs on a consumption-plan Container App.
+#   Estimated cost: ~$0/month on free tier (180,000 vCPU-seconds/month free).
+#   GPU-accelerated inference (local Llama backend) requires NC-series instances
+#   (~$0.50-1.50/hr) which is not cost-effective for a portfolio project.
+#   See DEPLOYMENT.md for the full cost analysis.
+# ─────────────────────────────────────────────────────────────────────────────
+set -e  # exit on any error
+# ── Configuration ──────────────────────────────────────────────────────────────
+RESOURCE_GROUP="irminsul-rg"
+LOCATION="eastus"
+ACR_NAME="irminsulacr"             # must be globally unique, lowercase alphanumeric
+ENVIRONMENT="irminsul-env"
+APP_NAME="irminsul"
+IMAGE_TAG="latest"
+# ── Validate required secrets ──────────────────────────────────────────────────
+if [[ -z "$PINECONE_API_KEY" ]]; then
+  echo "ERROR: PINECONE_API_KEY environment variable is not set."
+  echo "  export PINECONE_API_KEY=your_key"
+  exit 1
+fi
+if [[ -z "$GROQ_API_KEY" ]]; then
+  echo "ERROR: GROQ_API_KEY environment variable is not set."
+  echo "  export GROQ_API_KEY=your_key"
+  exit 1
+fi
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+echo "  Irminsul — Azure Container Apps Deployment"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+echo ""
+# ── Step 1: Resource Group ─────────────────────────────────────────────────────
+echo "[1/5] Creating resource group: $RESOURCE_GROUP"
+az group create \
+  --name "$RESOURCE_GROUP" \
+  --location "$LOCATION" \
+  --output none
+echo "      ✓ Resource group ready"
+# ── Step 2: Azure Container Registry ──────────────────────────────────────────
+echo "[2/5] Creating container registry: $ACR_NAME"
+az acr create \
+  --resource-group "$RESOURCE_GROUP" \
+  --name "$ACR_NAME" \
+  --sku Basic \
+  --admin-enabled true \
+  --output none
+echo "      ✓ ACR created"
+# ── Step 3: Build image via ACR Tasks (cloud build — no local Docker needed) ───
+echo "[3/5] Building Docker image via ACR Tasks..."
+echo "      This uploads your source code to Azure and builds in the cloud."
+az acr build \
+  --registry "$ACR_NAME" \
+  --image "${APP_NAME}:${IMAGE_TAG}" \
+  .
+echo "      ✓ Image built and pushed: ${ACR_NAME}.azurecr.io/${APP_NAME}:${IMAGE_TAG}"
+# ── Step 4: Container Apps Environment ────────────────────────────────────────
+echo "[4/5] Creating Container Apps environment: $ENVIRONMENT"
+az containerapp env create \
+  --name "$ENVIRONMENT" \
+  --resource-group "$RESOURCE_GROUP" \
+  --location "$LOCATION" \
+  --output none
+echo "      ✓ Environment ready"
+# ── Step 5: Deploy Container App ──────────────────────────────────────────────
+echo "[5/5] Deploying container app: $APP_NAME"
+# Get ACR credentials for pulling the image
+ACR_LOGIN_SERVER=$(az acr show --name "$ACR_NAME" --query loginServer --output tsv)
+ACR_USERNAME=$(az acr credential show --name "$ACR_NAME" --query username --output tsv)
+ACR_PASSWORD=$(az acr credential show --name "$ACR_NAME" --query "passwords[0].value" --output tsv)
+az containerapp create \
+  --name "$APP_NAME" \
+  --resource-group "$RESOURCE_GROUP" \
+  --environment "$ENVIRONMENT" \
+  --image "${ACR_LOGIN_SERVER}/${APP_NAME}:${IMAGE_TAG}" \
+  --registry-server "$ACR_LOGIN_SERVER" \
+  --registry-username "$ACR_USERNAME" \
+  --registry-password "$ACR_PASSWORD" \
+  --target-port 8000 \
+  --ingress external \
+  --min-replicas 0 \
+  --max-replicas 3 \
+  --cpu 1.0 \
+  --memory 2.0Gi \
+  --env-vars \
+      PINECONE_API_KEY=secretref:pinecone-key \
+      GROQ_API_KEY=secretref:groq-key \
+      PINECONE_INDEX=llmops-rag \
+      LLM_BACKEND=groq \
+      EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2 \
+  --secrets \
+      pinecone-key="$PINECONE_API_KEY" \
+      groq-key="$GROQ_API_KEY" \
+  --output none
+echo "      ✓ Container app deployed"
+# ── Print live URL ─────────────────────────────────────────────────────────────
+LIVE_URL=$(az containerapp show \
+  --name "$APP_NAME" \
+  --resource-group "$RESOURCE_GROUP" \
+  --query "properties.configuration.ingress.fqdn" \
+  --output tsv)
+echo ""
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+echo "  Deployment complete!"
+echo ""
+echo "  Live URL:  https://${LIVE_URL}"
+echo "  Health:    https://${LIVE_URL}/health"
+echo "  API docs:  https://${LIVE_URL}/docs"
+echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
+echo ""
+echo "  To tear down everything and stop billing:"
+echo "  az group delete --name $RESOURCE_GROUP --yes --no-wait"
+echo ""

requirements.txt CHANGED Viewed

@@ -1,23 +1,26 @@
-# Core serving
 fastapi==0.115.0
 uvicorn[standard]==0.30.6
 pydantic==2.7.4
-# LLM + fine-tuned model
 torch==2.3.1
 transformers==4.51.3
 peft==0.15.2
-bitsandbytes==0.43.3
 accelerate==1.6.0
-# RAG
 langchain==1.2.13
 langchain-community==0.4.1
-langchain-classic==1.0.3
 langchain-groq==1.1.2
 groq==0.37.1
-pinecone-client==3.2.2
 sentence-transformers==4.1.0
-# Utilities
-python-dotenv==1.0.1

+# ── Core serving ───────────────────────────────────────────────────────────────
 fastapi==0.115.0
 uvicorn[standard]==0.30.6
 pydantic==2.7.4
+# ── LLM + fine-tuned model (local backend) ─────────────────────────────────────
+# Only needed when LLM_BACKEND=local
+# GPU with 6GB+ VRAM required; loads fine on CPU but very slow
 torch==2.3.1
 transformers==4.51.3
 peft==0.15.2
+bitsandbytes==0.43.3          # 4-bit NF4 quantization (official Windows wheels since 0.42)
 accelerate==1.6.0
+# ── RAG ────────────────────────────────────────────────────────────────────────
 langchain==1.2.13
 langchain-community==0.4.1
+langchain-classic==1.0.3      # provides langchain_classic.chains.RetrievalQA
 langchain-groq==1.1.2
 groq==0.37.1
+pinecone-client==3.2.2        # langchain-community expects the v3 Pinecone client API
 sentence-transformers==4.1.0
+# ── Utilities ──────────────────────────────────────────────────────────────────
+python-dotenv==1.0.1
+requests>=2.31.0