docs: overhaul README, add DEPLOYMENT.md, deploy_azure.sh, .env.example
Browse files- .gitignore +0 -0
- DEPLOYMENT.md +208 -0
- README.md +310 -90
- deploy_azure.sh +149 -0
- requirements.txt +11 -8
.gitignore
CHANGED
|
Binary files a/.gitignore and b/.gitignore differ
|
|
|
DEPLOYMENT.md
ADDED
|
@@ -0,0 +1,208 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deployment Guide
|
| 2 |
+
|
| 3 |
+
This document covers all deployment options for Irminsul, the cost tradeoffs between them, and the architectural decisions behind the live demo setup.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Deployment Options
|
| 8 |
+
|
| 9 |
+
Irminsul supports two LLM backends and multiple hosting targets. Choose based on your infrastructure and budget.
|
| 10 |
+
|
| 11 |
+
| Backend | Where to Run | GPU Required | Cost |
|
| 12 |
+
|---|---|---|---|
|
| 13 |
+
| **Groq** (recommended) | Anywhere β no GPU | No | Free tier available |
|
| 14 |
+
| **Local Llama** (fine-tuned model) | Local machine / GPU VM | Yes (6GB+ VRAM) | Hardware cost / ~$0.50β1.50/hr on Azure |
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## Live Demo: HuggingFace Spaces + Groq
|
| 19 |
+
|
| 20 |
+
**Why this is the live demo environment:**
|
| 21 |
+
|
| 22 |
+
The fine-tuned Llama 3.1 8B model is 16GB on disk and requires a GPU-enabled instance to serve at acceptable latency. On Azure, the minimum viable GPU instance for this model is the **NC4as T4 v3** (~$0.50/hr, ~$360/month). Running this persistently for a portfolio project is not cost-effective.
|
| 23 |
+
|
| 24 |
+
The live demo instead uses:
|
| 25 |
+
- **HuggingFace Spaces** β free CPU hosting for the FastAPI container
|
| 26 |
+
- **Groq API** β runs `llama-3.3-70b-versatile` on Groq's Language Processing Units (LPUs) at ~300 tokens/second, for free under the public tier
|
| 27 |
+
|
| 28 |
+
This demonstrates the identical RAG architecture β the LLM backend is swapped via a single environment variable (`LLM_BACKEND=groq`). The retrieval pipeline, guardrails, response format, and API contract are unchanged.
|
| 29 |
+
|
| 30 |
+
```
|
| 31 |
+
Live demo: https://huggingface.co/spaces/MukulRay/Irminsul
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## Option A: Local Development
|
| 37 |
+
|
| 38 |
+
The full stack including the fine-tuned model runs locally on an RTX 3060 6GB:
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
# 1. Clone and install
|
| 42 |
+
git clone https://github.com/MukulRay1603/Irminsul.git
|
| 43 |
+
cd Irminsul
|
| 44 |
+
python -m venv venv && source venv/bin/activate
|
| 45 |
+
pip install -r requirements.txt
|
| 46 |
+
|
| 47 |
+
# 2. Configure
|
| 48 |
+
cp .env.example .env
|
| 49 |
+
# Edit .env β set MODEL_PATH, PINECONE_API_KEY
|
| 50 |
+
|
| 51 |
+
# 3. Ingest corpus
|
| 52 |
+
python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40
|
| 53 |
+
|
| 54 |
+
# 4. Serve
|
| 55 |
+
uvicorn main:app --host 0.0.0.0 --port 8000
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
**Memory profile:**
|
| 59 |
+
|
| 60 |
+
| Component | VRAM |
|
| 61 |
+
|---|---|
|
| 62 |
+
| Llama 3.1 8B @ 4-bit NF4 | ~4.5 GB |
|
| 63 |
+
| all-MiniLM-L6-v2 embedder | ~90 MB |
|
| 64 |
+
| Inference headroom | ~1.2 GB |
|
| 65 |
+
| **Total** | **~5.8 GB** |
|
| 66 |
+
|
| 67 |
+
The model loads with `max_memory={0: "5.5GiB", "cpu": "24GiB"}` β layers that don't fit on GPU overflow to RAM automatically via `accelerate`.
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## Option B: Docker (Local or Any Cloud)
|
| 72 |
+
|
| 73 |
+
The Dockerfile is intentionally slim β the model is **not baked in**. It's injected at runtime via `MODEL_PATH`.
|
| 74 |
+
|
| 75 |
+
```bash
|
| 76 |
+
# Build
|
| 77 |
+
docker build -t irminsul:latest .
|
| 78 |
+
|
| 79 |
+
# Run with Groq backend (no GPU needed)
|
| 80 |
+
docker run -p 8000:8000 \
|
| 81 |
+
-e PINECONE_API_KEY=your_key \
|
| 82 |
+
-e GROQ_API_KEY=your_key \
|
| 83 |
+
-e PINECONE_INDEX=llmops-rag \
|
| 84 |
+
-e LLM_BACKEND=groq \
|
| 85 |
+
irminsul:latest
|
| 86 |
+
|
| 87 |
+
# Run with local model (GPU required)
|
| 88 |
+
docker run -p 8000:8000 \
|
| 89 |
+
--gpus all \
|
| 90 |
+
-v /path/to/models:/app/models \
|
| 91 |
+
-e PINECONE_API_KEY=your_key \
|
| 92 |
+
-e MODEL_PATH=/app/models/merged/exp2_lr2e-4_r16 \
|
| 93 |
+
-e LLM_BACKEND=local \
|
| 94 |
+
irminsul:latest
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## Option C: Azure Container Apps
|
| 100 |
+
|
| 101 |
+
Azure Container Apps (ACA) is the production deployment target. The `deploy_azure.sh` script provisions the full stack in one command.
|
| 102 |
+
|
| 103 |
+
### Prerequisites
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
# Install Azure CLI
|
| 107 |
+
# macOS:
|
| 108 |
+
brew install azure-cli
|
| 109 |
+
|
| 110 |
+
# Linux:
|
| 111 |
+
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
|
| 112 |
+
|
| 113 |
+
# Windows: https://aka.ms/installazurecliwindows
|
| 114 |
+
|
| 115 |
+
# Log in
|
| 116 |
+
az login
|
| 117 |
+
az account show # confirm your subscription
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
### One-shot deploy
|
| 121 |
+
|
| 122 |
+
```bash
|
| 123 |
+
export PINECONE_API_KEY=your_pinecone_key
|
| 124 |
+
export GROQ_API_KEY=your_groq_key
|
| 125 |
+
chmod +x deploy_azure.sh
|
| 126 |
+
./deploy_azure.sh
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
The script:
|
| 130 |
+
1. Creates resource group `irminsul-rg` in East US
|
| 131 |
+
2. Creates Azure Container Registry `irminsulacr`
|
| 132 |
+
3. Builds the Docker image via **ACR Tasks** β the source code is uploaded and built in Azure's cloud; no local Docker daemon needed
|
| 133 |
+
4. Creates a Container Apps environment
|
| 134 |
+
5. Deploys the app with secrets injected as environment variables
|
| 135 |
+
6. Outputs the live HTTPS URL
|
| 136 |
+
|
| 137 |
+
### Tearing down
|
| 138 |
+
|
| 139 |
+
```bash
|
| 140 |
+
# Delete everything β stops all billing immediately
|
| 141 |
+
az group delete --name irminsul-rg --yes --no-wait
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
### Cost breakdown (Groq backend, no GPU)
|
| 145 |
+
|
| 146 |
+
| Resource | SKU | Cost |
|
| 147 |
+
|---|---|---|
|
| 148 |
+
| Container Apps | Consumption plan | Free (180k vCPU-s/month) |
|
| 149 |
+
| ACR | Basic | ~$5/month |
|
| 150 |
+
| Outbound bandwidth | First 100GB | Free |
|
| 151 |
+
| **Total** | | **~$5/month** |
|
| 152 |
+
|
| 153 |
+
On Azure for Students ($100 credit), this runs for ~20 months.
|
| 154 |
+
|
| 155 |
+
### Why not GPU on Azure?
|
| 156 |
+
|
| 157 |
+
To serve the fine-tuned Llama model in production, a GPU instance is required:
|
| 158 |
+
|
| 159 |
+
| Instance | GPU | VRAM | Cost |
|
| 160 |
+
|---|---|---|---|
|
| 161 |
+
| NC4as T4 v3 | Tesla T4 | 16 GB | ~$0.50/hr = **~$360/month** |
|
| 162 |
+
| NC6s v3 | Tesla V100 | 16 GB | ~$0.90/hr = **~$648/month** |
|
| 163 |
+
|
| 164 |
+
At these prices, a portfolio project running 24/7 would exhaust the $100 student credit in under a week. The Groq backend delivers the same RAG functionality at zero marginal cost, making it the right engineering tradeoff.
|
| 165 |
+
|
| 166 |
+
### Serving the fine-tuned model on Azure (production path)
|
| 167 |
+
|
| 168 |
+
If cost were not a constraint, the correct architecture is:
|
| 169 |
+
|
| 170 |
+
1. **Upload model to Azure Blob Storage** (~$0.02/GB/month for 16GB = ~$0.32/month)
|
| 171 |
+
2. **Mount as a volume** in Container Apps β the container sees it at `/app/models/`
|
| 172 |
+
3. **Switch to GPU SKU** β replace `--cpu 1.0 --memory 2.0Gi` in `deploy_azure.sh` with a GPU-enabled workload profile
|
| 173 |
+
4. **Set `LLM_BACKEND=local`** in env vars
|
| 174 |
+
|
| 175 |
+
The Docker image and application code require zero changes for this path. The abstraction was designed for it.
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
## Environment Variables Reference
|
| 180 |
+
|
| 181 |
+
| Variable | Required | Default | Description |
|
| 182 |
+
|---|---|---|---|
|
| 183 |
+
| `PINECONE_API_KEY` | Yes | β | Pinecone serverless API key |
|
| 184 |
+
| `PINECONE_INDEX` | No | `llmops-rag` | Pinecone index name |
|
| 185 |
+
| `LLM_BACKEND` | No | `groq` | `groq` or `local` |
|
| 186 |
+
| `GROQ_API_KEY` | If Groq | β | Groq API key |
|
| 187 |
+
| `GROQ_MODEL` | No | `llama-3.3-70b-versatile` | Groq model name |
|
| 188 |
+
| `MODEL_PATH` | If local | `./models/merged/exp2_lr2e-4_r16` | Path to merged model |
|
| 189 |
+
| `EMBED_MODEL` | No | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model |
|
| 190 |
+
|
| 191 |
+
---
|
| 192 |
+
|
| 193 |
+
## CI/CD (Planned)
|
| 194 |
+
|
| 195 |
+
The intended CI/CD pipeline:
|
| 196 |
+
|
| 197 |
+
```
|
| 198 |
+
git push main
|
| 199 |
+
β
|
| 200 |
+
βΌ
|
| 201 |
+
GitHub Actions
|
| 202 |
+
βββ Run tests
|
| 203 |
+
βββ Build Docker image
|
| 204 |
+
βββ Push to ACR
|
| 205 |
+
βββ az containerapp update --image new-tag
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
This would give zero-downtime rolling deploys on every push to main. Currently, re-running `deploy_azure.sh` achieves the same result with a cold start.
|
README.md
CHANGED
|
@@ -6,95 +6,235 @@ sdk: docker
|
|
| 6 |
pinned: false
|
| 7 |
---
|
| 8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
# Irminsul
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
**[β
|
| 16 |
|
| 17 |
---
|
| 18 |
|
| 19 |
-
##
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
-
|
| 24 |
-
-
|
| 25 |
-
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
- **Domain knowledge** β RAG corpus built around Genshin Impact lore, character builds, and elemental mechanics, serving as a rich real-world knowledge base for retrieval evaluation
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
## Architecture
|
| 33 |
|
| 34 |
```
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
```
|
| 54 |
|
| 55 |
---
|
| 56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
## Stack
|
| 58 |
|
| 59 |
| Layer | Technology |
|
| 60 |
|---|---|
|
| 61 |
| Base model | Llama 3.1 8B Instruct |
|
| 62 |
| Fine-tuning | QLoRA via PEFT (r=16, Ξ±=32, lr=2e-4) |
|
|
|
|
| 63 |
| Quantization | BitsAndBytes 4-bit NF4, bfloat16 compute |
|
| 64 |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 |
|
| 65 |
-
| Vector DB | Pinecone
|
| 66 |
| RAG chain | LangChain RetrievalQA |
|
| 67 |
| Serving | FastAPI + Uvicorn |
|
| 68 |
| Containerization | Docker (python:3.12-slim) |
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
---
|
| 72 |
|
| 73 |
## Quickstart
|
| 74 |
|
|
|
|
|
|
|
| 75 |
```bash
|
| 76 |
-
# 1. Clone
|
| 77 |
git clone https://github.com/MukulRay1603/Irminsul.git
|
| 78 |
cd Irminsul
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
| 80 |
pip install -r requirements.txt
|
| 81 |
|
| 82 |
-
#
|
| 83 |
cp .env.example .env
|
| 84 |
-
#
|
|
|
|
| 85 |
|
| 86 |
-
#
|
| 87 |
-
# Place merged model at: ./models/merged/exp2_lr2e-4_r16
|
| 88 |
-
# Or update MODEL_PATH in .env to point to your model
|
| 89 |
-
|
| 90 |
-
# 4. Ingest documents
|
| 91 |
python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40
|
| 92 |
|
| 93 |
-
# 5.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
uvicorn main:app --reload --port 8000
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
-
|
| 97 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
```
|
| 99 |
|
| 100 |
---
|
|
@@ -104,81 +244,161 @@ uvicorn main:app --reload --port 8000
|
|
| 104 |
| Method | Endpoint | Description |
|
| 105 |
|---|---|---|
|
| 106 |
| `GET` | `/` | Browser UI |
|
| 107 |
-
| `GET` | `/health` | Model load status |
|
| 108 |
-
| `POST` | `/generate` | RAG query β grounded answer |
|
| 109 |
-
| `POST` | `/ingest` | Ingest docs from local directory |
|
| 110 |
-
|
| 111 |
-
**Example:**
|
| 112 |
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
|
|
|
|
|
|
| 117 |
```
|
| 118 |
|
|
|
|
| 119 |
```json
|
| 120 |
{
|
| 121 |
"answer": "For Hu Tao on a budget, Dragon's Bane is the strongest F2P option β it scales with Elemental Mastery and deals significant bonus damage on vaporized hits. White Tassel is the best 3-star alternative for pure Normal Attack scaling.",
|
| 122 |
-
"sources": ["docs/
|
| 123 |
-
"latency_ms":
|
|
|
|
| 124 |
}
|
| 125 |
```
|
| 126 |
|
|
|
|
|
|
|
| 127 |
---
|
| 128 |
|
| 129 |
-
##
|
| 130 |
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
|
|
|
|
|
|
| 136 |
|
| 137 |
-
|
|
|
|
|
|
|
| 138 |
|
| 139 |
---
|
| 140 |
|
| 141 |
-
##
|
| 142 |
|
| 143 |
-
```bash
|
| 144 |
-
export PINECONE_API_KEY=your_key
|
| 145 |
-
chmod +x deploy_azure.sh
|
| 146 |
-
./deploy_azure.sh
|
| 147 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
|
| 153 |
---
|
| 154 |
|
| 155 |
-
##
|
| 156 |
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
|
| 172 |
---
|
| 173 |
|
| 174 |
-
## What's
|
| 175 |
|
| 176 |
-
- [ ]
|
| 177 |
-
- [ ]
|
| 178 |
-
- [ ]
|
| 179 |
-
- [ ]
|
| 180 |
-
- [ ] CI/CD pipeline β GitHub Actions β ACR build β
|
|
|
|
| 181 |
|
| 182 |
---
|
| 183 |
|
| 184 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
pinned: false
|
| 7 |
---
|
| 8 |
|
| 9 |
+
<div align="center">
|
| 10 |
+
|
| 11 |
+
<img src="docs/assets/banner.png" alt="Irminsul Banner" width="100%">
|
| 12 |
+
<!-- PLACEHOLDER: Add a banner image. Can be a dark-themed graphic with the Irminsul logo/name.
|
| 13 |
+
Recommended: 1280x320px, dark green/forest aesthetic matching the UI.
|
| 14 |
+
Tools: Figma, Canva, or even a screenshot of the UI works. -->
|
| 15 |
+
|
| 16 |
# Irminsul
|
| 17 |
|
| 18 |
+
**A production-shaped LLMOps stack β QLoRA fine-tuning on Colab, RAG pipeline, containerized serving, and cloud deployment.**
|
| 19 |
+
|
| 20 |
+
[](https://huggingface.co/spaces/MukulRay/Irminsul)
|
| 21 |
+
[](https://github.com/MukulRay1603/Irminsul)
|
| 22 |
+
[](https://github.com/MukulRay1603/irminsul-corpus)
|
| 23 |
+
[](LICENSE)
|
| 24 |
+
|
| 25 |
+
</div>
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
|
| 29 |
+
Most LLM projects stop at inference. This one builds the full stack: a QLoRA fine-tuned Llama 3.1 8B served through a RAG pipeline, with guardrails, a domain-specific knowledge base, and a containerized FastAPI server designed for cloud deployment.
|
| 30 |
|
| 31 |
+
**[β Try the live demo](https://huggingface.co/spaces/MukulRay/Irminsul)**
|
| 32 |
|
| 33 |
---
|
| 34 |
|
| 35 |
+
## What This Is
|
| 36 |
|
| 37 |
+
Irminsul is a domain-specific AI assistant for Genshin Impact β built not because Genshin needed an AI assistant, but because it provided a concrete, evaluable knowledge domain to build an LLMOps pipeline around. Every component was chosen deliberately:
|
| 38 |
|
| 39 |
+
- A knowledge domain rich enough to evaluate retrieval quality (characters, mechanics, lore)
|
| 40 |
+
- Ground truth data available (KQM Theorycrafting Library, game stat APIs) to measure hallucination
|
| 41 |
+
- Community signal data (patch notes, meta shifts) to test corpus freshness
|
| 42 |
+
|
| 43 |
+
The domain is the test harness. The pipeline is the project.
|
|
|
|
| 44 |
|
| 45 |
---
|
| 46 |
|
| 47 |
## Architecture
|
| 48 |
|
| 49 |
```
|
| 50 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 51 |
+
β User Query β
|
| 52 |
+
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
|
| 53 |
+
β
|
| 54 |
+
βΌ
|
| 55 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 56 |
+
β Guardrails Layer β
|
| 57 |
+
β β’ Injection detection (pattern matching) β
|
| 58 |
+
β β’ Domain validation (cosine similarity vs anchor embeddings) β
|
| 59 |
+
β β’ Output sanitization β
|
| 60 |
+
βββββββββββββββββββββββ¬ββββββββββββββοΏ½οΏ½βββββββββββββββββββββββββββββ
|
| 61 |
+
β
|
| 62 |
+
βΌ
|
| 63 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 64 |
+
β FastAPI /generate β
|
| 65 |
+
ββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 66 |
+
β
|
| 67 |
+
ββββ Embed query (sentence-transformers, local, CPU)
|
| 68 |
+
β β
|
| 69 |
+
β βΌ
|
| 70 |
+
β Pinecone ββ semantic search βββΊ top-k chunks
|
| 71 |
+
β β
|
| 72 |
+
βΌ βΌ
|
| 73 |
+
LangChain RetrievalQA (stuff chain)
|
| 74 |
+
β
|
| 75 |
+
βΌ
|
| 76 |
+
ββββββββββββββββββββββββββββββββββββββ
|
| 77 |
+
β LLM Backend β
|
| 78 |
+
β β
|
| 79 |
+
β Groq (live demo) β
|
| 80 |
+
β llama-3.3-70b-versatile β
|
| 81 |
+
β ~300 tok/s, free tier β
|
| 82 |
+
β ββββ OR ββββ β
|
| 83 |
+
β Local (fine-tuned) β
|
| 84 |
+
β Llama 3.1 8B QLoRA β
|
| 85 |
+
β 4-bit NF4, RTX 3060 6GB β
|
| 86 |
+
β (inference only β trained on β
|
| 87 |
+
β Colab A100) β
|
| 88 |
+
ββββββββββββββββββββββββββββββββββββββ
|
| 89 |
+
β
|
| 90 |
+
βΌ
|
| 91 |
+
Grounded answer + source attribution
|
| 92 |
```
|
| 93 |
|
| 94 |
---
|
| 95 |
|
| 96 |
+
## Components
|
| 97 |
+
|
| 98 |
+
### Fine-Tuned Model
|
| 99 |
+
|
| 100 |
+
Llama 3.1 8B Instruct fine-tuned with QLoRA on a custom instruction dataset, trained on Google Colab Pro (A100). Local inference runs in 4-bit NF4 quantization on an RTX 3060 6GB.
|
| 101 |
+
|
| 102 |
+
**[β View training notebook on Colab](https://colab.research.google.com/drive/YOUR_NOTEBOOK_LINK_HERE)**
|
| 103 |
+
<!-- PLACEHOLDER: Replace YOUR_NOTEBOOK_LINK_HERE with your actual Colab share link
|
| 104 |
+
File β Share β Copy link (set to "Anyone with the link can view") -->
|
| 105 |
+
|
| 106 |
+
| Parameter | Value |
|
| 107 |
+
|---|---|
|
| 108 |
+
| Base model | `meta-llama/Llama-3.1-8B-Instruct` |
|
| 109 |
+
| Method | QLoRA via PEFT |
|
| 110 |
+
| Rank / Alpha | r=16, Ξ±=32 |
|
| 111 |
+
| Learning rate | 2e-4 |
|
| 112 |
+
| Quantization (inference) | 4-bit NF4, bfloat16 compute |
|
| 113 |
+
| Training infra | Google Colab Pro (A100) |
|
| 114 |
+
| Experiment tracking | MLflow (3 runs) |
|
| 115 |
+
|
| 116 |
+
Three experiments were tracked in MLflow. Winning checkpoint selected by faithfulness score (0.826) and ROUGE-L (0.466) on a held-out eval set.
|
| 117 |
+
|
| 118 |
+
<!-- PLACEHOLDER: Add MLflow experiment screenshot here
|
| 119 |
+
docs/assets/mlflow_experiments.png
|
| 120 |
+
A screenshot of your MLflow UI showing the 3 runs and metrics comparison -->
|
| 121 |
+
|
| 122 |
+
### RAG Pipeline
|
| 123 |
+
|
| 124 |
+
Documents are chunked, embedded locally with `sentence-transformers/all-MiniLM-L6-v2` (384-dim, zero API cost), and stored in Pinecone serverless. Retrieval is semantic, top-k configurable per query.
|
| 125 |
+
|
| 126 |
+
| Component | Choice | Reason |
|
| 127 |
+
|---|---|---|
|
| 128 |
+
| Embedder | all-MiniLM-L6-v2 | Runs locally, strong semantic retrieval, 384-dim fits free Pinecone tier |
|
| 129 |
+
| Vector DB | Pinecone serverless | Zero ops, cosine similarity, free tier sufficient for corpus size |
|
| 130 |
+
| Chunking | Word-level, 300 words, 40-word overlap | Preserves semantic units across chunk boundaries |
|
| 131 |
+
| Chain | LangChain RetrievalQA (stuff) | Simple, inspectable, returns source documents |
|
| 132 |
+
|
| 133 |
+
### Knowledge Corpus
|
| 134 |
+
|
| 135 |
+
Corpus is maintained in a [separate repository](https://github.com/MukulRay1603/irminsul-corpus) with an autonomous update pipeline. It ingests from three tiers of sources with different trust levels:
|
| 136 |
+
|
| 137 |
+
| Tier | Source | Files | Trust |
|
| 138 |
+
|---|---|---|---|
|
| 139 |
+
| 1 β Ground Truth | KQM Theorycrafting Library (peer-reviewed mechanics) | ~305 | Highest β cite in builds |
|
| 140 |
+
| 1 β Ground Truth | genshin-db API (exact character/weapon/artifact stats) | ~406 | Highest β exact game data |
|
| 141 |
+
| 2 β Expert Synthesis | Gemini-authored prose grounded in Tier 1 | ~83 | High β no hallucinated stats |
|
| 142 |
+
| 3 β Community Signal | Official patch notes, banner history, event calendar | ~80 | Medium β tagged explicitly |
|
| 143 |
+
|
| 144 |
+
A GitHub Actions workflow runs every Sunday at 2am UTC, pulls fresh data, commits the docs, and re-ingests ~4,000 vectors to Pinecone automatically.
|
| 145 |
+
|
| 146 |
+
### Guardrails
|
| 147 |
+
|
| 148 |
+
Two layers of input validation before any LLM call:
|
| 149 |
+
|
| 150 |
+
1. **Injection detection** β pattern matching against known jailbreak phrases (`ignore previous instructions`, `act as`, `DAN mode`, etc.)
|
| 151 |
+
2. **Domain validation** β cosine similarity between the query embedding and a set of Genshin-domain anchor sentences. Queries scoring below threshold (0.35) are rejected with a domain-scoped error message before touching the LLM.
|
| 152 |
+
|
| 153 |
+
Output is sanitized to strip generation artifacts (`</s>` tokens, trailing whitespace) and length-checked.
|
| 154 |
+
|
| 155 |
+
### Serving Layer
|
| 156 |
+
|
| 157 |
+
FastAPI with:
|
| 158 |
+
- Async lifespan model loading (model loads once at startup, not per request)
|
| 159 |
+
- Typed Pydantic request/response models with `blocked` flag for guardrail rejections
|
| 160 |
+
- CORS enabled for cross-origin UI
|
| 161 |
+
- `/health` endpoint reporting model load status
|
| 162 |
+
- Browser UI served from the same process (no separate frontend server)
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
## Stack
|
| 167 |
|
| 168 |
| Layer | Technology |
|
| 169 |
|---|---|
|
| 170 |
| Base model | Llama 3.1 8B Instruct |
|
| 171 |
| Fine-tuning | QLoRA via PEFT (r=16, Ξ±=32, lr=2e-4) |
|
| 172 |
+
| Experiment tracking | MLflow |
|
| 173 |
| Quantization | BitsAndBytes 4-bit NF4, bfloat16 compute |
|
| 174 |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 |
|
| 175 |
+
| Vector DB | Pinecone serverless (cosine, 384-dim) |
|
| 176 |
| RAG chain | LangChain RetrievalQA |
|
| 177 |
| Serving | FastAPI + Uvicorn |
|
| 178 |
| Containerization | Docker (python:3.12-slim) |
|
| 179 |
+
| Live demo hosting | HuggingFace Spaces (CPU Basic) |
|
| 180 |
+
| Production deployment | Azure Container Apps + ACR |
|
| 181 |
+
| LLM backend (demo) | Groq API (llama-3.3-70b-versatile) |
|
| 182 |
+
| Corpus pipeline | GitHub Actions (weekly, autonomous) |
|
| 183 |
|
| 184 |
---
|
| 185 |
|
| 186 |
## Quickstart
|
| 187 |
|
| 188 |
+
### Option 1 β Groq backend (no GPU required)
|
| 189 |
+
|
| 190 |
```bash
|
| 191 |
+
# 1. Clone
|
| 192 |
git clone https://github.com/MukulRay1603/Irminsul.git
|
| 193 |
cd Irminsul
|
| 194 |
+
|
| 195 |
+
# 2. Install
|
| 196 |
+
python -m venv venv && source venv/bin/activate
|
| 197 |
+
# Windows: venv\Scripts\activate
|
| 198 |
pip install -r requirements.txt
|
| 199 |
|
| 200 |
+
# 3. Configure
|
| 201 |
cp .env.example .env
|
| 202 |
+
# Set PINECONE_API_KEY and GROQ_API_KEY in .env
|
| 203 |
+
# LLM_BACKEND=groq is the default
|
| 204 |
|
| 205 |
+
# 4. Ingest corpus (or use pre-ingested Pinecone index)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40
|
| 207 |
|
| 208 |
+
# 5. Run
|
| 209 |
+
uvicorn main:app --reload --port 8000
|
| 210 |
+
# UI: http://localhost:8000
|
| 211 |
+
# API docs: http://localhost:8000/docs
|
| 212 |
+
```
|
| 213 |
+
|
| 214 |
+
### Option 2 β Local fine-tuned model (GPU required for inference, 6GB+ VRAM)
|
| 215 |
+
|
| 216 |
+
```bash
|
| 217 |
+
# Same steps 1β3, then:
|
| 218 |
+
# Set LLM_BACKEND=local and MODEL_PATH in .env
|
| 219 |
+
|
| 220 |
+
# 4. Download model
|
| 221 |
+
# Place the merged QLoRA model at: ./models/merged/exp2_lr2e-4_r16/
|
| 222 |
+
# (Or update MODEL_PATH in .env)
|
| 223 |
+
|
| 224 |
+
# 5. Run
|
| 225 |
uvicorn main:app --reload --port 8000
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
### Docker
|
| 229 |
|
| 230 |
+
```bash
|
| 231 |
+
# Groq backend (no GPU)
|
| 232 |
+
docker build -t irminsul:latest .
|
| 233 |
+
docker run -p 8000:8000 \
|
| 234 |
+
-e PINECONE_API_KEY=your_key \
|
| 235 |
+
-e GROQ_API_KEY=your_key \
|
| 236 |
+
-e LLM_BACKEND=groq \
|
| 237 |
+
irminsul:latest
|
| 238 |
```
|
| 239 |
|
| 240 |
---
|
|
|
|
| 244 |
| Method | Endpoint | Description |
|
| 245 |
|---|---|---|
|
| 246 |
| `GET` | `/` | Browser UI |
|
| 247 |
+
| `GET` | `/health` | Model load status + ready flag |
|
| 248 |
+
| `POST` | `/generate` | RAG query β grounded answer + sources |
|
| 249 |
+
| `POST` | `/ingest` | Ingest docs from a local directory path |
|
|
|
|
|
|
|
| 250 |
|
| 251 |
+
**Request:**
|
| 252 |
+
```json
|
| 253 |
+
{
|
| 254 |
+
"query": "What weapons should Hu Tao use on a budget?",
|
| 255 |
+
"top_k": 3
|
| 256 |
+
}
|
| 257 |
```
|
| 258 |
|
| 259 |
+
**Response:**
|
| 260 |
```json
|
| 261 |
{
|
| 262 |
"answer": "For Hu Tao on a budget, Dragon's Bane is the strongest F2P option β it scales with Elemental Mastery and deals significant bonus damage on vaporized hits. White Tassel is the best 3-star alternative for pure Normal Attack scaling.",
|
| 263 |
+
"sources": ["docs/generated/characters/hu_tao.md", "docs/tcl/characters/pyro/hutao.md"],
|
| 264 |
+
"latency_ms": 1240.5,
|
| 265 |
+
"blocked": false
|
| 266 |
}
|
| 267 |
```
|
| 268 |
|
| 269 |
+
If a query is rejected by guardrails, `blocked: true` is returned with the rejection reason in `answer`. No LLM call is made.
|
| 270 |
+
|
| 271 |
---
|
| 272 |
|
| 273 |
+
## Deployment
|
| 274 |
|
| 275 |
+
See **[DEPLOYMENT.md](DEPLOYMENT.md)** for the full guide covering:
|
| 276 |
+
|
| 277 |
+
- Local development setup
|
| 278 |
+
- Docker (local and cloud)
|
| 279 |
+
- Azure Container Apps (one-shot `deploy_azure.sh`)
|
| 280 |
+
- Cost breakdown and the reasoning behind the demo setup
|
| 281 |
+
- GPU serving path for the fine-tuned model
|
| 282 |
|
| 283 |
+
**Why the live demo runs on HuggingFace + Groq, not Azure GPU:**
|
| 284 |
+
|
| 285 |
+
Serving the fine-tuned Llama 3.1 8B requires a GPU instance. The minimum viable option on Azure (NC4as T4 v3) costs ~$360/month β not justified for a portfolio project. The Dockerfile and `deploy_azure.sh` are written for the Azure path; the live demo swaps the LLM backend to Groq via a single environment variable. The RAG pipeline, guardrails, and serving layer are identical.
|
| 286 |
|
| 287 |
---
|
| 288 |
|
| 289 |
+
## Project Structure
|
| 290 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 291 |
```
|
| 292 |
+
Irminsul/
|
| 293 |
+
βββ main.py # FastAPI app: endpoints, lifespan, CORS, response models
|
| 294 |
+
βββ rag.py # LangChain RAG chain, dual backend (Groq / local Llama)
|
| 295 |
+
βββ embedder.py # sentence-transformers singleton (loads once, reused)
|
| 296 |
+
βββ ingest.py # Doc loader β word chunker β Pinecone upsert
|
| 297 |
+
βββ guardrails.py # Input validation: injection detection + domain cosine check
|
| 298 |
+
βββ index.html # Browser UI: dark Dendro theme, query history, source display
|
| 299 |
+
β
|
| 300 |
+
βββ Dockerfile # python:3.12-slim, model NOT baked in
|
| 301 |
+
βββ deploy_azure.sh # One-shot ACR build + Container Apps deploy
|
| 302 |
+
βββ .env.example # Environment variable reference
|
| 303 |
+
β
|
| 304 |
+
βββ DEPLOYMENT.md # Full deployment guide + cost analysis
|
| 305 |
+
βββ requirements.txt
|
| 306 |
+
βββ images/ # Screenshots and assets used in this README
|
| 307 |
+
β βββ banner.png
|
| 308 |
+
β βββ ui_main.png
|
| 309 |
+
β βββ ui_response.png
|
| 310 |
+
β βββ mlflow_runs.png
|
| 311 |
+
βββ docs/
|
| 312 |
+
βββ corpus/ # Legacy manual corpus docs
|
| 313 |
+
βββ demo.html # GitHub Pages demo page
|
| 314 |
+
```
|
| 315 |
+
|
| 316 |
+
---
|
| 317 |
+
|
| 318 |
+
## Evaluation
|
| 319 |
|
| 320 |
+
<!-- PLACEHOLDER: Fill this section once you have eval numbers ready.
|
| 321 |
+
Consider running a small eval set (20-50 questions) with:
|
| 322 |
+
- Faithfulness: Does the answer contradict the retrieved context?
|
| 323 |
+
- Answer relevance: Does the answer address the question?
|
| 324 |
+
- Context recall: Did retrieval find the right documents?
|
| 325 |
+
|
| 326 |
+
Tools to consider: RAGAS (pip install ragas) against your Pinecone index.
|
| 327 |
+
|
| 328 |
+
Example format:
|
| 329 |
|
| 330 |
+
| Metric | Score | Method |
|
| 331 |
+
|---|---|---|
|
| 332 |
+
| Faithfulness | 0.826 | Custom eval, n=50 |
|
| 333 |
+
| ROUGE-L | 0.466 | vs reference answers |
|
| 334 |
+
| Context recall | TBD | RAGAS |
|
| 335 |
+
| Answer relevance | TBD | RAGAS |
|
| 336 |
+
|
| 337 |
+
The fine-tuned model numbers (0.826 faithfulness, 0.466 ROUGE-L) came from
|
| 338 |
+
your MLflow eval during training β pull those into this table.
|
| 339 |
+
-->
|
| 340 |
+
|
| 341 |
+
The fine-tuned model was evaluated during training with a held-out set:
|
| 342 |
+
|
| 343 |
+
| Metric | Score |
|
| 344 |
+
|---|---|
|
| 345 |
+
| Faithfulness | 0.826 |
|
| 346 |
+
| ROUGE-L | 0.466 |
|
| 347 |
+
|
| 348 |
+
Full RAG pipeline evaluation (context recall, answer relevance) is a planned addition β see [What's Next](#whats-next).
|
| 349 |
|
| 350 |
---
|
| 351 |
|
| 352 |
+
## Screenshots
|
| 353 |
|
| 354 |
+
<!-- PLACEHOLDER: Add screenshots once you have them.
|
| 355 |
+
Save to images/ and uncomment these lines:
|
| 356 |
+
|
| 357 |
+

|
| 358 |
+

|
| 359 |
+

|
| 360 |
+
|
| 361 |
+
Tips:
|
| 362 |
+
- ui_main.png: screenshot of http://localhost:8000 before any query
|
| 363 |
+
- ui_response.png: run a query (try "best build for Hu Tao") so the answer + sources section is visible
|
| 364 |
+
- mlflow_runs.png: from your Colab β the experiment comparison table showing 3 runs
|
| 365 |
+
-->
|
| 366 |
+
|
| 367 |
+
*Screenshots coming soon β [try the live demo](https://huggingface.co/spaces/MukulRay/Irminsul) to see it in action.*
|
| 368 |
|
| 369 |
---
|
| 370 |
|
| 371 |
+
## What's Next
|
| 372 |
|
| 373 |
+
- [ ] **RAGAS evaluation** β systematic RAG eval (faithfulness, context recall, answer relevance) on a held-out question set
|
| 374 |
+
- [ ] **MarkdownHeaderTextSplitter** β replace naive word chunker for section-aware chunking that respects document structure
|
| 375 |
+
- [ ] **Metadata filtering** β filter Pinecone queries by character, content tier, or topic category
|
| 376 |
+
- [ ] **Streaming responses** β SSE for lower perceived latency on long answers
|
| 377 |
+
- [ ] **CI/CD pipeline** β GitHub Actions β ACR build β `az containerapp update` on push to main
|
| 378 |
+
- [ ] **Corpus expansion** β constellation effects, rotation guides, and ER/EM thresholds per character
|
| 379 |
|
| 380 |
---
|
| 381 |
|
| 382 |
+
## Related: irminsul-corpus
|
| 383 |
+
|
| 384 |
+
The knowledge base is maintained in a companion repository:
|
| 385 |
+
|
| 386 |
+
**[MukulRay1603/irminsul-corpus](https://github.com/MukulRay1603/irminsul-corpus)**
|
| 387 |
+
|
| 388 |
+
It runs a fully autonomous weekly pipeline: pulls fresh game data from the KQM Theorycrafting Library and genshin-db API, synthesizes prose with Gemini 2.5 Flash, commits ~800 documents to the repo, and re-ingests ~4,000 vectors to Pinecone β without any manual intervention.
|
| 389 |
+
|
| 390 |
+
---
|
| 391 |
+
|
| 392 |
+
## License
|
| 393 |
+
|
| 394 |
+
MIT β see [LICENSE](LICENSE) for details.
|
| 395 |
+
|
| 396 |
+
Genshin Impact is owned by HoYoverse. This project is not affiliated with or endorsed by HoYoverse.
|
| 397 |
+
|
| 398 |
+
---
|
| 399 |
+
|
| 400 |
+
<div align="center">
|
| 401 |
+
|
| 402 |
+
Built to learn the full MLOps lifecycle β fine-tuning, quantization, retrieval, serving, and cloud deployment β on consumer hardware. Every component chosen deliberately, not for hype.
|
| 403 |
+
|
| 404 |
+
</div>
|
deploy_azure.sh
ADDED
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/bin/bash
|
| 2 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 3 |
+
# deploy_azure.sh β One-shot Azure Container Apps deployment for Irminsul
|
| 4 |
+
#
|
| 5 |
+
# Prerequisites:
|
| 6 |
+
# - Azure CLI installed: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli
|
| 7 |
+
# - Logged in: az login
|
| 8 |
+
# - Subscription active: az account show
|
| 9 |
+
#
|
| 10 |
+
# Usage:
|
| 11 |
+
# export PINECONE_API_KEY=your_key
|
| 12 |
+
# export GROQ_API_KEY=your_key
|
| 13 |
+
# chmod +x deploy_azure.sh
|
| 14 |
+
# ./deploy_azure.sh
|
| 15 |
+
#
|
| 16 |
+
# What this script does:
|
| 17 |
+
# 1. Creates a resource group in East US
|
| 18 |
+
# 2. Creates an Azure Container Registry (ACR)
|
| 19 |
+
# 3. Builds the Docker image via ACR Tasks (no local Docker build needed)
|
| 20 |
+
# 4. Creates a Container Apps environment
|
| 21 |
+
# 5. Deploys the container with secrets injected as env vars
|
| 22 |
+
# 6. Prints the live HTTPS URL
|
| 23 |
+
#
|
| 24 |
+
# Cost note:
|
| 25 |
+
# This stack (Groq backend, no GPU) runs on a consumption-plan Container App.
|
| 26 |
+
# Estimated cost: ~$0/month on free tier (180,000 vCPU-seconds/month free).
|
| 27 |
+
# GPU-accelerated inference (local Llama backend) requires NC-series instances
|
| 28 |
+
# (~$0.50-1.50/hr) which is not cost-effective for a portfolio project.
|
| 29 |
+
# See DEPLOYMENT.md for the full cost analysis.
|
| 30 |
+
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 31 |
+
|
| 32 |
+
set -e # exit on any error
|
| 33 |
+
|
| 34 |
+
# ββ Configuration ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 35 |
+
RESOURCE_GROUP="irminsul-rg"
|
| 36 |
+
LOCATION="eastus"
|
| 37 |
+
ACR_NAME="irminsulacr" # must be globally unique, lowercase alphanumeric
|
| 38 |
+
ENVIRONMENT="irminsul-env"
|
| 39 |
+
APP_NAME="irminsul"
|
| 40 |
+
IMAGE_TAG="latest"
|
| 41 |
+
|
| 42 |
+
# ββ Validate required secrets ββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 43 |
+
if [[ -z "$PINECONE_API_KEY" ]]; then
|
| 44 |
+
echo "ERROR: PINECONE_API_KEY environment variable is not set."
|
| 45 |
+
echo " export PINECONE_API_KEY=your_key"
|
| 46 |
+
exit 1
|
| 47 |
+
fi
|
| 48 |
+
|
| 49 |
+
if [[ -z "$GROQ_API_KEY" ]]; then
|
| 50 |
+
echo "ERROR: GROQ_API_KEY environment variable is not set."
|
| 51 |
+
echo " export GROQ_API_KEY=your_key"
|
| 52 |
+
exit 1
|
| 53 |
+
fi
|
| 54 |
+
|
| 55 |
+
echo "ββββββββββββββββββββββββββββββββββββββββββββββββββ"
|
| 56 |
+
echo " Irminsul β Azure Container Apps Deployment"
|
| 57 |
+
echo "ββββββββββββββββββββββββββββββββββββββββββββββββββ"
|
| 58 |
+
echo ""
|
| 59 |
+
|
| 60 |
+
# ββ Step 1: Resource Group βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 61 |
+
echo "[1/5] Creating resource group: $RESOURCE_GROUP"
|
| 62 |
+
az group create \
|
| 63 |
+
--name "$RESOURCE_GROUP" \
|
| 64 |
+
--location "$LOCATION" \
|
| 65 |
+
--output none
|
| 66 |
+
echo " β Resource group ready"
|
| 67 |
+
|
| 68 |
+
# ββ Step 2: Azure Container Registry ββββββββββββββββββββββββββββββββββββββββββ
|
| 69 |
+
echo "[2/5] Creating container registry: $ACR_NAME"
|
| 70 |
+
az acr create \
|
| 71 |
+
--resource-group "$RESOURCE_GROUP" \
|
| 72 |
+
--name "$ACR_NAME" \
|
| 73 |
+
--sku Basic \
|
| 74 |
+
--admin-enabled true \
|
| 75 |
+
--output none
|
| 76 |
+
echo " β ACR created"
|
| 77 |
+
|
| 78 |
+
# ββ Step 3: Build image via ACR Tasks (cloud build β no local Docker needed) βββ
|
| 79 |
+
echo "[3/5] Building Docker image via ACR Tasks..."
|
| 80 |
+
echo " This uploads your source code to Azure and builds in the cloud."
|
| 81 |
+
az acr build \
|
| 82 |
+
--registry "$ACR_NAME" \
|
| 83 |
+
--image "${APP_NAME}:${IMAGE_TAG}" \
|
| 84 |
+
.
|
| 85 |
+
echo " β Image built and pushed: ${ACR_NAME}.azurecr.io/${APP_NAME}:${IMAGE_TAG}"
|
| 86 |
+
|
| 87 |
+
# ββ Step 4: Container Apps Environment ββββββββββββββββββββββββββββββββββββββββ
|
| 88 |
+
echo "[4/5] Creating Container Apps environment: $ENVIRONMENT"
|
| 89 |
+
az containerapp env create \
|
| 90 |
+
--name "$ENVIRONMENT" \
|
| 91 |
+
--resource-group "$RESOURCE_GROUP" \
|
| 92 |
+
--location "$LOCATION" \
|
| 93 |
+
--output none
|
| 94 |
+
echo " β Environment ready"
|
| 95 |
+
|
| 96 |
+
# ββ Step 5: Deploy Container App ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 97 |
+
echo "[5/5] Deploying container app: $APP_NAME"
|
| 98 |
+
|
| 99 |
+
# Get ACR credentials for pulling the image
|
| 100 |
+
ACR_LOGIN_SERVER=$(az acr show --name "$ACR_NAME" --query loginServer --output tsv)
|
| 101 |
+
ACR_USERNAME=$(az acr credential show --name "$ACR_NAME" --query username --output tsv)
|
| 102 |
+
ACR_PASSWORD=$(az acr credential show --name "$ACR_NAME" --query "passwords[0].value" --output tsv)
|
| 103 |
+
|
| 104 |
+
az containerapp create \
|
| 105 |
+
--name "$APP_NAME" \
|
| 106 |
+
--resource-group "$RESOURCE_GROUP" \
|
| 107 |
+
--environment "$ENVIRONMENT" \
|
| 108 |
+
--image "${ACR_LOGIN_SERVER}/${APP_NAME}:${IMAGE_TAG}" \
|
| 109 |
+
--registry-server "$ACR_LOGIN_SERVER" \
|
| 110 |
+
--registry-username "$ACR_USERNAME" \
|
| 111 |
+
--registry-password "$ACR_PASSWORD" \
|
| 112 |
+
--target-port 8000 \
|
| 113 |
+
--ingress external \
|
| 114 |
+
--min-replicas 0 \
|
| 115 |
+
--max-replicas 3 \
|
| 116 |
+
--cpu 1.0 \
|
| 117 |
+
--memory 2.0Gi \
|
| 118 |
+
--env-vars \
|
| 119 |
+
PINECONE_API_KEY=secretref:pinecone-key \
|
| 120 |
+
GROQ_API_KEY=secretref:groq-key \
|
| 121 |
+
PINECONE_INDEX=llmops-rag \
|
| 122 |
+
LLM_BACKEND=groq \
|
| 123 |
+
EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2 \
|
| 124 |
+
--secrets \
|
| 125 |
+
pinecone-key="$PINECONE_API_KEY" \
|
| 126 |
+
groq-key="$GROQ_API_KEY" \
|
| 127 |
+
--output none
|
| 128 |
+
|
| 129 |
+
echo " β Container app deployed"
|
| 130 |
+
|
| 131 |
+
# ββ Print live URL βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 132 |
+
LIVE_URL=$(az containerapp show \
|
| 133 |
+
--name "$APP_NAME" \
|
| 134 |
+
--resource-group "$RESOURCE_GROUP" \
|
| 135 |
+
--query "properties.configuration.ingress.fqdn" \
|
| 136 |
+
--output tsv)
|
| 137 |
+
|
| 138 |
+
echo ""
|
| 139 |
+
echo "ββββββββββββββββββββββββββββββββββββββββββββββββββ"
|
| 140 |
+
echo " Deployment complete!"
|
| 141 |
+
echo ""
|
| 142 |
+
echo " Live URL: https://${LIVE_URL}"
|
| 143 |
+
echo " Health: https://${LIVE_URL}/health"
|
| 144 |
+
echo " API docs: https://${LIVE_URL}/docs"
|
| 145 |
+
echo "ββββββββββββββββββββββββββββββββββββββββββββββββββ"
|
| 146 |
+
echo ""
|
| 147 |
+
echo " To tear down everything and stop billing:"
|
| 148 |
+
echo " az group delete --name $RESOURCE_GROUP --yes --no-wait"
|
| 149 |
+
echo ""
|
requirements.txt
CHANGED
|
@@ -1,23 +1,26 @@
|
|
| 1 |
-
# Core serving
|
| 2 |
fastapi==0.115.0
|
| 3 |
uvicorn[standard]==0.30.6
|
| 4 |
pydantic==2.7.4
|
| 5 |
|
| 6 |
-
# LLM + fine-tuned model
|
|
|
|
|
|
|
| 7 |
torch==2.3.1
|
| 8 |
transformers==4.51.3
|
| 9 |
peft==0.15.2
|
| 10 |
-
bitsandbytes==0.43.3
|
| 11 |
accelerate==1.6.0
|
| 12 |
|
| 13 |
-
# RAG
|
| 14 |
langchain==1.2.13
|
| 15 |
langchain-community==0.4.1
|
| 16 |
-
langchain-classic==1.0.3
|
| 17 |
langchain-groq==1.1.2
|
| 18 |
groq==0.37.1
|
| 19 |
-
pinecone-client==3.2.2
|
| 20 |
sentence-transformers==4.1.0
|
| 21 |
|
| 22 |
-
# Utilities
|
| 23 |
-
python-dotenv==1.0.1
|
|
|
|
|
|
| 1 |
+
# ββ Core serving βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 2 |
fastapi==0.115.0
|
| 3 |
uvicorn[standard]==0.30.6
|
| 4 |
pydantic==2.7.4
|
| 5 |
|
| 6 |
+
# ββ LLM + fine-tuned model (local backend) βββββββββββββββββββββββββββββββββββββ
|
| 7 |
+
# Only needed when LLM_BACKEND=local
|
| 8 |
+
# GPU with 6GB+ VRAM required; loads fine on CPU but very slow
|
| 9 |
torch==2.3.1
|
| 10 |
transformers==4.51.3
|
| 11 |
peft==0.15.2
|
| 12 |
+
bitsandbytes==0.43.3 # 4-bit NF4 quantization (official Windows wheels since 0.42)
|
| 13 |
accelerate==1.6.0
|
| 14 |
|
| 15 |
+
# ββ RAG ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 16 |
langchain==1.2.13
|
| 17 |
langchain-community==0.4.1
|
| 18 |
+
langchain-classic==1.0.3 # provides langchain_classic.chains.RetrievalQA
|
| 19 |
langchain-groq==1.1.2
|
| 20 |
groq==0.37.1
|
| 21 |
+
pinecone-client==3.2.2 # langchain-community expects the v3 Pinecone client API
|
| 22 |
sentence-transformers==4.1.0
|
| 23 |
|
| 24 |
+
# ββ Utilities ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 25 |
+
python-dotenv==1.0.1
|
| 26 |
+
requests>=2.31.0
|