MukulRay commited on
Commit
3fbe97b
Β·
1 Parent(s): 3898be2

docs: overhaul README, add DEPLOYMENT.md, deploy_azure.sh, .env.example

Browse files
Files changed (5) hide show
  1. .gitignore +0 -0
  2. DEPLOYMENT.md +208 -0
  3. README.md +310 -90
  4. deploy_azure.sh +149 -0
  5. requirements.txt +11 -8
.gitignore CHANGED
Binary files a/.gitignore and b/.gitignore differ
 
DEPLOYMENT.md ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deployment Guide
2
+
3
+ This document covers all deployment options for Irminsul, the cost tradeoffs between them, and the architectural decisions behind the live demo setup.
4
+
5
+ ---
6
+
7
+ ## Deployment Options
8
+
9
+ Irminsul supports two LLM backends and multiple hosting targets. Choose based on your infrastructure and budget.
10
+
11
+ | Backend | Where to Run | GPU Required | Cost |
12
+ |---|---|---|---|
13
+ | **Groq** (recommended) | Anywhere β€” no GPU | No | Free tier available |
14
+ | **Local Llama** (fine-tuned model) | Local machine / GPU VM | Yes (6GB+ VRAM) | Hardware cost / ~$0.50–1.50/hr on Azure |
15
+
16
+ ---
17
+
18
+ ## Live Demo: HuggingFace Spaces + Groq
19
+
20
+ **Why this is the live demo environment:**
21
+
22
+ The fine-tuned Llama 3.1 8B model is 16GB on disk and requires a GPU-enabled instance to serve at acceptable latency. On Azure, the minimum viable GPU instance for this model is the **NC4as T4 v3** (~$0.50/hr, ~$360/month). Running this persistently for a portfolio project is not cost-effective.
23
+
24
+ The live demo instead uses:
25
+ - **HuggingFace Spaces** β€” free CPU hosting for the FastAPI container
26
+ - **Groq API** β€” runs `llama-3.3-70b-versatile` on Groq's Language Processing Units (LPUs) at ~300 tokens/second, for free under the public tier
27
+
28
+ This demonstrates the identical RAG architecture β€” the LLM backend is swapped via a single environment variable (`LLM_BACKEND=groq`). The retrieval pipeline, guardrails, response format, and API contract are unchanged.
29
+
30
+ ```
31
+ Live demo: https://huggingface.co/spaces/MukulRay/Irminsul
32
+ ```
33
+
34
+ ---
35
+
36
+ ## Option A: Local Development
37
+
38
+ The full stack including the fine-tuned model runs locally on an RTX 3060 6GB:
39
+
40
+ ```bash
41
+ # 1. Clone and install
42
+ git clone https://github.com/MukulRay1603/Irminsul.git
43
+ cd Irminsul
44
+ python -m venv venv && source venv/bin/activate
45
+ pip install -r requirements.txt
46
+
47
+ # 2. Configure
48
+ cp .env.example .env
49
+ # Edit .env β€” set MODEL_PATH, PINECONE_API_KEY
50
+
51
+ # 3. Ingest corpus
52
+ python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40
53
+
54
+ # 4. Serve
55
+ uvicorn main:app --host 0.0.0.0 --port 8000
56
+ ```
57
+
58
+ **Memory profile:**
59
+
60
+ | Component | VRAM |
61
+ |---|---|
62
+ | Llama 3.1 8B @ 4-bit NF4 | ~4.5 GB |
63
+ | all-MiniLM-L6-v2 embedder | ~90 MB |
64
+ | Inference headroom | ~1.2 GB |
65
+ | **Total** | **~5.8 GB** |
66
+
67
+ The model loads with `max_memory={0: "5.5GiB", "cpu": "24GiB"}` β€” layers that don't fit on GPU overflow to RAM automatically via `accelerate`.
68
+
69
+ ---
70
+
71
+ ## Option B: Docker (Local or Any Cloud)
72
+
73
+ The Dockerfile is intentionally slim β€” the model is **not baked in**. It's injected at runtime via `MODEL_PATH`.
74
+
75
+ ```bash
76
+ # Build
77
+ docker build -t irminsul:latest .
78
+
79
+ # Run with Groq backend (no GPU needed)
80
+ docker run -p 8000:8000 \
81
+ -e PINECONE_API_KEY=your_key \
82
+ -e GROQ_API_KEY=your_key \
83
+ -e PINECONE_INDEX=llmops-rag \
84
+ -e LLM_BACKEND=groq \
85
+ irminsul:latest
86
+
87
+ # Run with local model (GPU required)
88
+ docker run -p 8000:8000 \
89
+ --gpus all \
90
+ -v /path/to/models:/app/models \
91
+ -e PINECONE_API_KEY=your_key \
92
+ -e MODEL_PATH=/app/models/merged/exp2_lr2e-4_r16 \
93
+ -e LLM_BACKEND=local \
94
+ irminsul:latest
95
+ ```
96
+
97
+ ---
98
+
99
+ ## Option C: Azure Container Apps
100
+
101
+ Azure Container Apps (ACA) is the production deployment target. The `deploy_azure.sh` script provisions the full stack in one command.
102
+
103
+ ### Prerequisites
104
+
105
+ ```bash
106
+ # Install Azure CLI
107
+ # macOS:
108
+ brew install azure-cli
109
+
110
+ # Linux:
111
+ curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
112
+
113
+ # Windows: https://aka.ms/installazurecliwindows
114
+
115
+ # Log in
116
+ az login
117
+ az account show # confirm your subscription
118
+ ```
119
+
120
+ ### One-shot deploy
121
+
122
+ ```bash
123
+ export PINECONE_API_KEY=your_pinecone_key
124
+ export GROQ_API_KEY=your_groq_key
125
+ chmod +x deploy_azure.sh
126
+ ./deploy_azure.sh
127
+ ```
128
+
129
+ The script:
130
+ 1. Creates resource group `irminsul-rg` in East US
131
+ 2. Creates Azure Container Registry `irminsulacr`
132
+ 3. Builds the Docker image via **ACR Tasks** β€” the source code is uploaded and built in Azure's cloud; no local Docker daemon needed
133
+ 4. Creates a Container Apps environment
134
+ 5. Deploys the app with secrets injected as environment variables
135
+ 6. Outputs the live HTTPS URL
136
+
137
+ ### Tearing down
138
+
139
+ ```bash
140
+ # Delete everything β€” stops all billing immediately
141
+ az group delete --name irminsul-rg --yes --no-wait
142
+ ```
143
+
144
+ ### Cost breakdown (Groq backend, no GPU)
145
+
146
+ | Resource | SKU | Cost |
147
+ |---|---|---|
148
+ | Container Apps | Consumption plan | Free (180k vCPU-s/month) |
149
+ | ACR | Basic | ~$5/month |
150
+ | Outbound bandwidth | First 100GB | Free |
151
+ | **Total** | | **~$5/month** |
152
+
153
+ On Azure for Students ($100 credit), this runs for ~20 months.
154
+
155
+ ### Why not GPU on Azure?
156
+
157
+ To serve the fine-tuned Llama model in production, a GPU instance is required:
158
+
159
+ | Instance | GPU | VRAM | Cost |
160
+ |---|---|---|---|
161
+ | NC4as T4 v3 | Tesla T4 | 16 GB | ~$0.50/hr = **~$360/month** |
162
+ | NC6s v3 | Tesla V100 | 16 GB | ~$0.90/hr = **~$648/month** |
163
+
164
+ At these prices, a portfolio project running 24/7 would exhaust the $100 student credit in under a week. The Groq backend delivers the same RAG functionality at zero marginal cost, making it the right engineering tradeoff.
165
+
166
+ ### Serving the fine-tuned model on Azure (production path)
167
+
168
+ If cost were not a constraint, the correct architecture is:
169
+
170
+ 1. **Upload model to Azure Blob Storage** (~$0.02/GB/month for 16GB = ~$0.32/month)
171
+ 2. **Mount as a volume** in Container Apps β€” the container sees it at `/app/models/`
172
+ 3. **Switch to GPU SKU** β€” replace `--cpu 1.0 --memory 2.0Gi` in `deploy_azure.sh` with a GPU-enabled workload profile
173
+ 4. **Set `LLM_BACKEND=local`** in env vars
174
+
175
+ The Docker image and application code require zero changes for this path. The abstraction was designed for it.
176
+
177
+ ---
178
+
179
+ ## Environment Variables Reference
180
+
181
+ | Variable | Required | Default | Description |
182
+ |---|---|---|---|
183
+ | `PINECONE_API_KEY` | Yes | β€” | Pinecone serverless API key |
184
+ | `PINECONE_INDEX` | No | `llmops-rag` | Pinecone index name |
185
+ | `LLM_BACKEND` | No | `groq` | `groq` or `local` |
186
+ | `GROQ_API_KEY` | If Groq | β€” | Groq API key |
187
+ | `GROQ_MODEL` | No | `llama-3.3-70b-versatile` | Groq model name |
188
+ | `MODEL_PATH` | If local | `./models/merged/exp2_lr2e-4_r16` | Path to merged model |
189
+ | `EMBED_MODEL` | No | `sentence-transformers/all-MiniLM-L6-v2` | Embedding model |
190
+
191
+ ---
192
+
193
+ ## CI/CD (Planned)
194
+
195
+ The intended CI/CD pipeline:
196
+
197
+ ```
198
+ git push main
199
+ β”‚
200
+ β–Ό
201
+ GitHub Actions
202
+ β”œβ”€β”€ Run tests
203
+ β”œβ”€β”€ Build Docker image
204
+ β”œβ”€β”€ Push to ACR
205
+ └── az containerapp update --image new-tag
206
+ ```
207
+
208
+ This would give zero-downtime rolling deploys on every push to main. Currently, re-running `deploy_azure.sh` achieves the same result with a cold start.
README.md CHANGED
@@ -6,95 +6,235 @@ sdk: docker
6
  pinned: false
7
  ---
8
 
 
 
 
 
 
 
 
9
  # Irminsul
10
 
11
- > Fine-tuned Llama 3.1 8B Β· QLoRA Β· Pinecone RAG Β· FastAPI Β· Azure Container Apps
 
 
 
 
 
 
 
 
 
12
 
13
- A full end-to-end LLMOps serving stack β€” from a QLoRA fine-tuned model running in 4-bit NF4 on consumer hardware, through a retrieval-augmented generation pipeline, to a containerized API deployed on Azure. Built to be production-shaped, not just a demo.
14
 
15
- **[β†’ Live Demo](https://mukulray1603.github.io/Irminsul/demo.html)**
16
 
17
  ---
18
 
19
- ## About Irminsul
20
 
21
- Most LLM projects stop at inference. This one goes further:
22
 
23
- - **Fine-tuned model** β€” Llama 3.1 8B fine-tuned with QLoRA (rank 16, lr 2e-4) on a custom dataset, merged and served locally in 4-bit NF4 quantization on an RTX 3060 6GB
24
- - **RAG pipeline** β€” Documents ingested, chunked, embedded with `sentence-transformers/all-MiniLM-L6-v2` (fully local, zero API cost), and stored in Pinecone. Retrieval is semantic, top-k configurable at query time
25
- - **Serving layer** β€” FastAPI with async lifespan model loading, typed Pydantic request/response models, CORS, health check, and a clean browser UI served from the same process
26
- - **Containerized** β€” Dockerfile built for slim Python 3.12, model loaded at runtime via env-configurable path (not baked in)
27
- - **Cloud-ready** β€” One-shot Azure deployment via ACR + Container Apps, with Pinecone key injected as a secret
28
- - **Domain knowledge** β€” RAG corpus built around Genshin Impact lore, character builds, and elemental mechanics, serving as a rich real-world knowledge base for retrieval evaluation
29
 
30
  ---
31
 
32
  ## Architecture
33
 
34
  ```
35
- User query
36
- β”‚
37
- β–Ό
38
- FastAPI /generate
39
- β”‚
40
- β”œβ”€β”€ Embed query (sentence-transformers, local)
41
- β”‚ β”‚
42
- β”‚ β–Ό
43
- β”‚ Pinecone β€” semantic search β†’ top-k chunks
44
- β”‚ β”‚
45
- β–Ό β–Ό
46
- LangChain RetrievalQA
47
- β”‚
48
- β–Ό
49
- Llama 3.1 8B (QLoRA fine-tuned, 4-bit NF4)
50
- β”‚
51
- β–Ό
52
- Grounded answer + source attribution
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
 
55
  ---
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  ## Stack
58
 
59
  | Layer | Technology |
60
  |---|---|
61
  | Base model | Llama 3.1 8B Instruct |
62
  | Fine-tuning | QLoRA via PEFT (r=16, Ξ±=32, lr=2e-4) |
 
63
  | Quantization | BitsAndBytes 4-bit NF4, bfloat16 compute |
64
  | Embeddings | sentence-transformers/all-MiniLM-L6-v2 |
65
- | Vector DB | Pinecone (serverless, cosine similarity) |
66
  | RAG chain | LangChain RetrievalQA |
67
  | Serving | FastAPI + Uvicorn |
68
  | Containerization | Docker (python:3.12-slim) |
69
- | Cloud | Azure Container Apps + ACR |
 
 
 
70
 
71
  ---
72
 
73
  ## Quickstart
74
 
 
 
75
  ```bash
76
- # 1. Clone and set up environment
77
  git clone https://github.com/MukulRay1603/Irminsul.git
78
  cd Irminsul
79
- python -m venv venv && source venv/bin/activate # Windows: venv\Scripts\activate
 
 
 
80
  pip install -r requirements.txt
81
 
82
- # 2. Configure environment
83
  cp .env.example .env
84
- # Fill in PINECONE_API_KEY in .env
 
85
 
86
- # 3. Add your fine-tuned model
87
- # Place merged model at: ./models/merged/exp2_lr2e-4_r16
88
- # Or update MODEL_PATH in .env to point to your model
89
-
90
- # 4. Ingest documents
91
  python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40
92
 
93
- # 5. Start the server
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  uvicorn main:app --reload --port 8000
 
 
 
95
 
96
- # UI available at http://localhost:8000
97
- # API docs at http://localhost:8000/docs
 
 
 
 
 
 
98
  ```
99
 
100
  ---
@@ -104,81 +244,161 @@ uvicorn main:app --reload --port 8000
104
  | Method | Endpoint | Description |
105
  |---|---|---|
106
  | `GET` | `/` | Browser UI |
107
- | `GET` | `/health` | Model load status |
108
- | `POST` | `/generate` | RAG query β†’ grounded answer |
109
- | `POST` | `/ingest` | Ingest docs from local directory |
110
-
111
- **Example:**
112
 
113
- ```bash
114
- curl -X POST http://localhost:8000/generate \
115
- -H "Content-Type: application/json" \
116
- -d '{"query": "What weapons should Hu Tao use on a budget?", "top_k": 3}'
 
 
117
  ```
118
 
 
119
  ```json
120
  {
121
  "answer": "For Hu Tao on a budget, Dragon's Bane is the strongest F2P option β€” it scales with Elemental Mastery and deals significant bonus damage on vaporized hits. White Tassel is the best 3-star alternative for pure Normal Attack scaling.",
122
- "sources": ["docs/character_builds.md"],
123
- "latency_ms": 4821.3
 
124
  }
125
  ```
126
 
 
 
127
  ---
128
 
129
- ## Memory profile (RTX 3060 6GB)
130
 
131
- | Component | VRAM |
132
- |---|---|
133
- | Llama 3.1 8B @ 4-bit NF4 | ~4.5 GB |
134
- | all-MiniLM-L6-v2 embedder | ~90 MB |
135
- | Inference headroom | ~1.2 GB |
 
 
136
 
137
- Running the embedder on CPU frees ~90MB if needed β€” set `device_map="cpu"` in `rag.py`.
 
 
138
 
139
  ---
140
 
141
- ## Deploy to Azure
142
 
143
- ```bash
144
- export PINECONE_API_KEY=your_key
145
- chmod +x deploy_azure.sh
146
- ./deploy_azure.sh
147
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
- The script provisions a resource group, builds and pushes the image via ACR Tasks (no local Docker build needed), creates a Container Apps environment, and deploys with the Pinecone key injected as a secret. Prints the live HTTPS endpoint on completion.
 
 
 
 
 
 
 
 
150
 
151
- **Model in Azure:** The merged model (~16GB) isn't baked into the image. Recommended approach: mount from Azure Blob Storage as a volume for cheapest cold start on student credits.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
  ---
154
 
155
- ## Project structure
156
 
157
- ```
158
- Irminsul/
159
- β”œβ”€β”€ main.py # FastAPI app β€” endpoints, lifespan, CORS
160
- β”œβ”€β”€ rag.py # Model loading, 4-bit config, LangChain RAG chain
161
- β”œβ”€β”€ embedder.py # sentence-transformers singleton wrapper
162
- β”œβ”€β”€ ingest.py # Doc loader β†’ chunker β†’ Pinecone upsert
163
- β”œβ”€β”€ index.html # Browser UI (dark theme, query history, source display)
164
- β”œβ”€β”€ Dockerfile
165
- β”œβ”€β”€ deploy_azure.sh # One-shot Azure Container Apps deploy
166
- β”œβ”€β”€ requirements.txt
167
- β”œβ”€β”€ .env.example
168
- └── docs/ # Corpus + GitHub Pages demo
169
- └── demo.html
170
- ```
171
 
172
  ---
173
 
174
- ## What's next
175
 
176
- - [ ] Swap naive word chunker for `MarkdownHeaderTextSplitter` for better retrieval precision
177
- - [ ] Add metadata filtering to Pinecone queries (filter by character, content type)
178
- - [ ] Streaming response via SSE for lower perceived latency
179
- - [ ] Expand corpus β€” per-character deep dives with stat thresholds and rotation guides
180
- - [ ] CI/CD pipeline β€” GitHub Actions β†’ ACR build β†’ Container Apps deploy on push
 
181
 
182
  ---
183
 
184
- Built while learning the full MLOps lifecycle β€” fine-tuning, quantization, retrieval, serving, and cloud deployment β€” on consumer hardware. Every component chosen deliberately, not for hype.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  pinned: false
7
  ---
8
 
9
+ <div align="center">
10
+
11
+ <img src="docs/assets/banner.png" alt="Irminsul Banner" width="100%">
12
+ <!-- PLACEHOLDER: Add a banner image. Can be a dark-themed graphic with the Irminsul logo/name.
13
+ Recommended: 1280x320px, dark green/forest aesthetic matching the UI.
14
+ Tools: Figma, Canva, or even a screenshot of the UI works. -->
15
+
16
  # Irminsul
17
 
18
+ **A production-shaped LLMOps stack β€” QLoRA fine-tuning on Colab, RAG pipeline, containerized serving, and cloud deployment.**
19
+
20
+ [![Live Demo](https://img.shields.io/badge/Live_Demo-HuggingFace_Spaces-FFD21E?style=flat&logo=huggingface)](https://huggingface.co/spaces/MukulRay/Irminsul)
21
+ [![GitHub](https://img.shields.io/badge/GitHub-MukulRay1603-181717?style=flat&logo=github)](https://github.com/MukulRay1603/Irminsul)
22
+ [![Corpus Pipeline](https://img.shields.io/badge/Corpus_Pipeline-irminsul--corpus-2ea44f?style=flat&logo=github)](https://github.com/MukulRay1603/irminsul-corpus)
23
+ [![License](https://img.shields.io/badge/License-MIT-green?style=flat)](LICENSE)
24
+
25
+ </div>
26
+
27
+ ---
28
 
29
+ Most LLM projects stop at inference. This one builds the full stack: a QLoRA fine-tuned Llama 3.1 8B served through a RAG pipeline, with guardrails, a domain-specific knowledge base, and a containerized FastAPI server designed for cloud deployment.
30
 
31
+ **[β†’ Try the live demo](https://huggingface.co/spaces/MukulRay/Irminsul)**
32
 
33
  ---
34
 
35
+ ## What This Is
36
 
37
+ Irminsul is a domain-specific AI assistant for Genshin Impact β€” built not because Genshin needed an AI assistant, but because it provided a concrete, evaluable knowledge domain to build an LLMOps pipeline around. Every component was chosen deliberately:
38
 
39
+ - A knowledge domain rich enough to evaluate retrieval quality (characters, mechanics, lore)
40
+ - Ground truth data available (KQM Theorycrafting Library, game stat APIs) to measure hallucination
41
+ - Community signal data (patch notes, meta shifts) to test corpus freshness
42
+
43
+ The domain is the test harness. The pipeline is the project.
 
44
 
45
  ---
46
 
47
  ## Architecture
48
 
49
  ```
50
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
51
+ β”‚ User Query β”‚
52
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
53
+ β”‚
54
+ β–Ό
55
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
56
+ β”‚ Guardrails Layer β”‚
57
+ β”‚ β€’ Injection detection (pattern matching) β”‚
58
+ β”‚ β€’ Domain validation (cosine similarity vs anchor embeddings) β”‚
59
+ β”‚ β€’ Output sanitization β”‚
60
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
61
+ β”‚
62
+ β–Ό
63
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
64
+ β”‚ FastAPI /generate β”‚
65
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
66
+ β”‚
67
+ β”œβ”€β”€β”€ Embed query (sentence-transformers, local, CPU)
68
+ β”‚ β”‚
69
+ β”‚ β–Ό
70
+ β”‚ Pinecone ── semantic search ──► top-k chunks
71
+ β”‚ β”‚
72
+ β–Ό β–Ό
73
+ LangChain RetrievalQA (stuff chain)
74
+ β”‚
75
+ β–Ό
76
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
77
+ β”‚ LLM Backend β”‚
78
+ β”‚ β”‚
79
+ β”‚ Groq (live demo) β”‚
80
+ β”‚ llama-3.3-70b-versatile β”‚
81
+ β”‚ ~300 tok/s, free tier β”‚
82
+ β”‚ ──── OR ──── β”‚
83
+ β”‚ Local (fine-tuned) β”‚
84
+ β”‚ Llama 3.1 8B QLoRA β”‚
85
+ β”‚ 4-bit NF4, RTX 3060 6GB β”‚
86
+ β”‚ (inference only β€” trained on β”‚
87
+ β”‚ Colab A100) β”‚
88
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
89
+ β”‚
90
+ β–Ό
91
+ Grounded answer + source attribution
92
  ```
93
 
94
  ---
95
 
96
+ ## Components
97
+
98
+ ### Fine-Tuned Model
99
+
100
+ Llama 3.1 8B Instruct fine-tuned with QLoRA on a custom instruction dataset, trained on Google Colab Pro (A100). Local inference runs in 4-bit NF4 quantization on an RTX 3060 6GB.
101
+
102
+ **[β†’ View training notebook on Colab](https://colab.research.google.com/drive/YOUR_NOTEBOOK_LINK_HERE)**
103
+ <!-- PLACEHOLDER: Replace YOUR_NOTEBOOK_LINK_HERE with your actual Colab share link
104
+ File β†’ Share β†’ Copy link (set to "Anyone with the link can view") -->
105
+
106
+ | Parameter | Value |
107
+ |---|---|
108
+ | Base model | `meta-llama/Llama-3.1-8B-Instruct` |
109
+ | Method | QLoRA via PEFT |
110
+ | Rank / Alpha | r=16, Ξ±=32 |
111
+ | Learning rate | 2e-4 |
112
+ | Quantization (inference) | 4-bit NF4, bfloat16 compute |
113
+ | Training infra | Google Colab Pro (A100) |
114
+ | Experiment tracking | MLflow (3 runs) |
115
+
116
+ Three experiments were tracked in MLflow. Winning checkpoint selected by faithfulness score (0.826) and ROUGE-L (0.466) on a held-out eval set.
117
+
118
+ <!-- PLACEHOLDER: Add MLflow experiment screenshot here
119
+ docs/assets/mlflow_experiments.png
120
+ A screenshot of your MLflow UI showing the 3 runs and metrics comparison -->
121
+
122
+ ### RAG Pipeline
123
+
124
+ Documents are chunked, embedded locally with `sentence-transformers/all-MiniLM-L6-v2` (384-dim, zero API cost), and stored in Pinecone serverless. Retrieval is semantic, top-k configurable per query.
125
+
126
+ | Component | Choice | Reason |
127
+ |---|---|---|
128
+ | Embedder | all-MiniLM-L6-v2 | Runs locally, strong semantic retrieval, 384-dim fits free Pinecone tier |
129
+ | Vector DB | Pinecone serverless | Zero ops, cosine similarity, free tier sufficient for corpus size |
130
+ | Chunking | Word-level, 300 words, 40-word overlap | Preserves semantic units across chunk boundaries |
131
+ | Chain | LangChain RetrievalQA (stuff) | Simple, inspectable, returns source documents |
132
+
133
+ ### Knowledge Corpus
134
+
135
+ Corpus is maintained in a [separate repository](https://github.com/MukulRay1603/irminsul-corpus) with an autonomous update pipeline. It ingests from three tiers of sources with different trust levels:
136
+
137
+ | Tier | Source | Files | Trust |
138
+ |---|---|---|---|
139
+ | 1 β€” Ground Truth | KQM Theorycrafting Library (peer-reviewed mechanics) | ~305 | Highest β€” cite in builds |
140
+ | 1 β€” Ground Truth | genshin-db API (exact character/weapon/artifact stats) | ~406 | Highest β€” exact game data |
141
+ | 2 β€” Expert Synthesis | Gemini-authored prose grounded in Tier 1 | ~83 | High β€” no hallucinated stats |
142
+ | 3 β€” Community Signal | Official patch notes, banner history, event calendar | ~80 | Medium β€” tagged explicitly |
143
+
144
+ A GitHub Actions workflow runs every Sunday at 2am UTC, pulls fresh data, commits the docs, and re-ingests ~4,000 vectors to Pinecone automatically.
145
+
146
+ ### Guardrails
147
+
148
+ Two layers of input validation before any LLM call:
149
+
150
+ 1. **Injection detection** β€” pattern matching against known jailbreak phrases (`ignore previous instructions`, `act as`, `DAN mode`, etc.)
151
+ 2. **Domain validation** β€” cosine similarity between the query embedding and a set of Genshin-domain anchor sentences. Queries scoring below threshold (0.35) are rejected with a domain-scoped error message before touching the LLM.
152
+
153
+ Output is sanitized to strip generation artifacts (`</s>` tokens, trailing whitespace) and length-checked.
154
+
155
+ ### Serving Layer
156
+
157
+ FastAPI with:
158
+ - Async lifespan model loading (model loads once at startup, not per request)
159
+ - Typed Pydantic request/response models with `blocked` flag for guardrail rejections
160
+ - CORS enabled for cross-origin UI
161
+ - `/health` endpoint reporting model load status
162
+ - Browser UI served from the same process (no separate frontend server)
163
+
164
+ ---
165
+
166
  ## Stack
167
 
168
  | Layer | Technology |
169
  |---|---|
170
  | Base model | Llama 3.1 8B Instruct |
171
  | Fine-tuning | QLoRA via PEFT (r=16, Ξ±=32, lr=2e-4) |
172
+ | Experiment tracking | MLflow |
173
  | Quantization | BitsAndBytes 4-bit NF4, bfloat16 compute |
174
  | Embeddings | sentence-transformers/all-MiniLM-L6-v2 |
175
+ | Vector DB | Pinecone serverless (cosine, 384-dim) |
176
  | RAG chain | LangChain RetrievalQA |
177
  | Serving | FastAPI + Uvicorn |
178
  | Containerization | Docker (python:3.12-slim) |
179
+ | Live demo hosting | HuggingFace Spaces (CPU Basic) |
180
+ | Production deployment | Azure Container Apps + ACR |
181
+ | LLM backend (demo) | Groq API (llama-3.3-70b-versatile) |
182
+ | Corpus pipeline | GitHub Actions (weekly, autonomous) |
183
 
184
  ---
185
 
186
  ## Quickstart
187
 
188
+ ### Option 1 β€” Groq backend (no GPU required)
189
+
190
  ```bash
191
+ # 1. Clone
192
  git clone https://github.com/MukulRay1603/Irminsul.git
193
  cd Irminsul
194
+
195
+ # 2. Install
196
+ python -m venv venv && source venv/bin/activate
197
+ # Windows: venv\Scripts\activate
198
  pip install -r requirements.txt
199
 
200
+ # 3. Configure
201
  cp .env.example .env
202
+ # Set PINECONE_API_KEY and GROQ_API_KEY in .env
203
+ # LLM_BACKEND=groq is the default
204
 
205
+ # 4. Ingest corpus (or use pre-ingested Pinecone index)
 
 
 
 
206
  python ingest.py --dir ./docs --chunk-size 300 --chunk-overlap 40
207
 
208
+ # 5. Run
209
+ uvicorn main:app --reload --port 8000
210
+ # UI: http://localhost:8000
211
+ # API docs: http://localhost:8000/docs
212
+ ```
213
+
214
+ ### Option 2 β€” Local fine-tuned model (GPU required for inference, 6GB+ VRAM)
215
+
216
+ ```bash
217
+ # Same steps 1–3, then:
218
+ # Set LLM_BACKEND=local and MODEL_PATH in .env
219
+
220
+ # 4. Download model
221
+ # Place the merged QLoRA model at: ./models/merged/exp2_lr2e-4_r16/
222
+ # (Or update MODEL_PATH in .env)
223
+
224
+ # 5. Run
225
  uvicorn main:app --reload --port 8000
226
+ ```
227
+
228
+ ### Docker
229
 
230
+ ```bash
231
+ # Groq backend (no GPU)
232
+ docker build -t irminsul:latest .
233
+ docker run -p 8000:8000 \
234
+ -e PINECONE_API_KEY=your_key \
235
+ -e GROQ_API_KEY=your_key \
236
+ -e LLM_BACKEND=groq \
237
+ irminsul:latest
238
  ```
239
 
240
  ---
 
244
  | Method | Endpoint | Description |
245
  |---|---|---|
246
  | `GET` | `/` | Browser UI |
247
+ | `GET` | `/health` | Model load status + ready flag |
248
+ | `POST` | `/generate` | RAG query β†’ grounded answer + sources |
249
+ | `POST` | `/ingest` | Ingest docs from a local directory path |
 
 
250
 
251
+ **Request:**
252
+ ```json
253
+ {
254
+ "query": "What weapons should Hu Tao use on a budget?",
255
+ "top_k": 3
256
+ }
257
  ```
258
 
259
+ **Response:**
260
  ```json
261
  {
262
  "answer": "For Hu Tao on a budget, Dragon's Bane is the strongest F2P option β€” it scales with Elemental Mastery and deals significant bonus damage on vaporized hits. White Tassel is the best 3-star alternative for pure Normal Attack scaling.",
263
+ "sources": ["docs/generated/characters/hu_tao.md", "docs/tcl/characters/pyro/hutao.md"],
264
+ "latency_ms": 1240.5,
265
+ "blocked": false
266
  }
267
  ```
268
 
269
+ If a query is rejected by guardrails, `blocked: true` is returned with the rejection reason in `answer`. No LLM call is made.
270
+
271
  ---
272
 
273
+ ## Deployment
274
 
275
+ See **[DEPLOYMENT.md](DEPLOYMENT.md)** for the full guide covering:
276
+
277
+ - Local development setup
278
+ - Docker (local and cloud)
279
+ - Azure Container Apps (one-shot `deploy_azure.sh`)
280
+ - Cost breakdown and the reasoning behind the demo setup
281
+ - GPU serving path for the fine-tuned model
282
 
283
+ **Why the live demo runs on HuggingFace + Groq, not Azure GPU:**
284
+
285
+ Serving the fine-tuned Llama 3.1 8B requires a GPU instance. The minimum viable option on Azure (NC4as T4 v3) costs ~$360/month β€” not justified for a portfolio project. The Dockerfile and `deploy_azure.sh` are written for the Azure path; the live demo swaps the LLM backend to Groq via a single environment variable. The RAG pipeline, guardrails, and serving layer are identical.
286
 
287
  ---
288
 
289
+ ## Project Structure
290
 
 
 
 
 
291
  ```
292
+ Irminsul/
293
+ β”œβ”€β”€ main.py # FastAPI app: endpoints, lifespan, CORS, response models
294
+ β”œβ”€β”€ rag.py # LangChain RAG chain, dual backend (Groq / local Llama)
295
+ β”œβ”€β”€ embedder.py # sentence-transformers singleton (loads once, reused)
296
+ β”œβ”€β”€ ingest.py # Doc loader β†’ word chunker β†’ Pinecone upsert
297
+ β”œβ”€β”€ guardrails.py # Input validation: injection detection + domain cosine check
298
+ β”œβ”€β”€ index.html # Browser UI: dark Dendro theme, query history, source display
299
+ β”‚
300
+ β”œβ”€β”€ Dockerfile # python:3.12-slim, model NOT baked in
301
+ β”œβ”€β”€ deploy_azure.sh # One-shot ACR build + Container Apps deploy
302
+ β”œβ”€β”€ .env.example # Environment variable reference
303
+ β”‚
304
+ β”œβ”€β”€ DEPLOYMENT.md # Full deployment guide + cost analysis
305
+ β”œβ”€β”€ requirements.txt
306
+ β”œβ”€β”€ images/ # Screenshots and assets used in this README
307
+ β”‚ β”œβ”€β”€ banner.png
308
+ β”‚ β”œβ”€β”€ ui_main.png
309
+ β”‚ β”œβ”€β”€ ui_response.png
310
+ β”‚ └── mlflow_runs.png
311
+ └── docs/
312
+ β”œβ”€β”€ corpus/ # Legacy manual corpus docs
313
+ └── demo.html # GitHub Pages demo page
314
+ ```
315
+
316
+ ---
317
+
318
+ ## Evaluation
319
 
320
+ <!-- PLACEHOLDER: Fill this section once you have eval numbers ready.
321
+ Consider running a small eval set (20-50 questions) with:
322
+ - Faithfulness: Does the answer contradict the retrieved context?
323
+ - Answer relevance: Does the answer address the question?
324
+ - Context recall: Did retrieval find the right documents?
325
+
326
+ Tools to consider: RAGAS (pip install ragas) against your Pinecone index.
327
+
328
+ Example format:
329
 
330
+ | Metric | Score | Method |
331
+ |---|---|---|
332
+ | Faithfulness | 0.826 | Custom eval, n=50 |
333
+ | ROUGE-L | 0.466 | vs reference answers |
334
+ | Context recall | TBD | RAGAS |
335
+ | Answer relevance | TBD | RAGAS |
336
+
337
+ The fine-tuned model numbers (0.826 faithfulness, 0.466 ROUGE-L) came from
338
+ your MLflow eval during training β€” pull those into this table.
339
+ -->
340
+
341
+ The fine-tuned model was evaluated during training with a held-out set:
342
+
343
+ | Metric | Score |
344
+ |---|---|
345
+ | Faithfulness | 0.826 |
346
+ | ROUGE-L | 0.466 |
347
+
348
+ Full RAG pipeline evaluation (context recall, answer relevance) is a planned addition β€” see [What's Next](#whats-next).
349
 
350
  ---
351
 
352
+ ## Screenshots
353
 
354
+ <!-- PLACEHOLDER: Add screenshots once you have them.
355
+ Save to images/ and uncomment these lines:
356
+
357
+ ![Irminsul UI](images/ui_main.png)
358
+ ![Response with sources](images/ui_response.png)
359
+ ![MLflow experiment runs](images/mlflow_runs.png)
360
+
361
+ Tips:
362
+ - ui_main.png: screenshot of http://localhost:8000 before any query
363
+ - ui_response.png: run a query (try "best build for Hu Tao") so the answer + sources section is visible
364
+ - mlflow_runs.png: from your Colab β€” the experiment comparison table showing 3 runs
365
+ -->
366
+
367
+ *Screenshots coming soon β€” [try the live demo](https://huggingface.co/spaces/MukulRay/Irminsul) to see it in action.*
368
 
369
  ---
370
 
371
+ ## What's Next
372
 
373
+ - [ ] **RAGAS evaluation** β€” systematic RAG eval (faithfulness, context recall, answer relevance) on a held-out question set
374
+ - [ ] **MarkdownHeaderTextSplitter** β€” replace naive word chunker for section-aware chunking that respects document structure
375
+ - [ ] **Metadata filtering** β€” filter Pinecone queries by character, content tier, or topic category
376
+ - [ ] **Streaming responses** β€” SSE for lower perceived latency on long answers
377
+ - [ ] **CI/CD pipeline** β€” GitHub Actions β†’ ACR build β†’ `az containerapp update` on push to main
378
+ - [ ] **Corpus expansion** β€” constellation effects, rotation guides, and ER/EM thresholds per character
379
 
380
  ---
381
 
382
+ ## Related: irminsul-corpus
383
+
384
+ The knowledge base is maintained in a companion repository:
385
+
386
+ **[MukulRay1603/irminsul-corpus](https://github.com/MukulRay1603/irminsul-corpus)**
387
+
388
+ It runs a fully autonomous weekly pipeline: pulls fresh game data from the KQM Theorycrafting Library and genshin-db API, synthesizes prose with Gemini 2.5 Flash, commits ~800 documents to the repo, and re-ingests ~4,000 vectors to Pinecone β€” without any manual intervention.
389
+
390
+ ---
391
+
392
+ ## License
393
+
394
+ MIT β€” see [LICENSE](LICENSE) for details.
395
+
396
+ Genshin Impact is owned by HoYoverse. This project is not affiliated with or endorsed by HoYoverse.
397
+
398
+ ---
399
+
400
+ <div align="center">
401
+
402
+ Built to learn the full MLOps lifecycle β€” fine-tuning, quantization, retrieval, serving, and cloud deployment β€” on consumer hardware. Every component chosen deliberately, not for hype.
403
+
404
+ </div>
deploy_azure.sh ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # ─────────────────────────────────────────────────────────────────────────────
3
+ # deploy_azure.sh β€” One-shot Azure Container Apps deployment for Irminsul
4
+ #
5
+ # Prerequisites:
6
+ # - Azure CLI installed: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli
7
+ # - Logged in: az login
8
+ # - Subscription active: az account show
9
+ #
10
+ # Usage:
11
+ # export PINECONE_API_KEY=your_key
12
+ # export GROQ_API_KEY=your_key
13
+ # chmod +x deploy_azure.sh
14
+ # ./deploy_azure.sh
15
+ #
16
+ # What this script does:
17
+ # 1. Creates a resource group in East US
18
+ # 2. Creates an Azure Container Registry (ACR)
19
+ # 3. Builds the Docker image via ACR Tasks (no local Docker build needed)
20
+ # 4. Creates a Container Apps environment
21
+ # 5. Deploys the container with secrets injected as env vars
22
+ # 6. Prints the live HTTPS URL
23
+ #
24
+ # Cost note:
25
+ # This stack (Groq backend, no GPU) runs on a consumption-plan Container App.
26
+ # Estimated cost: ~$0/month on free tier (180,000 vCPU-seconds/month free).
27
+ # GPU-accelerated inference (local Llama backend) requires NC-series instances
28
+ # (~$0.50-1.50/hr) which is not cost-effective for a portfolio project.
29
+ # See DEPLOYMENT.md for the full cost analysis.
30
+ # ─────────────────────────────────────────────────────────────────────────────
31
+
32
+ set -e # exit on any error
33
+
34
+ # ── Configuration ──────────────────────────────────────────────────────────────
35
+ RESOURCE_GROUP="irminsul-rg"
36
+ LOCATION="eastus"
37
+ ACR_NAME="irminsulacr" # must be globally unique, lowercase alphanumeric
38
+ ENVIRONMENT="irminsul-env"
39
+ APP_NAME="irminsul"
40
+ IMAGE_TAG="latest"
41
+
42
+ # ── Validate required secrets ──────────────────────────────────────────────────
43
+ if [[ -z "$PINECONE_API_KEY" ]]; then
44
+ echo "ERROR: PINECONE_API_KEY environment variable is not set."
45
+ echo " export PINECONE_API_KEY=your_key"
46
+ exit 1
47
+ fi
48
+
49
+ if [[ -z "$GROQ_API_KEY" ]]; then
50
+ echo "ERROR: GROQ_API_KEY environment variable is not set."
51
+ echo " export GROQ_API_KEY=your_key"
52
+ exit 1
53
+ fi
54
+
55
+ echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
56
+ echo " Irminsul β€” Azure Container Apps Deployment"
57
+ echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
58
+ echo ""
59
+
60
+ # ── Step 1: Resource Group ─────────────────────────────────────────────────────
61
+ echo "[1/5] Creating resource group: $RESOURCE_GROUP"
62
+ az group create \
63
+ --name "$RESOURCE_GROUP" \
64
+ --location "$LOCATION" \
65
+ --output none
66
+ echo " βœ“ Resource group ready"
67
+
68
+ # ── Step 2: Azure Container Registry ──────────────────────────────────────────
69
+ echo "[2/5] Creating container registry: $ACR_NAME"
70
+ az acr create \
71
+ --resource-group "$RESOURCE_GROUP" \
72
+ --name "$ACR_NAME" \
73
+ --sku Basic \
74
+ --admin-enabled true \
75
+ --output none
76
+ echo " βœ“ ACR created"
77
+
78
+ # ── Step 3: Build image via ACR Tasks (cloud build β€” no local Docker needed) ───
79
+ echo "[3/5] Building Docker image via ACR Tasks..."
80
+ echo " This uploads your source code to Azure and builds in the cloud."
81
+ az acr build \
82
+ --registry "$ACR_NAME" \
83
+ --image "${APP_NAME}:${IMAGE_TAG}" \
84
+ .
85
+ echo " βœ“ Image built and pushed: ${ACR_NAME}.azurecr.io/${APP_NAME}:${IMAGE_TAG}"
86
+
87
+ # ── Step 4: Container Apps Environment ────────────────────────────────────────
88
+ echo "[4/5] Creating Container Apps environment: $ENVIRONMENT"
89
+ az containerapp env create \
90
+ --name "$ENVIRONMENT" \
91
+ --resource-group "$RESOURCE_GROUP" \
92
+ --location "$LOCATION" \
93
+ --output none
94
+ echo " βœ“ Environment ready"
95
+
96
+ # ── Step 5: Deploy Container App ──────────────────────────────────────────────
97
+ echo "[5/5] Deploying container app: $APP_NAME"
98
+
99
+ # Get ACR credentials for pulling the image
100
+ ACR_LOGIN_SERVER=$(az acr show --name "$ACR_NAME" --query loginServer --output tsv)
101
+ ACR_USERNAME=$(az acr credential show --name "$ACR_NAME" --query username --output tsv)
102
+ ACR_PASSWORD=$(az acr credential show --name "$ACR_NAME" --query "passwords[0].value" --output tsv)
103
+
104
+ az containerapp create \
105
+ --name "$APP_NAME" \
106
+ --resource-group "$RESOURCE_GROUP" \
107
+ --environment "$ENVIRONMENT" \
108
+ --image "${ACR_LOGIN_SERVER}/${APP_NAME}:${IMAGE_TAG}" \
109
+ --registry-server "$ACR_LOGIN_SERVER" \
110
+ --registry-username "$ACR_USERNAME" \
111
+ --registry-password "$ACR_PASSWORD" \
112
+ --target-port 8000 \
113
+ --ingress external \
114
+ --min-replicas 0 \
115
+ --max-replicas 3 \
116
+ --cpu 1.0 \
117
+ --memory 2.0Gi \
118
+ --env-vars \
119
+ PINECONE_API_KEY=secretref:pinecone-key \
120
+ GROQ_API_KEY=secretref:groq-key \
121
+ PINECONE_INDEX=llmops-rag \
122
+ LLM_BACKEND=groq \
123
+ EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2 \
124
+ --secrets \
125
+ pinecone-key="$PINECONE_API_KEY" \
126
+ groq-key="$GROQ_API_KEY" \
127
+ --output none
128
+
129
+ echo " βœ“ Container app deployed"
130
+
131
+ # ── Print live URL ─────────────────────────────────────────────────────────────
132
+ LIVE_URL=$(az containerapp show \
133
+ --name "$APP_NAME" \
134
+ --resource-group "$RESOURCE_GROUP" \
135
+ --query "properties.configuration.ingress.fqdn" \
136
+ --output tsv)
137
+
138
+ echo ""
139
+ echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
140
+ echo " Deployment complete!"
141
+ echo ""
142
+ echo " Live URL: https://${LIVE_URL}"
143
+ echo " Health: https://${LIVE_URL}/health"
144
+ echo " API docs: https://${LIVE_URL}/docs"
145
+ echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
146
+ echo ""
147
+ echo " To tear down everything and stop billing:"
148
+ echo " az group delete --name $RESOURCE_GROUP --yes --no-wait"
149
+ echo ""
requirements.txt CHANGED
@@ -1,23 +1,26 @@
1
- # Core serving
2
  fastapi==0.115.0
3
  uvicorn[standard]==0.30.6
4
  pydantic==2.7.4
5
 
6
- # LLM + fine-tuned model
 
 
7
  torch==2.3.1
8
  transformers==4.51.3
9
  peft==0.15.2
10
- bitsandbytes==0.43.3
11
  accelerate==1.6.0
12
 
13
- # RAG
14
  langchain==1.2.13
15
  langchain-community==0.4.1
16
- langchain-classic==1.0.3
17
  langchain-groq==1.1.2
18
  groq==0.37.1
19
- pinecone-client==3.2.2
20
  sentence-transformers==4.1.0
21
 
22
- # Utilities
23
- python-dotenv==1.0.1
 
 
1
+ # ── Core serving ───────────────────────────────────────────────────────────────
2
  fastapi==0.115.0
3
  uvicorn[standard]==0.30.6
4
  pydantic==2.7.4
5
 
6
+ # ── LLM + fine-tuned model (local backend) ─────────────────────────────────────
7
+ # Only needed when LLM_BACKEND=local
8
+ # GPU with 6GB+ VRAM required; loads fine on CPU but very slow
9
  torch==2.3.1
10
  transformers==4.51.3
11
  peft==0.15.2
12
+ bitsandbytes==0.43.3 # 4-bit NF4 quantization (official Windows wheels since 0.42)
13
  accelerate==1.6.0
14
 
15
+ # ── RAG ────────────────────────────────────────────────────────────────────────
16
  langchain==1.2.13
17
  langchain-community==0.4.1
18
+ langchain-classic==1.0.3 # provides langchain_classic.chains.RetrievalQA
19
  langchain-groq==1.1.2
20
  groq==0.37.1
21
+ pinecone-client==3.2.2 # langchain-community expects the v3 Pinecone client API
22
  sentence-transformers==4.1.0
23
 
24
+ # ── Utilities ──────────────────────────────────────────────────────────────────
25
+ python-dotenv==1.0.1
26
+ requests>=2.31.0