Kanika's AI Learning Repo
A personal, hands-on learning repository for experimenting with open-source LLMs via the Hugging Face transformers library β using Meta Llama 3.1 8B Instruct as the working model.
The repo is organized as a series of phases, each one building on the previous to introduce a new concept or pattern. Phase 1 is the warm-up; Phase 2 establishes reusable inference building blocks; Phase 3 is the current focus and ties them together into a working RAG system.
Repository structure
kanikatestmodel/
βββ phase-1-model-verification/ β warm-up: get the model running locally
β βββ ai_experiment.py
β βββ ai_llama_cleanchat.py
β βββ README.md
βββ phase-2-inference-prompting/ β reusable inference + prompt engineering
β βββ inference/ LlamaInference, chat template, structured output
β β βββ base_inference.py
β β βββ chat_template_example.py
β β βββ structured_output.py
β βββ prompts/ markdown prompt catalogues
β β βββ system_prompts.md
β β βββ instruction_prompts.md
β βββ tests/ assertion + observational tests
β β βββ test_structured_output.py
β β βββ test_temperature_vs_top_p.py
β β βββ test_max_tokens.py
β βββ README.md
βββ phase-3-rag/ β current focus: end-to-end RAG
βββ data/sample_docs/ 4 short markdown source documents
βββ ingestion/ chunker β embedder β build_index
βββ retrieval/ Retriever (top-K over Chroma)
βββ rag/ RagPipeline (retrieval + grounded prompt + Llama)
βββ prompts/ system_prompts.md (rag_grounded_assistant)
βββ tests/ test_retriever.py, test_rag_grounded.py
βββ README.md
Phase 3 β RAG (current)
Pairs the LlamaInference wrapper from Phase 2 with a vector database so the model answers from a small corpus of source documents instead of pretraining alone.
The whole architecture is composition: Retriever (MiniLM + Chroma) + grounded prompt + LlamaInference. There is no extra magic.
The two phases of a RAG system
| Phase | When it runs | What it does |
|---|---|---|
| Ingestion | Once per document | chunk β embed β write to vector DB |
| Query | Every user question | embed query β top-K search β grounded prompt β LLM |
Highlights
ingestion/βchunker.py(pure word-window splitter),embedder.py(sentence-transformers/all-MiniLM-L6-v2, 384-dim),build_index.py(CLI that walks a folder, chunks, batch-embeds, persists to Chroma).retrieval/retriever.pyβRetriever.query(question, k=4)returning[{"text", "source", "chunk_index", "score"}, ...]. Hard-fails if no index exists, with a hint pointing atbuild_index.rag/pipeline.pyβRagPipeline.answer(question)returning{"answer", "sources", "hits"}. The grounded prompt explicitly tells the model to refuse with a fixed phrase when the context is silent β that refusal-on-silence is what makes RAG more than autocomplete.tests/βtest_retriever.py(no LLM, fast);test_rag_grounded.py(loads Llama, asserts both that good context produces a grounded answer AND that off-topic questions trigger the refusal phrase, not a hallucination from pretraining).
See phase-3-rag/ for the full per-module walkthrough and design rationale.
Phase 2 β Inference & Prompting (foundation)
The bulk of this repo. Phase 2 turns the throwaway scripts of Phase 1 into reusable building blocks and uses them to explore prompt engineering, structured output, and the effects of generation parameters.
inference/ β building blocks
base_inference.pyβ aLlamaInferenceclass that loads the model once (~16 GB in bfloat16) and exposes a clean.generate(messages, temperature, top_p, max_tokens, do_sample)method. This is the foundation everything else in Phase 2 imports.chat_template_example.pyβ demonstrates the proper multi-role chat template (system + user + assistant + user) so the model can answer follow-up questions with full conversational context.structured_output.pyβgenerate_json(...)helper that combines a strict system prompt with low temperature to force the model to return parseable JSON, then loads it into a real Pythondict.
prompts/ β content separated from code
A two-file catalogue based on a clear semantic split:
| File | Purpose | Example name | Sent as |
|---|---|---|---|
system_prompts.md |
Who the model is β persona, tone, global rules | json_extractor, python_tutor, creative_writer |
system message |
instruction_prompts.md |
What task to perform on this turn | summarize, extract_movie_info, classify_sentiment |
user message |
The two combine like LEGO: any system_prompt Γ any instruction_prompt = a working, reusable inference call.
tests/ β verify behavior, build intuition
Two flavors of test, both runnable directly from the command line:
- Assertion tests (
test_structured_output.py) β must pass. Callsgenerate_json(...)and asserts the returned dict has the expected fields and types. Catches regressions when prompts or models change. - Observational tests (
test_temperature_vs_top_p.py,test_max_tokens.py) β no assertions; sends the same prompt with different generation knobs and prints the outputs side-by-side. Designed to make the abstract effects oftemperature,top_p, andmax_tokensconcrete.
See phase-2-inference-prompting/ for code, prompts, and how to run each module.
Phase 1 β Model Verification (foundation)
The starting point: get Meta Llama 3.1 8B Instruct downloaded, loaded onto the GPU, and producing output end-to-end. Two minimal scripts:
ai_experiment.pyβ a single-shot pirate-themed chat completion. Verifies the model loads and runs.ai_llama_cleanchat.pyβ an interactive console chat loop with warnings/logs silenced for a cleaner UX.
See phase-1-model-verification/ for the scripts and the full setup walkthrough (gated-model access, virtual env, dependencies, troubleshooting).
Quick start
# 1. Clone
git clone https://huggingface.co/kanika23oct/kanikatestmodel
cd kanikatestmodel
# 2. Set up Python environment
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install transformers torch accelerate # Phase 2
pip install sentence-transformers chromadb # Phase 3 (added)
# 3. Authenticate with Hugging Face (Llama 3.1 is gated)
huggingface-cli login
# 4. Run a Phase 2 module
cd phase-2-inference-prompting
python -m inference.structured_output
python -m tests.test_structured_output
# 5. Run the Phase 3 RAG pipeline
cd ..\phase-3-rag
python -m ingestion.build_index # build the vector store (one-time per docs change)
python -m tests.test_retriever # fast sanity check (no LLM)
python -m tests.test_rag_grounded # end-to-end RAG with Llama (slow on CPU)
A GPU with β₯ 16 GB VRAM is strongly recommended for any phase that loads Llama; CPU inference works but is impractically slow (multiple minutes per generation).
What's next
Tracked in SESSION_NOTES.md. Headline items:
- Phase 3 wrap-up β actually run
test_rag_grounded.pyend-to-end (CPU torch makes it slow), citation parsing, bigger corpus, optional re-ranking. - Phase 4 β FastAPI β wrap
RagPipeline.answerbehind aPOST /ragHTTP endpoint. - Phase 5 β Agents β give the model tool-calling so it decides when to retrieve, when to answer directly, and when to call something else.
- Phase 2 wrap-up β few-shot, guardrails, quantization, the
load_prompt(name)helper that wires the markdown catalogues to code.