Kanika's AI Learning Repo

A personal, hands-on learning repository for experimenting with open-source LLMs via the Hugging Face transformers library — using Meta Llama 3.1 8B Instruct as the working model.

The repo is organized as a series of phases, each one building on the previous to introduce a new concept or pattern. Phase 1 is the warm-up; Phase 2 establishes reusable inference building blocks; Phase 3 is the current focus and ties them together into a working RAG system.

Repository structure

kanikatestmodel/
├── phase-1-model-verification/      ← warm-up: get the model running locally
│   ├── ai_experiment.py
│   ├── ai_llama_cleanchat.py
│   └── README.md
├── phase-2-inference-prompting/     ← reusable inference + prompt engineering
│   ├── inference/                    LlamaInference, chat template, structured output
│   │   ├── base_inference.py
│   │   ├── chat_template_example.py
│   │   └── structured_output.py
│   ├── prompts/                      markdown prompt catalogues
│   │   ├── system_prompts.md
│   │   └── instruction_prompts.md
│   ├── tests/                        assertion + observational tests
│   │   ├── test_structured_output.py
│   │   ├── test_temperature_vs_top_p.py
│   │   └── test_max_tokens.py
│   └── README.md
└── phase-3-rag/                     ← current focus: end-to-end RAG
    ├── data/sample_docs/             4 short markdown source documents
    ├── ingestion/                    chunker → embedder → build_index
    ├── retrieval/                    Retriever (top-K over Chroma)
    ├── rag/                          RagPipeline (retrieval + grounded prompt + Llama)
    ├── prompts/                      system_prompts.md (rag_grounded_assistant)
    ├── tests/                        test_retriever.py, test_rag_grounded.py
    └── README.md

Phase 3 — RAG (current)

Pairs the LlamaInference wrapper from Phase 2 with a vector database so the model answers from a small corpus of source documents instead of pretraining alone.

The whole architecture is composition: Retriever (MiniLM + Chroma) + grounded prompt + LlamaInference. There is no extra magic.

The two phases of a RAG system

Phase	When it runs	What it does
Ingestion	Once per document	chunk → embed → write to vector DB
Query	Every user question	embed query → top-K search → grounded prompt → LLM

Highlights

ingestion/ — chunker.py (pure word-window splitter), embedder.py (sentence-transformers/all-MiniLM-L6-v2, 384-dim), build_index.py (CLI that walks a folder, chunks, batch-embeds, persists to Chroma).
retrieval/retriever.py — Retriever.query(question, k=4) returning [{"text", "source", "chunk_index", "score"}, ...]. Hard-fails if no index exists, with a hint pointing at build_index.
rag/pipeline.py — RagPipeline.answer(question) returning {"answer", "sources", "hits"}. The grounded prompt explicitly tells the model to refuse with a fixed phrase when the context is silent — that refusal-on-silence is what makes RAG more than autocomplete.
tests/ — test_retriever.py (no LLM, fast); test_rag_grounded.py (loads Llama, asserts both that good context produces a grounded answer AND that off-topic questions trigger the refusal phrase, not a hallucination from pretraining).

See phase-3-rag/ for the full per-module walkthrough and design rationale.

Phase 2 — Inference & Prompting (foundation)

The bulk of this repo. Phase 2 turns the throwaway scripts of Phase 1 into reusable building blocks and uses them to explore prompt engineering, structured output, and the effects of generation parameters.

`inference/` — building blocks

base_inference.py — a LlamaInference class that loads the model once (~16 GB in bfloat16) and exposes a clean .generate(messages, temperature, top_p, max_tokens, do_sample) method. This is the foundation everything else in Phase 2 imports.
chat_template_example.py — demonstrates the proper multi-role chat template (system + user + assistant + user) so the model can answer follow-up questions with full conversational context.
structured_output.py — generate_json(...) helper that combines a strict system prompt with low temperature to force the model to return parseable JSON, then loads it into a real Python dict.

`prompts/` — content separated from code

A two-file catalogue based on a clear semantic split:

File	Purpose	Example name	Sent as
`system_prompts.md`	Who the model is — persona, tone, global rules	`json_extractor`, `python_tutor`, `creative_writer`	`system` message
`instruction_prompts.md`	What task to perform on this turn	`summarize`, `extract_movie_info`, `classify_sentiment`	`user` message

The two combine like LEGO: any system_prompt × any instruction_prompt = a working, reusable inference call.

`tests/` — verify behavior, build intuition

Two flavors of test, both runnable directly from the command line:

Assertion tests (test_structured_output.py) — must pass. Calls generate_json(...) and asserts the returned dict has the expected fields and types. Catches regressions when prompts or models change.
Observational tests (test_temperature_vs_top_p.py, test_max_tokens.py) — no assertions; sends the same prompt with different generation knobs and prints the outputs side-by-side. Designed to make the abstract effects of temperature, top_p, and max_tokens concrete.

See phase-2-inference-prompting/ for code, prompts, and how to run each module.

Phase 1 — Model Verification (foundation)

The starting point: get Meta Llama 3.1 8B Instruct downloaded, loaded onto the GPU, and producing output end-to-end. Two minimal scripts:

ai_experiment.py — a single-shot pirate-themed chat completion. Verifies the model loads and runs.
ai_llama_cleanchat.py — an interactive console chat loop with warnings/logs silenced for a cleaner UX.

See phase-1-model-verification/ for the scripts and the full setup walkthrough (gated-model access, virtual env, dependencies, troubleshooting).

Quick start

# 1. Clone
git clone https://huggingface.co/kanika23oct/kanikatestmodel
cd kanikatestmodel

# 2. Set up Python environment
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install transformers torch accelerate                # Phase 2
pip install sentence-transformers chromadb               # Phase 3 (added)

# 3. Authenticate with Hugging Face (Llama 3.1 is gated)
huggingface-cli login

# 4. Run a Phase 2 module
cd phase-2-inference-prompting
python -m inference.structured_output
python -m tests.test_structured_output

# 5. Run the Phase 3 RAG pipeline
cd ..\phase-3-rag
python -m ingestion.build_index        # build the vector store (one-time per docs change)
python -m tests.test_retriever         # fast sanity check (no LLM)
python -m tests.test_rag_grounded      # end-to-end RAG with Llama (slow on CPU)

A GPU with ≥ 16 GB VRAM is strongly recommended for any phase that loads Llama; CPU inference works but is impractically slow (multiple minutes per generation).

What's next

Tracked in SESSION_NOTES.md. Headline items:

Phase 3 wrap-up — actually run test_rag_grounded.py end-to-end (CPU torch makes it slow), citation parsing, bigger corpus, optional re-ranking.
Phase 4 — FastAPI — wrap RagPipeline.answer behind a POST /rag HTTP endpoint.
Phase 5 — Agents — give the model tool-calling so it decides when to retrieve, when to answer directly, and when to call something else.
Phase 2 wrap-up — few-shot, guardrails, quantization, the load_prompt(name) helper that wires the markdown catalogues to code.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support