Kanika's AI Learning Repo

A personal, hands-on learning repository for experimenting with open-source LLMs via the Hugging Face transformers library β€” using Meta Llama 3.1 8B Instruct as the working model.

The repo is organized as a series of phases, each one building on the previous to introduce a new concept or pattern. Phase 1 is the warm-up; Phase 2 establishes reusable inference building blocks; Phase 3 is the current focus and ties them together into a working RAG system.

Repository structure

kanikatestmodel/
β”œβ”€β”€ phase-1-model-verification/      ← warm-up: get the model running locally
β”‚   β”œβ”€β”€ ai_experiment.py
β”‚   β”œβ”€β”€ ai_llama_cleanchat.py
β”‚   └── README.md
β”œβ”€β”€ phase-2-inference-prompting/     ← reusable inference + prompt engineering
β”‚   β”œβ”€β”€ inference/                    LlamaInference, chat template, structured output
β”‚   β”‚   β”œβ”€β”€ base_inference.py
β”‚   β”‚   β”œβ”€β”€ chat_template_example.py
β”‚   β”‚   └── structured_output.py
β”‚   β”œβ”€β”€ prompts/                      markdown prompt catalogues
β”‚   β”‚   β”œβ”€β”€ system_prompts.md
β”‚   β”‚   └── instruction_prompts.md
β”‚   β”œβ”€β”€ tests/                        assertion + observational tests
β”‚   β”‚   β”œβ”€β”€ test_structured_output.py
β”‚   β”‚   β”œβ”€β”€ test_temperature_vs_top_p.py
β”‚   β”‚   └── test_max_tokens.py
β”‚   └── README.md
└── phase-3-rag/                     ← current focus: end-to-end RAG
    β”œβ”€β”€ data/sample_docs/             4 short markdown source documents
    β”œβ”€β”€ ingestion/                    chunker β†’ embedder β†’ build_index
    β”œβ”€β”€ retrieval/                    Retriever (top-K over Chroma)
    β”œβ”€β”€ rag/                          RagPipeline (retrieval + grounded prompt + Llama)
    β”œβ”€β”€ prompts/                      system_prompts.md (rag_grounded_assistant)
    β”œβ”€β”€ tests/                        test_retriever.py, test_rag_grounded.py
    └── README.md

Phase 3 β€” RAG (current)

Pairs the LlamaInference wrapper from Phase 2 with a vector database so the model answers from a small corpus of source documents instead of pretraining alone.

The whole architecture is composition: Retriever (MiniLM + Chroma) + grounded prompt + LlamaInference. There is no extra magic.

The two phases of a RAG system

Phase When it runs What it does
Ingestion Once per document chunk β†’ embed β†’ write to vector DB
Query Every user question embed query β†’ top-K search β†’ grounded prompt β†’ LLM

Highlights

  • ingestion/ β€” chunker.py (pure word-window splitter), embedder.py (sentence-transformers/all-MiniLM-L6-v2, 384-dim), build_index.py (CLI that walks a folder, chunks, batch-embeds, persists to Chroma).
  • retrieval/retriever.py β€” Retriever.query(question, k=4) returning [{"text", "source", "chunk_index", "score"}, ...]. Hard-fails if no index exists, with a hint pointing at build_index.
  • rag/pipeline.py β€” RagPipeline.answer(question) returning {"answer", "sources", "hits"}. The grounded prompt explicitly tells the model to refuse with a fixed phrase when the context is silent β€” that refusal-on-silence is what makes RAG more than autocomplete.
  • tests/ β€” test_retriever.py (no LLM, fast); test_rag_grounded.py (loads Llama, asserts both that good context produces a grounded answer AND that off-topic questions trigger the refusal phrase, not a hallucination from pretraining).

See phase-3-rag/ for the full per-module walkthrough and design rationale.


Phase 2 β€” Inference & Prompting (foundation)

The bulk of this repo. Phase 2 turns the throwaway scripts of Phase 1 into reusable building blocks and uses them to explore prompt engineering, structured output, and the effects of generation parameters.

inference/ β€” building blocks

  • base_inference.py β€” a LlamaInference class that loads the model once (~16 GB in bfloat16) and exposes a clean .generate(messages, temperature, top_p, max_tokens, do_sample) method. This is the foundation everything else in Phase 2 imports.
  • chat_template_example.py β€” demonstrates the proper multi-role chat template (system + user + assistant + user) so the model can answer follow-up questions with full conversational context.
  • structured_output.py β€” generate_json(...) helper that combines a strict system prompt with low temperature to force the model to return parseable JSON, then loads it into a real Python dict.

prompts/ β€” content separated from code

A two-file catalogue based on a clear semantic split:

File Purpose Example name Sent as
system_prompts.md Who the model is β€” persona, tone, global rules json_extractor, python_tutor, creative_writer system message
instruction_prompts.md What task to perform on this turn summarize, extract_movie_info, classify_sentiment user message

The two combine like LEGO: any system_prompt Γ— any instruction_prompt = a working, reusable inference call.

tests/ β€” verify behavior, build intuition

Two flavors of test, both runnable directly from the command line:

  • Assertion tests (test_structured_output.py) β€” must pass. Calls generate_json(...) and asserts the returned dict has the expected fields and types. Catches regressions when prompts or models change.
  • Observational tests (test_temperature_vs_top_p.py, test_max_tokens.py) β€” no assertions; sends the same prompt with different generation knobs and prints the outputs side-by-side. Designed to make the abstract effects of temperature, top_p, and max_tokens concrete.

See phase-2-inference-prompting/ for code, prompts, and how to run each module.


Phase 1 β€” Model Verification (foundation)

The starting point: get Meta Llama 3.1 8B Instruct downloaded, loaded onto the GPU, and producing output end-to-end. Two minimal scripts:

  • ai_experiment.py β€” a single-shot pirate-themed chat completion. Verifies the model loads and runs.
  • ai_llama_cleanchat.py β€” an interactive console chat loop with warnings/logs silenced for a cleaner UX.

See phase-1-model-verification/ for the scripts and the full setup walkthrough (gated-model access, virtual env, dependencies, troubleshooting).


Quick start

# 1. Clone
git clone https://huggingface.co/kanika23oct/kanikatestmodel
cd kanikatestmodel

# 2. Set up Python environment
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install transformers torch accelerate                # Phase 2
pip install sentence-transformers chromadb               # Phase 3 (added)

# 3. Authenticate with Hugging Face (Llama 3.1 is gated)
huggingface-cli login

# 4. Run a Phase 2 module
cd phase-2-inference-prompting
python -m inference.structured_output
python -m tests.test_structured_output

# 5. Run the Phase 3 RAG pipeline
cd ..\phase-3-rag
python -m ingestion.build_index        # build the vector store (one-time per docs change)
python -m tests.test_retriever         # fast sanity check (no LLM)
python -m tests.test_rag_grounded      # end-to-end RAG with Llama (slow on CPU)

A GPU with β‰₯ 16 GB VRAM is strongly recommended for any phase that loads Llama; CPU inference works but is impractically slow (multiple minutes per generation).

What's next

Tracked in SESSION_NOTES.md. Headline items:

  • Phase 3 wrap-up β€” actually run test_rag_grounded.py end-to-end (CPU torch makes it slow), citation parsing, bigger corpus, optional re-ranking.
  • Phase 4 β€” FastAPI β€” wrap RagPipeline.answer behind a POST /rag HTTP endpoint.
  • Phase 5 β€” Agents β€” give the model tool-calling so it decides when to retrieve, when to answer directly, and when to call something else.
  • Phase 2 wrap-up β€” few-shot, guardrails, quantization, the load_prompt(name) helper that wires the markdown catalogues to code.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support