Spaces:

AashishAIHub
/

DataScience

Running

AashishAIHub commited on Feb 28

Commit

4145e94

1 Parent(s): c3413c4

feat: Expand Hugging Face module to comprehensive deep-dive

- 9 in-depth concept sections: Transformers, Pipelines, Tokenizers,
Datasets, Trainer API, Accelerate, Model Hub, Libraries, Spaces
- 7 comprehensive code examples: Pipelines, Tokenizers, Datasets,
Model Loading (basic/fp16/4bit/flash), Trainer, Gradio, Hub API
- 6 expert interview questions covering from_pretrained vs pipeline,
device_map sharding, Arrow internals, chat templates, gated models,
safetensors security

Files changed (1) hide show

GenAI-AgenticAI/app.js +257 -42

GenAI-AgenticAI/app.js CHANGED Viewed

@@ -169,81 +169,296 @@ attn_layer0 = outputs.attentions[<span class="number">0</span>]  <span class="co
     'huggingface': {
         concepts: `
             <div class="section">
-                <h2>🤗 Hugging Face Ecosystem</h2>
                 <div class="info-box">
                     <div class="box-title">⚡ The GitHub of AI</div>
-                    <div class="box-content">Hugging Face (HF) is the central hub for the ML community. With 500,000+ models, 100,000+ datasets, and libraries like <strong>Transformers</strong>, <strong>Diffusers</strong>, and <strong>PEFT</strong>, it's the standard toolchain for working with LLMs — from experimentation to production.</div>
                 </div>
-                <h3>Core Libraries</h3>
                 <table>
-                    <tr><th>Library</th><th>Purpose</th><th>Key Classes</th></tr>
-                    <tr><td><code>transformers</code></td><td>Load & run any model</td><td>AutoModel, Pipeline, Trainer</td></tr>
-                    <tr><td><code>datasets</code></td><td>Load & process datasets</td><td>load_dataset, Dataset, DatasetDict</td></tr>
-                    <tr><td><code>tokenizers</code></td><td>Fast tokenization (Rust)</td><td>AutoTokenizer, PreTrainedTokenizerFast</td></tr>
-                    <tr><td><code>peft</code></td><td>Parameter-efficient fine-tuning</td><td>LoraConfig, get_peft_model</td></tr>
-                    <tr><td><code>accelerate</code></td><td>Distributed training / mixed precision</td><td>Accelerator, prepare()</td></tr>
-                    <tr><td><code>huggingface_hub</code></td><td>Interact with Model Hub</td><td>hf_hub_download, push_to_hub</td></tr>
                 </table>
-                <h3>Pipelines — The Fastest Path</h3>
-                <p>The <code>pipeline()</code> function wraps tokenization + model + post-processing into one call. Perfect for quickly testing a model. Under the hood: tokenize → model forward pass → decode output. Supports 20+ tasks: text-generation, sentiment-analysis, NER, summarization, translation, image-classification, and more.</p>
-                <h3>AutoClasses — Flexible Model Loading</h3>
-                <p><code>AutoModelForCausalLM.from_pretrained()</code> auto-detects the model architecture from its <code>config.json</code>. Key arguments: <code>torch_dtype=torch.float16</code> (half precision), <code>device_map="auto"</code> (auto-shard across GPUs), <code>load_in_4bit=True</code> (quantize at load).</p>
-                <h3>Spaces — Deploy in One Click</h3>
-                <p>HF Spaces lets you deploy Gradio/Streamlit apps on free-tier hardware. For ML demos, use <strong>Gradio</strong> (built in — HF knows it). Spaces support: CPU free tier, GPU T4 ($0.60/hr), A100 ($3/hr). You can host your fine-tuned models with a web UI at no cost on free tier.</p>
             </div>`,
         code: `
             <div class="section">
-                <h2>💻 Hugging Face Code Examples</h2>
-                <h3>Pipelines — Zero Boilerplate</h3>
                 <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> pipeline
-<span class="comment"># Text generation</span>
 gen = pipeline(<span class="string">"text-generation"</span>, model=<span class="string">"meta-llama/Llama-3.2-1B-Instruct"</span>)
 result = gen(<span class="string">"Explain RAG in one paragraph:"</span>, max_new_tokens=<span class="number">200</span>)
 <span class="function">print</span>(result[<span class="number">0</span>][<span class="string">"generated_text"</span>])
-<span class="comment"># Sentiment analysis</span>
 sa = pipeline(<span class="string">"sentiment-analysis"</span>)
-<span class="function">print</span>(sa(<span class="string">"This model is absolutely incredible!"</span>))
-<span class="comment"># Summarization</span>
-summ = pipeline(<span class="string">"summarization"</span>, model=<span class="string">"facebook/bart-large-cnn"</span>)
-<span class="function">print</span>(summ(long_article, max_length=<span class="number">130</span>))</div>
-                <h3>Loading Models with BitsAndBytes Quantization</h3>
                 <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
 <span class="keyword">import</span> torch
 bnb_config = BitsAndBytesConfig(
     load_in_4bit=<span class="keyword">True</span>,
-    bnb_4bit_quant_type=<span class="string">"nf4"</span>,
     bnb_4bit_compute_dtype=torch.bfloat16,
-    bnb_4bit_use_double_quant=<span class="keyword">True</span>  <span class="comment"># QLoRA-style</span>
 )
 model = AutoModelForCausalLM.from_pretrained(
     <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
     quantization_config=bnb_config,
     device_map=<span class="string">"auto"</span>
 )
-tokenizer = AutoTokenizer.from_pretrained(<span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>)</div>
-                <h3>Push Model to Hub</h3>
-                <div class="code-block"><span class="keyword">from</span> huggingface_hub <span class="keyword">import</span> HfApi
-<span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM
-<span class="comment"># Login first: huggingface-cli login</span>
-model.push_to_hub(<span class="string">"your-username/my-finetuned-llama"</span>)
-tokenizer.push_to_hub(<span class="string">"your-username/my-finetuned-llama"</span>)
-<span class="comment"># Load it anywhere</span>
-model = AutoModelForCausalLM.from_pretrained(<span class="string">"your-username/my-finetuned-llama"</span>)</div>
             </div>`,
         interview: `
             <div class="section">
-                <h2>🎯 Hugging Face Interview Questions</h2>
-                <div class="interview-box"><strong>Q1: What's the difference between <code>from_pretrained</code> and <code>pipeline</code>?</strong><p><strong>Answer:</strong> <code>pipeline()</code> is a high-level convenience wrapper that handles tokenization, model forward pass, and output decoding automatically. <code>from_pretrained()</code> gives you raw access to the model and tokenizer for customization. Use pipelines for quick experiments; use raw classes for fine-tuning, custom inference loops, or production.</p></div>
-                <div class="interview-box"><strong>Q2: What is <code>device_map="auto"</code>?</strong><p><strong>Answer:</strong> It uses the <code>accelerate</code> library to automatically shard a model across available devices (multiple GPUs, CPU, disk). It creates a "device map" placing layers on available memory. Essential for loading 70B+ models that don't fit on a single GPU. Uses <code>offload_folder</code> to spill overflow to CPU/disk.</p></div>
-                <div class="interview-box"><strong>Q3: What are HF Datasets and why use them over pandas?</strong><p><strong>Answer:</strong> HF Datasets uses <strong>Apache Arrow</strong> for memory-mapped, zero-copy access. A 100GB dataset can be iterated without loading into RAM. It supports streaming (<code>streaming=True</code>), map operations that run in parallel, automatic caching, and integration with the Trainer API. Much better than pandas for large-scale ML data.</p></div>
-                <div class="interview-box"><strong>Q4: How do you run inference with a gated model (like Llama)?</strong><p><strong>Answer:</strong> (1) Accept the license on the model page at huggingface.co, (2) Generate a HF token at hf.co/settings/tokens, (3) Run <code>huggingface-cli login</code> or pass <code>token=os.environ["HF_TOKEN"]</code> to <code>from_pretrained()</code>. In production, set the HF_TOKEN as an environment secret.</p></div>
             </div>`
     },
     'finetuning': {

     'huggingface': {
         concepts: `
             <div class="section">
+                <h2>🤗 Hugging Face Deep Dive — The Complete Ecosystem</h2>
                 <div class="info-box">
                     <div class="box-title">⚡ The GitHub of AI</div>
+                    <div class="box-content">Hugging Face (HF) is the central hub for the ML community. With <strong>700,000+ models</strong>, <strong>150,000+ datasets</strong>, and 15+ libraries, it's the standard toolchain for modern AI — from experimentation to production deployment. Understanding HF deeply is essential for any GenAI practitioner.</div>
                 </div>
+                <h3>1. Transformers Library — The Core Engine</h3>
+                <p>The <code>transformers</code> library provides a unified API to load, run, and fine-tune any model architecture. It wraps 200+ architectures (GPT, LLaMA, Mistral, Gemma, T5, BERT, ViT, Whisper, etc.) behind <strong>Auto Classes</strong> that detect architecture automatically from <code>config.json</code>.</p>
+                <table>
+                    <tr><th>AutoClass</th><th>Use Case</th><th>Example Models</th></tr>
+                    <tr><td>AutoModelForCausalLM</td><td>Text generation (decoder-only)</td><td>LLaMA, GPT-2, Mistral, Gemma</td></tr>
+                    <tr><td>AutoModelForSeq2SeqLM</td><td>Translation, summarization</td><td>T5, BART, mT5, Flan-T5</td></tr>
+                    <tr><td>AutoModelForSequenceClassification</td><td>Text classification, sentiment</td><td>BERT, RoBERTa, DeBERTa</td></tr>
+                    <tr><td>AutoModelForTokenClassification</td><td>NER, POS tagging</td><td>BERT-NER, SpanBERT</td></tr>
+                    <tr><td>AutoModelForQuestionAnswering</td><td>Extractive QA</td><td>BERT-QA, RoBERTa-QA</td></tr>
+                    <tr><td>AutoModel (base)</td><td>Embeddings, custom heads</td><td>Any backbone</td></tr>
+                </table>
+                <div class="callout tip">
+                    <div class="callout-title">💡 Key from_pretrained() Arguments</div>
+                    <p><code>torch_dtype=torch.bfloat16</code> — half precision, saves 50% memory<br>
+                    <code>device_map="auto"</code> — auto-shard across GPUs/CPU/disk<br>
+                    <code>load_in_4bit=True</code> — 4-bit quantization via BitsAndBytes<br>
+                    <code>attn_implementation="flash_attention_2"</code> — use FlashAttention for 2-4x faster inference<br>
+                    <code>trust_remote_code=True</code> — needed for custom architectures with code on the Hub</p>
+                </div>
+                <h3>2. Pipelines — 20+ Tasks in One Line</h3>
+                <p>The <code>pipeline()</code> function wraps tokenization + model + post-processing into a single call. Under the hood: tokenize → model forward pass → decode/format output. Supports these key tasks:</p>
+                <table>
+                    <tr><th>Task</th><th>Pipeline Name</th><th>Default Model</th></tr>
+                    <tr><td>Text Generation</td><td><code>"text-generation"</code></td><td>gpt2</td></tr>
+                    <tr><td>Sentiment Analysis</td><td><code>"sentiment-analysis"</code></td><td>distilbert-sst2</td></tr>
+                    <tr><td>Named Entity Recognition</td><td><code>"ner"</code></td><td>dbmdz/bert-large-NER</td></tr>
+                    <tr><td>Summarization</td><td><code>"summarization"</code></td><td>sshleifer/distilbart-cnn</td></tr>
+                    <tr><td>Translation</td><td><code>"translation_en_to_fr"</code></td><td>Helsinki-NLP/opus-mt</td></tr>
+                    <tr><td>Zero-Shot Classification</td><td><code>"zero-shot-classification"</code></td><td>facebook/bart-large-mnli</td></tr>
+                    <tr><td>Feature Extraction (Embeddings)</td><td><code>"feature-extraction"</code></td><td>sentence-transformers</td></tr>
+                    <tr><td>Image Classification</td><td><code>"image-classification"</code></td><td>google/vit-base</td></tr>
+                    <tr><td>Speech Recognition</td><td><code>"automatic-speech-recognition"</code></td><td>openai/whisper-base</td></tr>
+                    <tr><td>Text-to-Image</td><td><code>"text-to-image"</code></td><td>stabilityai/sdxl</td></tr>
+                </table>
+                <h3>3. Tokenizers Library — Rust-Powered Speed</h3>
+                <p>The <code>tokenizers</code> library is written in Rust with Python bindings. It's 10-100x faster than pure-Python tokenizers. Key tokenization algorithms used by modern LLMs:</p>
                 <table>
+                    <tr><th>Algorithm</th><th>Used By</th><th>How It Works</th></tr>
+                    <tr><td>BPE (Byte-Pair Encoding)</td><td>GPT-2, GPT-4, LLaMA</td><td>Repeatedly merges most frequent byte pairs. "unbelievable" → ["un", "believ", "able"]</td></tr>
+                    <tr><td>SentencePiece (Unigram)</td><td>T5, ALBERT, XLNet</td><td>Statistical model that finds optimal subword segmentation probabilistically</td></tr>
+                    <tr><td>WordPiece</td><td>BERT, DistilBERT</td><td>Greedy algorithm; splits by maximizing likelihood. Uses "##" prefix for sub-tokens</td></tr>
                 </table>
+                <div class="callout warning">
+                    <div class="callout-title">⚠️ Tokenizer Gotchas</div>
+                    <p>• Numbers tokenize unpredictably: "1234" might be 1-4 tokens depending on the model<br>
+                    • Whitespace matters: " hello" and "hello" produce different tokens in GPT<br>
+                    • Non-English languages use more tokens per word (higher cost per concept)<br>
+                    • Always use the model's own tokenizer — never mix tokenizers between models</p>
+                </div>
+                <h3>4. Datasets Library — Apache Arrow Under the Hood</h3>
+                <p><code>datasets</code> uses <strong>Apache Arrow</strong> for columnar, memory-mapped storage. A 100GB dataset can be iterated without loading into RAM. Key features:</p>
+                <p><strong>Memory Mapping:</strong> Data stays on disk; only accessed rows are loaded into memory. <strong>Streaming:</strong> <code>load_dataset(..., streaming=True)</code> returns an iterable — process terabytes with constant memory. <strong>Map/Filter:</strong> Apply transformations with automatic caching and multiprocessing. <strong>Hub Integration:</strong> 150,000+ datasets available via <code>load_dataset("dataset_name")</code>.</p>
+                <h3>5. Trainer API — High-Level Training Loop</h3>
+                <p>The <code>Trainer</code> class handles: training loop, evaluation, checkpointing, logging (to TensorBoard/W&B), mixed precision, gradient accumulation, distributed training, and early stopping. You just provide model + dataset + TrainingArguments. For instruction-tuning LLMs, use <strong>TRL's SFTTrainer</strong> (built on top of Trainer) which handles chat templates and packing automatically.</p>
+                <h3>6. Accelerate — Distributed Training Made Easy</h3>
+                <p><code>accelerate</code> abstracts away multi-GPU, TPU, and mixed-precision complexity. Write your training loop once; run on 1 GPU or 64 GPUs with zero code changes. Key feature: <code>Accelerator</code> class wraps your model, optimizer, and dataloader. It handles data sharding, gradient synchronization, and device placement automatically.</p>
+                <h3>7. Model Hub — Everything Is a Git Repo</h3>
+                <p>Every model on HF Hub is a <strong>Git LFS repo</strong> containing: <code>config.json</code> (architecture), <code>model.safetensors</code> (weights), <code>tokenizer.json</code>, and a <code>README.md</code> (model card). You can push your own models with <code>model.push_to_hub()</code>. The Hub supports: model versioning (Git branches/tags), automatic model cards, gated access (license agreements), and API inference endpoints.</p>
+                <h3>8. Additional HF Libraries</h3>
+                <table>
+                    <tr><th>Library</th><th>Purpose</th><th>Key Feature</th></tr>
+                    <tr><td><code>peft</code></td><td>Parameter-efficient fine-tuning</td><td>LoRA, QLoRA, Adapters, Prompt Tuning</td></tr>
+                    <tr><td><code>trl</code></td><td>RLHF and alignment training</td><td>SFTTrainer, DPOTrainer, PPOTrainer, RewardTrainer</td></tr>
+                    <tr><td><code>diffusers</code></td><td>Image/video generation</td><td>Stable Diffusion, SDXL, ControlNet, IP-Adapter</td></tr>
+                    <tr><td><code>evaluate</code></td><td>Metrics computation</td><td>BLEU, ROUGE, accuracy, perplexity, and 100+ metrics</td></tr>
+                    <tr><td><code>gradio</code></td><td>Build ML demos</td><td>Web UI for any model in 5 lines of code</td></tr>
+                    <tr><td><code>smolagents</code></td><td>Lightweight AI agents</td><td>Code-based tool calling, HF model integration</td></tr>
+                    <tr><td><code>safetensors</code></td><td>Safe model format</td><td>Fast, safe, and efficient tensor serialization (replaces pickle)</td></tr>
+                    <tr><td><code>huggingface_hub</code></td><td>Hub API client</td><td>Download files, push models, create repos, manage spaces</td></tr>
+                </table>
+                <h3>9. Spaces — Deploy ML Apps Free</h3>
+                <p>HF Spaces lets you deploy <strong>Gradio</strong> or <strong>Streamlit</strong> apps on managed infrastructure. Free CPU tier for demos; upgrade to T4 ($0.60/hr) or A100 ($3.09/hr) for GPU workloads. Spaces support Docker, static HTML, and custom environments. They auto-build from a Git repo with a simple <code>requirements.txt</code>. Ideal for: model demos, portfolio projects, internal tools, and quick prototypes.</p>
             </div>`,
         code: `
             <div class="section">
+                <h2>💻 Hugging Face — Comprehensive Code Examples</h2>
+                <h3>1. Pipelines — Every Task</h3>
                 <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> pipeline
+<span class="comment"># ─── Text Generation ───</span>
 gen = pipeline(<span class="string">"text-generation"</span>, model=<span class="string">"meta-llama/Llama-3.2-1B-Instruct"</span>)
 result = gen(<span class="string">"Explain RAG in one paragraph:"</span>, max_new_tokens=<span class="number">200</span>)
 <span class="function">print</span>(result[<span class="number">0</span>][<span class="string">"generated_text"</span>])
+<span class="comment"># ─── Sentiment Analysis ───</span>
 sa = pipeline(<span class="string">"sentiment-analysis"</span>)
+<span class="function">print</span>(sa(<span class="string">"Hugging Face is amazing!"</span>))
+<span class="comment"># [{'label': 'POSITIVE', 'score': 0.9998}]</span>
+<span class="comment"># ─── Named Entity Recognition ───</span>
+ner = pipeline(<span class="string">"ner"</span>, grouped_entities=<span class="keyword">True</span>)
+<span class="function">print</span>(ner(<span class="string">"Elon Musk founded SpaceX in California"</span>))
+<span class="comment"># [{'entity_group': 'PER', 'word': 'Elon Musk'}, ...]</span>
+<span class="comment"># ─── Zero-Shot Classification (no training needed!) ───</span>
+zsc = pipeline(<span class="string">"zero-shot-classification"</span>)
+result = zsc(<span class="string">"I need to fix a bug in my Python code"</span>,
+    candidate_labels=[<span class="string">"programming"</span>, <span class="string">"cooking"</span>, <span class="string">"sports"</span>])
+<span class="function">print</span>(result[<span class="string">"labels"</span>][<span class="number">0</span>])  <span class="comment"># "programming"</span>
+<span class="comment"># ─── Speech Recognition (Whisper) ───</span>
+asr = pipeline(<span class="string">"automatic-speech-recognition"</span>, model=<span class="string">"openai/whisper-large-v3"</span>)
+<span class="function">print</span>(asr(<span class="string">"audio.mp3"</span>)[<span class="string">"text"</span>])</div>
+                <h3>2. Tokenizers Deep Dive</h3>
+                <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(<span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>)
+<span class="comment"># Basic tokenization</span>
+text = <span class="string">"Hugging Face transformers are powerful!"</span>
+tokens = tokenizer.tokenize(text)
+ids = tokenizer.encode(text)
+<span class="function">print</span>(<span class="string">f"Tokens: {tokens}"</span>)
+<span class="function">print</span>(<span class="string">f"IDs: {ids}"</span>)
+<span class="function">print</span>(<span class="string">f"Decoded: {tokenizer.decode(ids)}"</span>)
+<span class="comment"># Chat template (critical for instruction models)</span>
+messages = [
+    {<span class="string">"role"</span>: <span class="string">"system"</span>, <span class="string">"content"</span>: <span class="string">"You are a helpful assistant."</span>},
+    {<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"What is LoRA?"</span>}
+]
+formatted = tokenizer.apply_chat_template(messages, tokenize=<span class="keyword">False</span>)
+<span class="function">print</span>(formatted)  <span class="comment"># Proper &lt;|start_header|&gt; format for Llama</span>
+<span class="comment"># Batch tokenization with padding</span>
+batch = tokenizer(
+    [<span class="string">"short"</span>, <span class="string">"a much longer sentence here"</span>],
+    padding=<span class="keyword">True</span>,
+    truncation=<span class="keyword">True</span>,
+    max_length=<span class="number">512</span>,
+    return_tensors=<span class="string">"pt"</span>  <span class="comment"># Returns PyTorch tensors</span>
+)
+<span class="function">print</span>(batch.keys())  <span class="comment"># input_ids, attention_mask</span></div>
+                <h3>3. Datasets Library — Load, Process, Stream</h3>
+                <div class="code-block"><span class="keyword">from</span> datasets <span class="keyword">import</span> load_dataset, Dataset
+<span class="comment"># Load from Hub</span>
+ds = load_dataset(<span class="string">"imdb"</span>)
+<span class="function">print</span>(ds)  <span class="comment"># DatasetDict with 'train' and 'test' splits</span>
+<span class="function">print</span>(ds[<span class="string">"train"</span>][<span class="number">0</span>])  <span class="comment"># First example</span>
+<span class="comment"># Streaming (constant memory for huge datasets)</span>
+stream = load_dataset(<span class="string">"allenai/c4"</span>, split=<span class="string">"train"</span>, streaming=<span class="keyword">True</span>)
+<span class="keyword">for</span> i, example <span class="keyword">in</span> enumerate(stream):
+    <span class="keyword">if</span> i >= <span class="number">5</span>: <span class="keyword">break</span>
+    <span class="function">print</span>(example[<span class="string">"text"</span>][:<span class="number">100</span>])
+<span class="comment"># Map with parallel processing</span>
+<span class="keyword">def</span> <span class="function">tokenize_fn</span>(examples):
+    <span class="keyword">return</span> tokenizer(examples[<span class="string">"text"</span>], truncation=<span class="keyword">True</span>, max_length=<span class="number">512</span>)
+tokenized = ds[<span class="string">"train"</span>].map(tokenize_fn, batched=<span class="keyword">True</span>, num_proc=<span class="number">4</span>)
+<span class="comment"># Create custom dataset from dict/pandas</span>
+my_data = Dataset.from_dict({
+    <span class="string">"text"</span>: [<span class="string">"Hello world"</span>, <span class="string">"AI is great"</span>],
+    <span class="string">"label"</span>: [<span class="number">1</span>, <span class="number">0</span>]
+})
+<span class="comment"># Push your dataset to Hub</span>
+my_data.push_to_hub(<span class="string">"your-username/my-dataset"</span>)</div>
+                <h3>4. Model Loading — From Basic to Production</h3>
                 <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
 <span class="keyword">import</span> torch
+<span class="comment"># ─── Basic Loading (full precision) ───</span>
+model = AutoModelForCausalLM.from_pretrained(<span class="string">"gpt2"</span>)
+<span class="comment"># ─── Half Precision (saves 50% VRAM) ───</span>
+model = AutoModelForCausalLM.from_pretrained(
+    <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
+    torch_dtype=torch.bfloat16,
+    device_map=<span class="string">"auto"</span>
+)
+<span class="comment"># ─── 4-bit Quantization (QLoRA-ready) ───</span>
 bnb_config = BitsAndBytesConfig(
     load_in_4bit=<span class="keyword">True</span>,
+    bnb_4bit_quant_type=<span class="string">"nf4"</span>,        <span class="comment"># NormalFloat4 — better than uniform int4</span>
     bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_use_double_quant=<span class="keyword">True</span>    <span class="comment"># Quantize the quantization constants too</span>
 )
 model = AutoModelForCausalLM.from_pretrained(
     <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
     quantization_config=bnb_config,
     device_map=<span class="string">"auto"</span>
 )
+<span class="comment"># ─── Flash Attention 2 (2-4x faster) ───</span>
+model = AutoModelForCausalLM.from_pretrained(
+    <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
+    torch_dtype=torch.bfloat16,
+    attn_implementation=<span class="string">"flash_attention_2"</span>,
+    device_map=<span class="string">"auto"</span>
+)</div>
+                <h3>5. Trainer API — Full Training Loop</h3>
+                <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForSequenceClassification, Trainer, TrainingArguments
+<span class="keyword">from</span> datasets <span class="keyword">import</span> load_dataset
+<span class="comment"># Load model and dataset</span>
+model = AutoModelForSequenceClassification.from_pretrained(<span class="string">"bert-base-uncased"</span>, num_labels=<span class="number">2</span>)
+ds = load_dataset(<span class="string">"imdb"</span>)
+tokenized = ds.map(<span class="keyword">lambda</span> x: tokenizer(x[<span class="string">"text"</span>], truncation=<span class="keyword">True</span>, max_length=<span class="number">512</span>), batched=<span class="keyword">True</span>)
+<span class="comment"># Configure training</span>
+args = TrainingArguments(
+    output_dir=<span class="string">"./results"</span>,
+    num_train_epochs=<span class="number">3</span>,
+    per_device_train_batch_size=<span class="number">16</span>,
+    per_device_eval_batch_size=<span class="number">64</span>,
+    learning_rate=<span class="number">2e-5</span>,
+    weight_decay=<span class="number">0.01</span>,
+    eval_strategy=<span class="string">"epoch"</span>,
+    save_strategy=<span class="string">"epoch"</span>,
+    load_best_model_at_end=<span class="keyword">True</span>,
+    fp16=<span class="keyword">True</span>,            <span class="comment"># Mixed precision</span>
+    gradient_accumulation_steps=<span class="number">4</span>,
+    logging_steps=<span class="number">100</span>,
+    report_to=<span class="string">"wandb"</span>,   <span class="comment"># Log to Weights & Biases</span>
+)
+trainer = Trainer(model=model, args=args, train_dataset=tokenized[<span class="string">"train"</span>], eval_dataset=tokenized[<span class="string">"test"</span>])
+trainer.train()
+trainer.push_to_hub()  <span class="comment"># Push trained model directly</span></div>
+                <h3>6. Gradio — Build a Demo in 5 Lines</h3>
+                <div class="code-block"><span class="keyword">import</span> gradio <span class="keyword">as</span> gr
+<span class="keyword">from</span> transformers <span class="keyword">import</span> pipeline
+pipe = pipeline(<span class="string">"sentiment-analysis"</span>)
+<span class="keyword">def</span> <span class="function">analyze</span>(text):
+    result = pipe(text)[<span class="number">0</span>]
+    <span class="keyword">return</span> <span class="string">f"{result['label']} ({result['score']:.2%})"</span>
+gr.Interface(fn=analyze, inputs=<span class="string">"text"</span>, outputs=<span class="string">"text"</span>,
+    title=<span class="string">"Sentiment Analyzer"</span>).launch()
+<span class="comment"># Runs at http://localhost:7860 — deploy to HF Spaces for free!</span></div>
+                <h3>7. Hub API — Programmatic Access</h3>
+                <div class="code-block"><span class="keyword">from</span> huggingface_hub <span class="keyword">import</span> HfApi, hf_hub_download, login
+<span class="comment"># Login</span>
+login(token=<span class="string">"hf_your_token"</span>)  <span class="comment"># or: huggingface-cli login</span>
+api = HfApi()
+<span class="comment"># List models by task</span>
+models = api.list_models(filter=<span class="string">"text-generation"</span>, sort=<span class="string">"downloads"</span>, limit=<span class="number">5</span>)
+<span class="keyword">for</span> m <span class="keyword">in</span> models:
+    <span class="function">print</span>(<span class="string">f"{m.id}: {m.downloads} downloads"</span>)
+<span class="comment"># Download specific file</span>
+path = hf_hub_download(repo_id=<span class="string">"meta-llama/Llama-3.1-8B"</span>, filename=<span class="string">"config.json"</span>)
+<span class="comment"># Push model to Hub</span>
+model.push_to_hub(<span class="string">"your-username/my-model"</span>)
+tokenizer.push_to_hub(<span class="string">"your-username/my-model"</span>)
+<span class="comment"># Create a new Space</span>
+api.create_repo(<span class="string">"your-username/my-demo"</span>, repo_type=<span class="string">"space"</span>, space_sdk=<span class="string">"gradio"</span>)</div>
             </div>`,
         interview: `
             <div class="section">
+                <h2>🎯 Hugging Face — In-Depth Interview Questions</h2>
+                <div class="interview-box"><strong>Q1: What's the difference between <code>from_pretrained</code> and <code>pipeline</code>?</strong><p><strong>Answer:</strong> <code>pipeline()</code> is a high-level convenience wrapper — it auto-detects the task, loads both model + tokenizer, handles tokenization/decoding, and returns human-readable output. <code>from_pretrained()</code> gives raw access to model weights for: custom inference loops, fine-tuning, extracting embeddings, modifying the model architecture, or anything beyond standard inference. Rule: prototyping → pipeline, production/training → from_pretrained.</p></div>
+                <div class="interview-box"><strong>Q2: What is <code>device_map="auto"</code> and how does model sharding work?</strong><p><strong>Answer:</strong> It uses the <code>accelerate</code> library to automatically distribute model layers across available hardware. The algorithm: (1) Measure available memory on each GPU, CPU, and disk; (2) Place layers sequentially, filling GPU first, spilling to CPU, then disk. For a 70B model on two 24GB GPUs: layers 0-40 on GPU 0, layers 41-80 on GPU 1. CPU/disk offloading adds latency but enables running models that don't fit in GPU memory at all. Use <code>max_memory</code> param to control allocation.</p></div>
+                <div class="interview-box"><strong>Q3: Why use HF Datasets over pandas, and how does Apache Arrow help?</strong><p><strong>Answer:</strong> Datasets uses <strong>Apache Arrow</strong> — a columnar, memory-mapped format. Key advantages: (1) <strong>Memory mapping:</strong> A 100GB dataset uses near-zero RAM — data stays on disk but accessed at near-RAM speed via OS page cache. (2) <strong>Zero-copy:</strong> Slicing doesn't duplicate data. (3) <strong>Streaming:</strong> Process datasets larger than disk with <code>streaming=True</code>. (4) <strong>Parallel map:</strong> <code>num_proc=N</code> for multi-core preprocessing. (5) <strong>Caching:</strong> Processed results are automatically cached to disk. Pandas loads everything into RAM — impossible for large-scale ML datasets.</p></div>
+                <div class="interview-box"><strong>Q4: What is a chat template and why does it matter?</strong><p><strong>Answer:</strong> Each instruction-tuned model is trained with a specific format for system/user/assistant messages. Llama uses <code>&lt;|begin_of_text|&gt;&lt;|start_header_id|&gt;system&lt;|end_header_id|&gt;</code>, while ChatML uses <code>&lt;|im_start|&gt;system</code>. If you format input incorrectly, the model behaves like a base model (no instruction following). <code>tokenizer.apply_chat_template()</code> auto-formats messages correctly for any model. This is the #1 mistake beginners make — using raw text instead of the chat template.</p></div>
+                <div class="interview-box"><strong>Q5: How do you handle gated models (Llama, Gemma) in production?</strong><p><strong>Answer:</strong> (1) Accept the model license on the Hub model page. (2) Create a read token at hf.co/settings/tokens. (3) For local: <code>huggingface-cli login</code>. (4) In CI/CD: set <code>HF_TOKEN</code> environment variable. (5) In code: pass <code>token="hf_xxx"</code> to <code>from_pretrained()</code>. For Docker: bake the token as a secret, never in the image. For Kubernetes: use a Secret mounted as an env var. The token is only needed for download — once cached locally, no token is needed for inference.</p></div>
+                <div class="interview-box"><strong>Q6: What is safetensors and why replace pickle?</strong><p><strong>Answer:</strong> Traditional PyTorch models use Python's <code>pickle</code> format, which can execute arbitrary code during loading — a <strong>security vulnerability</strong>. A malicious model file could run code on your machine when loaded. <code>safetensors</code> is a safe, fast tensor format that: (1) Cannot execute code (pure data), (2) Supports zero-copy loading (memory-mapped), (3) Is 2-5x faster to load than pickle, (4) Supports lazy loading (load only specific tensors). It's now the default format on HF Hub.</p></div>
             </div>`
     },
     'finetuning': {