AashishAIHub commited on
Commit
4145e94
Β·
1 Parent(s): c3413c4

feat: Expand Hugging Face module to comprehensive deep-dive

Browse files

- 9 in-depth concept sections: Transformers, Pipelines, Tokenizers,
Datasets, Trainer API, Accelerate, Model Hub, Libraries, Spaces
- 7 comprehensive code examples: Pipelines, Tokenizers, Datasets,
Model Loading (basic/fp16/4bit/flash), Trainer, Gradio, Hub API
- 6 expert interview questions covering from_pretrained vs pipeline,
device_map sharding, Arrow internals, chat templates, gated models,
safetensors security

Files changed (1) hide show
  1. GenAI-AgenticAI/app.js +257 -42
GenAI-AgenticAI/app.js CHANGED
@@ -169,81 +169,296 @@ attn_layer0 = outputs.attentions[<span class="number">0</span>] <span class="co
169
  'huggingface': {
170
  concepts: `
171
  <div class="section">
172
- <h2>πŸ€— Hugging Face Ecosystem</h2>
173
  <div class="info-box">
174
  <div class="box-title">⚑ The GitHub of AI</div>
175
- <div class="box-content">Hugging Face (HF) is the central hub for the ML community. With 500,000+ models, 100,000+ datasets, and libraries like <strong>Transformers</strong>, <strong>Diffusers</strong>, and <strong>PEFT</strong>, it's the standard toolchain for working with LLMs β€” from experimentation to production.</div>
176
  </div>
177
- <h3>Core Libraries</h3>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178
  <table>
179
- <tr><th>Library</th><th>Purpose</th><th>Key Classes</th></tr>
180
- <tr><td><code>transformers</code></td><td>Load & run any model</td><td>AutoModel, Pipeline, Trainer</td></tr>
181
- <tr><td><code>datasets</code></td><td>Load & process datasets</td><td>load_dataset, Dataset, DatasetDict</td></tr>
182
- <tr><td><code>tokenizers</code></td><td>Fast tokenization (Rust)</td><td>AutoTokenizer, PreTrainedTokenizerFast</td></tr>
183
- <tr><td><code>peft</code></td><td>Parameter-efficient fine-tuning</td><td>LoraConfig, get_peft_model</td></tr>
184
- <tr><td><code>accelerate</code></td><td>Distributed training / mixed precision</td><td>Accelerator, prepare()</td></tr>
185
- <tr><td><code>huggingface_hub</code></td><td>Interact with Model Hub</td><td>hf_hub_download, push_to_hub</td></tr>
186
  </table>
187
- <h3>Pipelines β€” The Fastest Path</h3>
188
- <p>The <code>pipeline()</code> function wraps tokenization + model + post-processing into one call. Perfect for quickly testing a model. Under the hood: tokenize β†’ model forward pass β†’ decode output. Supports 20+ tasks: text-generation, sentiment-analysis, NER, summarization, translation, image-classification, and more.</p>
189
- <h3>AutoClasses β€” Flexible Model Loading</h3>
190
- <p><code>AutoModelForCausalLM.from_pretrained()</code> auto-detects the model architecture from its <code>config.json</code>. Key arguments: <code>torch_dtype=torch.float16</code> (half precision), <code>device_map="auto"</code> (auto-shard across GPUs), <code>load_in_4bit=True</code> (quantize at load).</p>
191
- <h3>Spaces β€” Deploy in One Click</h3>
192
- <p>HF Spaces lets you deploy Gradio/Streamlit apps on free-tier hardware. For ML demos, use <strong>Gradio</strong> (built in β€” HF knows it). Spaces support: CPU free tier, GPU T4 ($0.60/hr), A100 ($3/hr). You can host your fine-tuned models with a web UI at no cost on free tier.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  </div>`,
194
  code: `
195
  <div class="section">
196
- <h2>πŸ’» Hugging Face Code Examples</h2>
197
- <h3>Pipelines β€” Zero Boilerplate</h3>
 
198
  <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> pipeline
199
 
200
- <span class="comment"># Text generation</span>
201
  gen = pipeline(<span class="string">"text-generation"</span>, model=<span class="string">"meta-llama/Llama-3.2-1B-Instruct"</span>)
202
  result = gen(<span class="string">"Explain RAG in one paragraph:"</span>, max_new_tokens=<span class="number">200</span>)
203
  <span class="function">print</span>(result[<span class="number">0</span>][<span class="string">"generated_text"</span>])
204
 
205
- <span class="comment"># Sentiment analysis</span>
206
  sa = pipeline(<span class="string">"sentiment-analysis"</span>)
207
- <span class="function">print</span>(sa(<span class="string">"This model is absolutely incredible!"</span>))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
 
209
- <span class="comment"># Summarization</span>
210
- summ = pipeline(<span class="string">"summarization"</span>, model=<span class="string">"facebook/bart-large-cnn"</span>)
211
- <span class="function">print</span>(summ(long_article, max_length=<span class="number">130</span>))</div>
212
- <h3>Loading Models with BitsAndBytes Quantization</h3>
213
  <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
214
  <span class="keyword">import</span> torch
215
 
 
 
 
 
 
 
 
 
 
 
 
216
  bnb_config = BitsAndBytesConfig(
217
  load_in_4bit=<span class="keyword">True</span>,
218
- bnb_4bit_quant_type=<span class="string">"nf4"</span>,
219
  bnb_4bit_compute_dtype=torch.bfloat16,
220
- bnb_4bit_use_double_quant=<span class="keyword">True</span> <span class="comment"># QLoRA-style</span>
221
  )
222
-
223
  model = AutoModelForCausalLM.from_pretrained(
224
  <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
225
  quantization_config=bnb_config,
226
  device_map=<span class="string">"auto"</span>
227
  )
228
- tokenizer = AutoTokenizer.from_pretrained(<span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>)</div>
229
- <h3>Push Model to Hub</h3>
230
- <div class="code-block"><span class="keyword">from</span> huggingface_hub <span class="keyword">import</span> HfApi
231
- <span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM
232
 
233
- <span class="comment"># Login first: huggingface-cli login</span>
234
- model.push_to_hub(<span class="string">"your-username/my-finetuned-llama"</span>)
235
- tokenizer.push_to_hub(<span class="string">"your-username/my-finetuned-llama"</span>)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
236
 
237
- <span class="comment"># Load it anywhere</span>
238
- model = AutoModelForCausalLM.from_pretrained(<span class="string">"your-username/my-finetuned-llama"</span>)</div>
239
  </div>`,
240
  interview: `
241
  <div class="section">
242
- <h2>🎯 Hugging Face Interview Questions</h2>
243
- <div class="interview-box"><strong>Q1: What's the difference between <code>from_pretrained</code> and <code>pipeline</code>?</strong><p><strong>Answer:</strong> <code>pipeline()</code> is a high-level convenience wrapper that handles tokenization, model forward pass, and output decoding automatically. <code>from_pretrained()</code> gives you raw access to the model and tokenizer for customization. Use pipelines for quick experiments; use raw classes for fine-tuning, custom inference loops, or production.</p></div>
244
- <div class="interview-box"><strong>Q2: What is <code>device_map="auto"</code>?</strong><p><strong>Answer:</strong> It uses the <code>accelerate</code> library to automatically shard a model across available devices (multiple GPUs, CPU, disk). It creates a "device map" placing layers on available memory. Essential for loading 70B+ models that don't fit on a single GPU. Uses <code>offload_folder</code> to spill overflow to CPU/disk.</p></div>
245
- <div class="interview-box"><strong>Q3: What are HF Datasets and why use them over pandas?</strong><p><strong>Answer:</strong> HF Datasets uses <strong>Apache Arrow</strong> for memory-mapped, zero-copy access. A 100GB dataset can be iterated without loading into RAM. It supports streaming (<code>streaming=True</code>), map operations that run in parallel, automatic caching, and integration with the Trainer API. Much better than pandas for large-scale ML data.</p></div>
246
- <div class="interview-box"><strong>Q4: How do you run inference with a gated model (like Llama)?</strong><p><strong>Answer:</strong> (1) Accept the license on the model page at huggingface.co, (2) Generate a HF token at hf.co/settings/tokens, (3) Run <code>huggingface-cli login</code> or pass <code>token=os.environ["HF_TOKEN"]</code> to <code>from_pretrained()</code>. In production, set the HF_TOKEN as an environment secret.</p></div>
 
 
247
  </div>`
248
  },
249
  'finetuning': {
 
169
  'huggingface': {
170
  concepts: `
171
  <div class="section">
172
+ <h2>πŸ€— Hugging Face Deep Dive β€” The Complete Ecosystem</h2>
173
  <div class="info-box">
174
  <div class="box-title">⚑ The GitHub of AI</div>
175
+ <div class="box-content">Hugging Face (HF) is the central hub for the ML community. With <strong>700,000+ models</strong>, <strong>150,000+ datasets</strong>, and 15+ libraries, it's the standard toolchain for modern AI β€” from experimentation to production deployment. Understanding HF deeply is essential for any GenAI practitioner.</div>
176
  </div>
177
+
178
+ <h3>1. Transformers Library β€” The Core Engine</h3>
179
+ <p>The <code>transformers</code> library provides a unified API to load, run, and fine-tune any model architecture. It wraps 200+ architectures (GPT, LLaMA, Mistral, Gemma, T5, BERT, ViT, Whisper, etc.) behind <strong>Auto Classes</strong> that detect architecture automatically from <code>config.json</code>.</p>
180
+ <table>
181
+ <tr><th>AutoClass</th><th>Use Case</th><th>Example Models</th></tr>
182
+ <tr><td>AutoModelForCausalLM</td><td>Text generation (decoder-only)</td><td>LLaMA, GPT-2, Mistral, Gemma</td></tr>
183
+ <tr><td>AutoModelForSeq2SeqLM</td><td>Translation, summarization</td><td>T5, BART, mT5, Flan-T5</td></tr>
184
+ <tr><td>AutoModelForSequenceClassification</td><td>Text classification, sentiment</td><td>BERT, RoBERTa, DeBERTa</td></tr>
185
+ <tr><td>AutoModelForTokenClassification</td><td>NER, POS tagging</td><td>BERT-NER, SpanBERT</td></tr>
186
+ <tr><td>AutoModelForQuestionAnswering</td><td>Extractive QA</td><td>BERT-QA, RoBERTa-QA</td></tr>
187
+ <tr><td>AutoModel (base)</td><td>Embeddings, custom heads</td><td>Any backbone</td></tr>
188
+ </table>
189
+ <div class="callout tip">
190
+ <div class="callout-title">πŸ’‘ Key from_pretrained() Arguments</div>
191
+ <p><code>torch_dtype=torch.bfloat16</code> β€” half precision, saves 50% memory<br>
192
+ <code>device_map="auto"</code> β€” auto-shard across GPUs/CPU/disk<br>
193
+ <code>load_in_4bit=True</code> β€” 4-bit quantization via BitsAndBytes<br>
194
+ <code>attn_implementation="flash_attention_2"</code> β€” use FlashAttention for 2-4x faster inference<br>
195
+ <code>trust_remote_code=True</code> β€” needed for custom architectures with code on the Hub</p>
196
+ </div>
197
+
198
+ <h3>2. Pipelines β€” 20+ Tasks in One Line</h3>
199
+ <p>The <code>pipeline()</code> function wraps tokenization + model + post-processing into a single call. Under the hood: tokenize β†’ model forward pass β†’ decode/format output. Supports these key tasks:</p>
200
+ <table>
201
+ <tr><th>Task</th><th>Pipeline Name</th><th>Default Model</th></tr>
202
+ <tr><td>Text Generation</td><td><code>"text-generation"</code></td><td>gpt2</td></tr>
203
+ <tr><td>Sentiment Analysis</td><td><code>"sentiment-analysis"</code></td><td>distilbert-sst2</td></tr>
204
+ <tr><td>Named Entity Recognition</td><td><code>"ner"</code></td><td>dbmdz/bert-large-NER</td></tr>
205
+ <tr><td>Summarization</td><td><code>"summarization"</code></td><td>sshleifer/distilbart-cnn</td></tr>
206
+ <tr><td>Translation</td><td><code>"translation_en_to_fr"</code></td><td>Helsinki-NLP/opus-mt</td></tr>
207
+ <tr><td>Zero-Shot Classification</td><td><code>"zero-shot-classification"</code></td><td>facebook/bart-large-mnli</td></tr>
208
+ <tr><td>Feature Extraction (Embeddings)</td><td><code>"feature-extraction"</code></td><td>sentence-transformers</td></tr>
209
+ <tr><td>Image Classification</td><td><code>"image-classification"</code></td><td>google/vit-base</td></tr>
210
+ <tr><td>Speech Recognition</td><td><code>"automatic-speech-recognition"</code></td><td>openai/whisper-base</td></tr>
211
+ <tr><td>Text-to-Image</td><td><code>"text-to-image"</code></td><td>stabilityai/sdxl</td></tr>
212
+ </table>
213
+
214
+ <h3>3. Tokenizers Library β€” Rust-Powered Speed</h3>
215
+ <p>The <code>tokenizers</code> library is written in Rust with Python bindings. It's 10-100x faster than pure-Python tokenizers. Key tokenization algorithms used by modern LLMs:</p>
216
  <table>
217
+ <tr><th>Algorithm</th><th>Used By</th><th>How It Works</th></tr>
218
+ <tr><td>BPE (Byte-Pair Encoding)</td><td>GPT-2, GPT-4, LLaMA</td><td>Repeatedly merges most frequent byte pairs. "unbelievable" β†’ ["un", "believ", "able"]</td></tr>
219
+ <tr><td>SentencePiece (Unigram)</td><td>T5, ALBERT, XLNet</td><td>Statistical model that finds optimal subword segmentation probabilistically</td></tr>
220
+ <tr><td>WordPiece</td><td>BERT, DistilBERT</td><td>Greedy algorithm; splits by maximizing likelihood. Uses "##" prefix for sub-tokens</td></tr>
 
 
 
221
  </table>
222
+ <div class="callout warning">
223
+ <div class="callout-title">⚠️ Tokenizer Gotchas</div>
224
+ <p>β€’ Numbers tokenize unpredictably: "1234" might be 1-4 tokens depending on the model<br>
225
+ β€’ Whitespace matters: " hello" and "hello" produce different tokens in GPT<br>
226
+ β€’ Non-English languages use more tokens per word (higher cost per concept)<br>
227
+ β€’ Always use the model's own tokenizer β€” never mix tokenizers between models</p>
228
+ </div>
229
+
230
+ <h3>4. Datasets Library β€” Apache Arrow Under the Hood</h3>
231
+ <p><code>datasets</code> uses <strong>Apache Arrow</strong> for columnar, memory-mapped storage. A 100GB dataset can be iterated without loading into RAM. Key features:</p>
232
+ <p><strong>Memory Mapping:</strong> Data stays on disk; only accessed rows are loaded into memory. <strong>Streaming:</strong> <code>load_dataset(..., streaming=True)</code> returns an iterable β€” process terabytes with constant memory. <strong>Map/Filter:</strong> Apply transformations with automatic caching and multiprocessing. <strong>Hub Integration:</strong> 150,000+ datasets available via <code>load_dataset("dataset_name")</code>.</p>
233
+
234
+ <h3>5. Trainer API β€” High-Level Training Loop</h3>
235
+ <p>The <code>Trainer</code> class handles: training loop, evaluation, checkpointing, logging (to TensorBoard/W&B), mixed precision, gradient accumulation, distributed training, and early stopping. You just provide model + dataset + TrainingArguments. For instruction-tuning LLMs, use <strong>TRL's SFTTrainer</strong> (built on top of Trainer) which handles chat templates and packing automatically.</p>
236
+
237
+ <h3>6. Accelerate β€” Distributed Training Made Easy</h3>
238
+ <p><code>accelerate</code> abstracts away multi-GPU, TPU, and mixed-precision complexity. Write your training loop once; run on 1 GPU or 64 GPUs with zero code changes. Key feature: <code>Accelerator</code> class wraps your model, optimizer, and dataloader. It handles data sharding, gradient synchronization, and device placement automatically.</p>
239
+
240
+ <h3>7. Model Hub β€” Everything Is a Git Repo</h3>
241
+ <p>Every model on HF Hub is a <strong>Git LFS repo</strong> containing: <code>config.json</code> (architecture), <code>model.safetensors</code> (weights), <code>tokenizer.json</code>, and a <code>README.md</code> (model card). You can push your own models with <code>model.push_to_hub()</code>. The Hub supports: model versioning (Git branches/tags), automatic model cards, gated access (license agreements), and API inference endpoints.</p>
242
+
243
+ <h3>8. Additional HF Libraries</h3>
244
+ <table>
245
+ <tr><th>Library</th><th>Purpose</th><th>Key Feature</th></tr>
246
+ <tr><td><code>peft</code></td><td>Parameter-efficient fine-tuning</td><td>LoRA, QLoRA, Adapters, Prompt Tuning</td></tr>
247
+ <tr><td><code>trl</code></td><td>RLHF and alignment training</td><td>SFTTrainer, DPOTrainer, PPOTrainer, RewardTrainer</td></tr>
248
+ <tr><td><code>diffusers</code></td><td>Image/video generation</td><td>Stable Diffusion, SDXL, ControlNet, IP-Adapter</td></tr>
249
+ <tr><td><code>evaluate</code></td><td>Metrics computation</td><td>BLEU, ROUGE, accuracy, perplexity, and 100+ metrics</td></tr>
250
+ <tr><td><code>gradio</code></td><td>Build ML demos</td><td>Web UI for any model in 5 lines of code</td></tr>
251
+ <tr><td><code>smolagents</code></td><td>Lightweight AI agents</td><td>Code-based tool calling, HF model integration</td></tr>
252
+ <tr><td><code>safetensors</code></td><td>Safe model format</td><td>Fast, safe, and efficient tensor serialization (replaces pickle)</td></tr>
253
+ <tr><td><code>huggingface_hub</code></td><td>Hub API client</td><td>Download files, push models, create repos, manage spaces</td></tr>
254
+ </table>
255
+
256
+ <h3>9. Spaces β€” Deploy ML Apps Free</h3>
257
+ <p>HF Spaces lets you deploy <strong>Gradio</strong> or <strong>Streamlit</strong> apps on managed infrastructure. Free CPU tier for demos; upgrade to T4 ($0.60/hr) or A100 ($3.09/hr) for GPU workloads. Spaces support Docker, static HTML, and custom environments. They auto-build from a Git repo with a simple <code>requirements.txt</code>. Ideal for: model demos, portfolio projects, internal tools, and quick prototypes.</p>
258
  </div>`,
259
  code: `
260
  <div class="section">
261
+ <h2>πŸ’» Hugging Face β€” Comprehensive Code Examples</h2>
262
+
263
+ <h3>1. Pipelines β€” Every Task</h3>
264
  <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> pipeline
265
 
266
+ <span class="comment"># ─── Text Generation ───</span>
267
  gen = pipeline(<span class="string">"text-generation"</span>, model=<span class="string">"meta-llama/Llama-3.2-1B-Instruct"</span>)
268
  result = gen(<span class="string">"Explain RAG in one paragraph:"</span>, max_new_tokens=<span class="number">200</span>)
269
  <span class="function">print</span>(result[<span class="number">0</span>][<span class="string">"generated_text"</span>])
270
 
271
+ <span class="comment"># ─── Sentiment Analysis ───</span>
272
  sa = pipeline(<span class="string">"sentiment-analysis"</span>)
273
+ <span class="function">print</span>(sa(<span class="string">"Hugging Face is amazing!"</span>))
274
+ <span class="comment"># [{'label': 'POSITIVE', 'score': 0.9998}]</span>
275
+
276
+ <span class="comment"># ─── Named Entity Recognition ───</span>
277
+ ner = pipeline(<span class="string">"ner"</span>, grouped_entities=<span class="keyword">True</span>)
278
+ <span class="function">print</span>(ner(<span class="string">"Elon Musk founded SpaceX in California"</span>))
279
+ <span class="comment"># [{'entity_group': 'PER', 'word': 'Elon Musk'}, ...]</span>
280
+
281
+ <span class="comment"># ─── Zero-Shot Classification (no training needed!) ───</span>
282
+ zsc = pipeline(<span class="string">"zero-shot-classification"</span>)
283
+ result = zsc(<span class="string">"I need to fix a bug in my Python code"</span>,
284
+ candidate_labels=[<span class="string">"programming"</span>, <span class="string">"cooking"</span>, <span class="string">"sports"</span>])
285
+ <span class="function">print</span>(result[<span class="string">"labels"</span>][<span class="number">0</span>]) <span class="comment"># "programming"</span>
286
+
287
+ <span class="comment"># ─── Speech Recognition (Whisper) ───</span>
288
+ asr = pipeline(<span class="string">"automatic-speech-recognition"</span>, model=<span class="string">"openai/whisper-large-v3"</span>)
289
+ <span class="function">print</span>(asr(<span class="string">"audio.mp3"</span>)[<span class="string">"text"</span>])</div>
290
+
291
+ <h3>2. Tokenizers Deep Dive</h3>
292
+ <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoTokenizer
293
+
294
+ tokenizer = AutoTokenizer.from_pretrained(<span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>)
295
+
296
+ <span class="comment"># Basic tokenization</span>
297
+ text = <span class="string">"Hugging Face transformers are powerful!"</span>
298
+ tokens = tokenizer.tokenize(text)
299
+ ids = tokenizer.encode(text)
300
+ <span class="function">print</span>(<span class="string">f"Tokens: {tokens}"</span>)
301
+ <span class="function">print</span>(<span class="string">f"IDs: {ids}"</span>)
302
+ <span class="function">print</span>(<span class="string">f"Decoded: {tokenizer.decode(ids)}"</span>)
303
+
304
+ <span class="comment"># Chat template (critical for instruction models)</span>
305
+ messages = [
306
+ {<span class="string">"role"</span>: <span class="string">"system"</span>, <span class="string">"content"</span>: <span class="string">"You are a helpful assistant."</span>},
307
+ {<span class="string">"role"</span>: <span class="string">"user"</span>, <span class="string">"content"</span>: <span class="string">"What is LoRA?"</span>}
308
+ ]
309
+ formatted = tokenizer.apply_chat_template(messages, tokenize=<span class="keyword">False</span>)
310
+ <span class="function">print</span>(formatted) <span class="comment"># Proper &lt;|start_header|&gt; format for Llama</span>
311
+
312
+ <span class="comment"># Batch tokenization with padding</span>
313
+ batch = tokenizer(
314
+ [<span class="string">"short"</span>, <span class="string">"a much longer sentence here"</span>],
315
+ padding=<span class="keyword">True</span>,
316
+ truncation=<span class="keyword">True</span>,
317
+ max_length=<span class="number">512</span>,
318
+ return_tensors=<span class="string">"pt"</span> <span class="comment"># Returns PyTorch tensors</span>
319
+ )
320
+ <span class="function">print</span>(batch.keys()) <span class="comment"># input_ids, attention_mask</span></div>
321
+
322
+ <h3>3. Datasets Library β€” Load, Process, Stream</h3>
323
+ <div class="code-block"><span class="keyword">from</span> datasets <span class="keyword">import</span> load_dataset, Dataset
324
+
325
+ <span class="comment"># Load from Hub</span>
326
+ ds = load_dataset(<span class="string">"imdb"</span>)
327
+ <span class="function">print</span>(ds) <span class="comment"># DatasetDict with 'train' and 'test' splits</span>
328
+ <span class="function">print</span>(ds[<span class="string">"train"</span>][<span class="number">0</span>]) <span class="comment"># First example</span>
329
+
330
+ <span class="comment"># Streaming (constant memory for huge datasets)</span>
331
+ stream = load_dataset(<span class="string">"allenai/c4"</span>, split=<span class="string">"train"</span>, streaming=<span class="keyword">True</span>)
332
+ <span class="keyword">for</span> i, example <span class="keyword">in</span> enumerate(stream):
333
+ <span class="keyword">if</span> i >= <span class="number">5</span>: <span class="keyword">break</span>
334
+ <span class="function">print</span>(example[<span class="string">"text"</span>][:<span class="number">100</span>])
335
+
336
+ <span class="comment"># Map with parallel processing</span>
337
+ <span class="keyword">def</span> <span class="function">tokenize_fn</span>(examples):
338
+ <span class="keyword">return</span> tokenizer(examples[<span class="string">"text"</span>], truncation=<span class="keyword">True</span>, max_length=<span class="number">512</span>)
339
+
340
+ tokenized = ds[<span class="string">"train"</span>].map(tokenize_fn, batched=<span class="keyword">True</span>, num_proc=<span class="number">4</span>)
341
+
342
+ <span class="comment"># Create custom dataset from dict/pandas</span>
343
+ my_data = Dataset.from_dict({
344
+ <span class="string">"text"</span>: [<span class="string">"Hello world"</span>, <span class="string">"AI is great"</span>],
345
+ <span class="string">"label"</span>: [<span class="number">1</span>, <span class="number">0</span>]
346
+ })
347
+
348
+ <span class="comment"># Push your dataset to Hub</span>
349
+ my_data.push_to_hub(<span class="string">"your-username/my-dataset"</span>)</div>
350
 
351
+ <h3>4. Model Loading β€” From Basic to Production</h3>
 
 
 
352
  <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
353
  <span class="keyword">import</span> torch
354
 
355
+ <span class="comment"># ─── Basic Loading (full precision) ───</span>
356
+ model = AutoModelForCausalLM.from_pretrained(<span class="string">"gpt2"</span>)
357
+
358
+ <span class="comment"># ─── Half Precision (saves 50% VRAM) ───</span>
359
+ model = AutoModelForCausalLM.from_pretrained(
360
+ <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
361
+ torch_dtype=torch.bfloat16,
362
+ device_map=<span class="string">"auto"</span>
363
+ )
364
+
365
+ <span class="comment"># ─── 4-bit Quantization (QLoRA-ready) ───</span>
366
  bnb_config = BitsAndBytesConfig(
367
  load_in_4bit=<span class="keyword">True</span>,
368
+ bnb_4bit_quant_type=<span class="string">"nf4"</span>, <span class="comment"># NormalFloat4 β€” better than uniform int4</span>
369
  bnb_4bit_compute_dtype=torch.bfloat16,
370
+ bnb_4bit_use_double_quant=<span class="keyword">True</span> <span class="comment"># Quantize the quantization constants too</span>
371
  )
 
372
  model = AutoModelForCausalLM.from_pretrained(
373
  <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
374
  quantization_config=bnb_config,
375
  device_map=<span class="string">"auto"</span>
376
  )
 
 
 
 
377
 
378
+ <span class="comment"># ─── Flash Attention 2 (2-4x faster) ───</span>
379
+ model = AutoModelForCausalLM.from_pretrained(
380
+ <span class="string">"meta-llama/Llama-3.1-8B-Instruct"</span>,
381
+ torch_dtype=torch.bfloat16,
382
+ attn_implementation=<span class="string">"flash_attention_2"</span>,
383
+ device_map=<span class="string">"auto"</span>
384
+ )</div>
385
+
386
+ <h3>5. Trainer API β€” Full Training Loop</h3>
387
+ <div class="code-block"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoModelForSequenceClassification, Trainer, TrainingArguments
388
+ <span class="keyword">from</span> datasets <span class="keyword">import</span> load_dataset
389
+
390
+ <span class="comment"># Load model and dataset</span>
391
+ model = AutoModelForSequenceClassification.from_pretrained(<span class="string">"bert-base-uncased"</span>, num_labels=<span class="number">2</span>)
392
+ ds = load_dataset(<span class="string">"imdb"</span>)
393
+ tokenized = ds.map(<span class="keyword">lambda</span> x: tokenizer(x[<span class="string">"text"</span>], truncation=<span class="keyword">True</span>, max_length=<span class="number">512</span>), batched=<span class="keyword">True</span>)
394
+
395
+ <span class="comment"># Configure training</span>
396
+ args = TrainingArguments(
397
+ output_dir=<span class="string">"./results"</span>,
398
+ num_train_epochs=<span class="number">3</span>,
399
+ per_device_train_batch_size=<span class="number">16</span>,
400
+ per_device_eval_batch_size=<span class="number">64</span>,
401
+ learning_rate=<span class="number">2e-5</span>,
402
+ weight_decay=<span class="number">0.01</span>,
403
+ eval_strategy=<span class="string">"epoch"</span>,
404
+ save_strategy=<span class="string">"epoch"</span>,
405
+ load_best_model_at_end=<span class="keyword">True</span>,
406
+ fp16=<span class="keyword">True</span>, <span class="comment"># Mixed precision</span>
407
+ gradient_accumulation_steps=<span class="number">4</span>,
408
+ logging_steps=<span class="number">100</span>,
409
+ report_to=<span class="string">"wandb"</span>, <span class="comment"># Log to Weights & Biases</span>
410
+ )
411
+
412
+ trainer = Trainer(model=model, args=args, train_dataset=tokenized[<span class="string">"train"</span>], eval_dataset=tokenized[<span class="string">"test"</span>])
413
+ trainer.train()
414
+ trainer.push_to_hub() <span class="comment"># Push trained model directly</span></div>
415
+
416
+ <h3>6. Gradio β€” Build a Demo in 5 Lines</h3>
417
+ <div class="code-block"><span class="keyword">import</span> gradio <span class="keyword">as</span> gr
418
+ <span class="keyword">from</span> transformers <span class="keyword">import</span> pipeline
419
+
420
+ pipe = pipeline(<span class="string">"sentiment-analysis"</span>)
421
+
422
+ <span class="keyword">def</span> <span class="function">analyze</span>(text):
423
+ result = pipe(text)[<span class="number">0</span>]
424
+ <span class="keyword">return</span> <span class="string">f"{result['label']} ({result['score']:.2%})"</span>
425
+
426
+ gr.Interface(fn=analyze, inputs=<span class="string">"text"</span>, outputs=<span class="string">"text"</span>,
427
+ title=<span class="string">"Sentiment Analyzer"</span>).launch()
428
+ <span class="comment"># Runs at http://localhost:7860 β€” deploy to HF Spaces for free!</span></div>
429
+
430
+ <h3>7. Hub API β€” Programmatic Access</h3>
431
+ <div class="code-block"><span class="keyword">from</span> huggingface_hub <span class="keyword">import</span> HfApi, hf_hub_download, login
432
+
433
+ <span class="comment"># Login</span>
434
+ login(token=<span class="string">"hf_your_token"</span>) <span class="comment"># or: huggingface-cli login</span>
435
+
436
+ api = HfApi()
437
+
438
+ <span class="comment"># List models by task</span>
439
+ models = api.list_models(filter=<span class="string">"text-generation"</span>, sort=<span class="string">"downloads"</span>, limit=<span class="number">5</span>)
440
+ <span class="keyword">for</span> m <span class="keyword">in</span> models:
441
+ <span class="function">print</span>(<span class="string">f"{m.id}: {m.downloads} downloads"</span>)
442
+
443
+ <span class="comment"># Download specific file</span>
444
+ path = hf_hub_download(repo_id=<span class="string">"meta-llama/Llama-3.1-8B"</span>, filename=<span class="string">"config.json"</span>)
445
+
446
+ <span class="comment"># Push model to Hub</span>
447
+ model.push_to_hub(<span class="string">"your-username/my-model"</span>)
448
+ tokenizer.push_to_hub(<span class="string">"your-username/my-model"</span>)
449
 
450
+ <span class="comment"># Create a new Space</span>
451
+ api.create_repo(<span class="string">"your-username/my-demo"</span>, repo_type=<span class="string">"space"</span>, space_sdk=<span class="string">"gradio"</span>)</div>
452
  </div>`,
453
  interview: `
454
  <div class="section">
455
+ <h2>🎯 Hugging Face β€” In-Depth Interview Questions</h2>
456
+ <div class="interview-box"><strong>Q1: What's the difference between <code>from_pretrained</code> and <code>pipeline</code>?</strong><p><strong>Answer:</strong> <code>pipeline()</code> is a high-level convenience wrapper β€” it auto-detects the task, loads both model + tokenizer, handles tokenization/decoding, and returns human-readable output. <code>from_pretrained()</code> gives raw access to model weights for: custom inference loops, fine-tuning, extracting embeddings, modifying the model architecture, or anything beyond standard inference. Rule: prototyping β†’ pipeline, production/training β†’ from_pretrained.</p></div>
457
+ <div class="interview-box"><strong>Q2: What is <code>device_map="auto"</code> and how does model sharding work?</strong><p><strong>Answer:</strong> It uses the <code>accelerate</code> library to automatically distribute model layers across available hardware. The algorithm: (1) Measure available memory on each GPU, CPU, and disk; (2) Place layers sequentially, filling GPU first, spilling to CPU, then disk. For a 70B model on two 24GB GPUs: layers 0-40 on GPU 0, layers 41-80 on GPU 1. CPU/disk offloading adds latency but enables running models that don't fit in GPU memory at all. Use <code>max_memory</code> param to control allocation.</p></div>
458
+ <div class="interview-box"><strong>Q3: Why use HF Datasets over pandas, and how does Apache Arrow help?</strong><p><strong>Answer:</strong> Datasets uses <strong>Apache Arrow</strong> β€” a columnar, memory-mapped format. Key advantages: (1) <strong>Memory mapping:</strong> A 100GB dataset uses near-zero RAM β€” data stays on disk but accessed at near-RAM speed via OS page cache. (2) <strong>Zero-copy:</strong> Slicing doesn't duplicate data. (3) <strong>Streaming:</strong> Process datasets larger than disk with <code>streaming=True</code>. (4) <strong>Parallel map:</strong> <code>num_proc=N</code> for multi-core preprocessing. (5) <strong>Caching:</strong> Processed results are automatically cached to disk. Pandas loads everything into RAM β€” impossible for large-scale ML datasets.</p></div>
459
+ <div class="interview-box"><strong>Q4: What is a chat template and why does it matter?</strong><p><strong>Answer:</strong> Each instruction-tuned model is trained with a specific format for system/user/assistant messages. Llama uses <code>&lt;|begin_of_text|&gt;&lt;|start_header_id|&gt;system&lt;|end_header_id|&gt;</code>, while ChatML uses <code>&lt;|im_start|&gt;system</code>. If you format input incorrectly, the model behaves like a base model (no instruction following). <code>tokenizer.apply_chat_template()</code> auto-formats messages correctly for any model. This is the #1 mistake beginners make β€” using raw text instead of the chat template.</p></div>
460
+ <div class="interview-box"><strong>Q5: How do you handle gated models (Llama, Gemma) in production?</strong><p><strong>Answer:</strong> (1) Accept the model license on the Hub model page. (2) Create a read token at hf.co/settings/tokens. (3) For local: <code>huggingface-cli login</code>. (4) In CI/CD: set <code>HF_TOKEN</code> environment variable. (5) In code: pass <code>token="hf_xxx"</code> to <code>from_pretrained()</code>. For Docker: bake the token as a secret, never in the image. For Kubernetes: use a Secret mounted as an env var. The token is only needed for download β€” once cached locally, no token is needed for inference.</p></div>
461
+ <div class="interview-box"><strong>Q6: What is safetensors and why replace pickle?</strong><p><strong>Answer:</strong> Traditional PyTorch models use Python's <code>pickle</code> format, which can execute arbitrary code during loading β€” a <strong>security vulnerability</strong>. A malicious model file could run code on your machine when loaded. <code>safetensors</code> is a safe, fast tensor format that: (1) Cannot execute code (pure data), (2) Supports zero-copy loading (memory-mapped), (3) Is 2-5x faster to load than pickle, (4) Supports lazy loading (load only specific tensors). It's now the default format on HF Hub.</p></div>
462
  </div>`
463
  },
464
  'finetuning': {