Spaces:
Running
Running
Commit ·
5ad7fc4
1
Parent(s): e6d6a49
feat: massive expansion of CME 295 content into a comprehensive technical deep-dive
Browse files- CME295-Transformers/index.html +45 -36
CME295-Transformers/index.html
CHANGED
|
@@ -503,19 +503,19 @@
|
|
| 503 |
<h3>Fundamental NLP Components</h3>
|
| 504 |
<div class="list-item">
|
| 505 |
<div class="list-num">01</div>
|
| 506 |
-
<div><strong>Tokenization:</strong>
|
| 507 |
</div>
|
| 508 |
<div class="list-item">
|
| 509 |
<div class="list-num">02</div>
|
| 510 |
-
<div><strong>
|
| 511 |
</div>
|
| 512 |
<div class="list-item">
|
| 513 |
<div class="list-num">03</div>
|
| 514 |
-
<div><strong>The RNN
|
| 515 |
</div>
|
| 516 |
<div class="list-item">
|
| 517 |
<div class="list-num">04</div>
|
| 518 |
-
<div><strong>Self-Attention:</strong> The
|
| 519 |
</div>
|
| 520 |
`,
|
| 521 |
math: `
|
|
@@ -573,19 +573,19 @@
|
|
| 573 |
<h3>Architectural Optimizations</h3>
|
| 574 |
<div class="list-item">
|
| 575 |
<div class="list-num">01</div>
|
| 576 |
-
<div><strong>Multi-Head Attention (MHA):</strong>
|
| 577 |
</div>
|
| 578 |
<div class="list-item">
|
| 579 |
<div class="list-num">02</div>
|
| 580 |
-
<div><strong>
|
| 581 |
</div>
|
| 582 |
<div class="list-item">
|
| 583 |
<div class="list-num">03</div>
|
| 584 |
-
<div><strong>RoPE (Rotary Position Embeddings):</strong>
|
| 585 |
</div>
|
| 586 |
<div class="list-item">
|
| 587 |
<div class="list-num">04</div>
|
| 588 |
-
<div><strong>
|
| 589 |
</div>
|
| 590 |
`
|
| 591 |
},
|
|
@@ -597,7 +597,15 @@
|
|
| 597 |
<h3>LLM Internals & Inference</h3>
|
| 598 |
<div class="list-item">
|
| 599 |
<div class="list-num">01</div>
|
| 600 |
-
<div><strong>Mixture of Experts (MoE):</strong>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 601 |
</div>
|
| 602 |
<div class="list-item">
|
| 603 |
<div class="list-num">02</div>
|
|
@@ -628,22 +636,22 @@
|
|
| 628 |
<p>How are these giant models built? The lifecycle of training a Foundation Model spans months and costs millions.</p>
|
| 629 |
`,
|
| 630 |
concepts: `
|
| 631 |
-
<h3>
|
| 632 |
<div class="list-item">
|
| 633 |
<div class="list-num">01</div>
|
| 634 |
-
<div><strong>
|
| 635 |
</div>
|
| 636 |
<div class="list-item">
|
| 637 |
<div class="list-num">02</div>
|
| 638 |
-
<div><strong>
|
| 639 |
</div>
|
| 640 |
<div class="list-item">
|
| 641 |
<div class="list-num">03</div>
|
| 642 |
-
<div><strong>
|
| 643 |
</div>
|
| 644 |
<div class="list-item">
|
| 645 |
<div class="list-num">04</div>
|
| 646 |
-
<div><strong>
|
| 647 |
</div>
|
| 648 |
`
|
| 649 |
},
|
|
@@ -652,18 +660,22 @@
|
|
| 652 |
<p>Even an SFT model might generate harmful, politically biased, or unsafe answers. Alignment refers to shaping the model's outputs to heavily mirror human preferences.</p>
|
| 653 |
`,
|
| 654 |
concepts: `
|
| 655 |
-
<h3>
|
| 656 |
<div class="list-item">
|
| 657 |
<div class="list-num">01</div>
|
| 658 |
-
<div><strong>
|
| 659 |
</div>
|
| 660 |
<div class="list-item">
|
| 661 |
<div class="list-num">02</div>
|
| 662 |
-
<div><strong>
|
| 663 |
</div>
|
| 664 |
<div class="list-item">
|
| 665 |
<div class="list-num">03</div>
|
| 666 |
-
<div><strong>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 667 |
</div>
|
| 668 |
`
|
| 669 |
},
|
|
@@ -676,17 +688,18 @@
|
|
| 676 |
</div>
|
| 677 |
`,
|
| 678 |
concepts: `
|
| 679 |
-
<h3>
|
| 680 |
-
<div class="callout insight">
|
| 681 |
-
<p><strong>Test-Time Compute:</strong> The insight that giving a model more time to generate hidden "thinking tokens" drastically improves its output quality. Unlike parametric training, you can scale intelligence infinitely by simply letting the model think longer.</p>
|
| 682 |
-
</div>
|
| 683 |
<div class="list-item">
|
| 684 |
<div class="list-num">01</div>
|
| 685 |
-
<div><strong>
|
| 686 |
</div>
|
| 687 |
<div class="list-item">
|
| 688 |
<div class="list-num">02</div>
|
| 689 |
-
<div><strong>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 690 |
</div>
|
| 691 |
`
|
| 692 |
},
|
|
@@ -715,22 +728,18 @@
|
|
| 715 |
<p>Evaluating an LLM's intelligence is extremely complex since language is subjective. This module covers benchmarks and the transition to algorithmic judges.</p>
|
| 716 |
`,
|
| 717 |
concepts: `
|
| 718 |
-
<h3>
|
| 719 |
<div class="list-item">
|
| 720 |
<div class="list-num">01</div>
|
| 721 |
-
<div><strong>The
|
| 722 |
</div>
|
| 723 |
<div class="list-item">
|
| 724 |
<div class="list-num">02</div>
|
| 725 |
-
<div><strong>
|
| 726 |
</div>
|
| 727 |
<div class="list-item">
|
| 728 |
<div class="list-num">03</div>
|
| 729 |
-
<div><strong>
|
| 730 |
-
</div>
|
| 731 |
-
<div class="list-item">
|
| 732 |
-
<div class="list-num">04</div>
|
| 733 |
-
<div><strong>Key Benchmarks:</strong> MMLU (General multi-subject knowledge), GSM8K (Middle school math), HumanEval (Python coding).</div>
|
| 734 |
</div>
|
| 735 |
`
|
| 736 |
},
|
|
@@ -739,18 +748,18 @@
|
|
| 739 |
<p>The bleeding edge. The integration of image, audio, and generation models merging into singular foundation interfaces.</p>
|
| 740 |
`,
|
| 741 |
concepts: `
|
| 742 |
-
<h3>
|
| 743 |
<div class="list-item">
|
| 744 |
<div class="list-num">01</div>
|
| 745 |
-
<div><strong>Vision Transformers (ViT):</strong>
|
| 746 |
</div>
|
| 747 |
<div class="list-item">
|
| 748 |
<div class="list-num">02</div>
|
| 749 |
-
<div><strong>
|
| 750 |
</div>
|
| 751 |
<div class="list-item">
|
| 752 |
<div class="list-num">03</div>
|
| 753 |
-
<div><strong>
|
| 754 |
</div>
|
| 755 |
`
|
| 756 |
}
|
|
|
|
| 503 |
<h3>Fundamental NLP Components</h3>
|
| 504 |
<div class="list-item">
|
| 505 |
<div class="list-num">01</div>
|
| 506 |
+
<div><strong>Tokenization & BPE:</strong> Transition from word-level to sub-word level. Byte Pair Encoding (BPE) uses a frequency-based merge strategy to build a vocabulary. This solves the "Out of Vocabulary" problem and allows the model to handle morphologically rich languages by breaking words into meaningful prefixes/suffixes.</div>
|
| 507 |
</div>
|
| 508 |
<div class="list-item">
|
| 509 |
<div class="list-num">02</div>
|
| 510 |
+
<div><strong>Word2Vec & Vector Spaces:</strong> Word embeddings map tokens into a high-dimensional continuous space (e.g., 512 dimensions). These vectors capture semantic relationships via dot products; tokens with similar meanings are geometrically close.</div>
|
| 511 |
</div>
|
| 512 |
<div class="list-item">
|
| 513 |
<div class="list-num">03</div>
|
| 514 |
+
<div><strong>The RNN Vanishing Gradient:</strong> Recurrent architectures (LSTMs, GRUs) process tokens sequentially. For long sequences, the gradient signals diminish, meaning the model "forgets" initial context (long-range dependencies). Transformers solve this by processing all tokens simultaneously.</div>
|
| 515 |
</div>
|
| 516 |
<div class="list-item">
|
| 517 |
<div class="list-num">04</div>
|
| 518 |
+
<div><strong>Self-Attention Mechanism:</strong> The breakthrough. Instead of fixed weights, the model computes dynamic weights for each token pair. It maps an input into Query (Q), Key (K), and Value (V) projections, enabling the model to "attend" to precisely the right context regardless of distance.</div>
|
| 519 |
</div>
|
| 520 |
`,
|
| 521 |
math: `
|
|
|
|
| 573 |
<h3>Architectural Optimizations</h3>
|
| 574 |
<div class="list-item">
|
| 575 |
<div class="list-num">01</div>
|
| 576 |
+
<div><strong>Multi-Head Attention (MHA):</strong> By projecting Q, K, and V into multiple lower-dimensional "heads," the model can attend to information from different representation subspaces simultaneously. One head might focus on syntax, while another focuses on sentiment.</div>
|
| 577 |
</div>
|
| 578 |
<div class="list-item">
|
| 579 |
<div class="list-num">02</div>
|
| 580 |
+
<div><strong>KV Cache & GQA:</strong> Autoregressive generation requires re-calculating attention for every new token. Storing K and V vectors (KV Cache) is memory intensive. Grouped-Query Attention (GQA) uses a many-to-one mapping for K/V pairs to Queries, drastically reducing memory bandwidth requirements in production.</div>
|
| 581 |
</div>
|
| 582 |
<div class="list-item">
|
| 583 |
<div class="list-num">03</div>
|
| 584 |
+
<div><strong>RoPE (Rotary Position Embeddings):</strong> Absolute positional encodings (sinusoidal) fail at extrapolation. RoPE applies a rotation to the Query and Key vectors in the complex plane, encoding *relative* distance between tokens. This allows models to handle much longer context windows than they were trained on.</div>
|
| 585 |
</div>
|
| 586 |
<div class="list-item">
|
| 587 |
<div class="list-num">04</div>
|
| 588 |
+
<div><strong>Normalization (LayerNorm vs RMSNorm):</strong> Llama models utilize RMSNorm (Root Mean Square Layer Normalization) which is computationally cheaper than standard LayerNorm as it avoids calculating the mean and variance, focusing only on the scaling factor.</div>
|
| 589 |
</div>
|
| 590 |
`
|
| 591 |
},
|
|
|
|
| 597 |
<h3>LLM Internals & Inference</h3>
|
| 598 |
<div class="list-item">
|
| 599 |
<div class="list-num">01</div>
|
| 600 |
+
<div><strong>Mixture of Experts (MoE):</strong> Scaling model capacity without scaling compute. For each token, a "Router" network selects only a top-K (usually 2) specialized Feed-Forward networks ("Experts") to process the token. This allows models like Mixtral 8x7B to have the knowledge of a large model with the speed of a small one.</div>
|
| 601 |
+
</div>
|
| 602 |
+
<div class="list-item">
|
| 603 |
+
<div class="list-num">02</div>
|
| 604 |
+
<div><strong>KV Cache Management:</strong> During inference, the memory bottleneck is the Key-Value cache. Optimization techniques like PagedAttention (vLLM) allocate memory for the cache in blocks (like OS virtual memory), eliminating fragmentation and increasing throughput by 20x.</div>
|
| 605 |
+
</div>
|
| 606 |
+
<div class="list-item">
|
| 607 |
+
<div class="list-num">03</div>
|
| 608 |
+
<div><strong>Continuous Batching:</strong> Unlike static batching where the user must wait for all requests in a batch to finish, continuous batching allows new requests to be added to the running batch as soon as any request generates a "Stop" token.</div>
|
| 609 |
</div>
|
| 610 |
<div class="list-item">
|
| 611 |
<div class="list-num">02</div>
|
|
|
|
| 636 |
<p>How are these giant models built? The lifecycle of training a Foundation Model spans months and costs millions.</p>
|
| 637 |
`,
|
| 638 |
concepts: `
|
| 639 |
+
<h3>Pretraining & Scaling Deep Dive</h3>
|
| 640 |
<div class="list-item">
|
| 641 |
<div class="list-num">01</div>
|
| 642 |
+
<div><strong>Scaling Laws (Chinchilla):</strong> DeepMind's research proved that model performance depends on both the number of parameters (N) and the amount of training data (D). For a given compute budget, N and D should be scaled proportionally. A common target for modern models is ~20 tokens per parameter for compute-optimal training.</div>
|
| 643 |
</div>
|
| 644 |
<div class="list-item">
|
| 645 |
<div class="list-num">02</div>
|
| 646 |
+
<div><strong>Data Mixing & Quality:</strong> The "garbage in, garbage out" rule applies. Models are trained on a mixture of Web Text (Common Crawl), Code (GitHub), and high-quality Books. Effective deduplication (MinHash, LSH) and quality filtering (using small classifiers) are as important as the model architecture itself.</div>
|
| 647 |
</div>
|
| 648 |
<div class="list-item">
|
| 649 |
<div class="list-num">03</div>
|
| 650 |
+
<div><strong>Learning Rate Schedules:</strong> Foundation models usually use AdamW with a Cosine Learning Rate decay and a linear warmup period to maintain training stability across billions of parameters.</div>
|
| 651 |
</div>
|
| 652 |
<div class="list-item">
|
| 653 |
<div class="list-num">04</div>
|
| 654 |
+
<div><strong>Compute Hardware:</strong> Training a 70B model requires thousands of synchronized H100 GPUs using Parallelism strategies (Tensor, Pipeline, and Data Parallelism) to split the model across hundreds of nodes.</div>
|
| 655 |
</div>
|
| 656 |
`
|
| 657 |
},
|
|
|
|
| 660 |
<p>Even an SFT model might generate harmful, politically biased, or unsafe answers. Alignment refers to shaping the model's outputs to heavily mirror human preferences.</p>
|
| 661 |
`,
|
| 662 |
concepts: `
|
| 663 |
+
<h3>Fine-Tuning & Alignment</h3>
|
| 664 |
<div class="list-item">
|
| 665 |
<div class="list-num">01</div>
|
| 666 |
+
<div><strong>SFT (Supervised Fine-Tuning):</strong> Converting a "next-token predictor" into a "helpful assistant." The model is trained on (Prompt, Completion) pairs. This stage teaches the model the "Instruction Following" format and the appropriate tone for interaction.</div>
|
| 667 |
</div>
|
| 668 |
<div class="list-item">
|
| 669 |
<div class="list-num">02</div>
|
| 670 |
+
<div><strong>PEFT (Parameter-Efficient Fine-Tuning):</strong> Techniques like LoRA (Low-Rank Adaptation) freeze the base weights and only train a small subset of "delta" weights. This reduces the VRAM requirement by 100x and allows multiple specialized models to share the same base weights in memory.</div>
|
| 671 |
</div>
|
| 672 |
<div class="list-item">
|
| 673 |
<div class="list-num">03</div>
|
| 674 |
+
<div><strong>Alignment (RLHF & DPO):</strong> Aligning model behavior with human values. (1) **RLHF** uses a Reward Model to train a Policy via PPO. (2) **DPO** (Direct Preference Optimization) is a mathematically elegant alternative that treats preference data as a supervised learning problem, avoiding the need for a separate reward model.</div>
|
| 675 |
+
</div>
|
| 676 |
+
<div class="list-item">
|
| 677 |
+
<div class="list-num">04</div>
|
| 678 |
+
<div><strong>Catastrophic Forgetting:</strong> A major challenge in fine-tuning where a model loses its general knowledge (pretraining facts) while learning a specific task. Mitigation strategies includes lower learning rates and mixing in pretraining data during the fine-tuning phase.</div>
|
| 679 |
</div>
|
| 680 |
`
|
| 681 |
},
|
|
|
|
| 688 |
</div>
|
| 689 |
`,
|
| 690 |
concepts: `
|
| 691 |
+
<h3>Logic, Math & Verification</h3>
|
|
|
|
|
|
|
|
|
|
| 692 |
<div class="list-item">
|
| 693 |
<div class="list-num">01</div>
|
| 694 |
+
<div><strong>Reasoning Trajectories:</strong> In "Chain of Thought" reasoning, the model generates an internal monologue to solve problems. New "Reasoning Models" (like o1/R1) are trained explicitly to spend more "test-time compute" to arrive at the correct answer by exploring multiple logical paths.</div>
|
| 695 |
</div>
|
| 696 |
<div class="list-item">
|
| 697 |
<div class="list-num">02</div>
|
| 698 |
+
<div><strong>Verifiable Rewards (RL on Math/Code):</strong> Unlike subjective chat data, math and code have ground-truth answers. Reinforcement Learning can optimize a model purely on output verification (does it pass the test cases?), allowing the model to "learn to think" without human-labeled reasoning paths.</div>
|
| 699 |
+
</div>
|
| 700 |
+
<div class="list-item">
|
| 701 |
+
<div class="list-num">03</div>
|
| 702 |
+
<div><strong>AlphaCode & Strawberry Logic:</strong> Strategies for searching through potential solution spaces. The model generates many candidates, verifies them, and uses the correct ones to further tune its internal policy. This moves LLMs from "System 1" (instinctive) to "System 2" (deliberative) thinking.</div>
|
| 703 |
</div>
|
| 704 |
`
|
| 705 |
},
|
|
|
|
| 728 |
<p>Evaluating an LLM's intelligence is extremely complex since language is subjective. This module covers benchmarks and the transition to algorithmic judges.</p>
|
| 729 |
`,
|
| 730 |
concepts: `
|
| 731 |
+
<h3>Benchmarks & Human Calibration</h3>
|
| 732 |
<div class="list-item">
|
| 733 |
<div class="list-num">01</div>
|
| 734 |
+
<div><strong>The Benchmark Saturation:</strong> Models are getting so good at standard tests (MMLU, GSM8K) that researchers are moving toward "live" leaderboards like Chatbot Arena (LMSYS), where models are ranked via Elo systems based on blind human voting on real-world prompts.</div>
|
| 735 |
</div>
|
| 736 |
<div class="list-item">
|
| 737 |
<div class="list-num">02</div>
|
| 738 |
+
<div><strong>Model Contamination:</strong> A critical issue where benchmark questions are accidentally included in the model's pretraining data (the model "memorizes" the test). Modern evals use private, fresh, or dynamically generated datasets to prevent this.</div>
|
| 739 |
</div>
|
| 740 |
<div class="list-item">
|
| 741 |
<div class="list-num">03</div>
|
| 742 |
+
<div><strong>Eval-as-a-Service:</strong> Tools like DeepEval or RAGAS focus on specific metrics for agents, such as Hallucination Rate, Factuality, and Context Adherence, rather than just general conversational quality.</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 743 |
</div>
|
| 744 |
`
|
| 745 |
},
|
|
|
|
| 748 |
<p>The bleeding edge. The integration of image, audio, and generation models merging into singular foundation interfaces.</p>
|
| 749 |
`,
|
| 750 |
concepts: `
|
| 751 |
+
<h3>The Multi-Senses Architecture</h3>
|
| 752 |
<div class="list-item">
|
| 753 |
<div class="list-num">01</div>
|
| 754 |
+
<div><strong>Vision Transformers (ViT):</strong> "An Image is worth 16x16 words." Transformers process images by slicing them into patches, flattening them into tokens, and applying the exact same attention architecture used for text. This unified architecture is why LLMs are becoming natively multi-modal.</div>
|
| 755 |
</div>
|
| 756 |
<div class="list-item">
|
| 757 |
<div class="list-num">02</div>
|
| 758 |
+
<div><strong>Contrastive Learning (CLIP):</strong> Training an image encoder and text encoder to output similar vectors for the same concept (e.g., a photo of a cat and the word "cat"). This creates the bridge between visual and verbal latent spaces.</div>
|
| 759 |
</div>
|
| 760 |
<div class="list-item">
|
| 761 |
<div class="list-num">03</div>
|
| 762 |
+
<div><strong>Autoregressive Vision:</strong> Future state where models generate pixels token-by-token (just like text) or use latent diffusion bridges to generate extremely consistent high-definition visual narratives.</div>
|
| 763 |
</div>
|
| 764 |
`
|
| 765 |
}
|