zkolter
/

Llama-3.2-1B-Instruct-Simplified

Text Generation

original-format

multi-head-attention

absolute-positional-embeddings

Model card Files Files and versions

zkolter commited on Mar 8

Commit

beae094

·

verified ·

1 Parent(s): 25eb6e2

Add model card YAML metadata

Files changed (1) hide show

README.md +21 -2

README.md CHANGED Viewed

@@ -1,3 +1,23 @@
 ## LLama 3.2 1B Simplified
 This repo contains a simplified variant of the Llama 3.2 1B Instruct model, aimed at instruction for the [Introduction to Modern AI](https://modernaicourse.org) course.  The model is intended for instructional purposes only, specifically meant to test the implementation of a Transformer for Homework 4.
@@ -6,5 +26,4 @@ The differences with the normal Llama 3.2 1B model are:
 1. The model replaces RoPE with an absolute positional embedding.  RoPE typically works slightly better, but is somewhat cumbersome and unintuitive to implement for an introductory class.
 2. The model uses normal multihead attention instead of grouped query attention.  Grouped query attention is a minor architecturual optimization that introduces marginal added complexity with little instructional value.
-To build this model, we made these two architecture changes then finetuned the model to recover the Llama 3.2 Instruct performance, matching with a KL loss on calibration set involving FineWebEDU and UltraChat200K.

+---
+license: llama3.2
+language:
+- en
+pipeline_tag: text-generation
+base_model: meta-llama/Llama-3.2-1B-Instruct
+base_model_relation: finetune
+datasets:
+- HuggingFaceFW/fineweb-edu
+- HuggingFaceH4/ultrachat_200k
+tags:
+- llama
+- llama-3.2
+- instruct
+- educational
+- original-format
+- multi-head-attention
+- absolute-positional-embeddings
+---
 ## LLama 3.2 1B Simplified
 This repo contains a simplified variant of the Llama 3.2 1B Instruct model, aimed at instruction for the [Introduction to Modern AI](https://modernaicourse.org) course.  The model is intended for instructional purposes only, specifically meant to test the implementation of a Transformer for Homework 4.
 1. The model replaces RoPE with an absolute positional embedding.  RoPE typically works slightly better, but is somewhat cumbersome and unintuitive to implement for an introductory class.
 2. The model uses normal multihead attention instead of grouped query attention.  Grouped query attention is a minor architecturual optimization that introduces marginal added complexity with little instructional value.
+To build this model, we made these two architecture changes and then finetuned the model to recover Llama 3.2 Instruct behavior using a KL distillation loss and next-token loss on a mixture of FineWebEDU (`HuggingFaceFW/fineweb-edu`, `sample-350BT`) and UltraChat200K (`HuggingFaceH4/ultrachat_200k`).