zkolter commited on
Commit
beae094
·
verified ·
1 Parent(s): 25eb6e2

Add model card YAML metadata

Browse files
Files changed (1) hide show
  1. README.md +21 -2
README.md CHANGED
@@ -1,3 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ## LLama 3.2 1B Simplified
2
 
3
  This repo contains a simplified variant of the Llama 3.2 1B Instruct model, aimed at instruction for the [Introduction to Modern AI](https://modernaicourse.org) course. The model is intended for instructional purposes only, specifically meant to test the implementation of a Transformer for Homework 4.
@@ -6,5 +26,4 @@ The differences with the normal Llama 3.2 1B model are:
6
  1. The model replaces RoPE with an absolute positional embedding. RoPE typically works slightly better, but is somewhat cumbersome and unintuitive to implement for an introductory class.
7
  2. The model uses normal multihead attention instead of grouped query attention. Grouped query attention is a minor architecturual optimization that introduces marginal added complexity with little instructional value.
8
 
9
- To build this model, we made these two architecture changes then finetuned the model to recover the Llama 3.2 Instruct performance, matching with a KL loss on calibration set involving FineWebEDU and UltraChat200K.
10
-
 
1
+ ---
2
+ license: llama3.2
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ base_model: meta-llama/Llama-3.2-1B-Instruct
7
+ base_model_relation: finetune
8
+ datasets:
9
+ - HuggingFaceFW/fineweb-edu
10
+ - HuggingFaceH4/ultrachat_200k
11
+ tags:
12
+ - llama
13
+ - llama-3.2
14
+ - instruct
15
+ - educational
16
+ - original-format
17
+ - multi-head-attention
18
+ - absolute-positional-embeddings
19
+ ---
20
+
21
  ## LLama 3.2 1B Simplified
22
 
23
  This repo contains a simplified variant of the Llama 3.2 1B Instruct model, aimed at instruction for the [Introduction to Modern AI](https://modernaicourse.org) course. The model is intended for instructional purposes only, specifically meant to test the implementation of a Transformer for Homework 4.
 
26
  1. The model replaces RoPE with an absolute positional embedding. RoPE typically works slightly better, but is somewhat cumbersome and unintuitive to implement for an introductory class.
27
  2. The model uses normal multihead attention instead of grouped query attention. Grouped query attention is a minor architecturual optimization that introduces marginal added complexity with little instructional value.
28
 
29
+ To build this model, we made these two architecture changes and then finetuned the model to recover Llama 3.2 Instruct behavior using a KL distillation loss and next-token loss on a mixture of FineWebEDU (`HuggingFaceFW/fineweb-edu`, `sample-350BT`) and UltraChat200K (`HuggingFaceH4/ultrachat_200k`).