gdvstd commited on
Commit
b63df18
·
verified ·
1 Parent(s): 0e1d6b0

fix: correct vocab size (Llama-3.2-1B uses 128256, not 32000)

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -30,7 +30,7 @@ Submission for **CAS4133 Assignment 1** (Yonsei).
30
 
31
  ## Tokenizer Extension
32
 
33
- - **+100 Korean morpheme tokens** added to the LLaMA tokenizer (extend mode, vocab 32000 -> 32100)
34
  - POS whitelist: `[NNG, NNP, VV, VA, MAG]` (content words only — common/proper nouns, verbs, adjectives, adverbs)
35
  - Functional morphemes (조사, 어미) deliberately excluded — they caused NaN/inf grad explosions on the all-POS variant
36
  - Selection: `freq_natural` (top-k by surface-form frequency, `min_freq=10`) over the filtered training corpus
@@ -82,6 +82,6 @@ adapter_id = "gdvstd/llama-3.2-1b-ko-cpt"
82
 
83
  tok = AutoTokenizer.from_pretrained(adapter_id) # extended tokenizer
84
  base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
85
- base.resize_token_embeddings(len(tok)) # 32100
86
  model = PeftModel.from_pretrained(base, adapter_id)
87
  ```
 
30
 
31
  ## Tokenizer Extension
32
 
33
+ - **+100 Korean morpheme tokens** added to the LLaMA tokenizer (extend mode, vocab 128,256 -> 128,356)
34
  - POS whitelist: `[NNG, NNP, VV, VA, MAG]` (content words only — common/proper nouns, verbs, adjectives, adverbs)
35
  - Functional morphemes (조사, 어미) deliberately excluded — they caused NaN/inf grad explosions on the all-POS variant
36
  - Selection: `freq_natural` (top-k by surface-form frequency, `min_freq=10`) over the filtered training corpus
 
82
 
83
  tok = AutoTokenizer.from_pretrained(adapter_id) # extended tokenizer
84
  base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
85
+ base.resize_token_embeddings(len(tok)) # 128356
86
  model = PeftModel.from_pretrained(base, adapter_id)
87
  ```