gdvstd
/

llama-3.2-1b-ko-cpt

@@ -30,7 +30,7 @@ Submission for **CAS4133 Assignment 1** (Yonsei).
 ## Tokenizer Extension
-- **+100 Korean morpheme tokens** added to the LLaMA tokenizer (extend mode, vocab 32000 -> 32100)
 - POS whitelist: `[NNG, NNP, VV, VA, MAG]` (content words only — common/proper nouns, verbs, adjectives, adverbs)
 - Functional morphemes (조사, 어미) deliberately excluded — they caused NaN/inf grad explosions on the all-POS variant
 - Selection: `freq_natural` (top-k by surface-form frequency, `min_freq=10`) over the filtered training corpus
@@ -82,6 +82,6 @@ adapter_id = "gdvstd/llama-3.2-1b-ko-cpt"
 tok = AutoTokenizer.from_pretrained(adapter_id)  # extended tokenizer
 base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
-base.resize_token_embeddings(len(tok))           # 32100
 model = PeftModel.from_pretrained(base, adapter_id)
 ```

 ## Tokenizer Extension
+- **+100 Korean morpheme tokens** added to the LLaMA tokenizer (extend mode, vocab 128,256 -> 128,356)
 - POS whitelist: `[NNG, NNP, VV, VA, MAG]` (content words only — common/proper nouns, verbs, adjectives, adverbs)
 - Functional morphemes (조사, 어미) deliberately excluded — they caused NaN/inf grad explosions on the all-POS variant
 - Selection: `freq_natural` (top-k by surface-form frequency, `min_freq=10`) over the filtered training corpus
 tok = AutoTokenizer.from_pretrained(adapter_id)  # extended tokenizer
 base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
+base.resize_token_embeddings(len(tok))           # 128356
 model = PeftModel.from_pretrained(base, adapter_id)
 ```