Instructions to use gdvstd/llama-3.2-1b-ko-cpt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use gdvstd/llama-3.2-1b-ko-cpt with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/Llama-3.2-1B-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "gdvstd/llama-3.2-1b-ko-cpt") - Notebooks
- Google Colab
- Kaggle
fix: correct vocab size (Llama-3.2-1B uses 128256, not 32000)
Browse files
README.md
CHANGED
|
@@ -30,7 +30,7 @@ Submission for **CAS4133 Assignment 1** (Yonsei).
|
|
| 30 |
|
| 31 |
## Tokenizer Extension
|
| 32 |
|
| 33 |
-
- **+100 Korean morpheme tokens** added to the LLaMA tokenizer (extend mode, vocab
|
| 34 |
- POS whitelist: `[NNG, NNP, VV, VA, MAG]` (content words only — common/proper nouns, verbs, adjectives, adverbs)
|
| 35 |
- Functional morphemes (조사, 어미) deliberately excluded — they caused NaN/inf grad explosions on the all-POS variant
|
| 36 |
- Selection: `freq_natural` (top-k by surface-form frequency, `min_freq=10`) over the filtered training corpus
|
|
@@ -82,6 +82,6 @@ adapter_id = "gdvstd/llama-3.2-1b-ko-cpt"
|
|
| 82 |
|
| 83 |
tok = AutoTokenizer.from_pretrained(adapter_id) # extended tokenizer
|
| 84 |
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
|
| 85 |
-
base.resize_token_embeddings(len(tok)) #
|
| 86 |
model = PeftModel.from_pretrained(base, adapter_id)
|
| 87 |
```
|
|
|
|
| 30 |
|
| 31 |
## Tokenizer Extension
|
| 32 |
|
| 33 |
+
- **+100 Korean morpheme tokens** added to the LLaMA tokenizer (extend mode, vocab 128,256 -> 128,356)
|
| 34 |
- POS whitelist: `[NNG, NNP, VV, VA, MAG]` (content words only — common/proper nouns, verbs, adjectives, adverbs)
|
| 35 |
- Functional morphemes (조사, 어미) deliberately excluded — they caused NaN/inf grad explosions on the all-POS variant
|
| 36 |
- Selection: `freq_natural` (top-k by surface-form frequency, `min_freq=10`) over the filtered training corpus
|
|
|
|
| 82 |
|
| 83 |
tok = AutoTokenizer.from_pretrained(adapter_id) # extended tokenizer
|
| 84 |
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
|
| 85 |
+
base.resize_token_embeddings(len(tok)) # 128356
|
| 86 |
model = PeftModel.from_pretrained(base, adapter_id)
|
| 87 |
```
|