Instructions to use gdvstd/llama-3.2-1b-ko-cpt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use gdvstd/llama-3.2-1b-ko-cpt with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/Llama-3.2-1B-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "gdvstd/llama-3.2-1b-ko-cpt") - Notebooks
- Google Colab
- Kaggle
llama-3.2-1b-ko-cpt — morph100_content variant
Continued pretraining of unsloth/Llama-3.2-1B-unsloth-bnb-4bit on Korean Wikipedia (ko_wiki_public) with a content-POS morpheme tokenizer extension (+100 Korean tokens) and rsLoRA r=256, α=256.
Submission for CAS4133 Assignment 1 (Yonsei).
Final Eval (frozen 2,125-doc held-out test set)
| Metric | Value |
|---|---|
| eval_loss | 1.9971 |
| perplexity | 7.368 |
| baseline (notebook reference) | eval_loss 2.0516 / PPL 7.780 |
| Δ vs baseline | -0.0545 / -0.412 PPL |
Tokenizer Extension
- +100 Korean morpheme tokens added to the LLaMA tokenizer (extend mode, vocab 128,256 -> 128,356)
- POS whitelist:
[NNG, NNP, VV, VA, MAG](content words only — common/proper nouns, verbs, adjectives, adverbs) - Functional morphemes (조사, 어미) deliberately excluded — they caused NaN/inf grad explosions on the all-POS variant
- Selection:
freq_natural(top-k by surface-form frequency,min_freq=10) over the filtered training corpus - Embedding init: subword-mean of base LLaMA tokenizer pieces
Training Configuration
| Component | Value |
|---|---|
| Base model | unsloth/Llama-3.2-1B-unsloth-bnb-4bit |
| Adapter | rsLoRA, r=256, alpha=256, dropout=0.0 |
| Target modules | q,k,v,o,gate,up,down + embed_tokens, lm_head |
| Optimizer | AdamW (8-bit), lr=2e-4 cosine, warmup_ratio=0.05 |
| Batch | bs=8 x grad_accum=4 (effective 32), seq_len=1024 |
| Steps | 2,500 |
| Precision | bf16, 4-bit base in NF4 |
| Hardware | 1x RTX 3090 (24 GB), ~5h31m wall-clock |
| Train / test split | 139,394 / 2,125 documents (super_strict filter) |
Data Filtering (super_strict)
Triple filter applied to ko_wiki_public:
- min/max chars
- Korean character ratio threshold
- content-density (drop list-heavy / link-stub pages)
Ablations
| Variant | Tokenizer extension | eval_loss | PPL |
|---|---|---|---|
| baseline (notebook) | none | 2.0516 | 7.780 |
| r256/a256 (no extension) | none | 1.9902 | 7.317 |
| morph100_content (this repo) | +100 content tokens | 1.9971 | 7.368 |
| morph200_content | +200 content tokens | 2.0041 | 7.420 |
| morph100 (all-POS) | +100 mixed tokens | NaN | inf |
| morph200 (all-POS) | +200 mixed tokens | NaN | inf |
Key finding: content-POS filtering is essential — including 조사/어미 in the extension causes immediate gradient explosion under the rsLoRA r=256 + mixed-precision embed/lm_head training setup. Smaller extensions (100) outperform larger ones (200) under the fixed 2,500-step budget.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base_id = "unsloth/Llama-3.2-1B-unsloth-bnb-4bit"
adapter_id = "gdvstd/llama-3.2-1b-ko-cpt"
tok = AutoTokenizer.from_pretrained(adapter_id) # extended tokenizer
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
base.resize_token_embeddings(len(tok)) # 128356
model = PeftModel.from_pretrained(base, adapter_id)
- Downloads last month
- 28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for gdvstd/llama-3.2-1b-ko-cpt
Base model
meta-llama/Llama-3.2-1B Quantized
unsloth/Llama-3.2-1B-unsloth-bnb-4bit