Instructions to use brandonbaek/Bori-2-135M-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use brandonbaek/Bori-2-135M-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="brandonbaek/Bori-2-135M-Base")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("brandonbaek/Bori-2-135M-Base")
model = AutoModelForCausalLM.from_pretrained("brandonbaek/Bori-2-135M-Base", device_map="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use brandonbaek/Bori-2-135M-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "brandonbaek/Bori-2-135M-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonbaek/Bori-2-135M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/brandonbaek/Bori-2-135M-Base

SGLang

How to use brandonbaek/Bori-2-135M-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "brandonbaek/Bori-2-135M-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonbaek/Bori-2-135M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "brandonbaek/Bori-2-135M-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonbaek/Bori-2-135M-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use brandonbaek/Bori-2-135M-Base with Docker Model Runner:
```
docker model run hf.co/brandonbaek/Bori-2-135M-Base
```

🌾 Bori-2 135M Base

🚀 Newer Version Available: The Bori project is currently developing Bori-3, which utilizes the SmolLM2-360M base and an upgraded response-only SFT loss pipeline. Check the GitHub Repository for the latest code.

Bori-2 135M Base is an experimental bilingual (Korean-English) Small Language Model (SLM) adapted from the highly efficient SmolLM2-135M architecture. It represents the final base model of the Bori-2 lineage, having successfully completed its full Continuous Pre-Training (CPT) pipeline (Checkpoint 10,000).

This model serves as a proof-of-concept for adapting extremely small, highly-capable English-centric models to new languages under extreme compute constraints, leveraging advanced initialization techniques and custom learning rate schedules.

🤖 Model Details

Base Architecture: SmolLM2 (Llama-based)
Parameter Count: ~135M
Languages: Korean, English
Vocabulary Size: 49,152 (Base) + 8,981 (Korean tokens) = 58,133 tokens
Context Length: 2048 tokens
License: Apache 2.0

💻 Hardware & Compute Constraints

A core goal of the Bori project is achieving meaningful language adaptation under strict free-tier compute limitations.

Hardware: Trained entirely on Kaggle Notebooks utilizing 2x NVIDIA T4 GPUs (16GB VRAM each).
Optimization: The training pipeline leveraged PyTorch's native sdpa (Scaled Dot-Product Attention) for Turing-architecture efficiency, FP16 mixed precision, and gradient checkpointing to fit the optimizer states into the tight 16GB VRAM limit.

🛠️ Training Methodology (CPT)

The model was adapted via Continuous Pre-Training (CPT) using a two-phase approach designed to inject deep Korean language understanding without causing catastrophic destruction of the base model's world knowledge and English representations.

1. Vocabulary Expansion (EEVE Initialization)

Pre-trained English-centric SLMs represent Korean prose very inefficiently, splitting single syllables into multiple bytes. To solve this, we trained a custom standalone Korean Byte-Level BPE tokenizer and merged it with the base tokenizer, adding 8,981 highly efficient Korean tokens.

Crucially, in src/model.py, the newly added Korean token embeddings are not initialized randomly. Instead, we utilized the EEVE (Efficient Embedding Vector Extraction) strategy, which initializes each new token from the mean embeddings of its English constituent subwords from the base tokenizer. This gives the model an excellent starting approximation and drastically lowers initial cross-entropy loss.

2. Phase 1A: Embedding Warmup

Objective: Stabilize the newly added Korean token embeddings without distorting the pre-trained weights.
Duration: 1,000 steps
Data: 100% Korean text (HuggingFaceFW/fineweb-2:kor_Hang)
Parameters: Backbone frozen; only the embedding layer and LM head were trained.

3. Phase 1B: Full CPT (WSD Scheduler)

Objective: Deep language acquisition and alignment.
Duration: 10,000 steps (Final Checkpoint)
Data Mixture: 90% Korean (fineweb-2:kor_Hang) and 10% English replay (fineweb-edu-dedup) to prevent catastrophic forgetting.
Parameters: All model parameters unfrozen.
Scheduler: Utilized a custom PyTorch Warmup-Stable-Decay (WSD) scheduler to maximize optimizer progress over high-entropy web text before decaying down to 10% to consolidate weights.

⚠️ Limitations & Intended Use

Not an Instruct Model: This is a base completion model. It has not undergone Supervised Fine-Tuning (SFT) or RLHF and will not follow instructions out of the box. Please see the Bori-2 Instruct model for chat capabilities.
Reasoning Capacity: At only 135M parameters, the model's capacity for complex reasoning, logic, or deep factual recall is inherently limited.
Intended Use: This model is published for researchers and developers interested in SLM vocabulary expansion, extreme compute-constrained training, and bilingual adaptation methodologies. It serves as an excellent, computationally cheap base for downstream Korean fine-tuning.