Instructions to use brandonbaek/Bori-2-135M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use brandonbaek/Bori-2-135M-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="brandonbaek/Bori-2-135M-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("brandonbaek/Bori-2-135M-Instruct")
model = AutoModelForCausalLM.from_pretrained("brandonbaek/Bori-2-135M-Instruct", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use brandonbaek/Bori-2-135M-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "brandonbaek/Bori-2-135M-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonbaek/Bori-2-135M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/brandonbaek/Bori-2-135M-Instruct

SGLang

How to use brandonbaek/Bori-2-135M-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "brandonbaek/Bori-2-135M-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonbaek/Bori-2-135M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "brandonbaek/Bori-2-135M-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "brandonbaek/Bori-2-135M-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use brandonbaek/Bori-2-135M-Instruct with Docker Model Runner:
```
docker model run hf.co/brandonbaek/Bori-2-135M-Instruct
```

🌾 Bori-2 135M Instruct (Checkpoint 1800)

🚀 Newer Version Available: The Bori project is currently developing Bori-3, which utilizes the SmolLM2-360M base and an upgraded response-only SFT loss pipeline to fix the instruction-following bugs present in this version. Check the GitHub Repository for the latest code.

Bori-2 135M Instruct is the Supervised Fine-Tuned (SFT) version of the Bori-2 Base model. It was designed to follow bilingual (Korean and English) instructions.

⚠️ Status: This training run was paused early at Checkpoint 1800 and is not fully converged. It is published for transparency, historical tracking, and research into SLM failure modes.

🤖 Model Details

Base Architecture: SmolLM2 (Llama-based)
Parameter Count: ~135M
Languages: Korean, English
Vocabulary Size: 49,152 (Base) + 8,981 (Korean tokens) = 58,133 tokens

💻 Hardware & Compute

Like the base model, SFT was performed under strict compute constraints:

Hardware: Kaggle Notebooks, 2x NVIDIA T4 GPUs (16GB VRAM each).
Optimization: Multi-GPU Accelerate distributed training was utilized to maximize the effective batch size across the T4s, paired with gradient accumulation and FP16 mixed precision.

📚 Training Dataset

The SFT data mixture was heavily interleaved to balance English reasoning with Korean generation:

HuggingFaceH4/ultrachat_200k (50% - English conversational)
brandonbaek/konglish-synthetic-instruct (40% - Bilingual/Korean synthetic instructions)
jojo0217/korean_safe_conversation (10% - Korean safety & alignment)

⚠️ Known Issues & Failure Modes

We are publishing this checkpoint specifically so the open-source community can study the dynamics of SFT on extreme SLMs when parameters and datasets are sub-optimal.

Collator Masking Bug (Response-Only Loss Failure): Standard instruction tuning requires "response-only loss," where the model only calculates loss gradients on the assistant's response, ignoring the user and system prompts (setting their labels to -100). During this training run, a bug in DataCollatorForLanguageModeling(mlm=False) inadvertently stripped the -100 ignore-index masks from the user prompts. Consequently, the model calculated loss over the entire sequence, severely degrading its instruction-following adherence and causing it to often mimic user prompts rather than answering them.
Dataset Over-Complexity: The heavy reliance on ultrachat_200k (50% of the batch) overwhelmed the 135M parameter capacity. The complex, multi-turn reasoning and lengthy conversational histories required by Ultrachat caused severe hallucinations in this small model, leading to logical breakdowns and basic arithmetic failures.

🎯 Intended Use

This checkpoint is highly experimental and not recommended for application deployment. It serves as an excellent case study for the necessity of tailored, high-quality, and appropriately scaled SFT datasets (as well as rigid testing of data collator masks) for models under 1B parameters.