๐ŸŒพ Bori-2 135M Base

๐Ÿš€ Newer Version Available: The Bori project is currently developing Bori-3, which utilizes the SmolLM2-360M base and an upgraded response-only SFT loss pipeline. Check the GitHub Repository for the latest code.

Bori-2 135M Base is an experimental bilingual (Korean-English) Small Language Model (SLM) adapted from the highly efficient SmolLM2-135M architecture. It represents the final base model of the Bori-2 lineage, having successfully completed its full Continuous Pre-Training (CPT) pipeline (Checkpoint 10,000).

This model serves as a proof-of-concept for adapting extremely small, highly-capable English-centric models to new languages under extreme compute constraints, leveraging advanced initialization techniques and custom learning rate schedules.

๐Ÿค– Model Details

  • Base Architecture: SmolLM2 (Llama-based)
  • Parameter Count: ~135M
  • Languages: Korean, English
  • Vocabulary Size: 49,152 (Base) + 8,981 (Korean tokens) = 58,133 tokens
  • Context Length: 2048 tokens
  • License: Apache 2.0

๐Ÿ’ป Hardware & Compute Constraints

A core goal of the Bori project is achieving meaningful language adaptation under strict free-tier compute limitations.

  • Hardware: Trained entirely on Kaggle Notebooks utilizing 2x NVIDIA T4 GPUs (16GB VRAM each).
  • Optimization: The training pipeline leveraged PyTorch's native sdpa (Scaled Dot-Product Attention) for Turing-architecture efficiency, FP16 mixed precision, and gradient checkpointing to fit the optimizer states into the tight 16GB VRAM limit.

๐Ÿ› ๏ธ Training Methodology (CPT)

The model was adapted via Continuous Pre-Training (CPT) using a two-phase approach designed to inject deep Korean language understanding without causing catastrophic destruction of the base model's world knowledge and English representations.

1. Vocabulary Expansion (EEVE Initialization)

Pre-trained English-centric SLMs represent Korean prose very inefficiently, splitting single syllables into multiple bytes. To solve this, we trained a custom standalone Korean Byte-Level BPE tokenizer and merged it with the base tokenizer, adding 8,981 highly efficient Korean tokens.

Crucially, in src/model.py, the newly added Korean token embeddings are not initialized randomly. Instead, we utilized the EEVE (Efficient Embedding Vector Extraction) strategy, which initializes each new token from the mean embeddings of its English constituent subwords from the base tokenizer. This gives the model an excellent starting approximation and drastically lowers initial cross-entropy loss.

2. Phase 1A: Embedding Warmup

  • Objective: Stabilize the newly added Korean token embeddings without distorting the pre-trained weights.
  • Duration: 1,000 steps
  • Data: 100% Korean text (HuggingFaceFW/fineweb-2:kor_Hang)
  • Parameters: Backbone frozen; only the embedding layer and LM head were trained.

3. Phase 1B: Full CPT (WSD Scheduler)

  • Objective: Deep language acquisition and alignment.
  • Duration: 10,000 steps (Final Checkpoint)
  • Data Mixture: 90% Korean (fineweb-2:kor_Hang) and 10% English replay (fineweb-edu-dedup) to prevent catastrophic forgetting.
  • Parameters: All model parameters unfrozen.
  • Scheduler: Utilized a custom PyTorch Warmup-Stable-Decay (WSD) scheduler to maximize optimizer progress over high-entropy web text before decaying down to 10% to consolidate weights.

โš ๏ธ Limitations & Intended Use

  • Not an Instruct Model: This is a base completion model. It has not undergone Supervised Fine-Tuning (SFT) or RLHF and will not follow instructions out of the box. Please see the Bori-2 Instruct model for chat capabilities.
  • Reasoning Capacity: At only 135M parameters, the model's capacity for complex reasoning, logic, or deep factual recall is inherently limited.
  • Intended Use: This model is published for researchers and developers interested in SLM vocabulary expansion, extreme compute-constrained training, and bilingual adaptation methodologies. It serves as an excellent, computationally cheap base for downstream Korean fine-tuning.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for brandonbaek/Bori-2-135M-Base

Finetuned
(915)
this model
Finetunes
1 model

Dataset used to train brandonbaek/Bori-2-135M-Base

Collection including brandonbaek/Bori-2-135M-Base