Instructions to use GuminiResearch/Gumini-1.5B-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use GuminiResearch/Gumini-1.5B-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="GuminiResearch/Gumini-1.5B-Base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GuminiResearch/Gumini-1.5B-Base")
model = AutoModelForCausalLM.from_pretrained("GuminiResearch/Gumini-1.5B-Base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use GuminiResearch/Gumini-1.5B-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "GuminiResearch/Gumini-1.5B-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GuminiResearch/Gumini-1.5B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/GuminiResearch/Gumini-1.5B-Base

SGLang

How to use GuminiResearch/Gumini-1.5B-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "GuminiResearch/Gumini-1.5B-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GuminiResearch/Gumini-1.5B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "GuminiResearch/Gumini-1.5B-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GuminiResearch/Gumini-1.5B-Base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use GuminiResearch/Gumini-1.5B-Base with Docker Model Runner:
```
docker model run hf.co/GuminiResearch/Gumini-1.5B-Base
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

🐻 Gumini-1.5B (구미니)

Built with Qwen

5,700× less data, better performance.
Gumini-1.5B achieves Korean PPL 8.49 with only 3.14B tokens, outperforming Qwen-1.5B (18T tokens, PPL 8.84).

🔥 Key Results

Model	Params	Training Tokens	Korean PPL ↓	Rank
Qwen-2.5-7B	7.62B	18T	6.39	#1
Gemma-2B	2.0B	2T	8.15	#2
Gumini-1.5B (Ours)	1.54B	3.14B	8.49	#3
Qwen-2.5-1.5B	1.5B	18T	8.84	#4
Llama-3.2-3B	3.21B	9T	9.47	#5
EXAONE-3.5-2.4B	2.4B	~6.5T	9.80	#6

📊 Data Efficiency

vs Model	Their Tokens	Gumini Tokens	Efficiency
Qwen-2.5	18T	3.14B	5,732× less
Llama-3.2	9T	3.14B	2,866× less
EXAONE-3.5	~6.5T	3.14B	~2,070× less

Model Description

Gumini-1.5B (구미니) is a bilingual Korean-English base language model trained using the Inheritune methodology. Starting from Qwen 2.5 3B, the model progressively grew from 10 to 16 layers through 7 training stages, with ~3.14B tokens of continued pretraining on a Korean–English mixed corpus.

This is a BASE model, not instruction-tuned.
It produces text continuations rather than conversational responses.

Training Highlights

Inheritune Progressive Layer Growing

Stage 0: 10 layers (1.08B) → 393M tokens
Stage 1: 11 layers (1.15B) → 393M tokens
Stage 2: 12 layers (1.23B) → 393M tokens
Stage 3: 13 layers (1.31B) → 393M tokens
Stage 4: 14 layers (1.39B) → 393M tokens
Stage 5: 15 layers (1.47B) → 393M tokens
Stage 6: 16 layers (1.54B) → 786M tokens ⭐
────────────────────────────────────────────
Total: 16 layers, 1.54B params, ~3.14B tokens

Model Details

Attribute	Value
Researcher	Gumin Kwon (권구민)
Base Model	Qwen/Qwen2.5-3B
Training Method	Inheritune + Pretraining
Parameters	1.54B
Layers	16
Hidden Size	2048
Attention Heads	16
KV Heads	2 (GQA)
Vocab Size	151,936
Total Tokens Trained	~3.14B
Precision	BF16

Training Data

Dataset	Language	Weight
FineWeb-Edu (sample-10BT)	English	20%
CulturaX-ko	Korean	50%
Wikipedia-ko	Korean	30%

Total: 80% Korean, 20% English

Optimization

learning_rate: 2.0e-4
weight_decay: 0.1
lr_scheduler: cosine
warmup_ratio: 0.01
max_grad_norm: 1.0
precision: bf16
gradient_checkpointing: true
attention: PyTorch SDPA (Flash Attention)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "GuminiResearch/Gumini-1.5B-Base",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("GuminiResearch/Gumini-1.5B-Base")

prompt = "저는 구미니입니다."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    repetition_penalty=1.2,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using Pipeline

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="GuminiResearch/Gumini-1.5B-Base",
    torch_dtype="bfloat16",
    device_map="auto",
)

output = generator(
    "저는 구미니입니다.",
    max_new_tokens=100,
    temperature=0.7,
    repetition_penalty=1.2,
)
print(output[0]["generated_text"])

Evaluation

Stage	Layers	Parameters
0	10	1.08B
5	15	1.47B
6	16	1.54B

Model Family

Model	Layers	Params	Tokens	Status
Gumini-1B	10	1.08B	393M	✅ Released
Gumini-1.5B	16	1.54B	3.14B	✅ This Model

Limitations

Base model: No instruction-tuning or safety alignment
High repetition risk: Use repetition_penalty >= 1.2
May generate incorrect or outdated information
Should not be used in sensitive or safety-critical contexts
Knowledge cutoff based on training data

License

Qwen Research License (Non-Commercial)

This model is Built with Qwen and derived from Qwen 2.5 3B.

Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT.
Copyright (c) Alibaba Cloud. All Rights Reserved.

This model is for NON-COMMERCIAL / RESEARCH use only.
For commercial use, contact Alibaba Cloud.

References

Inheritune Paper

@inproceedings{Sanyal2024inheritune,
  title={Inheritune: Training Smaller Yet More Attentive Language Models},
  author={Sunny Sanyal and Ravid Shwartz-Ziv and Alexandros G. Dimakis and Sujay Sanghavi},
  year={2024},
  url={https://arxiv.org/abs/2404.08634}
}

Qwen 2.5

@misc{qwen2.5,
  title={Qwen2.5: A Party of Foundation Models},
  author={Qwen Team},
  year={2024},
  url={https://qwenlm.github.io/blog/qwen2.5/}
}

Citation

@misc{gumini2025,
  title={Gumini-1.5B: Bilingual Korean-English Language Model via Inheritune},
  author={Gumin Kwon},
  year={2025},
  note={Built with Qwen. Trained with Inheritune progressive layer growing.},
  url={https://huggingface.co/GuminiResearch/Gumini-1.5B-Base}
}