Instructions to use tencent/Youtu-LLM-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tencent/Youtu-LLM-2B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tencent/Youtu-LLM-2B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tencent/Youtu-LLM-2B")
model = AutoModelForCausalLM.from_pretrained("tencent/Youtu-LLM-2B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use tencent/Youtu-LLM-2B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tencent/Youtu-LLM-2B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Youtu-LLM-2B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tencent/Youtu-LLM-2B

SGLang

How to use tencent/Youtu-LLM-2B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tencent/Youtu-LLM-2B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Youtu-LLM-2B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tencent/Youtu-LLM-2B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tencent/Youtu-LLM-2B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tencent/Youtu-LLM-2B with Docker Model Runner:
```
docker model run hf.co/tencent/Youtu-LLM-2B
```

Small Models, Real Intelligence: Edge AI Moves From Phones to Robots

by Javedalam - opened Jan 5

Discussion

Javedalam

Jan 5

•

edited Jan 5

Edge AI is finally becoming practical, not as a buzzword, but as something you can actually run, test, and depend on. Instead of shipping data to the cloud and waiting for answers, intelligence is moving to the edge: phones, embedded systems, and eventually machines that need to think for themselves in real time.

I recently got Tencent’s Youtu-LLM-2B running locally on a OnePlus 8 phone using a quantized GGUF build (Q5) with a custom-compiled llama.cpp. This is a roughly 2-billion-parameter model, small enough to run on a pocket device, yet structured enough to feel like “real” intelligence rather than a toy demo. The Q5 quantization keeps the footprint reasonable while preserving most of the model’s accuracy, which matters when you’re running everything on-device.

What makes this model interesting is not chatty general intelligence. It was trained with a strong STEM orientation. It’s not a hardcore symbolic math engine, but it performs very well on STEM text-based reasoning problems, multi-step explanations, and structured technical prompts. In quick tests, it stays coherent, focused, and surprisingly disciplined for its size.

This is where Edge AI gets bigger than phones. Pocket devices are just the entry point. The real payoff is onboard intelligence: models running directly on robots, drones, and autonomous systems without a permanent cloud connection. If a robot has to wait for the internet to think, it’s not really autonomous. Models like Youtu-LLM-2B show that you can embed useful reasoning directly on the machine, close to sensors and actuators, with predictable latency and no external dependency.

The takeaway is simple. The future is not dominated by massive, general models trying to do everything poorly. It belongs to focused, efficient models that do a smaller set of things well, running locally where decisions actually happen. Edge AI isn’t coming—it’s already running in your pocket, and soon it’ll be thinking on top of robots as well.

The prompt and model answer are here

https://fate-stingray-0b3.notion.site/Youtu-LLM-2B-2B-Parameters-GGUF-Q5_K_M-Edge-IS-STEM-Reasoning-Evaluation-2df3b975deec80868fb7fd2048336f54

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment