Instructions to use tencent/Youtu-LLM-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tencent/Youtu-LLM-2B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tencent/Youtu-LLM-2B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tencent/Youtu-LLM-2B") model = AutoModelForCausalLM.from_pretrained("tencent/Youtu-LLM-2B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tencent/Youtu-LLM-2B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tencent/Youtu-LLM-2B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tencent/Youtu-LLM-2B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/tencent/Youtu-LLM-2B
- SGLang
How to use tencent/Youtu-LLM-2B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tencent/Youtu-LLM-2B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tencent/Youtu-LLM-2B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tencent/Youtu-LLM-2B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tencent/Youtu-LLM-2B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use tencent/Youtu-LLM-2B with Docker Model Runner:
docker model run hf.co/tencent/Youtu-LLM-2B
Small Models, Real Intelligence: Edge AI Moves From Phones to Robots
Edge AI is finally becoming practical, not as a buzzword, but as something you can actually run, test, and depend on. Instead of shipping data to the cloud and waiting for answers, intelligence is moving to the edge: phones, embedded systems, and eventually machines that need to think for themselves in real time.
I recently got Tencent’s Youtu-LLM-2B running locally on a OnePlus 8 phone using a quantized GGUF build (Q5) with a custom-compiled llama.cpp. This is a roughly 2-billion-parameter model, small enough to run on a pocket device, yet structured enough to feel like “real” intelligence rather than a toy demo. The Q5 quantization keeps the footprint reasonable while preserving most of the model’s accuracy, which matters when you’re running everything on-device.
What makes this model interesting is not chatty general intelligence. It was trained with a strong STEM orientation. It’s not a hardcore symbolic math engine, but it performs very well on STEM text-based reasoning problems, multi-step explanations, and structured technical prompts. In quick tests, it stays coherent, focused, and surprisingly disciplined for its size.
This is where Edge AI gets bigger than phones. Pocket devices are just the entry point. The real payoff is onboard intelligence: models running directly on robots, drones, and autonomous systems without a permanent cloud connection. If a robot has to wait for the internet to think, it’s not really autonomous. Models like Youtu-LLM-2B show that you can embed useful reasoning directly on the machine, close to sensors and actuators, with predictable latency and no external dependency.
The takeaway is simple. The future is not dominated by massive, general models trying to do everything poorly. It belongs to focused, efficient models that do a smaller set of things well, running locally where decisions actually happen. Edge AI isn’t coming—it’s already running in your pocket, and soon it’ll be thinking on top of robots as well.
The prompt and model answer are here


