Instructions to use Tiiny/SmallThinker-3B-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Tiiny/SmallThinker-3B-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Tiiny/SmallThinker-3B-Preview")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Tiiny/SmallThinker-3B-Preview")
model = AutoModelForCausalLM.from_pretrained("Tiiny/SmallThinker-3B-Preview")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use Tiiny/SmallThinker-3B-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Tiiny/SmallThinker-3B-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-3B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Tiiny/SmallThinker-3B-Preview

SGLang

How to use Tiiny/SmallThinker-3B-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Tiiny/SmallThinker-3B-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-3B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Tiiny/SmallThinker-3B-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-3B-Preview",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Tiiny/SmallThinker-3B-Preview with Docker Model Runner:
```
docker model run hf.co/Tiiny/SmallThinker-3B-Preview
```

Training: Second Phase

by tugstugi - opened Dec 31, 2024

Discussion

tugstugi

Dec 31, 2024

Hi, what is the difference between PowerInfer/LONGCOT-Refine-500K and PowerInfer/QWQ-LONGCOT-500K? Why was PowerInfer/LONGCOT-Refine-500K added in the second phase? PowerInfer/QWQ-LONGCOT-500K was alone not enough?

Let's say if we want to replicate the result with 7B model we need to train with both datasets in a single run?

Greetings

malv-c

Dec 31, 2024

good questions
related mine : is training from cpu only possible ?

hyq001

Jan 1, 2025

I want to know more details about training. Is there any difference between training an inference model and fine-tuning a general model? Or can it be achieved by simply following the steps for fine-tuning a model but using different training datasets?

quantflex

Jan 1, 2025

This comment has been hidden

jeremyii

Tiiny AI org Jan 2, 2025

For more challenging questions, QWQ usually tends to use longer chains of thought to answer. For example, in QWQ-LONGCOT-500K, most of the answers exceed 8K. And most of the questions in QWQ-LONGCOT-500K are related to mathematics and code. In order to add other domain and hope to construct some shorter responses, we constructed LONGCOT-Refine-500K and then used these two datasets together for the second stage of SFT.

tugstugi

Jan 2, 2025

How was LONGCOT-Refine-500K constructed? First QWQ and then refined with Qwen72 to shorter responses?

jeremyii

Tiiny AI org Jan 2, 2025

•

edited Jan 2, 2025

The LONGCOT-Refine-500K dataset was constructed using two approaches:

For math and logical reasoning problems, we first used QWQ to generate initial responses, then refined them using Qwen2.5-72B-Instruct.
For open-ended tasks (like report writing etc), we used an example-guided approach - providing a QWQ-generated response(another problem) as a format reference, then having Qwen2.5-72B directly generate new responses following this format.

tugstugi changed discussion status to closed Jan 2, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment