Instructions to use Cylingo/Xinyuan-LLM-14B-0428 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Cylingo/Xinyuan-LLM-14B-0428 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Cylingo/Xinyuan-LLM-14B-0428")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Cylingo/Xinyuan-LLM-14B-0428")
model = AutoModelForCausalLM.from_pretrained("Cylingo/Xinyuan-LLM-14B-0428")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Cylingo/Xinyuan-LLM-14B-0428 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Cylingo/Xinyuan-LLM-14B-0428"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cylingo/Xinyuan-LLM-14B-0428",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Cylingo/Xinyuan-LLM-14B-0428

SGLang

How to use Cylingo/Xinyuan-LLM-14B-0428 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Cylingo/Xinyuan-LLM-14B-0428" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cylingo/Xinyuan-LLM-14B-0428",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Cylingo/Xinyuan-LLM-14B-0428" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cylingo/Xinyuan-LLM-14B-0428",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Cylingo/Xinyuan-LLM-14B-0428 with Docker Model Runner:
```
docker model run hf.co/Cylingo/Xinyuan-LLM-14B-0428
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Xinyuan-LLM-14B-0428

🤗 Hugging Face | 🤖 ModelScope

Xinyuan-LLM-14B-0428 Highlights

Xinyuan-LLM-14B-0428 is the first foundational model in the mental health industry, launched by Cylingo Group. Built upon the robust capabilities of Qwen3-14B, this model has been fine-tuned on millions of data points across diverse scenarios within the field.

The First All-Scenario Mental Health Support Foundation Model with 24/7 Intelligent Capabilities
Covering Diverse Mental Health Scenarios and Building Personalized Psychological Profiles
Resolving Multiple Parenting Challenges with Customized Family Companion Solutions

Quickstart

For deployment, you can use sglang>=0.4.6.post1 or vllm>=0.8.5 or to create an OpenAI-compatible API endpoint:

SGLang:

python -m sglang.launch_server --model-path Cylingo/Xinyuan-LLM-14B-0428

vLLM:

vllm serve Cylingo/Xinyuan-LLM-14B-0428

For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.

For non-thinking mode, we suggest using Temperature=0.8, TopP=0.8, TopK=20, and MinP=0. For more detailed guidance, please refer to the Best Practices section.

All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set factor as 2.0.

Xinyuan-LLM-14B-0428 does not include a hybrid mode for Thinking similar to Qwen3. For now, we recommend that users stick to the standard mode. We plan to gradually introduce related features to the community in the future.