Instructions to use ThingAI/Quark-270m-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ThingAI/Quark-270m-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ThingAI/Quark-270m-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ThingAI/Quark-270m-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ThingAI/Quark-270m-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ThingAI/Quark-270m-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-270m-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ThingAI/Quark-270m-Instruct

SGLang

How to use ThingAI/Quark-270m-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ThingAI/Quark-270m-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-270m-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ThingAI/Quark-270m-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-270m-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ThingAI/Quark-270m-Instruct with Docker Model Runner:
```
docker model run hf.co/ThingAI/Quark-270m-Instruct
```

Quark-270m-Instruct / README.md

ThingsAI

Update README.md

ec9fbd7 verified 3 days ago

preview code

raw

history blame contribute delete

4.43 kB

	---
	language:
	- it
	- en
	license: apache-2.0
	tags:
	- text-generation
	- causal-lm
	- bilingual
	- italian
	- english
	- small-language-model
	- trained-from-scratch
	- quark
	- instruct
	- sft
	- chat
	library_name: transformers
	pipeline_tag: text-generation
	---

	# Quark-270M-Instruct — Bilingual Chat Model
	Quark-270M-Instruct is the instruction-tuned version of [Quark-270M Base](https://huggingface.co/ThingAI/Quark-270m-Base), fine-tuned for conversational use in Italian and English. Built entirely from scratch by [ThingsAI](https://things-ai.org).

	## Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"ThingAI/Quark-270m-Instruct",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16
	).cuda()
	model.lm_head.weight = model.embed_tokens.weight # ensure weight tying

	tokenizer = AutoTokenizer.from_pretrained("ThingAI/Quark-270m-Instruct")

	prompt = "<\|user\|>\nCiao, come stai?\n<\|end\|>\n<\|assistant\|>\n"
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	out = model.generate(**inputs, max_new_tokens=150, do_sample=True, temperature=0.7, top_k=40)
	print(tokenizer.decode(out[0], skip_special_tokens=False))
	```

	## Chat Format

	```
	<\|user\|>
	{user message}
	<\|end\|>
	<\|assistant\|>
	{model response}
	<\|end\|>
	```

	Multi-turn:

	```
	<\|user\|>
	Ciao!
	<\|end\|>
	<\|assistant\|>
	Ciao! Come posso aiutarti?
	<\|end\|>
	<\|user\|>
	Cos'è l'intelligenza artificiale?
	<\|end\|>
	<\|assistant\|>
	```

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Base Model \| [Quark-270M Base](https://huggingface.co/ThingAI/Quark-270m-Base) \|
	\| Parameters \| 252M (with weight tying) \|
	\| Architecture \| Decoder-only Transformer (GQA, SwiGLU, RMSNorm, RoPE) \|
	\| Vocabulary \| 65,537 tokens \|
	\| Context Length \| 2,048 tokens \|
	\| Precision \| BF16 \|
	\| Languages \| Italian, English \|

	### Architecture

	\| \| \|
	\|---\|---\|
	\| d_model \| 768 \|
	\| Layers \| 32 \|
	\| Query Heads \| 12 \|
	\| KV Heads \| 4 \|
	\| Head Dim \| 64 \|
	\| FFN Dim \| 2,048 \|
	\| Activation \| SwiGLU \|

	## Training

	### Base Pretraining

	~10B tokens on a bilingual mix (Italian 50%, English 43%, Code 7%) on NVIDIA B200. See [Quark-270M Base](https://huggingface.co/ThingAI/Quark-270m-Base) for details.

	### SFT (Instruction Tuning)

	Fine-tuned on a diverse mix of conversational and instructional data:

	\| Dataset \| Examples \| Type \|
	\|---\|---\|---\|
	\| FreedomIntelligence/alpaca-gpt4-italian \| ~52,000 \| Italian instructions \|
	\| HuggingFaceH4/no_robots \| ~9,500 \| English conversations \|
	\| m-a-p/CodeFeedback-Filtered-Instruction \| 5,000 \| Code instructions \|
	\| yogeshm/text_to_bash (×80) \| ~9,900 \| Terminal commands \|
	\| Custom chitchat (×100) \| ~3,000 \| Identity, greetings, basic Q&A \|
	\| Total \| ~80,000 \| \|

	\| \| \|
	\|---\|---\|
	\| Hardware \| NVIDIA B200 \|
	\| Epochs \| 3 \|
	\| Learning Rate \| 2e-5 (cosine decay) \|
	\| Batch Size \| 16 × 4 = 64 effective \|
	\| Sequence Length \| 512 \|

	## Inference Server

	Quark-270M-Instruct powers [Things Chat](https://chat.things-ai.org) via a self-hosted FastAPI server with SSE streaming, conversation memory, web search, and content moderation.


	## Limitations

	- 252M is small: Limited factual knowledge, prone to hallucination
	- Mathematics: Unreliable beyond basic arithmetic
	- Code: Generates plausible but often non-functional code
	- Context: 2,048 token window
	- No system prompt: The model was not trained with `<\|system\|>` tags

	### Good for

	- Self-hosted bilingual chatbot
	- Learning about LLM training from scratch
	- Terminal command assistance
	- Light conversational AI

	### Not suited for

	- Factual Q&A requiring accuracy
	- Complex reasoning or math
	- Production-grade code generation
	- Safety-critical applications

	## The Quark Family

	\| Model \| Parameters \| Type \|
	\|---\|---\|---\|
	\| [Quark-50M](https://huggingface.co/ThingAI/Quark-50m) \| 51M \| Base \|
	\| [Quark-135M](https://huggingface.co/ThingAI/Quark-135m) \| 135M \| Base \|
	\| [Quark-270M Base](https://huggingface.co/ThingAI/Quark-270m-Base) \| 252M \| Base \|
	\| Quark-270M-Instruct \| 252M \| Chat \|

	## Links

	- 🌐 [ThingsAI](https://things-ai.org)
	- 💬 [Things Chat](https://chat.things-ai.org)
	- 🔤 [QuarkTokenizer](https://huggingface.co/ThingAI/QuarkTokenizer)
	- 📊 [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard)

	---

	Built from scratch by ThingsAI 🇮🇹