File size: 10,758 Bytes
5beea0d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d125c97
 
5beea0d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d125c97
 
5beea0d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
license: apache-2.0
language:
- ru
base_model:
- Qwen/Qwen3-8B
pipeline_tag: text-generation
library_name: transformers
---

# T-lite-it-2.1

**🚨 Users are advised to exercise caution and are responsible for any additional training and oversight required to ensure the model's responses meet acceptable ethical and safety standards. The responsibility for incorporating this model into industrial or commercial solutions lies entirely with those who choose to deploy it.**


## Description
T-lite-it-2.1 is an efficient Russian model built upon the Qwen 3 architecture, featuring significant improvements in instruction following and **adds support for tool-calling capabilities** — a key advancement over [T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0), which lacks tool-use support.
Outperforms Qwen3-8B in tool calling scenarios, which is essential for agentic applications. Built for both general tasks and complex workflows, with higher Russian text generation throughput enabled by optimized tokenizer.

More train details in our Habr: https://habr.com/ru/companies/tbank/articles/979650/

**NOTE: This model supports only non-thinking mode and does not generate `<think></think>` in its output. Meanwhile, specifying `enable_thinking=False` is no longer required.**

### 📚 Dataset

Instruction midtraining: 
40B tokens of instruction data.

Supervised Fine-Tuning (SFT): 
~670K high-quality and diverse instructions with balanced complexity combining general data, synthetic verifiable instruction-following and tool-calling scenarios.

Online RL alignment (GRPO):
Synthetic data generated for instruction-following (IF) and tool-calling optimization.
- *General stream:* general and chat tasks;
- *IF stream:* Diverse, verifiable synthetic tasks targeting strict instruction following;
- *Tool-calling stream:* Complex workflows with multi-step tool use; strong gains on tool-calling benchmarks.


## Merge Strategy

In this release, we leveraged an expert merging approach. After a shared SFT stage — which includes data for core capabilities (Instruction Following, General tasks, and Tool Calling) — we train three specialized experts via GRPO:
- **IF Expert**: Optimized for strict instruction following.
- **General Expert**: Focused on general and chat tasks.
- **Tool-Call Expert**: Trained on complex tool-calling workflows.

Each expert is trained with domain-specific data, hyperparameters, and reward functions for optimal performance. The final model is obtained by merging the three experts using **SLERP** (Spherical Linear Interpolation), enabling better preservation of individual capabilities compared to single-model training. To prevent artifacts after merging, we apply polishing stage using general domain to slightly adjust the model weights.

This approach allows fine-grained control over each skill domain and results in a more balanced and capable unified model.


## 📊 Benchmarks


| Model                        | Ru Arena Hard | ruIFeval* | enIFeval* | ruBFCL   | enBFCL   | Tau2     | ACEBench |
|------------------------------|:-------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| **T-lite-it-2.1**                  | **83.9**      | **75.9**   | **75.1**    | **56.5**  | <u>62.2<u>| **26.8**  | **61.0**  |
| **T-lite-it-1.0**                  | 24.4          | 58.9       | 60.1        | -         | -         | -         | -         |
| Qwen3-8B (no_think)                | 57.2          | <u>74.0<u> | <u>75.4<u>  | 52.6      | 59.4      | <u>22.7<u>| 48.1      |
| Ministral-3-8B-Instruct-2512       | <u>72.6</u>   | 63.8       | 64.3        | <u>55.3<u>| 59.8      | -         | <u>59.0<u>|
| RuadaptQwen3-8B-Hybrid (no_think)  | 56.9          | 68.7       | 73.1        | -         | -         | 18.2      | 52.1      |
| A-vibe                             | 50.1          | 60.4       | 53.2        | 52.6      | **63.0**  | 11.4      | 54.0      |

\* IFeval metric is mean of 4 values: prompt and instruct levels for strict and loose accuracy. 

\*\* T-lite-it-1.0 does not support tool calling, therefore tool-calling benchmark metrics are not available

More benchmarks can be found in our [Habr post](https://habr.com/ru/companies/tbank/articles/979650/).

## Recommended Generation Parameters

```
temperature: 0.7
top_p: 0.8
tok_k: 20
presence_penalty: 1.0
```

- Use lower temperature for straightforward queries and higher temperature for complex or creative tasks. 
- A presence_penalty between 0 and 2 can help avoid repetitive outputs.


## 👨‍💻 Examples of usage

## SGLang Usage
For better quality and stable performance, we recommend SGLang as your inference framework.

To run an inference server for **T-lite-it-2.1**, start by launching the SGLang server:

```bash
python -m sglang.launch_server \
    --model-path t-tech/T-lite-it-2.1 \
    --tool-call-parser qwen25
````

### VLLM Usage

```bash
vllm serve t-tech/T-lite-it-2.1 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes
````

Once the server is up and listening on host, you can send chat-based requests via the OpenAI Python client.

```python
# Описание инструмента для получения погоды
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Получить краткое описание текущей погоды в указанном городе.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "Город, например 'Москва'."
                    },
                    "date": {
                        "type": "string",
                        "description": "Дата в формате YYYY-MM-DD (опционально)."
                    },
                },
                "required": ["city"],
            },
        },
    }
]

prompt = (
    "Мне нужно спланировать прогулку по Москве сегодня вечером. "
    "Если тебе нужно, обратись к инструменту погоды, чтобы узнать текущие условия, "
    "а затем предложи, что можно делать на улице и какие есть альтернативы, если будет дождь."
)

completion = client.chat.completions.create(
    model="ANY",
    messages=[
        {
            "role": "system",
            "content": "Ты T-lite, виртуальный ассистент в Т-Технологиях. Твоя задача — быть полезным диалоговым ассистентом."
        },
        {"role": "user", "content": prompt},
    ],
    tools=tools,
    tool_choice="auto",
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    presence_penalty=1.0,
)

# В первом ответе модель либо даст готовый текст,
# либо вернет запрос на вызов инструмента (tool_calls)
message = completion.choices[0].message
print(message)
```

**Note:** It is **obligatory** to include both `temperature` and `presence_penalty` in every completion call.


### HF Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

torch.manual_seed(42)

model_name = "t-tech/T-lite-it-2.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

prompt = (
    "Мне нужно спланировать прогулку по Москве сегодня вечером. "
    "Предложи варианты занятий на улице и в помещении, "
    "предполагая типичную погоду для этого времени года."
)

messages = [
    {
        "role": "system",
        "content": "Ты T-lite, виртуальный ассистент в Т-Технологиях. Твоя задача — быть полезным диалоговым ассистентом."
    },
    {"role": "user", "content": prompt},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)

# Отбрасываем токены промпта
generated_ids = [
    output_ids[len(input_ids):]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

```


## Long Context Usage
T-lite-it-2.1 natively supports a context length of 32,768 tokens.  
For conversations where the input significantly exceeds this limit, follow the recommendations from the [Qwen3 model card](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) on processing long texts.

- Modify the model files:
  In the `config.json` file, add the `rope_scaling` fields:
    ```json
    {
        ...,
        "rope_scaling": {
            "rope_type": "yarn",
            "factor": 4.0,
            "original_max_position_embeddings": 32768
        }
    }
    ```
  For `llama.cpp`, you need to regenerate the GGUF file after the modification.
- Passing command line arguments:

  For `vllm`, you can use
    ```shell
    vllm serve ... --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072  
    ```
  For `sglang`, you can use
    ```shell
    python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
    ```
  For `llama-server` from `llama.cpp`, you can use
    ```shell
    llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
    ```

## Citation
If you find our work helpful, feel free to give us a cite.

```
@misc{stoianov2025tpro20efficientrussian,
      title={T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground}, 
      author={Dmitrii Stoianov and Danil Taranets and Olga Tsymboi and Ramil Latypov and Almaz Dautov and Vladislav Kruglikov and Nikita Surkov and German Abramov and Pavel Gein and Dmitry Abulkhanov and Mikhail Gashkov and Viktor Zelenkovskiy and Artem Batalov and Aleksandr Medvedev and Anatolii Potapov},
      year={2025},
      eprint={2512.10430},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.10430}, 
}
```