T-lite-it-2.1-FP8

Main BF16 model: t-tech/T-lite-it-2.1

🚨 Users are advised to exercise caution and are responsible for any additional training and oversight required to ensure the model's responses meet acceptable ethical and safety standards. The responsibility for incorporating this model into industrial or commercial solutions lies entirely with those who choose to deploy it.

T‑lite‑it‑2.1‑FP8 is a fine‑grained FP8‑quantised version of T‑lite‑it‑2.1 (built on the Qwen‑3 family). It delivers identical capabilities with roughly half the memory footprint and higher inference speed.

Description

T-lite-it-2.1 is an efficient Russian model built upon the Qwen 3 architecture, featuring significant improvements in instruction following and adds support for tool-calling capabilities — a key advancement over T-lite-it-1.0, which lacks tool-use support. Outperforms Qwen3-8B in tool calling scenarios, which is essential for agentic applications. Built for both general tasks and complex workflows, with higher Russian text generation throughput enabled by optimized tokenizer.

NOTE: This model supports only non-thinking mode and does not generate <think></think> in its output. Meanwhile, specifying enable_thinking=False is no longer required.

📊 Benchmarks

Model	Ru Arena Hard	ruIFeval*	enIFeval*	enBFCL	ruBFCL	Tau2	ACEBench
T-lite-it-2.1	83.9	75.9	75.1	62.2	56.5	26.8	61.0
T-lite-it-2.1 FP8	81.7	75.7	75.1	62.0	57.0	25.0	59.9

* IFeval metric is mean of 4 values: prompt and instruct levels for strict and loose accuracy.

Note on FP8

For convenience and performance, we have provided fp8-quantized model checkpoint for T-lite-it-2.1, whose name ends with -FP8. The quantization method is fine-grained fp8 quantization with block size of 128. You can find more details in the quantization_config field in config.json.

You can use the T-lite-it-2.1-FP8 model with serveral inference frameworks, including transformers, sglang, and vllm, as the original bfloat16 model. However, please pay attention to the following known issues:

transformers:
- there are currently issues with the "fine-grained fp8" method in transformers for distributed inference. You may need to set the environment variable CUDA_LAUNCH_BLOCKING=1 if multiple devices are used in inference.