t-tech
/

T-pro-it-2.1-FP8

compressed-tensors

Model card Files Files and versions

T-pro-it-2.1-FP8 / README.md

alexfida's picture

Initial commit

32549e1 19 days ago

|

history blame contribute delete

2.6 kB

	---
	license: apache-2.0
	language:
	- ru
	base_model:
	- t-tech/T-pro-it-2.1
	---

	# T-pro-it-2.1-FP8

	> Main BF16 model: [t-tech/T-pro-it-2.1](https://huggingface.co/t-tech/T-pro-it-2.1)

	🚨 Users are advised to exercise caution and are responsible for any additional training and oversight required to ensure the model's responses meet acceptable ethical and safety standards. The responsibility for incorporating this model into industrial or commercial solutions lies entirely with those who choose to deploy it.

	T‑pro‑it‑2.1‑FP8 is a fine‑grained FP8‑quantised version of T‑pro‑it‑2.1 (built on the Qwen‑3 family). It delivers identical capabilities with roughly half the memory footprint and higher inference speed.


	## Description
	T-pro-it-2.1 — is an efficient russian model built upon the Qwen 3 model family with improved instruction following and tool-calling capabilities compared to [T-pro-it-2.0](https://huggingface.co/t-tech/T-pro-it-2.0).
	Outperforms Qwen3-32B in tool calling scenarios, which is essential for agentic applications. Built for both general tasks and complex workflows.

	NOTE: This model supports only non-thinking mode and does not generate `<think></think>` blocks in its output. Meanwhile, specifying `enable_thinking=False` is no longer required.


	## 📊 Benchmarks

	\| Model \| Ru Arena Hard \| ruIFeval* \| enIFeval* \| enBFCL \| ruBFCL \| Tau2 \| ACEBench \|
	\|------------------\|:-------------:\|:--------:\|:--------:\|:------:\|:------:\|:-----:\|:--------:\|
	\| T-pro-it-2.1 \| 93.8 \| 80.7 \| 78.4 \| 72.3 \| 66.0 \| 37.6 \| 73.6 \|
	\| T-pro-it-2.1 FP8 \| 93.4 \| 80.7 \| 78.0 \| 72.3 \| 65.7 \| 35.2 \| 72.7 \|

	\* IFeval metric is mean of 4 values: prompt and instruct levels for strict and loose accuracy.

	## Note on FP8

	For convenience and performance, we have provided `fp8`-quantized model checkpoint for T-pro-it-2.1, whose name ends with `-FP8`. The quantization method is fine-grained `fp8` quantization with block size of 128. You can find more details in the `quantization_config` field in `config.json`.

	You can use the T-pro-it-2.1-FP8 model with serveral inference frameworks, including `transformers`, `sglang`, and `vllm`, as the original bfloat16 model.
	However, please pay attention to the following known issues:
	- `transformers`:
	- there are currently issues with the "fine-grained fp8" method in `transformers` for distributed inference. You may need to set the environment variable `CUDA_LAUNCH_BLOCKING=1` if multiple devices are used in inference.