Multilingual Driver Command Models
Model Summary
This repository contains four fine-tuned models for multilingual driver command intent classification.
The models were trained to classify short driver phrases in Russian and English into intent classes for an in-car voice assistant.
The repository is linked to the dataset:
Models
| Model | Architecture Type | Description |
|---|---|---|
bge-m3 |
Encoder-only | Multilingual encoder model |
e5-multilingual |
Encoder-only | Semantic multilingual encoder |
mmBERT-base |
Encoder-only | Compact multilingual BERT-style baseline |
gte-Qwen2-7B-instruct |
Decoder-only | Instruction-tuned decoder model adapted for classification |
Task
The models solve a multiclass intent classification task:
Given a short driver phrase, predict the corresponding intent class.
Example inputs:
Set the temperature to twenty twoTurn on Bluetooth audioПозвони мамеВключи обогрев сиденьяПострой маршрут до дома
Possible intent categories include climate control, navigation, media, calls, phone connection, lighting, seat control, cruise control, and other vehicle assistant actions.
Training Dataset
The models were trained on Multilingual Driver Commands Dataset.
Dataset characteristics:
| Property | Value |
|---|---|
| Dataset size | 153,062 examples |
| Languages | Russian + English |
| Language distribution | 50% RU / 50% EN |
| Final number of intents | 64 |
| Task | Intent classification |
The dataset was synthetically generated, manually validated, balanced across classes, and enriched with rare driving-related scenarios.
Experimental Results
The following results were obtained on the test set after class balancing and merging semantically overlapping intents into 64 final classes.
| Model | Accuracy | Macro F1 | Macro Precision | Macro Recall |
|---|---|---|---|---|
e5-multilingual-base |
0.864 | 0.862 | 0.868 | 0.859 |
mmBERT-base |
0.857 | 0.854 | 0.859 | 0.853 |
bge-m3 |
0.868 | 0.863 | 0.868 | 0.864 |
gte-Qwen2-7B-instruct |
0.872 | 0.870 | 0.878 | 0.865 |
A separate experiment with stronger intent merging into 45 classes showed that gte-Qwen2-7B-instruct reached 0.905 accuracy, but this reduced the functional granularity of the assistant.
Main Findings
The experiments show that larger models do not always provide a proportional improvement for short command classification.
Although gte-Qwen2-7B-instruct is much larger than bge-m3, the quality gap between them was relatively small. This suggests that, for this task, the main quality limitation is not only model size, but also:
- class taxonomy;
- semantic overlap between intents;
- synthetic data noise;
- incomplete or noisy parameter fields;
- dataset structure and balance.
For practical deployment, a smaller encoder-based model such as bge-m3 may be more efficient, since it provides competitive quality with lower computational cost.
Repository Structure
Recommended repository structure:
best_models/
├── bge-m3/
│ └── model.pt
├── e5-multilingual/
│ └── model.pt
├── mmBERT-base/
│ └── model.pt
└── qwen2/
└── model.pt
If the checkpoints are saved as PyTorch state_dict files, the model architecture code is required to load them correctly.
Loading PyTorch Checkpoints
Example loading pattern:
import torch
# Example only: replace MyModel with the corresponding architecture class.
from model import MyModel
model = MyModel(...)
state_dict = torch.load("best_models/bge-m3/model.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()
If a checkpoint was saved as a full PyTorch model object rather than a state_dict, it can be loaded as:
import torch
model = torch.load("best_models/bge-m3/model.pt", map_location="cpu")
model.eval()
The exact loading method depends on how the checkpoint was saved during training.
Intended Use
These models are intended for:
- educational experiments;
- research on synthetic NLU datasets;
- multilingual intent classification;
- comparison of encoder-only and decoder-only architectures;
- prototyping voice assistant command recognition.
Limitations
The models were trained on a synthetic dataset. Therefore, real-world performance may differ when applied to natural user traffic.
Known limitations:
- possible sensitivity to synthetic generation style;
- errors on semantically close intents;
- dependence on data quality and intent taxonomy;
- limited robustness to real-world noise, slang, ASR errors, and incomplete phrases;
- potential confusion between intents with similar surface forms.
For production use, the models should be evaluated on real driver commands and monitored for data drift.
Citation
If you use these checkpoints, please cite or reference this repository:
@misc{multilingual-driver-command-models,
title = {Multilingual Driver Command Models},
author = {Nizhankovskiy, Ilya},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/INFINITY1023/multilingual-driver-command-models}}
}