trishtan commited on
Commit
a9d84c0
·
0 Parent(s):

Duplicate from trishtan/voxtral-sentinel-4b

Browse files
Files changed (7) hide show
  1. .gitattributes +36 -0
  2. README.md +263 -0
  3. config.json +66 -0
  4. generation_config.json +12 -0
  5. model.safetensors +3 -0
  6. tekken.json +3 -0
  7. training_args.bin +3 -0
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tekken.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
3
+ library_name: transformers
4
+ model_name: voxtral-sentinel-4b
5
+ datasets:
6
+ - trishtan/voxtral-forensic-ds
7
+ tags:
8
+ - audio
9
+ - multimodal
10
+ - emotion-recognition
11
+ - customer-support
12
+ - emergency-services
13
+ - sft
14
+ - trl
15
+ - hf_jobs
16
+ language:
17
+ - en
18
+ license: apache-2.0
19
+ ---
20
+
21
+ # Model Card for voxtral-sentinel-4b
22
+
23
+ This model is a fine-tuned version of [mistralai/Voxtral-Mini-4B-Realtime-2602](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602).
24
+ It has been trained using [TRL](https://github.com/huggingface/trl).
25
+
26
+ # trishtan/voxtral-sentinel-4b
27
+
28
+ **voxtral-sentinel-4b** is a fine-tuned version of [mistralai/Voxtral-Mini-4B-Realtime-2602](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602), specialised for real-time audio understanding in high-stakes operational environments. Given a raw audio recording, the model produces a structured output containing a verbatim transcript, a contextual analysis of speaker emotion and situation, and a recommended action — enabling autonomous routing and triage without human-in-the-loop intervention.
29
+
30
+ Built for two primary verticals:
31
+
32
+ - **Automated customer support** — classify caller intent and emotional state to route calls, trigger escalations, or generate automated responses in real time
33
+ - **Emergency services & safety** — identify distress, urgency, and situational context from audio to assist dispatchers or fully autonomous response systems
34
+
35
+ ---
36
+
37
+ ## Model Details
38
+
39
+ | Property | Value |
40
+ |---|---|
41
+ | **Base model** | mistralai/Voxtral-Mini-4B-Realtime-2602 |
42
+ | **Model type** | Audio-to-text (multimodal) |
43
+ | **Parameters** | ~4B |
44
+ | **Fine-tune method** | Full fine-tune (no LoRA) |
45
+ | **Precision** | bfloat16 |
46
+ | **Training hardware** | NVIDIA A100 |
47
+ | **Framework** | Transformers + TRL SFTTrainer |
48
+ | **Language** | English |
49
+ | **License** | See base model license |
50
+
51
+ ---
52
+
53
+ ## Training
54
+
55
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/s222458666/voxtral-sentinel/runs/uouz4iq1)
56
+
57
+ [<img src="https://img.shields.io/badge/GitHub-Forensic--Audio-181717?logo=github&style=flat" alt="View on GitHub" width="150" height="24"/>](https://github.com/SageRish/Forensic-Audio)
58
+
59
+ ### Dataset
60
+
61
+ Fine-tuned on a curated dataset of ~9,984 audio samples with structured annotations [voxtral-forensic-ds](https://huggingface.co/datasets/trishtan/voxtral-forensic-ds). Each sample consists of a raw audio clip paired with a ground-truth output in the following canonical format:
62
+
63
+ ```
64
+ ### TRANSCRIPT:
65
+ <verbatim transcription of the audio>
66
+
67
+ ### ANALYSIS:
68
+ <contextual analysis of speaker emotion, tone, and situation>
69
+
70
+ ### CONCLUSION:
71
+ <recommended action or classification>
72
+ ```
73
+
74
+ The dataset was derived from [MELD (Multimodal EmotionLines Dataset)](https://huggingface.co/datasets/ajyy/MELD_audio), which contains emotionally rich conversational audio from multi-speaker dialogue scenarios, and [DCASE 2025 Task 1](https://dcase.community/challenge2025/task-low-complexity-acoustic-scene-classification-with-device-information) (Acoustic Scene Classification). Annotations were generated and standardised using automated pipelines with LLM-assisted formatting normalisation.
75
+
76
+ A 90/10 train/eval split was used with a fixed seed (42) for reproducibility. The final training dataset and held-out eval split are available at [trishtan/voxtral-forensic-ds-splits](https://huggingface.co/datasets/trishtan/voxtral-forensic-ds).
77
+
78
+ ### Hyperparameters
79
+
80
+ | Parameter | Value |
81
+ |---|---|
82
+ | Epochs | 5 (early stopping at eval loss < 1.15) |
83
+ | Learning rate | 5e-6 |
84
+ | LR scheduler | Cosine |
85
+ | Warmup ratio | 0.05 |
86
+ | Batch size (per device) | 2 |
87
+ | Gradient accumulation steps | 4 |
88
+ | Effective batch size | 8 |
89
+ | Max grad norm | 1.0 |
90
+ | Precision | bf16 |
91
+ | Eval strategy | Every 100 steps |
92
+
93
+ ### Training Results
94
+
95
+ | Metric | Value |
96
+ |---|---|
97
+ | Final eval loss | 1.148 |
98
+ | Final eval mean token accuracy | 74.35% |
99
+ | Train/eval accuracy gap | ~0% |
100
+ | Stopped at epoch | 2.75 (early stopping) |
101
+
102
+ The near-zero gap between train and eval accuracy across all runs indicates the model generalises well to unseen audio with no measurable overfitting.
103
+
104
+ ---
105
+
106
+ ## Usage
107
+
108
+ ```python
109
+ import torch
110
+ import soundfile as sf
111
+ import numpy as np
112
+ from transformers import AutoProcessor, VoxtralRealtimeForConditionalGeneration
113
+
114
+ model_id = "trishtan/voxtral-sentinel-4b"
115
+
116
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
117
+ model = VoxtralRealtimeForConditionalGeneration.from_pretrained(
118
+ model_id,
119
+ torch_dtype=torch.bfloat16,
120
+ device_map="auto",
121
+ )
122
+
123
+ # Load your audio (must be 16kHz mono)
124
+ audio, sr = sf.read("your_audio.wav")
125
+ if audio.ndim > 1:
126
+ audio = audio.mean(axis=1)
127
+ audio = audio.astype(np.float32)
128
+
129
+ PROMPT = "[INST] Analyze this recording for forensic indicators. [/INST]"
130
+
131
+ audio_inputs = processor.feature_extractor(
132
+ [audio], sampling_rate=16000, return_tensors="pt", padding=True,
133
+ )
134
+ text_inputs = processor.tokenizer(
135
+ [PROMPT], return_tensors="pt", padding=True,
136
+ )
137
+ inputs = {**audio_inputs, **text_inputs}
138
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
139
+
140
+ with torch.no_grad():
141
+ output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
142
+
143
+ response = processor.tokenizer.decode(output_ids[0], skip_special_tokens=True)
144
+ print(response)
145
+ ```
146
+
147
+ ### Expected Output Format
148
+
149
+ ```
150
+ ### TRANSCRIPT:
151
+ I need help immediately, my neighbour hasn't responded in hours and I can hear something...
152
+
153
+ ### ANALYSIS:
154
+ The speaker exhibits elevated vocal stress indicators including increased speech rate and
155
+ pitch variance. Tone suggests genuine distress rather than rehearsed or non-urgent
156
+ communication. Situational context implies potential welfare concern for a third party.
157
+
158
+ ### CONCLUSION:
159
+ Escalate to emergency services. Flag as high-priority welfare check. Do not route to
160
+ standard support queue.
161
+ ```
162
+
163
+ ---
164
+
165
+ ## Intended Use
166
+
167
+ ### In Scope
168
+ - Real-time audio triage in customer service pipelines
169
+ - Emergency call classification and dispatcher assistance
170
+ - Automated sentiment and intent detection from voice
171
+ - Proof-of-concept and research into multimodal audio understanding
172
+
173
+ ### Out of Scope
174
+ - Medical diagnosis or clinical decision-making
175
+ - Surveillance or non-consensual audio analysis
176
+ - Languages other than English
177
+ - Audio clips under 3 seconds (insufficient signal for reliable analysis)
178
+
179
+ ---
180
+
181
+ ## Limitations
182
+
183
+ - **Short audio clips** — clips under 3 seconds are padded with silence to the model's required 15-second input window. Analysis quality degrades significantly for very short recordings.
184
+ - **Single-language** — trained exclusively on English-language audio. Performance on accented, non-native, or non-English speech is untested.
185
+ - **Emotional diversity** — training data skews toward conversational emotional registers. Performance on domain-specific audio (medical, legal, industrial) may vary.
186
+ - **Not a safety-critical system** — outputs should be reviewed by human operators in any deployment where errors have real-world consequences.
187
+
188
+ ---
189
+
190
+ ## Data Attribution
191
+
192
+ This model was fine-tuned using audio data derived from:
193
+
194
+ **MELD — Multimodal EmotionLines Dataset**
195
+ Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., & Mihalcea, R. (2019, July).
196
+ Meld: A multimodal multi-party dataset for emotion recognition in conversations.
197
+ In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 527-536).
198
+ HuggingFace: [ajyy/MELD_audio](https://huggingface.co/datasets/ajyy/MELD_audio)
199
+
200
+ **DCASE 2025 Challenge — Task 1: Acoustic Scene Classification**
201
+ Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen.
202
+ A multi-device dataset for urban acoustic scene classification.
203
+ In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 9–13.
204
+ November 2018.
205
+ URL: https://dcase.community/documents/workshop2018/proceedings/DCASE2018Workshop_Mesaros_8.pdf.
206
+
207
+ ---
208
+
209
+ ## Framework Versions
210
+
211
+ - TRL: 0.29.0
212
+ - Transformers: 5.2.0
213
+ - PyTorch: 2.10.0
214
+ - Datasets: 4.6.1
215
+ - Tokenizers: 0.22.2
216
+
217
+ ---
218
+
219
+ ## Citation
220
+
221
+ ```bibtex
222
+ @misc{voxtral-sentinel-4b,
223
+ author = {trishtan},
224
+ title = {voxtral-sentinel-4b: Fine-tuned Voxtral for Audio Triage},
225
+ year = {2026},
226
+ publisher = {Hugging Face},
227
+ url = {https://huggingface.co/trishtan/voxtral-sentinel-4b}
228
+ }
229
+
230
+ @inproceedings{poria2019meld,
231
+ title={Meld: A multimodal multi-party dataset for emotion recognition in conversations},
232
+ author={Poria, Soujanya and Hazarika, Devamanyu and Majumder, Navonil and Naik, Gautam and Cambria, Erik and Mihalcea, Rada},
233
+ booktitle={Proceedings of the 57th annual meeting of the association for computational linguistics},
234
+ pages={527--536},
235
+ year={2019}
236
+ }
237
+
238
+ @inproceedings{Mesaros2018_DCASE,
239
+ Author = "Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas",
240
+ title = "A multi-device dataset for urban acoustic scene classification",
241
+ year = "2018",
242
+ booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)",
243
+ month = "November",
244
+ pages = "9--13",
245
+ keywords = "Acoustic scene classification, DCASE challenge, public datasets, multi-device data",
246
+ url = "https://dcase.community/documents/workshop2018/proceedings/DCASE2018Workshop\_Mesaros\_8.pdf"
247
+ }
248
+
249
+ @software{vonwerra2020trl,
250
+ title = {{TRL: Transformers Reinforcement Learning}},
251
+ author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
252
+ license = {Apache-2.0},
253
+ url = {https://github.com/huggingface/trl},
254
+ year = {2020}
255
+ }
256
+ ```
257
+
258
+ ---
259
+
260
+ ## Acknowledgements
261
+
262
+ Built on [Voxtral-Mini-4B-Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) by Mistral AI.
263
+ Fine-tuning infrastructure: HuggingFace Transformers, TRL, and Accelerate.
config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "VoxtralRealtimeForConditionalGeneration"
4
+ ],
5
+ "audio_config": {
6
+ "activation_function": "gelu",
7
+ "attention_dropout": 0.0,
8
+ "dtype": "bfloat16",
9
+ "head_dim": 64,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 1280,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 5120,
14
+ "max_position_embeddings": 1500,
15
+ "model_type": "voxtral_realtime_encoder",
16
+ "num_attention_heads": 32,
17
+ "num_hidden_layers": 32,
18
+ "num_key_value_heads": 32,
19
+ "num_mel_bins": 128,
20
+ "rms_norm_eps": 1e-05,
21
+ "rope_parameters": {
22
+ "rope_theta": 1000000.0,
23
+ "rope_type": "default"
24
+ },
25
+ "sliding_window": 750,
26
+ "vocab_size": 131072
27
+ },
28
+ "audio_length_per_tok": 8,
29
+ "bos_token_id": 1,
30
+ "default_num_delay_tokens": 6,
31
+ "downsample_factor": 4,
32
+ "dtype": "bfloat16",
33
+ "eos_token_id": 2,
34
+ "hidden_size": 3072,
35
+ "model_type": "voxtral_realtime",
36
+ "pad_token_id": 11,
37
+ "projector_hidden_act": "gelu",
38
+ "text_config": {
39
+ "attention_dropout": 0.0,
40
+ "bos_token_id": 1,
41
+ "dtype": "bfloat16",
42
+ "eos_token_id": 2,
43
+ "head_dim": 128,
44
+ "hidden_act": "silu",
45
+ "hidden_size": 3072,
46
+ "initializer_range": 0.02,
47
+ "intermediate_size": 9216,
48
+ "max_position_embeddings": 131072,
49
+ "model_type": "voxtral_realtime_text",
50
+ "num_attention_heads": 32,
51
+ "num_hidden_layers": 26,
52
+ "num_key_value_heads": 8,
53
+ "pad_token_id": null,
54
+ "rms_norm_eps": 1e-05,
55
+ "rope_parameters": {
56
+ "rope_theta": 1000000.0,
57
+ "rope_type": "default"
58
+ },
59
+ "sliding_window": 8192,
60
+ "tie_word_embeddings": true,
61
+ "use_cache": true,
62
+ "vocab_size": 131072
63
+ },
64
+ "transformers_version": "5.2.0",
65
+ "use_cache": false
66
+ }
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "eos_token_id": [
4
+ 2,
5
+ 2
6
+ ],
7
+ "output_attentions": false,
8
+ "output_hidden_states": false,
9
+ "pad_token_id": 11,
10
+ "transformers_version": "5.2.0",
11
+ "use_cache": true
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6ee63e48011841bbf002d90f2baacd1c0474c78dddd288093cf55e645e6f363a
3
+ size 8859446848
tekken.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8434af1d39eba99f0ef46cf1450bf1a63fa941a26933a1ef5dbbf4adf0d00e44
3
+ size 14910348
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c1b4951fcbeb59169efa164df8cea10ef700d2574b2aa3da47ab8b6d0e914d01
3
+ size 5713