Streaming-STT-1.5B
Continue pretraining Qwen/Qwen2.5-1.5B on malaysia-ai/Malaysian-STT 3B audio tokens or equivalent to 72k hours, and natively,
- Streaming mode by using
<|streaming|>prefix. - Semantic VAD by predicting
<|endofspeech|>token probability for streaming mode. - Whole mode by using
<|whole|>prefix. - Support segment level timestamp by using
<|segment|>prefix. - Support word level timestamp by using
<|word|>prefix. - Beyond 30 seconds audio prediction.
- Plug and play in any continuous batching serving framework such as vLLM, just another Qwen2.5 model.
- Use GLM4 Speech Tokenizer, 12.5 TPS. Discrete tokens work like a charm with prefix caching, especially for streaming.
Benchmark
How do we train
- Multipacking with proper document masking on 10240 context length.
- FP32-BF16 mixed precision training.
- Full parameter finetuning.
- WanDB at https://wandb.ai/huseinzol05/Qwen-Qwen2.5-1.5B-STT-10k
How to
First you need to install the speech tokenizer,
pip3 install git+https://github.com/malaysia-ai/glm4-audio-tokenizer
And load the model,
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
from glm4_audio_tokenizer import Glm4Tokenizer
import torch
glm4 = Glm4Tokenizer().to(torch.float16).cuda()
model = AutoModelForCausalLM.from_pretrained('malaysia-ai/Streaming-STT-1.5B').cuda()
tokenizer = AutoTokenizer.from_pretrained('malaysia-ai/Streaming-STT-1.5B')
Whole segment timestamp mode
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat.mp3
speech_tokens = glm4.tokenize(['husein-chat.mp3'])
token = ''.join([f'<|s{t}|>' for t in speech_tokens[0]]) + '<|endofspeech|>'
prompt = '<|whole|><|segment|>' + token
input_ids = tokenizer(prompt, return_tensors = 'pt').to('cuda')
generate_kwargs = dict(
**input_ids,
max_new_tokens=1024,
)
generation_output = model.generate(**generate_kwargs)
tokenizer.decode(generation_output[0, input_ids['input_ids'].shape[1]:])
Output,
<|0.30|> Hai,<|0.56|><|1.16|> saya adalah pembantu AI anda.<|3.02|><|3.62|> Selamat berkenalan!<|4.50|><|5.06|> Apa yang saya boleh tolong<|6.20|><|6.52|> untuk buatkan hari anda lebih ceria?<|8.62|><|endoftext|>
Whole word timestamp mode
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat.mp3
speech_tokens = glm4.tokenize(['husein-chat.mp3'])
token = ''.join([f'<|s{t}|>' for t in speech_tokens[0]]) + '<|endofspeech|>'
prompt = '<|whole|><|word|>' + token
input_ids = tokenizer(prompt, return_tensors = 'pt').to('cuda')
generate_kwargs = dict(
**input_ids,
max_new_tokens=1024,
)
generation_output = model.generate(**generate_kwargs)
tokenizer.decode(generation_output[0, input_ids['input_ids'].shape[1]:])
Output,
<|0.30|> Hai,<|0.56|><|1.16|> saya<|1.38|><|1.50|> adalah<|1.78|><|1.84|> pembantu<|2.22|><|2.42|> AI<|2.68|><|2.86|> anda.<|3.06|><|3.66|> Selamat<|3.98|><|4.04|> berkenalan!<|4.52|><|5.08|> Apa<|5.22|><|5.30|> yang<|5.42|><|5.48|> saya<|5.62|><|5.68|> boleh<|5.84|><|5.88|> tolong<|6.20|><|6.54|> untuk<|6.74|><|6.80|> buatkan<|7.12|><|7.22|> hari<|7.42|><|7.54|> anda<|7.72|><|7.84|> lebih<|8.02|><|8.08|> ceria?<|8.62|><|endoftext|>
Streaming segment timestamp mode
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part1.mp3
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part2.mp3
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part3.mp3
speech_tokens = glm4.tokenize(['husein-chat-part1.mp3', 'husein-chat-part2.mp3', 'husein-chat-part3.mp3'])
prompt = '<|streaming|><|segment|>'
for i in range(len(speech_tokens)):
token = ''.join([f'<|s{t}|>' for t in speech_tokens[i]]) + '<|endofspeech|>'
input_ids = tokenizer(prompt + token, return_tensors = 'pt').to('cuda')
generate_kwargs = dict(
**input_ids,
max_new_tokens=1024,
)
generation_output = model.generate(**generate_kwargs)
new_prompt = tokenizer.decode(generation_output[0])
prompt = new_prompt
t = tokenizer.decode(generation_output[0, input_ids['input_ids'].shape[1]:])
print(f'index {i + 1}: {t}')
print()
Output,
index 1: <|0.02|> Hai. Saya adalah pembantu AI anda.<|3.24|><|endoftext|>
index 2: <|3.62|> Selamat berkenalan! Apa yang saya boleh tolong?<|6.84|><|endoftext|>
index 3: <|7.24|> Untuk buatkan hari anda lebih ceria.<|9.46|><|endoftext|>
Streaming word timestamp mode
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part1.mp3
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part2.mp3
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-part3.mp3
speech_tokens = glm4.tokenize(['husein-chat-part1.mp3', 'husein-chat-part2.mp3', 'husein-chat-part3.mp3'])
# we found out using whole mode for word timestamp works better for this example
prompt = '<|whole|><|word|>'
for i in range(len(speech_tokens)):
token = ''.join([f'<|s{t}|>' for t in speech_tokens[i]]) + '<|endofspeech|>'
input_ids = tokenizer(prompt + token, return_tensors = 'pt').to('cuda')
generate_kwargs = dict(
**input_ids,
max_new_tokens=1024,
)
generation_output = model.generate(**generate_kwargs)
new_prompt = tokenizer.decode(generation_output[0])
prompt = new_prompt
t = tokenizer.decode(generation_output[0, input_ids['input_ids'].shape[1]:])
print(f'index {i + 1}: {t}')
print()
Output,
index 1: <|0.28|> Hai.<|0.54|><|1.14|> Saya<|1.36|><|1.48|> adalah<|1.76|><|1.82|> pembantu<|2.22|><|2.40|> AI<|2.68|><|2.84|> anda.<|3.20|><|endoftext|>
index 2: <|3.72|> Selamat<|4.44|><|4.50|> berkenalan!<|5.04|><|5.58|> Apa<|5.72|><|5.82|> yang<|5.92|><|5.98|> saya<|6.12|><|6.18|> boleh<|6.34|><|6.38|> tolong?<|6.96|><|endoftext|>
index 3: <|7.38|> Untuk<|7.64|><|7.72|> buatkan<|8.06|><|8.14|> hari<|8.36|><|8.48|> anda<|8.64|><|8.78|> lebih<|8.96|><|9.00|> ceria.<|9.60|><|endoftext|>
Semantic VAD
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-not-proper-cut.mp3
# !wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/record/husein-chat-proper-cut.mp3
speech_tokens = glm4.tokenize(['husein-chat-not-proper-cut.mp3', 'husein-chat-proper-cut.mp3'])
for i in range(len(speech_tokens)):
prompt = '<|streaming|><|word|>'
token = ''.join([f'<|s{t}|>' for t in speech_tokens[i]])
input_ids = tokenizer(prompt + token, return_tensors = 'pt').to('cuda')
logits = model(**input_ids).logits
print(i, logits[0, -1, 151665]) # 151665 is <|endofspeech|> token
Output,
0 tensor(51.9292, device='cuda:0') # not proper cut
1 tensor(54.5208, device='cuda:0') # proper cut
Source code
Source code at https://github.com/malaysia-ai/cooking/tree/main/qwen-stt
Acknowledgement
Special thanks to Lambda Research Grant program for Lambda cloud credit!
- Downloads last month
- 2
