University of Edingburgh - Centre For Speech Technology Research

university

https://www.cstr.ed.ac.uk/

Activity Feed

AI & ML interests

Speech, Language

Recent Activity

sanchit-gandhi authored a paper 27 days ago

Voxtral TTS

sanchit-gandhi authored a paper 2 months ago

Voxtral Realtime

lhoestq new activity 4 months ago

edinburghcstr/ami:Dataset is not compatible with latest dataset release

View all activity

sanchit-gandhi

authored a paper 27 days ago

Voxtral TTS

Paper • 2603.25551 • Published Mar 26 • 61

sanchit-gandhi

authored a paper 2 months ago

Voxtral Realtime

Paper • 2602.11298 • Published Feb 11 • 26

lhoestq

in edinburghcstr/ami 4 months ago

Dataset is not compatible with latest dataset release

#5 opened 6 months ago by

alunkingusw

patrickvonplaten

in edinburghcstr/ami 4 months ago

Convert dataset to Parquet (part 00001-of-00002)

#9 opened 4 months ago by

lhoestq

Convert dataset to Parquet (part 00000-of-00002)

#8 opened 4 months ago by

lhoestq

sanchit-gandhi

authored 2 papers 10 months ago

Magistral

Paper • 2506.10910 • Published Jun 12, 2025 • 68

Voxtral

Paper • 2507.13264 • Published Jul 17, 2025 • 34

lhoestq

authored a paper over 1 year ago

Croissant: A Metadata Format for ML-Ready Datasets

Paper • 2403.19546 • Published Mar 28, 2024 • 1

lhoestq

posted an update over 1 year ago

Post

3320

Made a HF Dataset editor a la gg sheets here: lhoestq/dataset-spreadsheets

With Dataset Spreadsheets:
✏️ Edit datasets in the UI
🔗 Share link with collaborators
🐍 Use locally in DuckDB or Python

Available for the 100,000+ parquet datasets on HF :)

patrickvonplaten

authored a paper over 1 year ago

Pixtral 12B

Paper • 2410.07073 • Published Oct 9, 2024 • 69

lhoestq

posted an update almost 2 years ago

Post

4202

Hey ! I'm working on a 100% synthetic Dataset Hub here (you can search for any kind of datasets an the app invents them). The link is here: infinite-dataset-hub/infinite-dataset-hub

Question for the Community:

Which models should I use to generate images and audio samples for those datasets ? 🤗

4 replies

lhoestq

posted an update about 2 years ago

Post

3095

✨ Easy Synthetic Dataset File Generation using LLM DataGen ! Link: https://huggingface.co/spaces/lhoestq/LLM_DataGen

features + how it works:

✍️ Generate the dataset content you want just by entering a file name
💡 Optionally specify the column names you need
💨 The dataset is streamed and generated on-the-fly in JSON Lines format
✅ Generation is constrained to always output valid JSON

How does this work ?
1/ Enter a file name
2/ The model generates column names for such a file. Using structured generation, it can generate 2 to 5 column names using lower case characters and underscores. I use a prompt that asks to generate column names for a realistic dataset and low temperature.
3/ The columns are used to update the Finite State Machine for the dataset content structured generation, so that it is used to generate JSON objects using those columns
4/ The model generates JSON objects using structured generation again, using the updated Finite State Machine. I use a prompt that asks for realistic data and a temperature of 1.

> Why update a Finite State Machine instead of re-creating one ?

Creating one can take up to 30sec, while updating one takes 0.1s (though it requires to manipulate a graph which is not easy to implement)

> Batched generation is faster, why not use it ?

Generate in batches is faster but tends to generate duplicates for this demo.
Further work can be to provide different prompts (one per sequence in the batch) to end up with a different distribution of sequences in each batch. Or implement a custom sampler that would forbid generating the same data in sequences of the same batch.

> How does structured generation work ?

I used the outlines library with transformers to to define a JSON schema that the generation has to follow. It uses a Finite State Machine with token_id as transitions.

Let me know what you think ! And feel free to duplicate/modify it to try other models/prompts or sampling methods :)

sanchit-gandhi

posted an update about 2 years ago

Post

Why does returning timestamps help Whisper reduce hallucinations? 🧐

Empirically, most practitioners have found that setting return_timestamps=True helps reduce hallucinations, particularly when doing long-form evaluation with Transformers’ “chunked” algorithm.

But why does this work?..

My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription:

The cat sat on the on the on the mat.

Where we have a repeated hallucination for “on the”. If we ask the model to predict timestamps, then the “on the” has to contribute to the overall segment-level timing, e.g.:

<|0.00|> The cat sat on the on the on the mat.<|5.02|>

However, it’s impossible to fit 3 copies of “on the” within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability:

<|0.00|> The cat sat on the mat.<|5.02|>

In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper Robust Speech Recognition via Large-Scale Weak Supervision (2212.04356) → it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences).

Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?