AI & ML interests

Speech, Language

Recent Activity

lhoestq 
posted an update about 1 year ago
view post
Post
3044
Made a HF Dataset editor a la gg sheets here: lhoestq/dataset-spreadsheets

With Dataset Spreadsheets:
✏️ Edit datasets in the UI
🔗 Share link with collaborators
🐍 Use locally in DuckDB or Python

Available for the 100,000+ parquet datasets on HF :)
lhoestq 
posted an update over 1 year ago
view post
Post
4202
Hey ! I'm working on a 100% synthetic Dataset Hub here (you can search for any kind of datasets an the app invents them). The link is here: infinite-dataset-hub/infinite-dataset-hub

Question for the Community:

Which models should I use to generate images and audio samples for those datasets ? 🤗
  • 4 replies
·
lhoestq 
posted an update almost 2 years ago
view post
Post
3092
✨ Easy Synthetic Dataset File Generation using LLM DataGen ! Link: https://huggingface.co/spaces/lhoestq/LLM_DataGen

features + how it works:

✍️ Generate the dataset content you want just by entering a file name
💡 Optionally specify the column names you need
💨 The dataset is streamed and generated on-the-fly in JSON Lines format
✅ Generation is constrained to always output valid JSON

How does this work ?
1/ Enter a file name
2/ The model generates column names for such a file. Using structured generation, it can generate 2 to 5 column names using lower case characters and underscores. I use a prompt that asks to generate column names for a realistic dataset and low temperature.
3/ The columns are used to update the Finite State Machine for the dataset content structured generation, so that it is used to generate JSON objects using those columns
4/ The model generates JSON objects using structured generation again, using the updated Finite State Machine. I use a prompt that asks for realistic data and a temperature of 1.

> Why update a Finite State Machine instead of re-creating one ?

Creating one can take up to 30sec, while updating one takes 0.1s (though it requires to manipulate a graph which is not easy to implement)

> Batched generation is faster, why not use it ?

Generate in batches is faster but tends to generate duplicates for this demo.
Further work can be to provide different prompts (one per sequence in the batch) to end up with a different distribution of sequences in each batch. Or implement a custom sampler that would forbid generating the same data in sequences of the same batch.

> How does structured generation work ?

I used the outlines library with transformers to to define a JSON schema that the generation has to follow. It uses a Finite State Machine with token_id as transitions.

Let me know what you think ! And feel free to duplicate/modify it to try other models/prompts or sampling methods :)
sanchit-gandhi 
posted an update almost 2 years ago
view post
Post
Why does returning timestamps help Whisper reduce hallucinations? 🧐

Empirically, most practitioners have found that setting return_timestamps=True helps reduce hallucinations, particularly when doing long-form evaluation with Transformers’ “chunked” algorithm.

But why does this work?..

My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription:
The cat sat on the on the on the mat.

Where we have a repeated hallucination for “on the”. If we ask the model to predict timestamps, then the “on the” has to contribute to the overall segment-level timing, e.g.:
<|0.00|> The cat sat on the on the on the mat.<|5.02|>

However, it’s impossible to fit 3 copies of “on the” within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability:
<|0.00|> The cat sat on the mat.<|5.02|>

In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper Robust Speech Recognition via Large-Scale Weak Supervision (2212.04356) → it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences).

Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?
·