π Dataset Guidelines
π·οΈ Minimum metadata
- Speaker ID (anonymized)
- Approximate age band
- Gender (optional/self-declared)
- Dialect/region
- Recording environment and device class
π§ Audio quality basics
- Prefer 16kHz+ clean speech
- Avoid clipping and heavy background noise
- Keep transcript aligned with spoken content
βοΈ Text policy
- Use agreed normalization rules
- Keep punctuation consistent
- Track alternate spellings in glossary