README / README.md
algernon725's picture
Update README.md
564e306 verified
---
title: README
emoji: πŸ“š
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
short_description: Organization card for podscripter-project
---
# podscripter-project
This is the HuggingFace organization for [**podscripter**](https://github.com/algernon725/podscripter),
a Dockerized local-first transcription tool built on OpenAI Whisper, pyannote.audio
speaker diarization, and sentence-transformers punctuation restoration. Primary language
focus: **English**, **Spanish**, **French**.
This org doesn't publish models β€” Whisper and pyannote live in their own upstream orgs.
What lives here is the **supporting data** that the podscripter project owns and republishes
under permissive licenses, primarily for testing and reproducibility.
## What's published here
### Datasets
- [`podscripter-project/test-fixtures`](https://huggingface.co/datasets/podscripter-project/test-fixtures)
β€” small, curated EN/ES/FR audio clips (CC-BY 4.0) used by podscripter's Tier 1 regression
tests. Audio is sourced from permissively licensed public corpora (LibriSpeech, FLEURS, MLS, AMI)
and trimmed/concatenated to exercise specific pipeline code paths (single-speaker ASR,
multi-speaker diarization, chunked-mode transcription). Each clip ships with verbatim transcripts, speaker turns, source
attribution, and per-fixture WER/DER thresholds.
## License posture
Everything published here is **permissively licensed** (CC-BY 4.0 or CC0 1.0). Aggregate
licenses match the most restrictive component β€” typically CC-BY 4.0, which requires
attribution and indication of changes when redistributed. Per-source attribution lives in
each artifact's dataset card and (for the test-fixtures) in
[`tests/fixtures/audio/LICENSES.md`](https://github.com/algernon725/podscripter/blob/main/tests/fixtures/audio/LICENSES.md)
in the podscripter repo.
NC/ND-licensed sources are deliberately excluded so artifacts here can be freely
redistributed.
## Contributing
Issues, fixture proposals, and bug-reproduction clips all go through the
[podscripter GitHub repo](https://github.com/algernon725/podscripter). The
[contribution workflow for new audio fixtures](https://github.com/algernon725/podscripter/blob/main/tests/fixtures/audio/README.md#adding-a-new-fixture)
covers trimming, licensing requirements, the `.expected.json` schema, and bumping
`HF_REVISION` so the dataset and tests stay in lockstep.