---
title: README
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: static
pinned: false
short_description: Organization card for podscripter-project
---

# podscripter-project

This is the HuggingFace organization for [**podscripter**](https://github.com/algernon725/podscripter),
a Dockerized local-first transcription tool built on OpenAI Whisper, pyannote.audio
speaker diarization, and sentence-transformers punctuation restoration. Primary language
focus: **English**, **Spanish**, **French**.

This org doesn't publish models — Whisper and pyannote live in their own upstream orgs.
What lives here is the **supporting data** that the podscripter project owns and republishes
under permissive licenses, primarily for testing and reproducibility.

## What's published here

### Datasets

- [`podscripter-project/test-fixtures`](https://huggingface.co/datasets/podscripter-project/test-fixtures)
  — small, curated EN/ES/FR audio clips (CC-BY 4.0) used by podscripter's Tier 1 regression
  tests. Audio is sourced from permissively licensed public corpora (LibriSpeech, FLEURS,
  Common Voice, VoxPopuli, AMI, MLS) and trimmed/concatenated to exercise specific
  pipeline code paths (single-speaker ASR, multi-speaker diarization, chunked-mode
  transcription). Each clip ships with verbatim transcripts, speaker turns, source
  attribution, and per-fixture WER/DER thresholds.

## License posture

Everything published here is **permissively licensed** (CC-BY 4.0 or CC0 1.0). Aggregate
licenses match the most restrictive component — typically CC-BY 4.0, which requires
attribution and indication of changes when redistributed. Per-source attribution lives in
each artifact's dataset card and (for the test-fixtures) in
[`tests/fixtures/audio/LICENSES.md`](https://github.com/algernon725/podscripter/blob/main/tests/fixtures/audio/LICENSES.md)
in the podscripter repo.

NC/ND-licensed sources are deliberately excluded so artifacts here can be freely
redistributed.

## Contributing

Issues, fixture proposals, and bug-reproduction clips all go through the
[podscripter GitHub repo](https://github.com/algernon725/podscripter). The
[contribution workflow for new audio fixtures](https://github.com/algernon725/podscripter/blob/main/tests/fixtures/audio/README.md#adding-a-new-fixture)
covers trimming, licensing requirements, the `.expected.json` schema, and bumping
`HF_REVISION` so the dataset and tests stay in lockstep.