Spaces:

podscripter-project
/

README

Running

App Files Files Community

README / README.md

algernon725

Update README.md

564e306 verified about 8 hours ago

preview code

raw

history blame contribute delete

2.4 kB

	---
	title: README
	emoji: 📚
	colorFrom: blue
	colorTo: indigo
	sdk: static
	pinned: false
	short_description: Organization card for podscripter-project
	---

	# podscripter-project

	This is the HuggingFace organization for [podscripter](https://github.com/algernon725/podscripter),
	a Dockerized local-first transcription tool built on OpenAI Whisper, pyannote.audio
	speaker diarization, and sentence-transformers punctuation restoration. Primary language
	focus: English, Spanish, French.

	This org doesn't publish models — Whisper and pyannote live in their own upstream orgs.
	What lives here is the supporting data that the podscripter project owns and republishes
	under permissive licenses, primarily for testing and reproducibility.

	## What's published here

	### Datasets

	- [`podscripter-project/test-fixtures`](https://huggingface.co/datasets/podscripter-project/test-fixtures)
	— small, curated EN/ES/FR audio clips (CC-BY 4.0) used by podscripter's Tier 1 regression
	tests. Audio is sourced from permissively licensed public corpora (LibriSpeech, FLEURS, MLS, AMI)
	and trimmed/concatenated to exercise specific pipeline code paths (single-speaker ASR,
	multi-speaker diarization, chunked-mode transcription). Each clip ships with verbatim transcripts, speaker turns, source
	attribution, and per-fixture WER/DER thresholds.

	## License posture

	Everything published here is permissively licensed (CC-BY 4.0 or CC0 1.0). Aggregate
	licenses match the most restrictive component — typically CC-BY 4.0, which requires
	attribution and indication of changes when redistributed. Per-source attribution lives in
	each artifact's dataset card and (for the test-fixtures) in
	[`tests/fixtures/audio/LICENSES.md`](https://github.com/algernon725/podscripter/blob/main/tests/fixtures/audio/LICENSES.md)
	in the podscripter repo.

	NC/ND-licensed sources are deliberately excluded so artifacts here can be freely
	redistributed.

	## Contributing

	Issues, fixture proposals, and bug-reproduction clips all go through the
	[podscripter GitHub repo](https://github.com/algernon725/podscripter). The
	[contribution workflow for new audio fixtures](https://github.com/algernon725/podscripter/blob/main/tests/fixtures/audio/README.md#adding-a-new-fixture)
	covers trimming, licensing requirements, the `.expected.json` schema, and bumping
	`HF_REVISION` so the dataset and tests stay in lockstep.