Spaces:
Running
Running
| title: README | |
| emoji: 🌖 | |
| colorFrom: pink | |
| colorTo: green | |
| sdk: static | |
| pinned: false | |
| # FormosanBank | |
| **Short description:** | |
| FormosanBank is a large-scale, machine-readable corpus and tooling ecosystem for Taiwan’s Indigenous Formosan languages—supporting research, education, and revitalization across 16 official languages with multimodal text–audio resources. | |
| --- | |
| ## Overview | |
| FormosanBank curates standardized, machine-actionable corpora for the Indigenous **Formosan** languages of Taiwan (part of the Austronesian family). The project aggregates, cleans, and structures multilingual text and audio into a consistent XML schema, enabling downstream tasks such as ASR/forced alignment, translation, lexicon building, and pedagogical content creation. | |
| - **Scale:** 8M+ tokens, 730+ hours of audio (across languages and corpora). | |
| - **Structure:** Language-specific corpora delivered in a unified **FormosanBank XML** format with metadata, speaker/source info, and licensing notes where applicable. | |
| - **Tooling:** Quality-control (QC) utilities for XML validation, orthography checks/extraction, token counting, and cleaning pipelines. | |
| --- | |
| ## Quick links | |
| - 📖 **Documentation / Guidebook**: https://ai4commsci.gitbook.io/formosanbank | |
| - 🗂️ **Repository (code, corpora, QC tools)**: https://github.com/FormosanBank/FormosanBank | |
| - 🏛️ **Hugging Face (org home)**: search for the “FormosanBank” organization on Hugging Face to browse datasets & Spaces. | |