Spaces:

FormosanBankDemos
/

README

Running

README / README.md

Update README.md

a1b240a verified about 2 months ago

1.49 kB

	---
	title: README
	emoji: 🌖
	colorFrom: pink
	colorTo: green
	sdk: static
	pinned: false
	---

	# FormosanBank

	Short description:
	FormosanBank is a large-scale, machine-readable corpus and tooling ecosystem for Taiwan’s Indigenous Formosan languages—supporting research, education, and revitalization across 16 official languages with multimodal text–audio resources.

	---

	## Overview

	FormosanBank curates standardized, machine-actionable corpora for the Indigenous Formosan languages of Taiwan (part of the Austronesian family). The project aggregates, cleans, and structures multilingual text and audio into a consistent XML schema, enabling downstream tasks such as ASR/forced alignment, translation, lexicon building, and pedagogical content creation.
	- Scale: 8M+ tokens, 730+ hours of audio (across languages and corpora).
	- Structure: Language-specific corpora delivered in a unified FormosanBank XML format with metadata, speaker/source info, and licensing notes where applicable.
	- Tooling: Quality-control (QC) utilities for XML validation, orthography checks/extraction, token counting, and cleaning pipelines.

	---

	## Quick links

	- 📖 Documentation / Guidebook: https://ai4commsci.gitbook.io/formosanbank
	- 🗂️ Repository (code, corpora, QC tools): https://github.com/FormosanBank/FormosanBank
	- 🏛️ Hugging Face (org home): search for the “FormosanBank” organization on Hugging Face to browse datasets & Spaces.