File size: 1,494 Bytes
07c003f
a1b240a
 
 
 
 
 
07c003f
 
a1b240a
785ca28
a1b240a
 
c7c4807
a1b240a
c7c4807
a1b240a
c7c4807
a1b240a
 
 
 
c7c4807
07eb7de
c7c4807
a1b240a
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
---
title: README
emoji: 🌖
colorFrom: pink
colorTo: green
sdk: static
pinned: false
---

# FormosanBank

**Short description:**  
FormosanBank is a large-scale, machine-readable corpus and tooling ecosystem for Taiwan’s Indigenous Formosan languages—supporting research, education, and revitalization across 16 official languages with multimodal text–audio resources.

---

## Overview

FormosanBank curates standardized, machine-actionable corpora for the Indigenous **Formosan** languages of Taiwan (part of the Austronesian family). The project aggregates, cleans, and structures multilingual text and audio into a consistent XML schema, enabling downstream tasks such as ASR/forced alignment, translation, lexicon building, and pedagogical content creation.  
- **Scale:** 8M+ tokens, 730+ hours of audio (across languages and corpora).  
- **Structure:** Language-specific corpora delivered in a unified **FormosanBank XML** format with metadata, speaker/source info, and licensing notes where applicable.  
- **Tooling:** Quality-control (QC) utilities for XML validation, orthography checks/extraction, token counting, and cleaning pipelines.

---

## Quick links

- 📖 **Documentation / Guidebook**: https://ai4commsci.gitbook.io/formosanbank  
- 🗂️ **Repository (code, corpora, QC tools)**: https://github.com/FormosanBank/FormosanBank  
- 🏛️ **Hugging Face (org home)**: search for the “FormosanBank” organization on Hugging Face to browse datasets & Spaces.