hunterschep commited on
Commit
a1b240a
·
verified ·
1 Parent(s): 07eb7de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -26
README.md CHANGED
@@ -1,36 +1,30 @@
1
  ---
2
- title: README
3
- emoji: 🌖
4
- colorFrom: pink
5
- colorTo: green
6
- sdk: static
7
- pinned: false
8
  ---
9
 
10
- ## FormosanBank
11
 
12
- **What is FormosanBank?**
13
- FormosanBank is an open-source repository of corpora and quality-control tools supporting the documentation, processing, and machine-learning use of Taiwan’s indigenous Formosan languages. :contentReference[oaicite:1]{index=1}
14
 
15
- **Key Features:**
16
- - A large collection of corpora across multiple Formosan languages (e.g., Amis, Paiwan, Atayal) with scripts for cleaning, orthography extraction, validation etc. :contentReference[oaicite:2]{index=2}
17
- - Quality control modules: punctuation checks, non-ASCII filtering, XML-template verification, orthography extraction. :contentReference[oaicite:3]{index=3}
18
- - Designed to support downstream NLP tasks (translation, ASR, summarization) for low-resource languages. :contentReference[oaicite:4]{index=4}
19
- - License: *[you should insert your specific license here]*
20
- - Maintained in a GitHub repository: [https://github.com/FormosanBank/FormosanBank](https://github.com/FormosanBank/FormosanBank) :contentReference[oaicite:5]{index=5}
21
- - Linked documentation: [https://ai4commsci.gitbook.io/formosanbank](https://ai4commsci.gitbook.io/formosanbank)
22
 
23
- **Usage / SDK:**
24
- Since the card says `sdk: static` this suggests you are using static hosting of docs or a simple web UI. You can embed links to the repo, the docs, usage instructions etc.
25
 
26
- **Short description:**
27
- Building open-source infrastructure and corpora for Taiwan’s indigenous Formosan languages, enabling machine translation, ASR and summarization efforts in extremely low-resource settings.
 
 
28
 
29
  ---
30
 
31
- ### Getting started
32
- 1. Clone the repository:
33
- ```bash
34
- git clone https://github.com/FormosanBank/FormosanBank
35
- cd FormosanBank
36
- pip install -r requirements.txt
 
1
  ---
2
+ title: README
3
+ emoji: 🌖
4
+ colorFrom: pink
5
+ colorTo: green
6
+ sdk: static
7
+ pinned: false
8
  ---
9
 
10
+ # FormosanBank
11
 
12
+ **Short description:**
13
+ FormosanBank is a large-scale, machine-readable corpus and tooling ecosystem for Taiwan’s Indigenous Formosan languages—supporting research, education, and revitalization across 16 official languages with multimodal text–audio resources.
14
 
15
+ ---
 
 
 
 
 
 
16
 
17
+ ## Overview
 
18
 
19
+ FormosanBank curates standardized, machine-actionable corpora for the Indigenous **Formosan** languages of Taiwan (part of the Austronesian family). The project aggregates, cleans, and structures multilingual text and audio into a consistent XML schema, enabling downstream tasks such as ASR/forced alignment, translation, lexicon building, and pedagogical content creation.
20
+ - **Scale:** 8M+ tokens, 730+ hours of audio (across languages and corpora).
21
+ - **Structure:** Language-specific corpora delivered in a unified **FormosanBank XML** format with metadata, speaker/source info, and licensing notes where applicable.
22
+ - **Tooling:** Quality-control (QC) utilities for XML validation, orthography checks/extraction, token counting, and cleaning pipelines.
23
 
24
  ---
25
 
26
+ ## Quick links
27
+
28
+ - 📖 **Documentation / Guidebook**: https://ai4commsci.gitbook.io/formosanbank
29
+ - 🗂️ **Repository (code, corpora, QC tools)**: https://github.com/FormosanBank/FormosanBank
30
+ - 🏛️ **Hugging Face (org home)**: search for the “FormosanBank” organization on Hugging Face to browse datasets & Spaces.