Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -7,4 +7,85 @@ sdk: static
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# FormosanBank
|
| 11 |
+
|
| 12 |
+
FormosanBank is a large-scale, machine-readable corpus and tooling ecosystem for Taiwan’s Indigenous Formosan languages.
|
| 13 |
+
|
| 14 |
+
We build and share open resources that support:
|
| 15 |
+
|
| 16 |
+
- language documentation
|
| 17 |
+
- linguistic research
|
| 18 |
+
- education
|
| 19 |
+
- language revitalization
|
| 20 |
+
- speech and language technology development
|
| 21 |
+
|
| 22 |
+
This Hugging Face organization is where we publish FormosanBank datasets and related resources for easier access, download, and reuse.
|
| 23 |
+
|
| 24 |
+
## What you’ll find here
|
| 25 |
+
|
| 26 |
+
- **Datasets** containing text, annotations, and audio-linked resources
|
| 27 |
+
- **Corpus releases** organized for practical use on the Hugging Face Hub
|
| 28 |
+
- **Resources for computational work**, including materials useful for ASR, MT, and other NLP workflows
|
| 29 |
+
- **Documentation and usage guidance** connected to the broader FormosanBank project
|
| 30 |
+
|
| 31 |
+
## About the project
|
| 32 |
+
|
| 33 |
+
FormosanBank is designed as a centralized repository for data across the extant Formosan languages, with an emphasis on accessibility for researchers, educators, students, and community collaborators.
|
| 34 |
+
|
| 35 |
+
The broader project includes:
|
| 36 |
+
|
| 37 |
+
- digitized texts and transcriptions
|
| 38 |
+
- dictionaries and reference materials
|
| 39 |
+
- audio recordings
|
| 40 |
+
- annotated corpora
|
| 41 |
+
- structured metadata for search, retrieval, and downstream analysis
|
| 42 |
+
|
| 43 |
+
## Start here
|
| 44 |
+
|
| 45 |
+
- **Documentation / GitBook:** [FormosanBank GitBook](https://ai4commsci.gitbook.io/formosanbank)
|
| 46 |
+
- **GitHub:** [FormosanBank GitHub repository](https://github.com/FormosanBank/FormosanBank)
|
| 47 |
+
- **Hugging Face organization:** [FormosanBank on Hugging Face](https://huggingface.co/FormosanBank)
|
| 48 |
+
|
| 49 |
+
## Using these resources
|
| 50 |
+
|
| 51 |
+
Some corpora are distributed on Hugging Face in ways that make large audio collections easier to host and retrieve. The FormosanBank documentation includes guidance for downloading data by corpus or language, including workflows for larger audio collections.
|
| 52 |
+
|
| 53 |
+
For technical usage details, see the Hugging Face section of the documentation:
|
| 54 |
+
- [Hugging Face usage guide](https://ai4commsci.gitbook.io/formosanbank/the-bank-architecture/developers/huggingface)
|
| 55 |
+
|
| 56 |
+
## Licensing and responsible use
|
| 57 |
+
|
| 58 |
+
Licensing may vary by corpus or source material, so please check the license and citation requirements on each dataset and in the documentation before reuse.
|
| 59 |
+
|
| 60 |
+
Important notes from the project documentation include:
|
| 61 |
+
|
| 62 |
+
- some source materials may have their own citation or usage requirements
|
| 63 |
+
- FormosanBank corpora include restrictions on **commercial AI use**
|
| 64 |
+
- FormosanBank annotations and metadata are released under **CC-BY-4.0**
|
| 65 |
+
|
| 66 |
+
Please review the full terms here:
|
| 67 |
+
- [Terms of Use](https://ai4commsci.gitbook.io/formosanbank/additional-resources/terms-of-use)
|
| 68 |
+
|
| 69 |
+
## Contributing
|
| 70 |
+
|
| 71 |
+
We welcome collaboration with researchers, educators, and community members.
|
| 72 |
+
|
| 73 |
+
If you would like to contribute data, discuss licensing, share corrections, or explore collaboration, please see:
|
| 74 |
+
- [Contributing to FormosanBank](https://ai4commsci.gitbook.io/formosanbank/additional-resources/contributing-to-formosanbank)
|
| 75 |
+
|
| 76 |
+
## Publications
|
| 77 |
+
|
| 78 |
+
FormosanBank supports research on endangered and Indigenous language technology, including work in machine translation, ASR, OCR, and corpus development.
|
| 79 |
+
|
| 80 |
+
A list of related publications is available here:
|
| 81 |
+
- [Publications](https://ai4commsci.gitbook.io/formosanbank/additional-resources/publications)
|
| 82 |
+
|
| 83 |
+
## Citation
|
| 84 |
+
|
| 85 |
+
If you use FormosanBank in academic work, please cite:
|
| 86 |
+
|
| 87 |
+
> Mohamed, W., Le Ferrand, É., Sung, L.-M., Prud'hommeaux, E., & Hartshorne, J. K. (2024). *FormosanBank*. Electronic Resource.
|
| 88 |
+
|
| 89 |
+
## Acknowledgment
|
| 90 |
+
|
| 91 |
+
FormosanBank is made possible through collaboration among researchers, contributors, and community partners working to support the documentation and revitalization of Taiwan’s Indigenous Formosan languages.
|