| --- |
| license: apache-2.0 |
| language: |
| - bn |
| metrics: |
| - wer |
| - cer |
| tags: |
| - seq2seq |
| - ipa |
| - bengali |
| - byt5 |
| widget: |
| - text: <Narail> আমি সে বাবুর মামু বাড়ি গিছিলাম। |
| example_title: Narail Text |
| - text: <Rangpur> এখন এই কুলো তার শেষ অই কুলো তার শেষ। |
| example_title: Rangpur Text |
| - text: <Chittagong> খয়দে সিআরের এইল্লা কি অবস্থা! |
| example_title: Chittagong Text |
| - text: <Kishoreganj> আটাইশ করছিলাম দের কানি ক্ষেত, ইবার মাইর কাইছি। |
| example_title: Kishoreganj Text |
| - text: <Narsingdi> তারা তো ওই খারাপ খেইলাই আসে না। |
| example_title: Narsingdi Text |
| - text: <Tangail> আর সব থেকে ফানি কথা হইতেছে দেখ? |
| example_title: Tangail Text |
| --- |
| |
|
|
| # Regional bengali text to IPA transcription - byT5-small |
|
|
|
|
| This is a fine-tuned version of the [google/byt5-small](https://huggingface.co/google/byt5-small) for the task of generating IPA transcriptions from regional bengali text. |
| This was done on the dataset of the competition [“ভাষামূল: মুখের ভাষার খোঁজে“](https://www.kaggle.com/competitions/regipa/overview) by Bengali.AI. |
|
|
| Model performance: |
| - **Word error rate (wer)**: 0.0124279344454407 |
| - **Char error rate (cer)**: 0.00427635805681347 |
|
|
|
|
| Supported district tokens: |
| - Kishoreganj |
| - Narail |
| - Narsingdi |
| - Chittagong |
| - Rangpur |
| - Tangail |
|
|
| --- |
|
|
| ## Loading & using the model |
| ```python |
| # Load model directly |
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
| |
| tokenizer = AutoTokenizer.from_pretrained("teamapocalypseml/ben2ipa-byt5small") |
| model = AutoModelForSeq2SeqLM.from_pretrained("teamapocalypseml/ben2ipa-byt5small") |
| |
| """ |
| The format of the input text MUST BE: <district> <bengali_text> |
| """ |
| text = "<district> bengali_text_here" |
| text_ids = tokenizer(text, return_tensors='pt').input_ids |
| model(text_ids) |
| ``` |
|
|
|
|
| ## Using the pipeline |
| ```python |
| # Use a pipeline as a high-level helper |
| from transformers import pipeline |
| |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| |
| pipe = pipeline("text2text-generation", model="teamapocalypseml/ben2ipa-byt5small", device=device) |
| |
| |
| """ |
| `texts` must be in the format of: <district> <contents> |
| """ |
| outputs = pipe(texts, max_length=1024, batch_size=batch_size) |
| ``` |
|
|
| ## Credits |
| Done by [S M Jishanul Islam](https://github.com/S-M-J-I), [Sadia Ahmmed](https://huggingface.co/sadiaahmmed), [Sahid Hossain Mustakim](https://huggingface.co/rhsm15) |