KalanaPabasara commited on
Commit ·
f28d091
1
Parent(s): 86328d1
Fix word_candidates NameError in mBart mode; remove IndoNLP dataset and mappings.py; update README
Browse files- IndoNLP-2025-Shared-Task/Images/sample.png +0 -0
- IndoNLP-2025-Shared-Task/Readme.md +0 -53
- IndoNLP-2025-Shared-Task/Test Dataset/Bengali/Bengali Test Set 1.txt +0 -0
- IndoNLP-2025-Shared-Task/Test Dataset/Bengali/Bengali Test Set 2.txt +0 -0
- IndoNLP-2025-Shared-Task/Test Dataset/Gujarati/Gujarati Test 1.txt +0 -0
- IndoNLP-2025-Shared-Task/Test Dataset/Gujarati/Gujarati Test 2.txt +0 -0
- IndoNLP-2025-Shared-Task/Test Dataset/Hindi/Hindi Test Set 1.txt +0 -0
- IndoNLP-2025-Shared-Task/Test Dataset/Hindi/Hindi Test Set 2.txt +0 -0
- IndoNLP-2025-Shared-Task/Test Dataset/Malayalam/Malayalam Test Set 1.txt +0 -0
- IndoNLP-2025-Shared-Task/Test Dataset/Malayalam/Malayalam Test Set 2.txt +0 -0
- IndoNLP-2025-Shared-Task/Test Dataset/Sinhala/Sinhala Test set 1.txt +0 -0
- IndoNLP-2025-Shared-Task/Test Dataset/Sinhala/Sinhala Test set 2.txt +0 -0
- app.py +1 -0
- core/mappings.py +0 -8
IndoNLP-2025-Shared-Task/Images/sample.png
DELETED
|
Binary file (18.2 kB)
|
|
|
IndoNLP-2025-Shared-Task/Readme.md
DELETED
|
@@ -1,53 +0,0 @@
|
|
| 1 |
-
# Romanized Indo-Aryan Language Reverse Transliterator
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
**Typing Romanized Indo-Aryan languages** (native languages written using English alphabets) using adhoc transliterations, with or without vowels, and achieving accurate native script output is often challenging. Existing keyboard systems frequently fail to provide accurate transliteration, resulting in a subpar user experience.
|
| 6 |
-
|
| 7 |
-
This project introduces a **real-time reverse transliterator**, designed to convert Romanized Indo-Aryan language input into its corresponding native script. By improving the accuracy of transliteration, we aim to enhance the typing experience for users who prefer using Romanized alphabets for Indo-Aryan languages.
|
| 8 |
-
|
| 9 |
-

|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
## Task Objective
|
| 13 |
-
|
| 14 |
-
The primary goal of this project is to develop and evaluate a tool that:
|
| 15 |
-
- Accurately converts Romanized Indo-Aryan language text to native script in real-time.
|
| 16 |
-
- Handles transliterations with or without vowels, and resolves the ambiguities that arise from such variations.
|
| 17 |
-
- Provides a seamless typing experience for users by addressing the inaccuracies present in current keyboard systems.
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
## Important Dates
|
| 22 |
-
|
| 23 |
-
- 1st Call for Registration :July 20, 2024
|
| 24 |
-
- 2nd Call for Registration : July 31, 2024
|
| 25 |
-
- Registration Closing : August 31, 2024
|
| 26 |
-
- Intial Briefing :September 2, 2024
|
| 27 |
-
- Test Data released: October 11, 2024 New
|
| 28 |
-
- Shared Task Completion :November 15, 2024
|
| 29 |
-
- Paper Submission Deadline :November 30, 2024
|
| 30 |
-
|
| 31 |
-
## Language Test Sets
|
| 32 |
-
|
| 33 |
-
This Test dataset has been created and augmented specifically for the INdoNLP Shared Task 2025. Please note that some data records are a combination of existing datasets that are publicly available for the respective languages. The augmentation process involved generating new data samples based on these existing resources while ensuring data diversity and relevance to the task.
|
| 34 |
-
|
| 35 |
-
We want to give full credit to the authors and original creators of the datasets from which the data has been derived. Their contributions have been invaluable in the development of this dataset.
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
| Language | Test Set 1: General Typing Patterns | Test Set 2: Adhoc Typing Patterns |
|
| 39 |
-
|------------|--------------------------------------|------------------------------------|
|
| 40 |
-
| Sinhala | 10000 | 5000 |
|
| 41 |
-
| Bengali | 10000 | 5000 |
|
| 42 |
-
| Gujarati | 5000 | 5000 |
|
| 43 |
-
| Hindi | 5000 | 5000 |
|
| 44 |
-
| Malayalam | 10000 | 5000 |
|
| 45 |
-
|
| 46 |
-
## Submission
|
| 47 |
-
|
| 48 |
-
- The developed transliteration model/code.
|
| 49 |
-
- A report detailing the approach, challenges faced, and solutions implemented.
|
| 50 |
-
- A detailed discussion about the evaluation techniques/frameworks used.
|
| 51 |
-
- A short paper (4 pages)
|
| 52 |
-
|
| 53 |
-
For any inquiries or concerns regarding this dataset, please contact us at: indonlp2025@gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
IndoNLP-2025-Shared-Task/Test Dataset/Bengali/Bengali Test Set 1.txt
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
IndoNLP-2025-Shared-Task/Test Dataset/Bengali/Bengali Test Set 2.txt
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
IndoNLP-2025-Shared-Task/Test Dataset/Gujarati/Gujarati Test 1.txt
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
IndoNLP-2025-Shared-Task/Test Dataset/Gujarati/Gujarati Test 2.txt
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
IndoNLP-2025-Shared-Task/Test Dataset/Hindi/Hindi Test Set 1.txt
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
IndoNLP-2025-Shared-Task/Test Dataset/Hindi/Hindi Test Set 2.txt
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
IndoNLP-2025-Shared-Task/Test Dataset/Malayalam/Malayalam Test Set 1.txt
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
IndoNLP-2025-Shared-Task/Test Dataset/Malayalam/Malayalam Test Set 2.txt
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
IndoNLP-2025-Shared-Task/Test Dataset/Sinhala/Sinhala Test set 1.txt
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
IndoNLP-2025-Shared-Task/Test Dataset/Sinhala/Sinhala Test set 2.txt
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
app.py
CHANGED
|
@@ -101,6 +101,7 @@ if st.button("Transliterate", type="primary") and sentence.strip():
|
|
| 101 |
transliterator = load_transliterator()
|
| 102 |
result = transliterator.transliterate(sentence.strip())
|
| 103 |
trace_logs: list[str] = []
|
|
|
|
| 104 |
else:
|
| 105 |
with st.spinner("Transliterating…"):
|
| 106 |
decoder = load_decoder()
|
|
|
|
| 101 |
transliterator = load_transliterator()
|
| 102 |
result = transliterator.transliterate(sentence.strip())
|
| 103 |
trace_logs: list[str] = []
|
| 104 |
+
word_candidates: list[tuple] = []
|
| 105 |
else:
|
| 106 |
with st.spinner("Transliterating…"):
|
| 107 |
decoder = load_decoder()
|
core/mappings.py
DELETED
|
@@ -1,8 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
core/mappings.py — deprecated.
|
| 3 |
-
|
| 4 |
-
All manual Singlish→Sinhala mappings have been removed.
|
| 5 |
-
Correction pairs are in seq2seq/finetune_corrections.py and baked into
|
| 6 |
-
the ByT5 model weights via targeted correction fine-tuning.
|
| 7 |
-
Candidate generation is handled end-to-end by the ByT5 seq2seq model.
|
| 8 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|