KalanaPabasara commited on
Commit
f28d091
·
1 Parent(s): 86328d1

Fix word_candidates NameError in mBart mode; remove IndoNLP dataset and mappings.py; update README

Browse files
IndoNLP-2025-Shared-Task/Images/sample.png DELETED
Binary file (18.2 kB)
 
IndoNLP-2025-Shared-Task/Readme.md DELETED
@@ -1,53 +0,0 @@
1
- # Romanized Indo-Aryan Language Reverse Transliterator
2
-
3
- ## Overview
4
-
5
- **Typing Romanized Indo-Aryan languages** (native languages written using English alphabets) using adhoc transliterations, with or without vowels, and achieving accurate native script output is often challenging. Existing keyboard systems frequently fail to provide accurate transliteration, resulting in a subpar user experience.
6
-
7
- This project introduces a **real-time reverse transliterator**, designed to convert Romanized Indo-Aryan language input into its corresponding native script. By improving the accuracy of transliteration, we aim to enhance the typing experience for users who prefer using Romanized alphabets for Indo-Aryan languages.
8
-
9
- ![Example Task](./Images/sample.png)
10
-
11
-
12
- ## Task Objective
13
-
14
- The primary goal of this project is to develop and evaluate a tool that:
15
- - Accurately converts Romanized Indo-Aryan language text to native script in real-time.
16
- - Handles transliterations with or without vowels, and resolves the ambiguities that arise from such variations.
17
- - Provides a seamless typing experience for users by addressing the inaccuracies present in current keyboard systems.
18
-
19
-
20
-
21
- ## Important Dates
22
-
23
- - 1st Call for Registration :July 20, 2024
24
- - 2nd Call for Registration : July 31, 2024
25
- - Registration Closing : August 31, 2024
26
- - Intial Briefing :September 2, 2024
27
- - Test Data released: October 11, 2024 New
28
- - Shared Task Completion :November 15, 2024
29
- - Paper Submission Deadline :November 30, 2024
30
-
31
- ## Language Test Sets
32
-
33
- This Test dataset has been created and augmented specifically for the INdoNLP Shared Task 2025. Please note that some data records are a combination of existing datasets that are publicly available for the respective languages. The augmentation process involved generating new data samples based on these existing resources while ensuring data diversity and relevance to the task.
34
-
35
- We want to give full credit to the authors and original creators of the datasets from which the data has been derived. Their contributions have been invaluable in the development of this dataset.
36
-
37
-
38
- | Language | Test Set 1: General Typing Patterns | Test Set 2: Adhoc Typing Patterns |
39
- |------------|--------------------------------------|------------------------------------|
40
- | Sinhala | 10000 | 5000 |
41
- | Bengali | 10000 | 5000 |
42
- | Gujarati | 5000 | 5000 |
43
- | Hindi | 5000 | 5000 |
44
- | Malayalam | 10000 | 5000 |
45
-
46
- ## Submission
47
-
48
- - The developed transliteration model/code.
49
- - A report detailing the approach, challenges faced, and solutions implemented.
50
- - A detailed discussion about the evaluation techniques/frameworks used.
51
- - A short paper (4 pages)
52
-
53
- For any inquiries or concerns regarding this dataset, please contact us at: indonlp2025@gmail.com
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
IndoNLP-2025-Shared-Task/Test Dataset/Bengali/Bengali Test Set 1.txt DELETED
The diff for this file is too large to render. See raw diff
 
IndoNLP-2025-Shared-Task/Test Dataset/Bengali/Bengali Test Set 2.txt DELETED
The diff for this file is too large to render. See raw diff
 
IndoNLP-2025-Shared-Task/Test Dataset/Gujarati/Gujarati Test 1.txt DELETED
The diff for this file is too large to render. See raw diff
 
IndoNLP-2025-Shared-Task/Test Dataset/Gujarati/Gujarati Test 2.txt DELETED
The diff for this file is too large to render. See raw diff
 
IndoNLP-2025-Shared-Task/Test Dataset/Hindi/Hindi Test Set 1.txt DELETED
The diff for this file is too large to render. See raw diff
 
IndoNLP-2025-Shared-Task/Test Dataset/Hindi/Hindi Test Set 2.txt DELETED
The diff for this file is too large to render. See raw diff
 
IndoNLP-2025-Shared-Task/Test Dataset/Malayalam/Malayalam Test Set 1.txt DELETED
The diff for this file is too large to render. See raw diff
 
IndoNLP-2025-Shared-Task/Test Dataset/Malayalam/Malayalam Test Set 2.txt DELETED
The diff for this file is too large to render. See raw diff
 
IndoNLP-2025-Shared-Task/Test Dataset/Sinhala/Sinhala Test set 1.txt DELETED
The diff for this file is too large to render. See raw diff
 
IndoNLP-2025-Shared-Task/Test Dataset/Sinhala/Sinhala Test set 2.txt DELETED
The diff for this file is too large to render. See raw diff
 
app.py CHANGED
@@ -101,6 +101,7 @@ if st.button("Transliterate", type="primary") and sentence.strip():
101
  transliterator = load_transliterator()
102
  result = transliterator.transliterate(sentence.strip())
103
  trace_logs: list[str] = []
 
104
  else:
105
  with st.spinner("Transliterating…"):
106
  decoder = load_decoder()
 
101
  transliterator = load_transliterator()
102
  result = transliterator.transliterate(sentence.strip())
103
  trace_logs: list[str] = []
104
+ word_candidates: list[tuple] = []
105
  else:
106
  with st.spinner("Transliterating…"):
107
  decoder = load_decoder()
core/mappings.py DELETED
@@ -1,8 +0,0 @@
1
- """
2
- core/mappings.py — deprecated.
3
-
4
- All manual Singlish→Sinhala mappings have been removed.
5
- Correction pairs are in seq2seq/finetune_corrections.py and baked into
6
- the ByT5 model weights via targeted correction fine-tuning.
7
- Candidate generation is handled end-to-end by the ByT5 seq2seq model.
8
- """