Jakir057 commited on
Commit
3474e4d
·
verified ·
1 Parent(s): 59d57f0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -1
README.md CHANGED
@@ -8,4 +8,76 @@ metrics:
8
  base_model:
9
  - ai4bharat/indicwav2vec_v1_bengali
10
  pipeline_tag: automatic-speech-recognition
11
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  base_model:
9
  - ai4bharat/indicwav2vec_v1_bengali
10
  pipeline_tag: automatic-speech-recognition
11
+ ---
12
+
13
+
14
+ <div align="center">
15
+ <h1>🚨 BRDialect 🚨
16
+
17
+ BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects </h1>
18
+ 📝 <a href="https://arxiv.org/abs/2510.06188"><b>Paper</b></a>, 🖥️ <a href="https://github.com/Jak57/BanglaTalk"><b>Github</b></a>
19
+ </div>
20
+
21
+ <!-- APT-Eval is the first and largest dataset to evaluate the AI-text detectors behavior for AI-polished texts.
22
+ It contains almost **15K** text samples, polished by 5 different LLMs, for 6 different domains, with 2 major polishing types. All of these samples initially came from purely human written texts.
23
+ It not only includes AI-polished texts, but also includes fine-grained involvement of AI/LLM.
24
+ It is designed to push the boundary of AI-text detectors, for the scenarios where human uses LLM to minimally polish their own written texts.
25
+
26
+ The overview of our dataset is given below --
27
+
28
+ | **Polish Type** | **GPT-4o** | **Llama3.1-70B** | **Llama3-8B** | **Llama2-7B** | **DeepSeek-V3** | **Total** |
29
+ |-------------------------------------------|------------|------------------|---------------|---------------|-- |-----------|
30
+ | **no-polish / pure HWT** | - | - | - | - | - | 300 |
31
+ | **Degree-based** | 1152 | 1085 | 1125 | 744 | 1141 | 4406 |
32
+ | **Percentage-based** | 2072 | 2048 | 1977 | 1282 | 2078 | 7379 |
33
+ | **Total** | 3224 | 3133 | 3102 | 2026 | 3219 | **15004** | -->
34
+
35
+
36
+ <!-- ## Load the dataset
37
+
38
+ To load the dataset, install the library `datasets` with `pip install datasets`. Then,
39
+ ```
40
+ from datasets import load_dataset
41
+ apt_eval_dataset = load_dataset("smksaha/apt-eval")
42
+ ```
43
+
44
+ If you also want to access the original human written text samples, use this
45
+ ```
46
+ from datasets import load_dataset
47
+ dataset = load_dataset("smksaha/apt-eval", data_files={
48
+ "test": "merged_apt_eval_dataset.csv",
49
+ "original": "original.csv"
50
+ })
51
+ ``` -->
52
+ <!--
53
+ ## Data fields
54
+ The RAID dataset has the following fields
55
+
56
+ ```
57
+ 1. `id`: A id that uniquely identifies each sample
58
+ 2. `polish_type`: The type of polishing that was used to generate this text sample
59
+ - Choices: `['degree-based', 'percentage-based']`
60
+ 3. `polishing_degree`: The degree of polishing that was used by the polisher to generate this text sample
61
+ - Choices: `["extreme_minor", "minor", "slight_major", "major"]`
62
+ 4. `polishing_percent`: The percetnage of original text was prompted to the polisher to generate this text sample
63
+ - Choices: `["1", "5", "10", "20", "35", "50", "75"]`
64
+ 5. `polisher`: The LLMs were used as polisher
65
+ - Choices: `["DeepSeek-V3", "GPT-4o", "Llama3.1-70B", "Llama3-8B", "Llama2-7B"]`
66
+ 6. `domain`: The genre from where the original human written text was taken
67
+ - Choices: `['blog', 'email_content', 'game_review', 'news', 'paper_abstract', 'speech']`
68
+ 7. `generation`: The text of the generation
69
+ 8. `sem_similarity`: The semantic similarity between polished text and original human written text
70
+ 9. `levenshtein_distance`: The levenshtein distance between polished text and original human written text
71
+ 10. `jaccard_distance`: The jaccard distance between polished text and original human written text
72
+ ``` -->
73
+
74
+ ## Citation
75
+
76
+ ```
77
+ @article{hasan2025banglatalk,
78
+ title={BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects},
79
+ author={Hasan, Jakir and Dipta, Shubhashis Roy},
80
+ journal={arXiv preprint arXiv:2510.06188},
81
+ year={2025}
82
+ }
83
+ ```