Jakir057 commited on
Commit
32b485a
·
verified ·
1 Parent(s): 3474e4d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -2
README.md CHANGED
@@ -18,12 +18,13 @@ BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects </
18
  📝 <a href="https://arxiv.org/abs/2510.06188"><b>Paper</b></a>, 🖥️ <a href="https://github.com/Jak57/BanglaTalk"><b>Github</b></a>
19
  </div>
20
 
 
21
  <!-- APT-Eval is the first and largest dataset to evaluate the AI-text detectors behavior for AI-polished texts.
22
  It contains almost **15K** text samples, polished by 5 different LLMs, for 6 different domains, with 2 major polishing types. All of these samples initially came from purely human written texts.
23
  It not only includes AI-polished texts, but also includes fine-grained involvement of AI/LLM.
24
- It is designed to push the boundary of AI-text detectors, for the scenarios where human uses LLM to minimally polish their own written texts.
25
 
26
- The overview of our dataset is given below --
27
 
28
  | **Polish Type** | **GPT-4o** | **Llama3.1-70B** | **Llama3-8B** | **Llama2-7B** | **DeepSeek-V3** | **Total** |
29
  |-------------------------------------------|------------|------------------|---------------|---------------|-- |-----------|
@@ -32,6 +33,77 @@ The overview of our dataset is given below --
32
  | **Percentage-based** | 2072 | 2048 | 1977 | 1282 | 2078 | 7379 |
33
  | **Total** | 3224 | 3133 | 3102 | 2026 | 3219 | **15004** | -->
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  <!-- ## Load the dataset
37
 
 
18
  📝 <a href="https://arxiv.org/abs/2510.06188"><b>Paper</b></a>, 🖥️ <a href="https://github.com/Jak57/BanglaTalk"><b>Github</b></a>
19
  </div>
20
 
21
+ **BRDialect** - ASR system is trained on ten regional dialects of Bangladesh using the <a href="https://www.kaggle.com/competitions/ben10">Ben10</a> dataset from Bengali.AI.
22
  <!-- APT-Eval is the first and largest dataset to evaluate the AI-text detectors behavior for AI-polished texts.
23
  It contains almost **15K** text samples, polished by 5 different LLMs, for 6 different domains, with 2 major polishing types. All of these samples initially came from purely human written texts.
24
  It not only includes AI-polished texts, but also includes fine-grained involvement of AI/LLM.
25
+ It is designed to push the boundary of AI-text detectors, for the scenarios where human uses LLM to minimally polish their own written texts. -->
26
 
27
+ <!-- The overview of our dataset is given below --
28
 
29
  | **Polish Type** | **GPT-4o** | **Llama3.1-70B** | **Llama3-8B** | **Llama2-7B** | **DeepSeek-V3** | **Total** |
30
  |-------------------------------------------|------------|------------------|---------------|---------------|-- |-----------|
 
33
  | **Percentage-based** | 2072 | 2048 | 1977 | 1282 | 2078 | 7379 |
34
  | **Total** | 3224 | 3133 | 3102 | 2026 | 3219 | **15004** | -->
35
 
36
+ ## Load the model
37
+
38
+ **Prerequisite**<br>
39
+ ```
40
+ !pip install -U transformers
41
+ !pip install https://github.com/kpu/kenlm/archive/master.zip
42
+ !pip install pyctcdecode
43
+ ```
44
+
45
+ **Log in to HuggingFace**<br>
46
+ ```
47
+ from huggingface_hub import login
48
+ login("TOKEN")
49
+ ```
50
+
51
+ **Load base model and BRDialect**<br>
52
+ ```
53
+ ## BRDialect
54
+ from huggingface_hub import hf_hub_download
55
+
56
+ kenlm_model_path = hf_hub_download(repo_id="Jakir057/BRDialect", filename="BRDialect/5gram_kenlm.arpa")
57
+ state_dict_path = hf_hub_download(repo_id="Jakir057/BRDialect", filename="BRDialect/wav2vec2_bangla_regional_dialect.pth")
58
+ ```
59
+ ```
60
+ from transformers import AutoProcessor, AutoModelForCTC, Wav2Vec2ProcessorWithLM
61
+ import torch
62
+ import numpy as np
63
+ import pyctcdecode
64
+ import librosa
65
+
66
+ base_model_id = "ai4bharat/indicwav2vec_v1_bengali"
67
+ processor = AutoProcessor.from_pretrained(base_model_id)
68
+ model = AutoModelForCTC.from_pretrained(base_model_id)
69
+ model.load_state_dict(torch.load(state_dict_path)["model"])
70
+
71
+ vocab_dict = processor.tokenizer.get_vocab()
72
+ sorted_vocab_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}
73
+ decoder = pyctcdecode.build_ctcdecoder(
74
+ list(sorted_vocab_dict.keys()),
75
+ str(kenlm_model_path)
76
+ )
77
+ processor_with_lm = Wav2Vec2ProcessorWithLM(
78
+ feature_extractor=processor.feature_extractor,
79
+ tokenizer=processor.tokenizer,
80
+ decoder=decoder
81
+ )
82
+ model.freeze_feature_encoder()
83
+ model.eval()
84
+ ```
85
+
86
+ ## Transcription Generation
87
+ ```
88
+ sampling_rate = 16000
89
+ path = "AUDIO_PATH"
90
+ frame, sr = librosa.load(path, sr=sampling_rate, mono=True)
91
+
92
+ inputs = processor(
93
+ frame,
94
+ sampling_rate=sampling_rate,
95
+ return_tensors="pt",
96
+ padding=False
97
+ )
98
+
99
+ with torch.no_grad():
100
+ logits = model(inputs.input_values.to("cpu")).logits
101
+
102
+ np_logits = logits.squeeze(0).cpu().numpy()
103
+ result = processor_with_lm.decode(np_logits, beam_width=256)
104
+ text = result.text
105
+ print(f"Transcription={text}")
106
+ ```
107
 
108
  <!-- ## Load the dataset
109