Professor commited on
Commit
4ce06d7
·
verified ·
1 Parent(s): bc7a10f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -6
README.md CHANGED
@@ -1,7 +1,9 @@
1
  ---
2
  base_model: YatharthS/MiraTTS
3
  tags:
4
- - text-generation-inference
 
 
5
  - transformers
6
  - unsloth
7
  - qwen2
@@ -9,15 +11,72 @@ tags:
9
  - sft
10
  license: apache-2.0
11
  language:
 
12
  - en
13
  ---
14
 
15
- # Uploaded model
16
 
17
  - **Developed by:** Professor
18
- - **License:** apache-2.0
19
- - **Finetuned from model :** YatharthS/MiraTTS
 
20
 
21
- This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)
 
22
 
23
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model: YatharthS/MiraTTS
3
  tags:
4
+ - text-to-speech
5
+ - tts
6
+ - audio-generation
7
  - transformers
8
  - unsloth
9
  - qwen2
 
11
  - sft
12
  license: apache-2.0
13
  language:
14
+ - rw
15
  - en
16
  ---
17
 
18
+ # MiraTTS Kinyarwanda (Phase 1 - Language Acquisition)
19
 
20
  - **Developed by:** Professor
21
+ - **License:** Apache 2.0
22
+ - **Finetuned from model:** YatharthS/MiraTTS
23
+ - **Language:** Kinyarwanda (`rw`), English (`en`)
24
 
25
+ ## Model Overview
26
+ This is a foundational Text-to-Speech (TTS) model for the Kinyarwanda language. It is built on the MiraTTS architecture (which utilizes a 0.5B parameter Qwen2.5 LLM backbone) and was fine-tuned to map Kinyarwanda text to its correct phonetic and acoustic representations.
27
 
28
+ **Note:** This is a "Phase 1" checkpoint. It was trained on a combined dataset of high-fidelity human speech and synthetic speech to teach the model the core phonetic rules, prefixes, and rhythm of Kinyarwanda. It is capable of generating intelligible Kinyarwanda speech but may exhibit occasional synthetic artifacts or hallucinated padding. A Phase 2 model (refined strictly on human data) is recommended for production use.
29
+
30
+ ## Training Details
31
+ The model was trained using the `Unsloth` framework for optimized hardware utilization. Training was intentionally halted early (around Epoch 10) to prevent the LLM backbone from memorizing the dataset and losing natural prosody.
32
+
33
+ * **Dataset Size:** 28,629 audio-text pairs
34
+ * **Effective Batch Size:** 256 (64 per device * 4 gradient accumulation steps)
35
+ * **Total Steps Trained:** 1,189
36
+ * **Starting Loss:** 10.84
37
+ * **Final Loss:** 5.76
38
+ * **Hardware:** Trained on a single NVIDIA GPU in `bfloat16` precision (where supported).
39
+
40
+ ## How to Use (Inference)
41
+ Because this model utilizes the highly optimized Lmdeploy backend for rapid audio generation, it requires a modern NVIDIA GPU (such as an L4, A100, or G4) to run at full speed.
42
+
43
+ Below is the standard inference script to generate Kinyarwanda audio using a reference voice clip.
44
+
45
+ ### 1. Installation
46
+ Ensure you install the optimized `MiraTTS` library and align your PyTorch audio dependencies:
47
+ ```bash
48
+ pip install git+https://github.com/ysharma3501/MiraTTS.git
49
+ # Ensure torchaudio and torchvision match your active PyTorch version
50
+ ```
51
+
52
+ ### 2. Python Inference Code
53
+ ```python
54
+ import torch
55
+ from mira.model import MiraTTS
56
+ from IPython.display import Audio, display
57
+
58
+ print("Loading Kinyarwanda Phase 1 Model...")
59
+ # Initialize the model directly from the Hub
60
+ mira_tts = MiraTTS("Professor/MiraTTS-Kinyarwanda-Phase1")
61
+
62
+ # Provide a path to a real, high-quality audio file to use as the voice print
63
+ reference_audio_path = "/path/to/your/reference_audio.wav"
64
+
65
+ test_text = "Muraho neza! Uyu munsi turimo kugerageza porogaramu nshya y'ikinyarwanda."
66
+
67
+ # Extract voice context and synthesize
68
+ print("Synthesizing audio...")
69
+ context_tokens = mira_tts.encode_audio(reference_audio_path)
70
+ audio = mira_tts.generate(test_text, context_tokens)
71
+
72
+ # Play the audio (if running in a Jupyter/Colab notebook)
73
+ display(Audio(audio, rate=48000))
74
+ ```
75
+
76
+ ## Limitations
77
+ * **Hardware Constraints:** Requires a CUDA-enabled NVIDIA GPU. Running on older architectures (like the T4) requires bypassing the optimized pipeline and forcing float32 precision, which is significantly slower.
78
+ * **End-of-Sequence Hallucinations:** Because this is an LLM-based generative model, it may occasionally continue generating extra Kinyarwanda syllables after the input text is finished.
79
+
80
+ ---
81
+ *This model was trained 2x faster with Unsloth.*
82
+ <img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200" alt="Unsloth Made With Love"/>