Spaces:

Inferless
/

Open-Source-TTS-Gallary

Running

App Files Files Community

rbgo commited on May 31

Commit

21a3273

verified ·

1 Parent(s): 7743176

Update app.py

Browse files

Files changed (1) hide show

app.py +33 -35

app.py CHANGED Viewed

@@ -208,10 +208,9 @@ def create_interface():
         gr.HTML("""
         <div id="intro-section">
             <h3>🔬 Our Exciting Quest</h3>
-            <p>We're on a thrilling journey to help developers discover the perfect TTS models for their innovative audio projects!
-            We've put these 12 cutting-edge models using the test prompts.</p>
-            <p><strong>Featured TTS Engines:</strong></p>
             <ul>
                 <li>🎭 <strong>Dia-1.6B</strong> - Expressive conversational voice</li>
                 <li>🎪 <strong>Kokoro-82M</strong> - Lightweight powerhouse</li>
@@ -226,20 +225,19 @@ def create_interface():
             <ol>
                 <li><strong>Outstanding Speech Quality</strong><br>
                     Several models—namely <strong>Kokoro-82M</strong>, <strong>csm-1b</strong>, <strong>Spark-TTS-0.5B</strong>,
-                    <strong>Orpheus-3b-0.1-ft</strong>, <strong>F5-TTS</strong>, and <strong>Llasa-3B</strong>—delivered exceptionally
                     natural, clear, and realistic synthesized speech. Among these, <strong>csm-1b</strong> and <strong>F5-TTS</strong>
-                    stood out as the most well-rounded: they combined top-tier naturalness and intelligibility with solid controllability.
                 </li>
                 <li><strong>Superior Controllability</strong><br>
-                    <strong>Zonos-v0.1-transformer</strong> emerged as the leader in fine-grained control: it offers detailed
                     adjustments for prosody, emotion, and audio quality, making it ideal for use cases that demand precise
                     voice modulation.
                 </li>
                 <li><strong>Performance vs. Footprint Trade-off</strong><br>
-                    Smaller models (e.g., <strong>Kokoro-82M</strong> at 82 million parameters) can still achieve “Good” or
-                    “Excellent” ratings in many scenarios, especially when efficient inference or low VRAM usage is critical.
                     Larger models (1 billion–3 billion+ parameters) generally offer more versatility—handling multilingual
-                    synthesis, zero-shot voice cloning, and multi-speaker generation—but require heavier compute resources.
                 </li>
                 <li><strong>Special Notes on Multilingual & Cloning Capabilities</strong><br>
                     <strong>Spark-TTS-0.5B</strong> and <strong>XTTS-v2</strong> excel at cross-lingual and zero-shot voice
@@ -368,39 +366,39 @@ def create_interface():
         )
         # Methodology Section
-        with gr.Accordion("📋 Detailed Evaluation Methodology", open=False):
-            gr.Markdown("""
-            ### Test Prompt
-            `Hello, this is a universal test sentence. Can the advanced Zylophonic system clearly articulate this and express a hint of excitement? The quick brown fox certainly hopes so!`
-            ### Model Evaluation Criteria:
-            🎭 **Naturalness (Human-like Quality)**
-            - Prosody and rhythm patterns
-            - Emotional expression capability
-            - Voice texture and warmth
-            - Natural breathing and pauses
-            🗣️ **Intelligibility (Clarity & Accuracy)**
-            - Word pronunciation precision
-            - Consonant and vowel clarity
-            - Sentence comprehensibility
-            - Technical term handling
-            🎛️ **Controllability (Flexibility)**
-            - Parameter responsiveness
-            - Tone modification capability
-            - Speed and pitch control
-            - Customization potential
-            ### Key Insights:
-            - Smaller models (82M-500M) can excel in specific scenarios
-            - Larger models (1B-3B+) offer more versatility but require more resources
-            - Architecture matters as much as parameter count
-            - Training data quality significantly impacts output quality
-            """)
         # Footer
         # gr.HTML("""

         gr.HTML("""
         <div id="intro-section">
             <h3>🔬 Our Exciting Quest</h3>
+            <p>We’re on a mission to help developers quickly find and compare the best open-source TTS models for their audio projects. In this gallery, you’ll find 12 state-of-the-art TTS models, each evaluated using a consistent test prompt to assess their synthesized speech.</p>
+            <p><strong>Featured TTS Models:</strong></p>
             <ul>
                 <li>🎭 <strong>Dia-1.6B</strong> - Expressive conversational voice</li>
                 <li>🎪 <strong>Kokoro-82M</strong> - Lightweight powerhouse</li>
             <ol>
                 <li><strong>Outstanding Speech Quality</strong><br>
                     Several models—namely <strong>Kokoro-82M</strong>, <strong>csm-1b</strong>, <strong>Spark-TTS-0.5B</strong>,
+                    <strong>Orpheus-3b-0.1-ft</strong>, <strong>F5-TTS</strong>, and <strong>Llasa-3B</strong> delivered exceptionally
                     natural, clear, and realistic synthesized speech. Among these, <strong>csm-1b</strong> and <strong>F5-TTS</strong>
+                    stood out as the most well-rounded model as they combined good synthesized speech with solid controllability.
                 </li>
                 <li><strong>Superior Controllability</strong><br>
+                    <strong>Zonos-v0.1-transformer</strong> emerged as the best in fine-grained control: it offers detailed
                     adjustments for prosody, emotion, and audio quality, making it ideal for use cases that demand precise
                     voice modulation.
                 </li>
                 <li><strong>Performance vs. Footprint Trade-off</strong><br>
+                    Smaller models (e.g., <strong>Kokoro-82M</strong> at 82 million parameters) can still excel in many scenarios, especially when efficient inference or low VRAM usage is critical.
                     Larger models (1 billion–3 billion+ parameters) generally offer more versatility—handling multilingual
+                    synthesis, zero-shot voice cloning, and multi-speaker generation but require heavier compute resources.
                 </li>
                 <li><strong>Special Notes on Multilingual & Cloning Capabilities</strong><br>
                     <strong>Spark-TTS-0.5B</strong> and <strong>XTTS-v2</strong> excel at cross-lingual and zero-shot voice
         )
         # Methodology Section
+        # with gr.Accordion("📋 Detailed Evaluation Methodology", open=False):
+        #     gr.Markdown("""
+        #     ### Test Prompt
+        #     `Hello, this is a universal test sentence. Can the advanced Zylophonic system clearly articulate this and express a hint of excitement? The quick brown fox certainly hopes so!`
+        #     ### Model Evaluation Criteria:
+        #     🎭 **Naturalness (Human-like Quality)**
+        #     - Prosody and rhythm patterns
+        #     - Emotional expression capability
+        #     - Voice texture and warmth
+        #     - Natural breathing and pauses
+        #     🗣️ **Intelligibility (Clarity & Accuracy)**
+        #     - Word pronunciation precision
+        #     - Consonant and vowel clarity
+        #     - Sentence comprehensibility
+        #     - Technical term handling
+        #     🎛️ **Controllability (Flexibility)**
+        #     - Parameter responsiveness
+        #     - Tone modification capability
+        #     - Speed and pitch control
+        #     - Customization potential
+        #     ### Key Insights:
+        #     - Smaller models (82M-500M) can excel in specific scenarios
+        #     - Larger models (1B-3B+) offer more versatility but require more resources
+        #     - Architecture matters as much as parameter count
+        #     - Training data quality significantly impacts output quality
+        #     """)
         # Footer
         # gr.HTML("""