Spaces:

jonloporto
/

ImageToSpeechTest

Sleeping

App Files Files Community

jonloporto commited on Jan 5

Commit

9f14f1c

verified ·

1 Parent(s): ed8dbf8

Upload 3 files

Browse files

Files changed (3) hide show

README.md +28 -12
app.py +69 -0
requirements.txt +7 -0

README.md CHANGED Viewed

@@ -1,12 +1,28 @@
----
-title: ImageToSpeechTest
-emoji: 🦀
-colorFrom: blue
-colorTo: gray
-sdk: gradio
-sdk_version: 6.2.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: Image to Voice
+emoji: 🎤
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: 4.0.0
+app_file: app.py
+pinned: false
+---
+# Image to Voice Converter
+This Space converts images to text using Hugging Face's image-to-text pipeline, then converts the text to speech using Supertonic TTS.
+## How it works
+1. Upload an image
+2. The model extracts text from the image
+3. The text is converted to speech using a text-to-speech model
+4. Listen to the generated audio!
+## Technologies Used
+- **Hugging Face Transformers**: For image-to-text conversion
+- **Supertonic TTS**: For text-to-speech synthesis
+- **Gradio**: For the web interface

app.py ADDED Viewed

	@@ -0,0 +1,69 @@

+# -*- coding: utf-8 -*-
+"""ImageToVoice Hugging Face Space
+Converts images to text using Hugging Face's image-to-text pipeline,
+then converts the text to speech using Supertonic TTS.
+"""
+import gradio as gr
+from supertonic import TTS
+from transformers import pipeline
+from PIL import Image
+import io
+# Initialize models (load once at startup)
+image_to_text = pipeline("image-to-text")
+tts = TTS(auto_download=True)
+style = tts.get_voice_style(voice_name="M5")
+def image_to_voice(image):
+    """Convert image to text, then text to speech."""
+    if image is None:
+        return None, "Please upload an image."
+    try:
+        # Convert image to text
+        result = image_to_text(image)
+        generated_text = result[0]['generated_text']
+        # Convert text to speech
+        wav, duration = tts.synthesize(generated_text, voice_style=style)
+        # Convert numpy array to audio format for Gradio
+        # Gradio Audio component expects (sample_rate, audio_data) tuple
+        # Supertonic typically uses 22050 Hz sample rate
+        sample_rate = 22050
+        return (sample_rate, wav), generated_text
+    except Exception as e:
+        return None, f"Error: {str(e)}"
+# Create Gradio interface
+with gr.Blocks(title="Image to Voice") as demo:
+    gr.Markdown("# Image to Voice Converter")
+    gr.Markdown("Upload an image to convert it to text, then hear it as speech!")
+    with gr.Row():
+        with gr.Column():
+            image_input = gr.Image(type="pil", label="Upload Image")
+            generate_btn = gr.Button("Generate Speech", variant="primary")
+        with gr.Column():
+            audio_output = gr.Audio(label="Generated Speech", type="numpy")
+            text_output = gr.Textbox(label="Extracted Text", lines=5)
+    generate_btn.click(
+        fn=image_to_voice,
+        inputs=image_input,
+        outputs=[audio_output, text_output]
+    )
+    gr.Examples(
+        examples=[],
+        inputs=image_input
+    )
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+gradio>=4.0.0
+transformers>=4.30.0
+supertonic
+pillow>=9.0.0
+torch
+torchaudio