update readme
Browse files
README.md
CHANGED
|
@@ -4,106 +4,109 @@ emoji: 🚀
|
|
| 4 |
colorFrom: green
|
| 5 |
colorTo: yellow
|
| 6 |
sdk: docker
|
| 7 |
-
app_port:
|
| 8 |
tags:
|
| 9 |
-
-
|
|
|
|
| 10 |
pinned: false
|
| 11 |
short_description: 'VoxSum Studio: Transform Audio into Insightful Summaries'
|
| 12 |
license: apache-2.0
|
| 13 |
---
|
| 14 |
|
| 15 |
-
# VoxSum Studio
|
| 16 |
|
| 17 |
**VoxSum Studio** is a powerful web application built for Hugging Face Spaces, designed to transform audio into insightful summaries. This tool leverages advanced Automatic Speech Recognition (ASR) and Large Language Models (LLMs) to transcribe and summarize audio from podcasts, YouTube videos, or uploaded files. With an interactive transcript player and customizable settings, VoxSum Studio makes it easy to extract key insights from audio content in real time.
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
## Features
|
| 22 |
-
|
| 23 |
-
- **Podcast Search & Download**: Search for podcast series, browse episodes, and download audio directly from the app.
|
| 24 |
-
|
| 25 |
-
- **
|
| 26 |
-
|
| 27 |
-
- **
|
| 28 |
-
|
| 29 |
-
- **
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
4. **生成摘要**:
|
| 53 |
-
- 點擊「生成摘要」按鈕,根據轉錄內容創建摘要。
|
| 54 |
-
- 在側邊欄中選擇合適的 LLM 模型,並輸入自訂提示詞以生成符合您需求的摘要內容。
|
| 55 |
-
|
| 56 |
-
5. **與結果互動**:
|
| 57 |
-
- 使用互動式播放器,點擊轉錄中的特定段落即可跳轉至對應的音頻時間點,快速定位關鍵內容。
|
| 58 |
-
- 最終摘要將顯示於轉錄下方,清晰呈現音頻中的核心資訊。
|
| 59 |
-
|
| 60 |
-
## Configuration / 配置
|
| 61 |
-
- **Sidebar Settings / 側邊欄設置**:
|
| 62 |
-
- **VAD Threshold / VAD 閾值**: Adjust the slider (0.1 to 0.9) to fine-tune voice activity detection. / 調整滑桿(0.1 至 0.9)以優化語音活動檢測。
|
| 63 |
-
- **ASR Model / ASR 模型**: Select from available Moonshine models for transcription. / 從可用的 Moonshine 模型中選擇用於轉錄的模型。
|
| 64 |
-
- **LLM Model / LLM 模型**: Choose an LLM for summarization from the available options. / 從可用選項中選擇用於摘要的 LLM 模型。
|
| 65 |
-
- **Custom Prompt / 自訂提示詞**: Input a custom prompt to guide the summarization process. / 輸入自訂提示詞以引導摘要生成過程。
|
| 66 |
-
|
| 67 |
-
## Project Structure / 專案結構
|
| 68 |
```
|
| 69 |
voxsum-studio/
|
| 70 |
-
├── Dockerfile # Docker configuration for building and running the app
|
| 71 |
-
├── README.md # Project documentation and setup instructions
|
| 72 |
-
├── requirements.txt # Python dependencies for the project
|
| 73 |
-
├── src/ # Source code directory
|
| 74 |
-
│ ├──
|
| 75 |
-
│ ├──
|
| 76 |
-
│ ├──
|
| 77 |
-
│ ├──
|
| 78 |
-
│
|
| 79 |
-
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
```
|
| 82 |
|
| 83 |
-
## Notes
|
| 84 |
-
- **
|
| 85 |
-
- **
|
| 86 |
-
- **
|
| 87 |
-
- **
|
| 88 |
-
- **
|
| 89 |
-
- The `
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
-
|
| 94 |
-
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
## Acknowledgments
|
| 107 |
-
- Built with [
|
| 108 |
-
- Powered by Hugging Face Spaces for hosting and deployment.
|
| 109 |
-
- Inspired by advancements in ASR and LLM technologies for audio processing.
|
|
|
|
| 4 |
colorFrom: green
|
| 5 |
colorTo: yellow
|
| 6 |
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
tags:
|
| 9 |
+
- fastapi
|
| 10 |
+
- web-app
|
| 11 |
pinned: false
|
| 12 |
short_description: 'VoxSum Studio: Transform Audio into Insightful Summaries'
|
| 13 |
license: apache-2.0
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# VoxSum Studio
|
| 17 |
|
| 18 |
**VoxSum Studio** is a powerful web application built for Hugging Face Spaces, designed to transform audio into insightful summaries. This tool leverages advanced Automatic Speech Recognition (ASR) and Large Language Models (LLMs) to transcribe and summarize audio from podcasts, YouTube videos, or uploaded files. With an interactive transcript player and customizable settings, VoxSum Studio makes it easy to extract key insights from audio content in real time.
|
| 19 |
|
| 20 |
+
The application features a modern web interface built with HTML, CSS, and JavaScript, powered by a FastAPI backend for robust API handling.
|
| 21 |
+
|
| 22 |
+
## Features
|
| 23 |
+
|
| 24 |
+
- **Podcast Search & Download**: Search for podcast series, browse episodes, and download audio directly from the app.
|
| 25 |
+
- **YouTube Audio Fetching**: Extract audio from YouTube videos by providing a URL.
|
| 26 |
+
- **Audio Upload**: Upload your own audio files (MP3, WAV) for transcription and summarization.
|
| 27 |
+
- **Interactive Transcript Player**: View real-time transcripts synced with audio playback, with clickable timestamps for easy navigation and auto-scrolling highlights.
|
| 28 |
+
- **Customizable Summarization**: Choose from multiple LLMs and provide custom prompts to generate tailored summaries.
|
| 29 |
+
- **Voice Activity Detection (VAD)**: Adjust the VAD threshold to optimize transcription accuracy.
|
| 30 |
+
- **Web Interface**: A user-friendly interface with settings for model selection and real-time status updates.
|
| 31 |
+
|
| 32 |
+
## Getting Started
|
| 33 |
+
|
| 34 |
+
### Usage
|
| 35 |
+
1. **Launch the application**: Open the application via the URL provided by Hugging Face Space. The interface is intuitive for easy operation.
|
| 36 |
+
|
| 37 |
+
2. **Select audio source**: Search for podcast series, browse episodes, and download audio. Upload MP3 or WAV audio files, or enter a YouTube video URL to extract audio content.
|
| 38 |
+
|
| 39 |
+
3. **Perform transcription**: Click the "Transcribe Audio" button to start generating transcript content. The transcription process will display text content in real time, and upon completion, an interactive player will be automatically generated for viewing transcripts synchronized with audio.
|
| 40 |
+
|
| 41 |
+
4. **Generate summary**: Click the "Generate Summary" button to create a summary based on the transcript. Choose the appropriate LLM model and input custom prompts to generate summaries that meet your needs.
|
| 42 |
+
|
| 43 |
+
5. **Interact with results**: Use the interactive player to click on specific segments in the transcript to jump to the corresponding audio time point, quickly locating key content. The final summary will be displayed below the transcript, clearly presenting the core information in the audio.
|
| 44 |
+
|
| 45 |
+
## Configuration
|
| 46 |
+
- **Settings**:
|
| 47 |
+
- **VAD Threshold**: Adjust the threshold (0.1 to 0.9) to fine-tune voice activity detection.
|
| 48 |
+
- **ASR Model**: Select from available models for transcription.
|
| 49 |
+
- **LLM Model**: Choose an LLM for summarization from the available options.
|
| 50 |
+
- **Custom Prompt**: Input a custom prompt to guide the summarization process.
|
| 51 |
+
|
| 52 |
+
## Project Structure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
```
|
| 54 |
voxsum-studio/
|
| 55 |
+
├── Dockerfile # Docker configuration for building and running the app
|
| 56 |
+
├── README.md # Project documentation and setup instructions
|
| 57 |
+
├── requirements.txt # Python dependencies for the project
|
| 58 |
+
├── src/ # Source code directory
|
| 59 |
+
│ ├── __init__.py # Makes src a Python package
|
| 60 |
+
│ ├── asr.py # Logic for Automatic Speech Recognition (ASR) transcription
|
| 61 |
+
│ ├── diarization.py # Speaker diarization functionality
|
| 62 |
+
│ ├── editing_sync.py # Audio editing and synchronization
|
| 63 |
+
│ ├── export_utils.py # Utilities for exporting transcripts and summaries
|
| 64 |
+
│ ├── improved_diarization.py # Enhanced diarization features
|
| 65 |
+
│ ├── podcast.py # Functions for podcast search, episode fetching, and audio downloading
|
| 66 |
+
│ ├── streamlit_app.py # Legacy Streamlit application (for reference)
|
| 67 |
+
│ ├── summarization.py # Logic for generating summaries using LLMs
|
| 68 |
+
│ ├── utils.py # Utility functions and model configurations
|
| 69 |
+
│ ├── server/ # FastAPI backend
|
| 70 |
+
│ │ ├── __init__.py
|
| 71 |
+
│ │ ├── main.py # Main FastAPI application
|
| 72 |
+
│ │ ├── core/ # Core configuration
|
| 73 |
+
│ │ ├── models/ # Pydantic models for API
|
| 74 |
+
│ │ ├── routers/ # API routes
|
| 75 |
+
│ │ └── services/ # Business logic services
|
| 76 |
+
│ ├── frontend/ # Static frontend files
|
| 77 |
+
│ └── static/ # Static assets
|
| 78 |
+
├── frontend/ # Frontend source files
|
| 79 |
+
│ ├── app.js # Main JavaScript application
|
| 80 |
+
│ ├── index.html # Main HTML page
|
| 81 |
+
│ └── styles.css # CSS styles
|
| 82 |
+
└── static/ # Static assets directory
|
| 83 |
+
└── audio/ # Temporary storage for audio files (not tracked in git)
|
| 84 |
```
|
| 85 |
|
| 86 |
+
## Notes
|
| 87 |
+
- **Architecture**: The application uses a FastAPI backend for API endpoints and a vanilla JavaScript frontend for the user interface.
|
| 88 |
+
- **Temporary Storage**: Uploaded and downloaded audio files are stored in the `/tmp` directory (mapped to `static/audio/`) for Hugging Face Spaces compatibility.
|
| 89 |
+
- **Audio Formats**: Supports MP3 and WAV files for uploads and downloads.
|
| 90 |
+
- **Error Handling**: The app provides real-time status updates and error messages for transcription or summarization failures.
|
| 91 |
+
- **Interactive Player**: The player is implemented as a single HTML component with JavaScript for seamless audio-transcript synchronization.
|
| 92 |
+
- **Docker Support**: The `Dockerfile` ensures consistent environments on Hugging Face Spaces.
|
| 93 |
+
- The `__pycache__` directory (auto-generated) is excluded from version control.
|
| 94 |
+
|
| 95 |
+
## Limitations
|
| 96 |
+
- Transcription and summarization quality depend on the selected models and audio clarity.
|
| 97 |
+
- Large audio files may take longer to process, especially in a resource-constrained environment like Hugging Face Spaces.
|
| 98 |
+
- YouTube audio fetching requires a valid URL and may be subject to rate limits or availability.
|
| 99 |
+
|
| 100 |
+
## Contributing
|
| 101 |
+
Contributions are welcome! To contribute:
|
| 102 |
+
1. Fork the repository on Hugging Face.
|
| 103 |
+
2. Create a new branch for your feature or bug fix.
|
| 104 |
+
3. Submit a pull request with a clear description of your changes.
|
| 105 |
+
|
| 106 |
+
## License
|
| 107 |
+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
|
| 108 |
+
|
| 109 |
+
## Acknowledgments
|
| 110 |
+
- Built with [FastAPI](https://fastapi.tiangolo.com/) for the backend API and vanilla JavaScript for the frontend.
|
| 111 |
+
- Powered by Hugging Face Spaces for hosting and deployment.
|
| 112 |
+
- Inspired by advancements in ASR and LLM technologies for audio processing.
|