Spaces:

Luigi
/

VoxSum

Sleeping

App Files Files Community

Luigi commited on Sep 26, 2025

Commit

07dbe5e

1 Parent(s): 9cd7aca

update readme

Browse files

Files changed (1) hide show

README.md +93 -90

README.md CHANGED Viewed

@@ -4,106 +4,109 @@ emoji: 🚀
 colorFrom: green
 colorTo: yellow
 sdk: docker
-app_port: 8501
 tags:
-- streamlit
 pinned: false
 short_description: 'VoxSum Studio: Transform Audio into Insightful Summaries'
 license: apache-2.0
 ---
-# VoxSum Studio / 語摘工作室
 **VoxSum Studio** is a powerful web application built for Hugging Face Spaces, designed to transform audio into insightful summaries. This tool leverages advanced Automatic Speech Recognition (ASR) and Large Language Models (LLMs) to transcribe and summarize audio from podcasts, YouTube videos, or uploaded files. With an interactive transcript player and customizable settings, VoxSum Studio makes it easy to extract key insights from audio content in real time.
-**語摘工作室 (VoxSum Studio)** 是一款為 Hugging Face Spaces 打造的強大網頁應用程式，專為將音頻轉化為深入的摘要而設計。此工具利用先進的自動語音辨識 (ASR) 和大型語言模型 (LLM)，從播客、YouTube 影片或上傳的音頻檔案中進行轉錄和摘要。透過互動式轉錄播放器和可自訂的設置，語摘工作室讓您輕鬆從音頻內容中即時提取關鍵見解。
-## Features / 功能
-- **Podcast Search & Download**: Search for podcast series, browse episodes, and download audio directly from the app.
-  **播客搜尋與下載**：搜尋播客系列，瀏覽集數，並直接從應用程式下載音頻。
-- **YouTube Audio Fetching**: Extract audio from YouTube videos by providing a URL.
-  **YouTube 音頻提取**：透過提供 YouTube 網址提取音頻。
-- **Audio Upload**: Upload your own audio files (MP3, WAV) for transcription and summarization.
-  **音頻上傳**：支援上傳 MP3 或 WAV 格式的音頻檔案進行轉錄和摘要。
-- **Interactive Transcript Player**: View real-time transcripts synced with audio playback, with clickable timestamps for easy navigation and auto-scrolling highlights.
-  **互動式轉錄播放器**：即時檢視與音頻播放同步的轉錄內容，支援點擊時間戳快速導航和自動滾動高亮顯示。
-- **Customizable Summarization**: Choose from multiple LLMs and provide custom prompts to generate tailored summaries.
-  **可自訂的摘要生成**：選擇多種 LLM 模型並提供自訂提示詞以生成客製化摘要。
-- **Voice Activity Detection (VAD)**: Adjust the VAD threshold to optimize transcription accuracy.
-  **語音活動檢測 (VAD)**：調整 VAD 閾值以優化轉錄準確度。
-- **Streamlit Interface**: A user-friendly, tabbed interface with settings for model selection and real-time status updates.
-  **Streamlit 介面**：友善的標籤式介面，支援模型選擇和即時狀態更新。
-## Getting Started / 開始使用
-### Usage / 使用方式
-1. **啟動應用程式**:
-   透過 Hugging Face Space 提供的網址開啟應用程式。界面直觀，採用標籤式設計，讓您輕鬆操作。
-2. **選擇音頻來源**:
-   - **播客標籤**：輸入播客名稱進行搜尋，選擇您感興趣的系列，載入集數清單，然後下載所需集數的音頻。
-   - **音頻輸入標籤**：上傳 MP3 或 WAV 格式的音頻檔案，或輸入 YouTube 影片網址以提取音頻內容。
-3. **進行轉錄**:
-   - 點擊「轉錄音頻」按鈕開始生成轉錄內容。
-   - 轉錄過程將即時顯示文字內容，完成後會自動生成一個互動式播放器，方便您檢視與音頻同步的轉錄。
-4. **生成摘要**:
-   - 點擊「生成摘要」按鈕，根據轉錄內容創建摘要。
-   - 在側邊欄中選擇合適的 LLM 模型，並輸入自訂提示詞以生成符合您需求的摘要內容。
-5. **與結果互動**:
-   - 使用互動式播放器，點擊轉錄中的特定段落即可跳轉至對應的音頻時間點，快速定位關鍵內容。
-   - 最終摘要將顯示於轉錄下方，清晰呈現音頻中的核心資訊。
-## Configuration / 配置
-- **Sidebar Settings / 側邊欄設置**:
-  - **VAD Threshold / VAD 閾值**: Adjust the slider (0.1 to 0.9) to fine-tune voice activity detection. / 調整滑桿（0.1 至 0.9）以優化語音活動檢測。
-  - **ASR Model / ASR 模型**: Select from available Moonshine models for transcription. / 從可用的 Moonshine 模型中選擇用於轉錄的模型。
-  - **LLM Model / LLM 模型**: Choose an LLM for summarization from the available options. / 從可用選項中選擇用於摘要的 LLM 模型。
-  - **Custom Prompt / 自訂提示詞**: Input a custom prompt to guide the summarization process. / 輸入自訂提示詞以引導摘要生成過程。
-## Project Structure / 專案結構
 ```
 voxsum-studio/
-├── Dockerfile                # Docker configuration for building and running the app / Docker 配置，用於建置和運行應用程式
-├── README.md                 # Project documentation and setup instructions (English and Chinese) / 專案文件和設置說明（英文與中文）
-├── requirements.txt          # Python dependencies for the project / 專案的 Python 依賴項
-├── src/                      # Source code directory / 源代碼目錄
-│   ├── asr.py                # Logic for Automatic Speech Recognition (ASR) transcription / 自動語音辨識 (ASR) 轉錄邏輯
-│   ├── podcast.py            # Functions for podcast search, episode fetching, and audio downloading / 播客搜尋、集數獲取和音頻下載功能
-│   ├── streamlit_app.py      # Main Streamlit application script / 主 Streamlit 應用程式腳本
-│   ├── summarization.py      # Logic for generating summaries using LLMs / 使用 LLM 生成摘要的邏輯
-│   └── utils.py              # Utility functions and model configurations / 工具函數和模型配置
-└── static/                   # Static assets directory / 靜態資源目錄
-    └── audio/                # Temporary storage for audio files (not tracked in git) / 音頻檔案的臨時儲存（不納入版本控制）
 ```
-## Notes / 注意事項
-- **Temporary Storage**: Uploaded and downloaded audio files are stored in the `/tmp` directory (mapped to `static/audio/`) for Hugging Face Spaces compatibility. / **臨時儲存**：上傳和下載的音頻檔案儲存於 `/tmp` 目錄（對應 `static/audio/`），以確保與 Hugging Face Spaces 的兼容性。
-- **Audio Formats**: Supports MP3 and WAV files for uploads and downloads. / **音頻格式**：支援 MP3 和 WAV 格式的上傳和下載。
-- **Error Handling**: The app provides real-time status updates and error messages for transcription or summarization failures. / **錯誤處理**：應用程式提供即時狀態更新和轉錄或摘要失敗的錯誤訊息。
-- **Interactive Player**: The player is implemented as a single HTML component with JavaScript for seamless audio-transcript synchronization, avoiding complex iframe communication. / **互動式播放器**：播放器以單一 HTML 元件實現，內含 JavaScript，實現音頻與轉錄的無縫同步，避免複雜的 iframe 通信。
-- **Docker Support**: The `Dockerfile` ensures consistent environments on Hugging Face Spaces. / **Docker 支援**：`Dockerfile` 確保 Hugging Face Spaces 環境的一致性。
-- The `__pycache__` directory (auto-generated) is excluded from version control. / `__pycache__` 目錄（自動生成）應排除於版本控制之外。
-## Limitations / 限制
-- Transcription and summarization quality depend on the selected models and audio clarity. / 轉錄和摘要品質取決於所選模型和音頻清晰度。
-- Large audio files may take longer to process, especially in a resource-constrained environment like Hugging Face Spaces. / 大型音頻檔案在資源受限的環境（如 Hugging Face Spaces）中可能需要較長處理時間。
-- YouTube audio fetching requires a valid URL and may be subject to rate limits or availability. / YouTube 音頻提取需要有效網址，且可能受限於速率限制或可用性。
-## Contributing / 貢獻
-Contributions are welcome! To contribute: / 歡迎貢獻！若要參與：
-1. Fork the repository on Hugging Face. / 在 Hugging Face 上分叉儲存庫。
-2. Create a new branch for your feature or bug fix. / 為您的功能或錯誤修復創建新分支。
-3. Submit a pull request with a clear description of your changes. / 提交包含清晰變更描述的拉取請求。
-## License / 授權
-This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
-本專案採用 MIT 授權。詳情請見 [LICENSE](LICENSE) 檔案。
-## Acknowledgments / 致謝
-- Built with [Streamlit](https://streamlit.io/) for the web interface. / 使用 [Streamlit](https://streamlit.io/) 建置網頁介面。
-- Powered by Hugging Face Spaces for hosting and deployment. / 由 Hugging Face Spaces 提供托管和部署支援。
-- Inspired by advancements in ASR and LLM technologies for audio processing. / 靈感來自 ASR 和 LLM 技術在音頻處理方面的進展。

 colorFrom: green
 colorTo: yellow
 sdk: docker
+app_port: 7860
 tags:
+- fastapi
+- web-app
 pinned: false
 short_description: 'VoxSum Studio: Transform Audio into Insightful Summaries'
 license: apache-2.0
 ---
+# VoxSum Studio
 **VoxSum Studio** is a powerful web application built for Hugging Face Spaces, designed to transform audio into insightful summaries. This tool leverages advanced Automatic Speech Recognition (ASR) and Large Language Models (LLMs) to transcribe and summarize audio from podcasts, YouTube videos, or uploaded files. With an interactive transcript player and customizable settings, VoxSum Studio makes it easy to extract key insights from audio content in real time.
+The application features a modern web interface built with HTML, CSS, and JavaScript, powered by a FastAPI backend for robust API handling.
+## Features
+- **Podcast Search & Download**: Search for podcast series, browse episodes, and download audio directly from the app.
+- **YouTube Audio Fetching**: Extract audio from YouTube videos by providing a URL.
+- **Audio Upload**: Upload your own audio files (MP3, WAV) for transcription and summarization.
+- **Interactive Transcript Player**: View real-time transcripts synced with audio playback, with clickable timestamps for easy navigation and auto-scrolling highlights.
+- **Customizable Summarization**: Choose from multiple LLMs and provide custom prompts to generate tailored summaries.
+- **Voice Activity Detection (VAD)**: Adjust the VAD threshold to optimize transcription accuracy.
+- **Web Interface**: A user-friendly interface with settings for model selection and real-time status updates.
+## Getting Started
+### Usage
+1. **Launch the application**: Open the application via the URL provided by Hugging Face Space. The interface is intuitive for easy operation.
+2. **Select audio source**: Search for podcast series, browse episodes, and download audio. Upload MP3 or WAV audio files, or enter a YouTube video URL to extract audio content.
+3. **Perform transcription**: Click the "Transcribe Audio" button to start generating transcript content. The transcription process will display text content in real time, and upon completion, an interactive player will be automatically generated for viewing transcripts synchronized with audio.
+4. **Generate summary**: Click the "Generate Summary" button to create a summary based on the transcript. Choose the appropriate LLM model and input custom prompts to generate summaries that meet your needs.
+5. **Interact with results**: Use the interactive player to click on specific segments in the transcript to jump to the corresponding audio time point, quickly locating key content. The final summary will be displayed below the transcript, clearly presenting the core information in the audio.
+## Configuration
+- **Settings**:
+  - **VAD Threshold**: Adjust the threshold (0.1 to 0.9) to fine-tune voice activity detection.
+  - **ASR Model**: Select from available models for transcription.
+  - **LLM Model**: Choose an LLM for summarization from the available options.
+  - **Custom Prompt**: Input a custom prompt to guide the summarization process.
+## Project Structure
 ```
 voxsum-studio/
+├── Dockerfile                # Docker configuration for building and running the app
+├── README.md                 # Project documentation and setup instructions
+├── requirements.txt          # Python dependencies for the project
+├── src/                      # Source code directory
+│   ├── __init__.py           # Makes src a Python package
+│   ├── asr.py                # Logic for Automatic Speech Recognition (ASR) transcription
+│   ├── diarization.py        # Speaker diarization functionality
+│   ├── editing_sync.py       # Audio editing and synchronization
+│   ├── export_utils.py       # Utilities for exporting transcripts and summaries
+│   ├── improved_diarization.py # Enhanced diarization features
+│   ├── podcast.py            # Functions for podcast search, episode fetching, and audio downloading
+│   ├── streamlit_app.py      # Legacy Streamlit application (for reference)
+│   ├── summarization.py      # Logic for generating summaries using LLMs
+│   ├── utils.py              # Utility functions and model configurations
+│   ├── server/               # FastAPI backend
+│   │   ├── __init__.py
+│   │   ├── main.py           # Main FastAPI application
+│   │   ├── core/             # Core configuration
+│   │   ├── models/           # Pydantic models for API
+│   │   ├── routers/          # API routes
+│   │   └── services/         # Business logic services
+│   ├── frontend/             # Static frontend files
+│   └── static/               # Static assets
+├── frontend/                 # Frontend source files
+│   ├── app.js                # Main JavaScript application
+│   ├── index.html            # Main HTML page
+│   └── styles.css            # CSS styles
+└── static/                   # Static assets directory
+    └── audio/                # Temporary storage for audio files (not tracked in git)
 ```
+## Notes
+- **Architecture**: The application uses a FastAPI backend for API endpoints and a vanilla JavaScript frontend for the user interface.
+- **Temporary Storage**: Uploaded and downloaded audio files are stored in the `/tmp` directory (mapped to `static/audio/`) for Hugging Face Spaces compatibility.
+- **Audio Formats**: Supports MP3 and WAV files for uploads and downloads.
+- **Error Handling**: The app provides real-time status updates and error messages for transcription or summarization failures.
+- **Interactive Player**: The player is implemented as a single HTML component with JavaScript for seamless audio-transcript synchronization.
+- **Docker Support**: The `Dockerfile` ensures consistent environments on Hugging Face Spaces.
+- The `__pycache__` directory (auto-generated) is excluded from version control.
+## Limitations
+- Transcription and summarization quality depend on the selected models and audio clarity.
+- Large audio files may take longer to process, especially in a resource-constrained environment like Hugging Face Spaces.
+- YouTube audio fetching requires a valid URL and may be subject to rate limits or availability.
+## Contributing
+Contributions are welcome! To contribute:
+1. Fork the repository on Hugging Face.
+2. Create a new branch for your feature or bug fix.
+3. Submit a pull request with a clear description of your changes.
+## License
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
+## Acknowledgments
+- Built with [FastAPI](https://fastapi.tiangolo.com/) for the backend API and vanilla JavaScript for the frontend.
+- Powered by Hugging Face Spaces for hosting and deployment.
+- Inspired by advancements in ASR and LLM technologies for audio processing.