| # HFStudio Technical Specifications | |
| ## Project Overview | |
| HFStudio is a web-based text-to-speech application that provides both local and API-based TTS capabilities, inspired by ElevenLabs Studio but with support for local model execution. | |
| ## Core Features | |
| ### 1. Text-to-Speech Engine | |
| - **Input**: Multi-line text area for user input | |
| - **Output**: Generated audio playback with download capability | |
| - **Models**: Support for multiple TTS models (local and API-based) | |
| - **Voice Selection**: Dropdown/list for available voices | |
| - **Audio Controls**: Play, pause, download generated audio | |
| ### 2. Execution Modes | |
| - **API Mode**: Connect to remote TTS services (HuggingFace, OpenAI, etc.) | |
| - **Local Mode**: Run TTS models locally using downloaded models | |
| - **Mode Toggle**: Clear UI toggle between API and Local execution | |
| - **Local Setup Instructions**: Display installation command when local mode selected | |
| ### 3. Voice Configuration | |
| - **Speed Control**: Slider (0.5x - 2.0x speed) | |
| - **Stability**: Slider for voice consistency (when applicable) | |
| - **Similarity**: Slider for voice matching (when applicable) | |
| - **Style/Emotion**: Optional controls for voice style | |
| ### 4. User Interface Layout | |
| - **Left Sidebar**: Navigation and feature selection | |
| - Home/Text-to-Speech (default) | |
| - Settings | |
| - History (future feature) | |
| - **Main Content Area**: Text input and controls | |
| - **Right Panel**: Voice/model selection and parameters | |
| ## Technology Stack | |
| ### Frontend | |
| - **Framework**: SvelteKit | |
| - **Styling**: TailwindCSS | |
| - **Components**: | |
| - Shadcn-svelte for UI components | |
| - Audio player: Native HTML5 or Wavesurfer.js | |
| - **State Management**: Svelte stores | |
| - **Build Tool**: Vite | |
| ### Backend (Python Package) | |
| - **Framework**: FastAPI for API server | |
| - **TTS Libraries**: | |
| - Transformers (HuggingFace models) | |
| - Coqui TTS | |
| - Optional: Piper, Bark | |
| - **Audio Processing**: librosa, soundfile | |
| - **CLI**: Click or Typer for command-line interface | |
| ### API Integration | |
| - **HuggingFace Inference API** | |
| - **OpenAI TTS API** (optional) | |
| - **Custom model endpoints** | |
| ## Project Structure | |
| ``` | |
| hfstudio/ | |
| βββ frontend/ # Svelte frontend | |
| β βββ src/ | |
| β β βββ routes/ | |
| β β β βββ +layout.svelte | |
| β β β βββ +page.svelte | |
| β β β βββ api/ | |
| β β βββ lib/ | |
| β β β βββ components/ | |
| β β β β βββ Sidebar.svelte | |
| β β β β βββ TextInput.svelte | |
| β β β β βββ VoiceSelector.svelte | |
| β β β β βββ AudioPlayer.svelte | |
| β β β β βββ ModeToggle.svelte | |
| β β β β βββ ParameterControls.svelte | |
| β β β βββ stores/ | |
| β β β β βββ app.js | |
| β β β β βββ audio.js | |
| β β β βββ api/ | |
| β β β βββ client.js | |
| β β βββ app.html | |
| β βββ package.json | |
| β βββ vite.config.js | |
| β βββ tailwind.config.js | |
| β | |
| βββ backend/ # Python backend | |
| β βββ hfstudio/ | |
| β β βββ __init__.py | |
| β β βββ __main__.py | |
| β β βββ server.py # FastAPI app | |
| β β βββ cli.py # CLI interface | |
| β β βββ models/ | |
| β β β βββ __init__.py | |
| β β β βββ base.py | |
| β β β βββ local.py | |
| β β β βββ api.py | |
| β β βββ voices/ | |
| β β β βββ __init__.py | |
| β β β βββ manager.py | |
| β β βββ utils/ | |
| β β βββ __init__.py | |
| β β βββ audio.py | |
| β βββ requirements.txt | |
| β βββ setup.py | |
| β | |
| βββ README.md | |
| βββ docker-compose.yml # Optional containerization | |
| ``` | |
| ## API Endpoints | |
| ### REST API | |
| ``` | |
| POST /api/tts/generate | |
| Body: { | |
| text: string, | |
| voice_id: string, | |
| model_id: string, | |
| parameters: { | |
| speed: float, | |
| stability: float, | |
| similarity: float, | |
| style: string | |
| }, | |
| mode: "api" | "local" | |
| } | |
| Response: { | |
| audio_url: string, | |
| duration: float, | |
| format: string | |
| } | |
| GET /api/voices | |
| Response: { | |
| voices: [{ | |
| id: string, | |
| name: string, | |
| preview_url: string, | |
| supported_models: string[] | |
| }] | |
| } | |
| GET /api/models | |
| Response: { | |
| models: [{ | |
| id: string, | |
| name: string, | |
| type: "local" | "api", | |
| status: "available" | "downloadable" | "api-only" | |
| }] | |
| } | |
| GET /api/status | |
| Response: { | |
| mode: "api" | "local", | |
| local_available: boolean, | |
| api_configured: boolean | |
| } | |
| ``` | |
| ## Component Specifications | |
| ### 1. ModeToggle Component | |
| ```svelte | |
| Props: | |
| - mode: "api" | "local" | |
| - onModeChange: function | |
| Features: | |
| - Visual toggle switch | |
| - Installation hint for local mode | |
| - Status indicator (green/yellow/red) | |
| ``` | |
| ### 2. TextInput Component | |
| ```svelte | |
| Props: | |
| - value: string | |
| - maxLength: number (default: 5000) | |
| - placeholder: string | |
| Features: | |
| - Character counter | |
| - Auto-resize | |
| - Clear button | |
| ``` | |
| ### 3. VoiceSelector Component | |
| ```svelte | |
| Props: | |
| - voices: Voice[] | |
| - selectedVoice: string | |
| - onSelect: function | |
| Features: | |
| - Search/filter | |
| - Voice preview | |
| - Favorite voices | |
| ``` | |
| ### 4. AudioPlayer Component | |
| ```svelte | |
| Props: | |
| - audioUrl: string | |
| - duration: number | |
| Features: | |
| - Play/pause | |
| - Progress bar | |
| - Volume control | |
| - Download button | |
| - Waveform visualization (optional) | |
| ``` | |
| ## Local Package (hfstudio) | |
| ### Installation | |
| ```bash | |
| pip install hfstudio | |
| ``` | |
| ### CLI Usage | |
| ```bash | |
| # Start the server | |
| hfstudio | |
| # Start with custom port | |
| hfstudio --port 8080 | |
| # Download models for offline use | |
| hfstudio download-models | |
| # List available models | |
| hfstudio list-models | |
| ``` | |
| ### Python API | |
| ```python | |
| from hfstudio import TTSEngine | |
| # Initialize engine | |
| engine = TTSEngine(mode="local") | |
| # Generate speech | |
| audio = engine.generate( | |
| text="Hello, world!", | |
| voice="default", | |
| model="coqui/tts-vits" | |
| ) | |
| # Save audio | |
| audio.save("output.wav") | |
| ``` | |
| ## Configuration | |
| ### Frontend (.env) | |
| ```env | |
| PUBLIC_API_URL=http://localhost:8000 | |
| PUBLIC_DEFAULT_MODE=api | |
| ``` | |
| ### Backend (config.yaml) | |
| ```yaml | |
| server: | |
| host: 0.0.0.0 | |
| port: 8000 | |
| cors_origins: | |
| - http://localhost:5173 | |
| - http://localhost:3000 | |
| models: | |
| local: | |
| cache_dir: ~/.hfstudio/models | |
| default: "coqui/tts-vits" | |
| api: | |
| huggingface_token: ${HF_TOKEN} | |
| openai_key: ${OPENAI_API_KEY} | |
| audio: | |
| output_format: "wav" | |
| sample_rate: 22050 | |
| bitrate: 128 | |
| ``` | |
| ## Development Workflow | |
| ### Phase 1: MVP | |
| 1. Basic Svelte frontend with text input and generate button | |
| 2. FastAPI backend with single TTS model support | |
| 3. Mode toggle (UI only, local mode shows installation message) | |
| 4. Basic audio playback | |
| ### Phase 2: Core Features | |
| 1. Multiple voice support | |
| 2. Parameter controls (speed, stability, similarity) | |
| 3. Local model execution | |
| 4. Audio download functionality | |
| ### Phase 3: Enhanced Features | |
| 1. History/saved generations | |
| 2. Voice cloning (if supported by models) | |
| 3. Batch processing | |
| 4. Audio format options | |
| ### Phase 4: Polish | |
| 1. Waveform visualization | |
| 2. Real-time generation (streaming) | |
| 3. Voice preview | |
| 4. Keyboard shortcuts | |
| ## Performance Requirements | |
| - **API Response Time**: < 2s for typical requests | |
| - **Local Generation**: < 5s for 100 words | |
| - **Frontend Load Time**: < 1s | |
| - **Audio Streaming**: Start playback within 500ms | |
| ## Security Considerations | |
| - API key management (environment variables) | |
| - CORS configuration | |
| - Rate limiting | |
| - Input sanitization | |
| - File size limits for audio generation | |
| ## Testing Strategy | |
| - Frontend: Vitest for unit tests, Playwright for E2E | |
| - Backend: Pytest for unit and integration tests | |
| - Load testing: Locust or K6 | |
| - Audio quality: Manual testing with various inputs | |
| ## Deployment Options | |
| 1. **Standalone**: User runs both frontend and backend locally | |
| 2. **Docker**: Containerized deployment | |
| 3. **Cloud**: Separate frontend (Vercel/Netlify) and backend (Railway/Fly.io) | |
| 4. **Desktop**: Electron wrapper (future consideration) |