Spaces:

abidlabs
/

hfstudio

Running on CPU Upgrade

App Files Files Community

hfstudio / TECHNICAL_SPECS.md

GitHub Action

Sync from GitHub: ffdd28283d24c66d2e788d9b4a630d7d9f76b0a1

d3f86d8 about 2 months ago

preview code

raw

history blame contribute delete

7.99 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

HFStudio Technical Specifications

Project Overview

HFStudio is a web-based text-to-speech application that provides both local and API-based TTS capabilities, inspired by ElevenLabs Studio but with support for local model execution.

Core Features

1. Text-to-Speech Engine

Input: Multi-line text area for user input
Output: Generated audio playback with download capability
Models: Support for multiple TTS models (local and API-based)
Voice Selection: Dropdown/list for available voices
Audio Controls: Play, pause, download generated audio

2. Execution Modes

API Mode: Connect to remote TTS services (HuggingFace, OpenAI, etc.)
Local Mode: Run TTS models locally using downloaded models
Mode Toggle: Clear UI toggle between API and Local execution
Local Setup Instructions: Display installation command when local mode selected

3. Voice Configuration

Speed Control: Slider (0.5x - 2.0x speed)
Stability: Slider for voice consistency (when applicable)
Similarity: Slider for voice matching (when applicable)
Style/Emotion: Optional controls for voice style

4. User Interface Layout

Left Sidebar: Navigation and feature selection
- Home/Text-to-Speech (default)
- Settings
- History (future feature)
Main Content Area: Text input and controls
Right Panel: Voice/model selection and parameters

Technology Stack

Frontend

Framework: SvelteKit
Styling: TailwindCSS
Components:
- Shadcn-svelte for UI components
- Audio player: Native HTML5 or Wavesurfer.js
State Management: Svelte stores
Build Tool: Vite

Backend (Python Package)

Framework: FastAPI for API server
TTS Libraries:
- Transformers (HuggingFace models)
- Coqui TTS
- Optional: Piper, Bark
Audio Processing: librosa, soundfile
CLI: Click or Typer for command-line interface

API Integration

HuggingFace Inference API
OpenAI TTS API (optional)
Custom model endpoints

Project Structure

hfstudio/
├── frontend/                 # Svelte frontend
│   ├── src/
│   │   ├── routes/
│   │   │   ├── +layout.svelte
│   │   │   ├── +page.svelte
│   │   │   └── api/
│   │   ├── lib/
│   │   │   ├── components/
│   │   │   │   ├── Sidebar.svelte
│   │   │   │   ├── TextInput.svelte
│   │   │   │   ├── VoiceSelector.svelte
│   │   │   │   ├── AudioPlayer.svelte
│   │   │   │   ├── ModeToggle.svelte
│   │   │   │   └── ParameterControls.svelte
│   │   │   ├── stores/
│   │   │   │   ├── app.js
│   │   │   │   └── audio.js
│   │   │   └── api/
│   │   │       └── client.js
│   │   └── app.html
│   ├── package.json
│   ├── vite.config.js
│   └── tailwind.config.js
│
├── backend/                  # Python backend
│   ├── hfstudio/
│   │   ├── __init__.py
│   │   ├── __main__.py
│   │   ├── server.py        # FastAPI app
│   │   ├── cli.py           # CLI interface
│   │   ├── models/
│   │   │   ├── __init__.py
│   │   │   ├── base.py
│   │   │   ├── local.py
│   │   │   └── api.py
│   │   ├── voices/
│   │   │   ├── __init__.py
│   │   │   └── manager.py
│   │   └── utils/
│   │       ├── __init__.py
│   │       └── audio.py
│   ├── requirements.txt
│   └── setup.py
│
├── README.md
└── docker-compose.yml       # Optional containerization

API Endpoints

REST API

POST /api/tts/generate
  Body: {
    text: string,
    voice_id: string,
    model_id: string,
    parameters: {
      speed: float,
      stability: float,
      similarity: float,
      style: string
    },
    mode: "api" | "local"
  }
  Response: {
    audio_url: string,
    duration: float,
    format: string
  }

GET /api/voices
  Response: {
    voices: [{
      id: string,
      name: string,
      preview_url: string,
      supported_models: string[]
    }]
  }

GET /api/models
  Response: {
    models: [{
      id: string,
      name: string,
      type: "local" | "api",
      status: "available" | "downloadable" | "api-only"
    }]
  }

GET /api/status
  Response: {
    mode: "api" | "local",
    local_available: boolean,
    api_configured: boolean
  }

Component Specifications

1. ModeToggle Component

Props:
- mode: "api" | "local"
- onModeChange: function

Features:
- Visual toggle switch
- Installation hint for local mode
- Status indicator (green/yellow/red)

2. TextInput Component

Props:
- value: string
- maxLength: number (default: 5000)
- placeholder: string

Features:
- Character counter
- Auto-resize
- Clear button

3. VoiceSelector Component

Props:
- voices: Voice[]
- selectedVoice: string
- onSelect: function

Features:
- Search/filter
- Voice preview
- Favorite voices

4. AudioPlayer Component

Props:
- audioUrl: string
- duration: number

Features:
- Play/pause
- Progress bar
- Volume control
- Download button
- Waveform visualization (optional)

Local Package (hfstudio)

Installation

pip install hfstudio

CLI Usage

# Start the server
hfstudio

# Start with custom port
hfstudio --port 8080

# Download models for offline use
hfstudio download-models

# List available models
hfstudio list-models

Python API

from hfstudio import TTSEngine

# Initialize engine
engine = TTSEngine(mode="local")

# Generate speech
audio = engine.generate(
    text="Hello, world!",
    voice="default",
    model="coqui/tts-vits"
)

# Save audio
audio.save("output.wav")

Configuration

Frontend (.env)

PUBLIC_API_URL=http://localhost:8000
PUBLIC_DEFAULT_MODE=api

Backend (config.yaml)

server:
  host: 0.0.0.0
  port: 8000
  cors_origins:
    - http://localhost:5173
    - http://localhost:3000

models:
  local:
    cache_dir: ~/.hfstudio/models
    default: "coqui/tts-vits"
  api:
    huggingface_token: ${HF_TOKEN}
    openai_key: ${OPENAI_API_KEY}

audio:
  output_format: "wav"
  sample_rate: 22050
  bitrate: 128

Development Workflow

Phase 1: MVP

Basic Svelte frontend with text input and generate button
FastAPI backend with single TTS model support
Mode toggle (UI only, local mode shows installation message)
Basic audio playback

Phase 2: Core Features

Multiple voice support
Parameter controls (speed, stability, similarity)
Local model execution
Audio download functionality

Phase 3: Enhanced Features

History/saved generations
Voice cloning (if supported by models)
Batch processing
Audio format options

Phase 4: Polish

Waveform visualization
Real-time generation (streaming)
Voice preview
Keyboard shortcuts

Performance Requirements

API Response Time: < 2s for typical requests
Local Generation: < 5s for 100 words
Frontend Load Time: < 1s
Audio Streaming: Start playback within 500ms

Security Considerations

API key management (environment variables)
CORS configuration
Rate limiting
Input sanitization
File size limits for audio generation

Testing Strategy

Frontend: Vitest for unit tests, Playwright for E2E
Backend: Pytest for unit and integration tests
Load testing: Locust or K6
Audio quality: Manual testing with various inputs

Deployment Options

Standalone: User runs both frontend and backend locally
Docker: Containerized deployment
Cloud: Separate frontend (Vercel/Netlify) and backend (Railway/Fly.io)
Desktop: Electron wrapper (future consideration)