Spaces:
Sleeping
title: Avatar-based AI Tutor Engine
emoji: π
colorFrom: purple
colorTo: red
sdk: docker
app_port: 7860
Avatar-based AI Tutor Engine
Overview
The Avatar-based AI Tutor Engine is a standalone intelligence service that creates personalized AI tutor clones from user-provided images and voice samples. The engine generates dynamic teaching videos with realistic facial movements, lip-sync, and natural expressions.
What This Engine Does
Input: Image + Voice Sample + Course Content
Output: Animated avatar video teaching the course content
Key Features
- β Dynamic Avatar Generation: Creates talking head videos with facial animations
- β Lip-Sync Accuracy: Precise synchronization between audio and mouth movements
- β Natural Expressions: Includes blinking, head movements, and micro-expressions
- β Personalized Teaching: Generates engaging teaching scripts from course content
- β Voice Matching: Analyzes and replicates voice characteristics
Architecture
avatar-tutor-engine/
βββ app/
β βββ __init__.py # Package initialization
β βββ main.py # FastAPI app + routing
β βββ contracts.py # EngineRequest / EngineResponse
β βββ config.py # Environment variables
β βββ hf_client.py # Hugging Face API client
β βββ engine.py # Core intelligence logic
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
Setup
1. Install Dependencies
cd avatar-tutor-engine
pip install -r requirements.txt
2. Configure Environment
Copy .env.example to .env and add your Hugging Face token:
cp .env.example .env
Edit .env:
HF_TOKEN=your_actual_token_here
3. Start the Engine
python -m app.main
The engine will start on http://127.0.0.1:7860
API Usage
Endpoint
POST /run
Request Format
{
"request_id": "unique-id",
"engine": "avatar-tutor-engine",
"action": "create_avatar_tutor",
"actor": {
"user_id": "user123",
"session_id": "session456"
},
"input": {
"text": "Course content to teach...",
"items": [
{
"type": "image",
"ref": "https://example.com/user-photo.jpg"
},
{
"type": "audio",
"ref": "https://example.com/voice-sample.mp3"
}
]
},
"context": {},
"options": {
"lesson_duration": "5 minutes",
"temperature": 0.7
}
}
Response Format
{
"request_id": "unique-id",
"ok": true,
"status": "success",
"engine": "avatar-tutor-engine",
"action": "create_avatar_tutor",
"result": {
"avatar_video_url": "https://...",
"teaching_script": "Generated teaching content...",
"voice_sample_transcription": "Transcribed voice...",
"lesson_duration": "5 minutes",
"avatar_features": {
"facial_animations": true,
"lip_sync": true,
"natural_expressions": true,
"head_movements": true
}
},
"messages": [
"Avatar tutor created successfully",
"Generated 5 minutes teaching session"
],
"suggested_actions": [
"download_video",
"generate_another_lesson"
]
}
Supported Action
create_avatar_tutor: Creates an animated avatar teaching video
Models Used
| Purpose | Model | Description |
|---|---|---|
| Teaching Script | meta-llama/Meta-Llama-3-8B-Instruct |
Generates engaging teaching content |
| Voice Transcription | openai/whisper-base |
Analyzes voice characteristics |
| Text-to-Speech | facebook/fastspeech2-en-ljspeech |
Synthesizes teaching audio |
| Avatar Animation | vinthony/SadTalker |
Creates talking head with lip-sync |
Options
lesson_duration: Target length (e.g., "5 minutes", "10 minutes")temperature: LLM creativity (0.0-1.0, default: 0.7)max_tokens: Maximum script length (default: 2048)
Error Handling
All errors return structured responses:
{
"ok": false,
"status": "error",
"error": {
"code": "ERROR_CODE",
"detail": "Human-readable explanation"
}
}
Common Error Codes
INVALID_ACTION: Unsupported action requestedMISSING_IMAGE: No image providedMISSING_AUDIO: No voice sample providedMISSING_CONTENT: No course content providedENGINE_ERROR: Processing failure
Limitations
Video Generation: Current implementation uses placeholder for talking head generation. Full integration requires HF Inference API support for video generation models.
Voice Cloning: TTS uses default voice. Advanced voice cloning requires additional models not yet available via free-tier HF Inference API.
Processing Time: Video generation can take 30-120 seconds depending on lesson length and model availability.
Free Tier Limits: Hugging Face free tier has rate limits. For production use, consider upgrading to Pro tier.
Model Availability: Some models (especially talking head generation) may require direct API access or specialized endpoints not available in standard HF Inference API.
Testing
Access Swagger UI at: http://127.0.0.1:7860/docs
See SWAGGER_TESTS.md for example payloads.
Health Check
curl http://127.0.0.1:7860/health
Design Principles
- Stateless: No session storage, each request is independent
- Standalone: No dependencies on other engines
- Fail Gracefully: Structured errors, no crashes
- Single Responsibility: Only creates avatar tutors, nothing else
Future Enhancements
- Real-time voice cloning
- Multiple avatar styles
- Gesture and body language
- Multi-language support
- Custom teaching styles
- Interactive avatar responses