at-engine / README.md
Godswill-IoT's picture
Upload 15 files
fb9f2a9 verified
metadata
title: Avatar-based AI Tutor Engine
emoji: 🎭
colorFrom: purple
colorTo: red
sdk: docker
app_port: 7860

Avatar-based AI Tutor Engine

Overview

The Avatar-based AI Tutor Engine is a standalone intelligence service that creates personalized AI tutor clones from user-provided images and voice samples. The engine generates dynamic teaching videos with realistic facial movements, lip-sync, and natural expressions.

What This Engine Does

Input: Image + Voice Sample + Course Content
Output: Animated avatar video teaching the course content

Key Features

  • βœ… Dynamic Avatar Generation: Creates talking head videos with facial animations
  • βœ… Lip-Sync Accuracy: Precise synchronization between audio and mouth movements
  • βœ… Natural Expressions: Includes blinking, head movements, and micro-expressions
  • βœ… Personalized Teaching: Generates engaging teaching scripts from course content
  • βœ… Voice Matching: Analyzes and replicates voice characteristics

Architecture

avatar-tutor-engine/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ __init__.py       # Package initialization
β”‚   β”œβ”€β”€ main.py           # FastAPI app + routing
β”‚   β”œβ”€β”€ contracts.py      # EngineRequest / EngineResponse
β”‚   β”œβ”€β”€ config.py         # Environment variables
β”‚   β”œβ”€β”€ hf_client.py      # Hugging Face API client
β”‚   └── engine.py         # Core intelligence logic
β”œβ”€β”€ requirements.txt      # Python dependencies
└── .env.example          # Environment template

Setup

1. Install Dependencies

cd avatar-tutor-engine
pip install -r requirements.txt

2. Configure Environment

Copy .env.example to .env and add your Hugging Face token:

cp .env.example .env

Edit .env:

HF_TOKEN=your_actual_token_here

3. Start the Engine

python -m app.main

The engine will start on http://127.0.0.1:7860

API Usage

Endpoint

POST /run

Request Format

{
  "request_id": "unique-id",
  "engine": "avatar-tutor-engine",
  "action": "create_avatar_tutor",
  "actor": {
    "user_id": "user123",
    "session_id": "session456"
  },
  "input": {
    "text": "Course content to teach...",
    "items": [
      {
        "type": "image",
        "ref": "https://example.com/user-photo.jpg"
      },
      {
        "type": "audio",
        "ref": "https://example.com/voice-sample.mp3"
      }
    ]
  },
  "context": {},
  "options": {
    "lesson_duration": "5 minutes",
    "temperature": 0.7
  }
}

Response Format

{
  "request_id": "unique-id",
  "ok": true,
  "status": "success",
  "engine": "avatar-tutor-engine",
  "action": "create_avatar_tutor",
  "result": {
    "avatar_video_url": "https://...",
    "teaching_script": "Generated teaching content...",
    "voice_sample_transcription": "Transcribed voice...",
    "lesson_duration": "5 minutes",
    "avatar_features": {
      "facial_animations": true,
      "lip_sync": true,
      "natural_expressions": true,
      "head_movements": true
    }
  },
  "messages": [
    "Avatar tutor created successfully",
    "Generated 5 minutes teaching session"
  ],
  "suggested_actions": [
    "download_video",
    "generate_another_lesson"
  ]
}

Supported Action

  • create_avatar_tutor: Creates an animated avatar teaching video

Models Used

Purpose Model Description
Teaching Script meta-llama/Meta-Llama-3-8B-Instruct Generates engaging teaching content
Voice Transcription openai/whisper-base Analyzes voice characteristics
Text-to-Speech facebook/fastspeech2-en-ljspeech Synthesizes teaching audio
Avatar Animation vinthony/SadTalker Creates talking head with lip-sync

Options

  • lesson_duration: Target length (e.g., "5 minutes", "10 minutes")
  • temperature: LLM creativity (0.0-1.0, default: 0.7)
  • max_tokens: Maximum script length (default: 2048)

Error Handling

All errors return structured responses:

{
  "ok": false,
  "status": "error",
  "error": {
    "code": "ERROR_CODE",
    "detail": "Human-readable explanation"
  }
}

Common Error Codes

  • INVALID_ACTION: Unsupported action requested
  • MISSING_IMAGE: No image provided
  • MISSING_AUDIO: No voice sample provided
  • MISSING_CONTENT: No course content provided
  • ENGINE_ERROR: Processing failure

Limitations

  1. Video Generation: Current implementation uses placeholder for talking head generation. Full integration requires HF Inference API support for video generation models.

  2. Voice Cloning: TTS uses default voice. Advanced voice cloning requires additional models not yet available via free-tier HF Inference API.

  3. Processing Time: Video generation can take 30-120 seconds depending on lesson length and model availability.

  4. Free Tier Limits: Hugging Face free tier has rate limits. For production use, consider upgrading to Pro tier.

  5. Model Availability: Some models (especially talking head generation) may require direct API access or specialized endpoints not available in standard HF Inference API.

Testing

Access Swagger UI at: http://127.0.0.1:7860/docs

See SWAGGER_TESTS.md for example payloads.

Health Check

curl http://127.0.0.1:7860/health

Design Principles

  • Stateless: No session storage, each request is independent
  • Standalone: No dependencies on other engines
  • Fail Gracefully: Structured errors, no crashes
  • Single Responsibility: Only creates avatar tutors, nothing else

Future Enhancements

  • Real-time voice cloning
  • Multiple avatar styles
  • Gesture and body language
  • Multi-language support
  • Custom teaching styles
  • Interactive avatar responses