genai-engine / README.md
Godswill-IoT's picture
Upload 27 files
65b22a4 verified
metadata
title: General AI Engine
emoji: 🧠
colorFrom: indigo
colorTo: blue
sdk: docker
app_port: 7860

General AI Engine

Overview

The General AI Engine is a pure intelligence service designed for open-ended question answering and multi-modal interaction. It uses various Hugging Face models to process text, images, and audio, providing a unified "ask anything" interface.

What This Engine Does

Input: Text, Image, Audio, or Video
Output: Intelligent natural language responses

Key Features

  • βœ… Multi-modal Chat: Unified interface for text, image, and audio interaction.
  • βœ… Dynamic Model Routing: Automatically selects appropriate models based on input modality.
  • βœ… Conversation History: Supports multi-turn dialogue when provided in context.
  • βœ… Audio Support: Transcribes spoken questions automatically.
  • βœ… Vision Support: Understands and describes image/video content.

Architecture

This is a standalone intelligence engine - NOT a chatbot, NOT a UI, NOT orchestration. It is callable by an AI Mentor like other engine services.

general-ai-engine/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ __init__.py       # Package initialization
β”‚   β”œβ”€β”€ main.py           # FastAPI app + routing
β”‚   β”œβ”€β”€ contracts.py      # EngineRequest / EngineResponse
β”‚   β”œβ”€β”€ config.py         # Environment variables
β”‚   β”œβ”€β”€ hf_client.py      # Hugging Face API client
β”‚   └── engine.py         # Core intelligence logic
β”œβ”€β”€ requirements.txt      # Python dependencies
└── .env.example          # Environment template

Setup

1. Install Dependencies

cd general-ai-engine
pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Edit .env with your HF_TOKEN

3. Start the Engine

python -m app.main

The engine will start on http://127.0.0.1:7860

API

Single Entrypoint: POST /run

Text-Only Request:

{
  "request_id": "req_123",
  "engine": "general-ai-engine",
  "action": "ask_question",
  "actor": {
    "user_id": "user_456",
    "session_id": "session_789"
  },
  "input": {
    "text": "What is quantum computing?"
  },
  "context": {},
  "options": {
    "temperature": 0.7,
    "max_tokens": 2048
  }
}

Response:

{
  "request_id": "req_123",
  "ok": true,
  "status": "success",
  "engine": "general-ai-engine",
  "action": "ask_question",
  "result": {
    "answer": "Quantum computing is...",
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "question": "What is quantum computing?",
    "modalities": ["text"]
  },
  "messages": ["Generated response using meta-llama/Llama-3.3-70B-Instruct"],
  "suggested_actions": ["ask_followup", "clarify", "explore_topic"],
  "citations": []
}

Image Understanding Request:

{
  "request_id": "req_124",
  "engine": "general-ai-engine",
  "action": "ask_question",
  "actor": {
    "user_id": "user_456",
    "session_id": "session_789"
  },
  "input": {
    "text": "What's in this image?",
    "items": [
      {
        "type": "image",
        "text": "",
        "ref": "https://example.com/image.jpg"
      }
    ]
  },
  "context": {},
  "options": {}
}

Response:

{
  "request_id": "req_124",
  "ok": true,
  "status": "success",
  "engine": "general-ai-engine",
  "action": "ask_question",
  "result": {
    "answer": "The image shows...",
    "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
    "question": "What's in this image?",
    "modalities": ["image"]
  },
  "messages": ["Generated response using meta-llama/Llama-3.2-11B-Vision-Instruct"],
  "suggested_actions": ["ask_followup", "clarify", "explore_topic"]
}

Audio Transcription + Question:

{
  "request_id": "req_125",
  "engine": "general-ai-engine",
  "action": "ask_question",
  "actor": {
    "user_id": "user_456",
    "session_id": "session_789"
  },
  "input": {
    "text": "Summarize what was said",
    "items": [
      {
        "type": "audio",
        "text": "",
        "ref": "https://example.com/audio.mp3"
      }
    ]
  },
  "context": {},
  "options": {}
}

Response:

{
  "request_id": "req_125",
  "ok": true,
  "status": "success",
  "engine": "general-ai-engine",
  "action": "ask_question",
  "result": {
    "answer": "The audio discusses...",
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "question": "Summarize what was said\n\n[Audio transcription]: Hello, this is a test...",
    "modalities": ["audio"],
    "audio_transcription": "Hello, this is a test..."
  },
  "messages": ["Generated response using meta-llama/Llama-3.3-70B-Instruct"],
  "suggested_actions": ["ask_followup", "clarify", "explore_topic"]
}

Supported Actions

  • ask_question - Answer a single question
  • chat - Conversational interaction (supports context.conversation_history)

Configuration

All configuration via environment variables:

  • HF_TOKEN - Hugging Face API token (required - get free token at hf.co/settings/tokens)
  • HF_TEXT_MODEL - Text model (default: google/flan-t5-base - 250M params, stable on free tier)
  • HF_VISION_MODEL - Vision model (default: nlpconnect/vit-gpt2-image-captioning)
  • HF_ASR_MODEL - Audio model (default: openai/whisper-base)
  • HOST - Server host (default: 127.0.0.1)
  • PORT - Server port (default: 8002)

Error Handling

All errors return structured responses:

{
  "ok": false,
  "status": "error",
  "error": {
    "code": "ENGINE_ERROR",
    "detail": "Human-readable explanation"
  }
}

No stack traces are exposed to clients.

Testing

Access Swagger UI at: http://localhost:8000/docs

Known Limitations

  1. Free Tier Limits - Uses HF Serverless Inference API with rate limits (~1000 requests/day for free users)
  2. Stateless - No conversation memory; context must be provided in each request
  3. Model per modality - Uses different models for text/vision/audio (not a unified multimodal model)
  4. No streaming - Returns complete responses only
  5. Cold starts - First request to a model may take 10-30 seconds (model loading)
  6. Timeout - 60-second timeout on HF API calls
  7. Audio format - Audio must be accessible via URL or base64-encoded
  8. Video processing - Videos treated as images (single frame analysis, not full video understanding)
  9. No retry logic - Single API call attempt; failures return immediately
  10. No caching - Every request hits HF API (no response caching)