|
|
---
|
|
|
license: apache-2.0
|
|
|
language:
|
|
|
- en
|
|
|
metrics:
|
|
|
- accuracy
|
|
|
new_version: Qybera/LisaV3
|
|
|
library_name: transformers
|
|
|
---
|
|
|
# LISA-v3.5: Learning Intelligence with Sensory Awareness
|
|
|
|
|
|
## Developed in Kenya, Africa by the LISA Team
|
|
|
|
|
|
**LISA (Learning Intelligence with Sensory Awareness)** is a cutting-edge multimodal AI system developed in Kenya, Africa, by the dedicated LISA Team. This model represents African innovation in artificial intelligence, built entirely from scratch without relying on pretrained models.
|
|
|
|
|
|
## Core Mission
|
|
|
|
|
|
Build a scalable, perception-focused AI that can:
|
|
|
- **See** and understand visual environments
|
|
|
- **Listen** and process audio/speech
|
|
|
- **Understand** context and situations
|
|
|
- **Interact** intelligently with the environment
|
|
|
- **Learn** continuously from experiences
|
|
|
|
|
|
## Key Features
|
|
|
|
|
|
- **Lisa Architecture**: Built from scratch using ViT-B/16 inspired architectures
|
|
|
- **Computer Vision**: Real-time object detection, depth estimation, and scene understanding
|
|
|
- **Audio Processing**: Speech recognition, sound classification, and emotion detection
|
|
|
- **Multimodal Fusion**: Seamless integration of vision, and audio processing
|
|
|
- **Real-time Processing**: Optimized for live streaming and interactive applications
|
|
|
- **African Innovation**: Proudly developed in Kenya, East Africa
|
|
|
|
|
|
## Quick Start
|
|
|
|
|
|
### Basic Usage
|
|
|
|
|
|
```python
|
|
|
from lisa import LISAModel
|
|
|
import torch
|
|
|
|
|
|
# Load the model - same initialization process
|
|
|
model = LISAModel.from_pretrained("./")
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
|
model = model.to(device)
|
|
|
|
|
|
# Process vision + audio input
|
|
|
result = model.process_multimodal(
|
|
|
image_path="image.jpg", # Visual input - what the model "sees"
|
|
|
audio_path="audio.wav" # Auditory input - what the model "hears"
|
|
|
)
|
|
|
|
|
|
print(result.response)
|
|
|
```
|
|
|
|
|
|
### Streaming Processing
|
|
|
|
|
|
```python
|
|
|
import cv2
|
|
|
import sounddevice as sd
|
|
|
import numpy as np
|
|
|
import threading
|
|
|
from queue import Queue
|
|
|
|
|
|
# Initialize LISA for multimodal streaming
|
|
|
lisa = LISAModel.from_pretrained("./")
|
|
|
lisa.start_streaming()
|
|
|
|
|
|
# Create synchronized queues for audio and video data
|
|
|
audio_queue = Queue(maxsize=10) # Buffer for audio chunks
|
|
|
frame_queue = Queue(maxsize=5) # Buffer for video frames
|
|
|
|
|
|
def audio_callback(indata, frames, time, status):
|
|
|
"""Continuously capture audio and store in queue"""
|
|
|
if not audio_queue.full():
|
|
|
audio_queue.put(indata.copy()) # Store audio chunk for processing
|
|
|
|
|
|
# Start audio stream (runs in background thread)
|
|
|
audio_stream = sd.InputStream(
|
|
|
callback=audio_callback,
|
|
|
channels=1, # Mono audio for simplicity
|
|
|
samplerate=16000, # Standard rate for speech processing
|
|
|
blocksize=1024 # Audio chunk size
|
|
|
)
|
|
|
|
|
|
# Process synchronized video and audio streams
|
|
|
cap = cv2.VideoCapture(0)
|
|
|
audio_stream.start()
|
|
|
|
|
|
while True:
|
|
|
ret, frame = cap.read()
|
|
|
if ret and not audio_queue.empty():
|
|
|
# Get the most recent audio chunk
|
|
|
audio_chunk = audio_queue.get()
|
|
|
|
|
|
# Process both video frame AND audio together
|
|
|
result = lisa.process_multimodal_frame(
|
|
|
frame=frame, # What the AI "sees" right now
|
|
|
audio=audio_chunk # What the AI "hears" right now
|
|
|
)
|
|
|
|
|
|
print(f"Vision: {result.visual_detections}")
|
|
|
print(f"Audio: {result.audio_events}")
|
|
|
print(f"Combined: {result.multimodal_inference}")
|
|
|
|
|
|
# Display with annotations from both modalities
|
|
|
annotated_frame = lisa.annotate_multimodal_frame(frame, result)
|
|
|
cv2.imshow('LISA Vision+Audio', annotated_frame)
|
|
|
|
|
|
if cv2.waitKey(1) & 0xFF == ord('q'):
|
|
|
break
|
|
|
|
|
|
# Clean up resources
|
|
|
audio_stream.stop()
|
|
|
cap.release()
|
|
|
cv2.destroyAllWindows()
|
|
|
```
|
|
|
|
|
|
### Vision+Audio Processing
|
|
|
|
|
|
```python
|
|
|
import cv2
|
|
|
import numpy as np
|
|
|
from threading import Thread
|
|
|
import time
|
|
|
|
|
|
# Enhanced callback that processes both audio and synchronized video
|
|
|
def multimodal_callback(audio_chunk, current_frame=None):
|
|
|
"""
|
|
|
This callback now processes both audio and visual information together.
|
|
|
Think of this like how humans naturally combine what they hear with what they see
|
|
|
to understand a conversation or situation more completely.
|
|
|
"""
|
|
|
|
|
|
# Process both modalities together - this is the key difference
|
|
|
result = lisa.process_multimodal_realtime(
|
|
|
audio=audio_chunk, # What the AI hears (speech, sounds, emotions)
|
|
|
frame=current_frame # What the AI sees (faces, gestures, environment)
|
|
|
)
|
|
|
|
|
|
# Now we get richer, cross-modal insights
|
|
|
if result.transcript:
|
|
|
print(f"Speech: {result.transcript}")
|
|
|
|
|
|
# Emotion detection now uses BOTH audio tone AND facial expressions
|
|
|
if result.emotion_scores:
|
|
|
print(f"Voice Emotion: {result.audio_emotion}") # From speech patterns
|
|
|
print(f"Visual Emotion: {result.facial_emotion}") # From facial expressions
|
|
|
print(f"Combined Emotion: {result.fused_emotion}") # Best of both worlds
|
|
|
|
|
|
# New capabilities emerge from combining modalities
|
|
|
if result.speaker_identification:
|
|
|
print(f"Speaker: {result.identified_speaker}") # Match voice to face
|
|
|
|
|
|
if result.attention_focus:
|
|
|
print(f"Looking at: {result.visual_attention}") # Where are they looking while speaking?
|
|
|
|
|
|
# Capture video frames continuously to sync with audio
|
|
|
current_frame = None
|
|
|
cap = cv2.VideoCapture(0)
|
|
|
|
|
|
def capture_frames():
|
|
|
"""
|
|
|
Continuously capture video frames in a separate thread.
|
|
|
This ensures we always have a recent frame available when audio arrives.
|
|
|
Think of this as maintaining a 'visual memory' that stays current.
|
|
|
"""
|
|
|
global current_frame
|
|
|
while True:
|
|
|
ret, frame = cap.read()
|
|
|
if ret:
|
|
|
current_frame = frame # Update the most recent visual context
|
|
|
time.sleep(0.03) # Roughly 30 FPS capture rate
|
|
|
|
|
|
# Start the video capture thread
|
|
|
video_thread = Thread(target=capture_frames, daemon=True)
|
|
|
video_thread.start()
|
|
|
|
|
|
# Modified callback function that includes current visual context
|
|
|
def enhanced_audio_callback(audio_chunk):
|
|
|
"""
|
|
|
This wrapper ensures each audio chunk is processed alongside
|
|
|
the most recent visual frame, creating temporal alignment.
|
|
|
"""
|
|
|
multimodal_callback(audio_chunk, current_frame)
|
|
|
|
|
|
# Start the integrated audio+vision stream
|
|
|
lisa.start_audio_stream(callback=enhanced_audio_callback)
|
|
|
```
|
|
|
|
|
|
- **Temporal Synchronization:** The biggest challenge in multimodal AI is ensuring that what you hear and what you see correspond to the same moment in time. Notice how we maintain a current_frame variable that's continuously updated in a separate thread. This creates a "visual memory" that's always fresh when new audio arrives. Think of it like how your brain automatically coordinates the timing of what your eyes see with what your ears hear.
|
|
|
- **Cross-Modal Enhancement:** The real magic happens in process_multimodal_realtime(). Instead of analyzing speech and visual cues separately, the model can now cross-reference them. For example, if someone says "I'm fine" but their facial expression shows distress, the combined emotion analysis will be more accurate than either modality alone. This mimics human intuition about reading people's true feelings.
|
|
|
- **Emergent Capabilities:** When you combine vision and audio, new possibilities emerge that weren't available with either modality alone. Speaker identification becomes much more robust when you can match a voice to a face. Understanding where someone is looking while they speak adds crucial context about their intent and focus.
|
|
|
- **Threaded Architecture:** Notice how we use a separate thread for video capture. This architectural choice is crucial because audio processing is time-sensitive - you cannot afford to miss audio chunks while waiting for a video frame to process. The threaded approach ensures smooth, real-time operation of both streams.
|
|
|
|
|
|
## Architecture
|
|
|
|
|
|
### Vision Component
|
|
|
- **Lisa ViT-B/16 inspired architecture**
|
|
|
- Patch size: 16x16
|
|
|
- Embedding dimensions: 384 (mini) / 768 (full)
|
|
|
- Multi-head attention layers: 6-12
|
|
|
- Lisa object detection head
|
|
|
- Depth estimation module
|
|
|
|
|
|
### Audio Component
|
|
|
- **Lisa Audio Transformer**
|
|
|
- Sample rate: 16kHz
|
|
|
- Mel-scale features: 80 channels
|
|
|
- CTC-based speech recognition
|
|
|
- Environmental sound classification (50+ classes)
|
|
|
- Emotion detection (7 emotions)
|
|
|
|
|
|
### Multimodal Fusion
|
|
|
- Cross-attention mechanisms
|
|
|
- Temporal synchronization
|
|
|
- Context-aware processing
|
|
|
- Real-time inference capabilities
|
|
|
|
|
|
## Model Specifications
|
|
|
|
|
|
- **Total Parameters**: ~6M (mini) / ~25M (full)
|
|
|
- **Input Modalities**: Images, Audio, Video
|
|
|
- **Output Capabilities**: Object detection, Audio analysis
|
|
|
- **Processing Speed**: Real-time capable
|
|
|
- **Memory Requirements**: 2GB+ RAM recommended
|
|
|
- **Platform Support**: Windows, Linux, macOS
|
|
|
|
|
|
## About the LISA Team
|
|
|
|
|
|
The LISA Team is based in Kenya, East Africa, and is dedicated to advancing artificial intelligence research and development within the African continent. Our mission is to create AI systems that understand and serve diverse communities while maintaining cultural sensitivity and awareness.
|
|
|
|
|
|
**Development Location**: Kenya, East Africa
|
|
|
**Team**: LISA Development Team
|
|
|
**Philosophy**: Building AI from the ground up without dependency on external pretrained models
|
|
|
**Vision**: Democratizing AI development in Africa and beyond
|
|
|
|
|
|
## Self-Awareness Features
|
|
|
|
|
|
LISA is designed with self-awareness capabilities and knows:
|
|
|
- Its development origin: Kenya, Africa
|
|
|
- Its creators: The LISA Team
|
|
|
- Its cultural context: African AI innovation
|
|
|
- Its architectural uniqueness: Built from scratch
|
|
|
- Its mission: Advancing African AI capabilities
|
|
|
|
|
|
## Performance Metrics
|
|
|
|
|
|
- **Object Detection**: mAP@0.5: ~65% (Lisa dataset)
|
|
|
- **Speech Recognition**: WER: ~15% (English)
|
|
|
- **Sound Classification**: Accuracy: ~78% (environmental sounds)
|
|
|
- **Emotion Detection**: F1-Score: ~72% (7 emotions)
|
|
|
- **Processing Speed**: ~30 FPS (vision), ~Real-time (audio)
|
|
|
|
|
|
## Deployment
|
|
|
|
|
|
### Local Deployment
|
|
|
```bash
|
|
|
python deploy.py --host 0.0.0.0 --port 8000
|
|
|
```
|
|
|
|
|
|
### Docker Deployment
|
|
|
```bash
|
|
|
docker build -t lisa-v3.5 .
|
|
|
docker run -p 8000:8000 lisa-v3.5
|
|
|
```
|
|
|
|
|
|
### API Usage
|
|
|
```bash
|
|
|
curl -X POST "http://localhost:8000/process" \
|
|
|
-H "Content-Type: application/json" \
|
|
|
-d '{"audio": "audio.wav", "image_url": "image.jpg"}'
|
|
|
```
|
|
|
|
|
|
## License
|
|
|
|
|
|
This model is released under the Apache 2.0 License. See LICENSE file for details.
|
|
|
|
|
|
## Contributing
|
|
|
|
|
|
We welcome contributions from the global AI community. Please see CONTRIBUTING.md for guidelines.
|
|
|
|
|
|
## Contact
|
|
|
|
|
|
- **Team**: LISA Development Team
|
|
|
- **Location**: Kenya, East Africa
|
|
|
- **Email**: [Contact information](elijahnzeli894@gmail.com)
|
|
|
- **Website**: [Website URL](None)
|
|
|
|
|
|
## Acknowledgments
|
|
|
|
|
|
Special thanks to the Kenyan AI community and African researchers who contributed to making LISA possible. This project represents the growing AI capabilities within Africa and our commitment to technological innovation.
|
|
|
|
|
|
---
|
|
|
|
|
|
**Proudly developed in Kenya, Africa 🇰🇪**
|
|
|
|
|
|
*"LISA represents African innovation in artificial intelligence - built from the ground up with pride, passion, and purpose."* |