|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- multimodal |
|
|
- vision |
|
|
- audio |
|
|
- multispectral |
|
|
- emotion-recognition |
|
|
- scene-understanding |
|
|
- object-detection |
|
|
- spatial-reasoning |
|
|
- conversational-ai |
|
|
widget: |
|
|
- example_title: Vision+Audio Analysis Sample 1 |
|
|
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac |
|
|
- example_title: Vision+Audio Analysis Sample 2 |
|
|
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac |
|
|
pipeline_tag: reinforcement-learning |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- Qybera/LisaV3 |
|
|
datasets: |
|
|
- Qybera/pkl-video-audio |
|
|
metrics: |
|
|
- accuracy |
|
|
--- |
|
|
|
|
|
# AdvancedLISA - Multimodal Vision+Audio AI |
|
|
|
|
|
## Model Description |
|
|
|
|
|
AdvancedLISA is a sophisticated multimodal AI model that combines advanced vision and audio processing with reasoning capabilities. The model provides comprehensive scene understanding, emotion recognition, and multimodal analysis. |
|
|
|
|
|
### Key Capabilities |
|
|
|
|
|
- **Multispectral Vision Processing**: Processes 5-channel vision input (RGB + multispectral) with spatial reasoning |
|
|
- **Advanced Audio Analysis**: Comprehensive audio understanding including emotion, speaker, and content analysis |
|
|
- **Multimodal Fusion**: Cross-modal attention between vision and audio modalities |
|
|
- **Reasoning Module**: Transformer-based reasoning with sequence-to-sequence understanding |
|
|
- **Emotion Recognition**: Real-time emotion detection from audio input |
|
|
- **Spatial Understanding**: 3D spatial reasoning and object detection |
|
|
- **Conversation Memory**: Persistent memory across interaction sequences |
|
|
- **Voice Synthesis**: Independent voice generation capabilities |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: AdvancedLISA |
|
|
- **Architecture**: Vision+Audio Fusion with Reasoning |
|
|
- **Parameters**: 190,809,376 (191M) |
|
|
- **Trainable Parameters**: 190,809,376 |
|
|
- **Input Modalities**: |
|
|
- Vision: 5-channel multispectral images (224×224) |
|
|
- Audio: Mel spectrograms (80 bins × 200 time steps) |
|
|
- **Sequence Length**: 30 frames/steps |
|
|
- **Device**: CPU/GPU compatible |
|
|
- **Framework**: PyTorch |
|
|
|
|
|
## Architecture Components |
|
|
|
|
|
| Component | Type | Parameters | Function | |
|
|
|-----------|------|------------|----------| |
|
|
| **Vision Encoder** | MultispectralVisionEncoder | 15,544,195 | Multispectral image processing + 3D spatial reasoning | |
|
|
| **Audio Encoder** | AdvancedAudioEncoder | 29,479,243 | Audio analysis + emotion/speaker detection | |
|
|
| **Fusion Module** | AdvancedFusionModule | 16,803,334 | Cross-modal attention and feature fusion | |
|
|
| **Reasoning Module** | ReasoningModule | 68,231,168 | Transformer-based sequence reasoning | |
|
|
| **Voice Synthesis** | IndependentVoiceSynthesis | 8,061,965 | Voice generation capabilities | |
|
|
| **Self Awareness** | SelfAwarenessModule | 22,579,201 | Identity and context awareness | |
|
|
| **Conversation Memory** | ConversationMemory | 6,823,937 | Persistent dialogue memory | |
|
|
|
|
|
## Model Outputs |
|
|
|
|
|
The model returns a comprehensive output dictionary: |
|
|
|
|
|
```python |
|
|
{ |
|
|
'vision_analysis': { |
|
|
'features': [batch, 30, 512], # Core vision features |
|
|
'spatial_3d': [batch, 30, 6], # 3D spatial understanding |
|
|
'scene': [batch, 30, 1000], # Scene classification |
|
|
'objects': [batch, 30, 80], # Object detection |
|
|
'motion': [batch, 30, 4] # Motion analysis |
|
|
}, |
|
|
'audio_analysis': { |
|
|
'features': [batch, 30, 1024], # Core audio features |
|
|
'spatial': [batch, 30, 4], # Spatial audio |
|
|
'emotion': [batch, 30, 7], # Emotion classification |
|
|
'speaker': [batch, 30, 256], # Speaker characteristics |
|
|
'content': [batch, 30, 128] # Content analysis |
|
|
}, |
|
|
'reasoning': [batch, 30, 1024], # Fused reasoning output |
|
|
'timestamp': float, # Processing timestamp |
|
|
'rl_action': dict # Reinforcement learning actions |
|
|
} |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
- **Inference Time**: ~17.4s per sequence (CPU) |
|
|
- **Throughput**: ~0.06 sequences/second (CPU) |
|
|
- **Memory Usage**: ~191M parameters |
|
|
- **Input Resolution**: 224×224 images, 80-bin mel spectrograms |
|
|
- **Sequence Length**: Fixed at 30 frames |
|
|
|
|
|
*Note: GPU inference will be significantly faster* |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Inference |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import json |
|
|
from pathlib import Path |
|
|
|
|
|
# Load model configuration |
|
|
config_path = "Qybera/LisaV3.0/config.json" |
|
|
with open(config_path, 'r') as f: |
|
|
config = json.load(f) |
|
|
|
|
|
# Import and create model (requires lisa_model.py) |
|
|
from lisa_model import create_lisa_model |
|
|
|
|
|
model_config = { |
|
|
'model_config': { |
|
|
'vision_channels': 5, # Multispectral input |
|
|
'audio_channels': 1, |
|
|
'vision_hidden': 512, |
|
|
'audio_hidden': 512, |
|
|
'fused_dim': 1024, |
|
|
'voice_hidden': 512, |
|
|
'vision_layers': 4, |
|
|
'audio_layers': 4, |
|
|
'reasoning_layers': 8, |
|
|
'mel_bins': 80, |
|
|
'max_memory': 50 |
|
|
}, |
|
|
'data_config': { |
|
|
'frame_size': [224, 224], |
|
|
'seq_len': 30, |
|
|
'n_mels': 80 |
|
|
} |
|
|
} |
|
|
|
|
|
# Create and load model |
|
|
model, device = create_lisa_model(model_config) |
|
|
|
|
|
# Load trained weights |
|
|
state_dict = torch.load("Qybera/LisaV3.0/pytorch_model.bin", map_location=device) |
|
|
model.load_state_dict(state_dict) |
|
|
model.eval() |
|
|
|
|
|
# Prepare inputs (must be exactly sequence length 30) |
|
|
vision_input = torch.randn(1, 30, 5, 224, 224).to(device) # 5-channel multispectral |
|
|
audio_input = torch.randn(1, 30, 1, 80, 200).to(device) # Mel spectrograms |
|
|
|
|
|
# Generate comprehensive analysis |
|
|
with torch.no_grad(): |
|
|
output = model(vision_input, audio_input) |
|
|
|
|
|
# Access different analysis components |
|
|
vision_features = output['vision_analysis']['features'] # [1, 30, 512] |
|
|
audio_emotions = output['audio_analysis']['emotion'] # [1, 30, 7] |
|
|
reasoning_output = output['reasoning'] # [1, 30, 1024] |
|
|
|
|
|
print(f"Vision features: {vision_features.shape}") |
|
|
print(f"Detected emotions: {audio_emotions.shape}") |
|
|
print(f"Reasoning output: {reasoning_output.shape}") |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
|
|
|
```python |
|
|
# Process multiple sequences |
|
|
batch_size = 2 |
|
|
vision_batch = torch.randn(batch_size, 30, 5, 224, 224).to(device) |
|
|
audio_batch = torch.randn(batch_size, 30, 1, 80, 200).to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
batch_output = model(vision_batch, audio_batch) |
|
|
|
|
|
print(f"Batch processing: {batch_size} sequences") |
|
|
print(f"Batch reasoning output: {batch_output['reasoning'].shape}") |
|
|
``` |
|
|
|
|
|
### Individual Component Access |
|
|
|
|
|
```python |
|
|
# Access individual model components |
|
|
vision_encoder = model.vision_encoder |
|
|
audio_encoder = model.audio_encoder |
|
|
reasoning_module = model.reasoning_module |
|
|
|
|
|
# Use vision encoder separately |
|
|
vision_analysis = vision_encoder(vision_input) |
|
|
print("Vision analysis keys:", list(vision_analysis.keys())) |
|
|
|
|
|
# Use audio encoder separately |
|
|
audio_analysis = audio_encoder(audio_input) |
|
|
print("Audio analysis keys:", list(audio_analysis.keys())) |
|
|
``` |
|
|
|
|
|
## Input Requirements |
|
|
|
|
|
⚠️ **Important**: The model expects **exactly 30 frames/steps** per sequence due to memory constraints. |
|
|
|
|
|
- **Vision Input**: `[batch_size, 30, 5, 224, 224]` - 5-channel multispectral images |
|
|
- **Audio Input**: `[batch_size, 30, 1, 80, 200]` - Mel spectrograms with 80 frequency bins |
|
|
- **Batch Size**: Flexible (tested up to batch_size=2) |
|
|
- **Sequence Length**: **Fixed at 30** (longer sequences will cause errors) |
|
|
|
|
|
## Training Information |
|
|
|
|
|
- **Framework**: PyTorch |
|
|
- **Final Training Loss**: 0.611 |
|
|
- **Final Validation Loss**: 0.639 |
|
|
- **Training Epochs**: 50 |
|
|
- **Learning Rate**: 2.14e-05 (with scheduling) |
|
|
- **Optimizer**: AdamW |
|
|
- **Dataset**: YouTube videos with multimodal processing |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Fixed Sequence Length**: Must use exactly 30 frames per sequence |
|
|
- **Memory Constraints**: Cannot handle variable sequence lengths due to conversation memory implementation |
|
|
- **CPU Performance**: ~17s per inference on CPU (GPU recommended for real-time use) |
|
|
- **Input Format**: Requires specific multispectral (5-channel) vision input |
|
|
|
|
|
## Applications |
|
|
|
|
|
- **Multimodal Scene Analysis**: Comprehensive understanding of visual scenes with audio context |
|
|
- **Emotion Recognition**: Real-time emotion detection from audio input |
|
|
- **Content Analysis**: Understanding of both visual and audio content |
|
|
- **Spatial Reasoning**: 3D spatial understanding and object detection |
|
|
- **Interactive AI**: Conversation memory enables contextual interactions |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@model{advancedlisa2025, |
|
|
title={AdvancedLISA: Multimodal Vision+Audio AI with Advanced Reasoning}, |
|
|
author={LISA Development Team}, |
|
|
year={2025}, |
|
|
url={https://github.com/elijahnzeli1/LISA3D}-private |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache-2.0 License - see LICENSE file for details |
|
|
|
|
|
--- |
|
|
*Model card updated based on comprehensive testing - September 2025* |