FastVLM-7B Screen Observer
A local web application for real-time screen observation and analysis using Apple's FastVLM-7B model via HuggingFace.
Features
- Real-time Screen Capture: Capture and analyze screen content on-demand or automatically
- FastVLM-7B Integration: Uses Apple's vision-language model for intelligent screen analysis
- UI Element Detection: Identifies buttons, links, forms, and other interface elements
- Text Extraction: Captures text snippets from the screen
- Risk Detection: Flags potential security or privacy concerns
- Automation Demo: Demonstrates browser automation capabilities
- NDJSON Logging: Comprehensive logging in NDJSON format with timestamps
- Export Functionality: Download logs and captured frames as ZIP archive
Specifications
- Frontend: React + Vite on
http://localhost:5173 - Backend: FastAPI on
http://localhost:8000 - Model: Apple FastVLM-7B with
trust_remote_code=True - Image Token:
IMAGE_TOKEN_INDEX = -200 - Output Format: JSON with summary, ui_elements, text_snippets, risk_flags
Prerequisites
- Python 3.8+
- Node.js 16+
- Chrome/Chromium browser (for automation demo)
- 14GB+ RAM (required for FastVLM-7B model weights)
- CUDA-capable GPU or Apple Silicon (recommended for FastVLM-7B)
Installation
- Clone this repository:
cd fastvlm-screen-observer
- Install Python dependencies:
cd backend
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
- Install Node.js dependencies:
cd ../frontend
npm install
Running the Application
Option 1: Using the start script (Recommended)
./start.sh
Option 2: Manual start
Terminal 1 - Backend:
cd backend
source venv/bin/activate
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
Terminal 2 - Frontend:
cd frontend
npm run dev
Usage
- Open your browser and navigate to
http://localhost:5173 - Click "Capture Screen" to analyze the current screen
- Enable "Auto Capture" for continuous monitoring
- Use "Run Demo" to see browser automation in action
- Click "Export Logs" to download analysis data
API Endpoints
GET /- API status checkPOST /analyze- Capture and analyze screenPOST /demo- Run automation demoGET /export- Export logs as ZIPGET /logs/stream- Stream logs via SSEGET /docs- Interactive API documentation
Project Structure
fastvlm-screen-observer/
βββ backend/
β βββ app/
β β βββ main.py # FastAPI application
β βββ models/
β β βββ fastvlm_model.py # FastVLM-7B main integration
β β βββ fastvlm_optimized.py # Memory optimization strategies
β β βββ fastvlm_extreme.py # Extreme optimization (4-bit)
β β βββ use_fastvlm_small.py # Alternative 1.5B model
β βββ utils/
β β βββ screen_capture.py # Screen capture utilities
β β βββ automation.py # Browser automation
β β βββ logger.py # NDJSON logging
β βββ requirements.txt
βββ frontend/
β βββ src/
β β βββ App.jsx # React main component (with error handling)
β β βββ ScreenCapture.jsx # WebRTC screen capture
β β βββ App.css # Styling
β βββ package.json
β βββ vite.config.js
βββ logs/ # Generated logs and frames
βββ start.sh # Startup script
βββ README.md
Model Notes
The application uses Apple's FastVLM-7B model with the following specifications:
- Model ID:
apple/FastVLM-7Bfrom HuggingFace - Tokenizer: Qwen2Tokenizer (requires
transformers>=4.40.0) - IMAGE_TOKEN_INDEX: -200 (special token for image placeholders)
- trust_remote_code: True (required for model loading)
Memory Requirements:
- Minimum: 14GB RAM for model weights
- Recommended: 16GB+ RAM for smooth operation
- The model will download automatically on first run (~14GB)
Current Implementation:
The system includes multiple optimization strategies:
- Standard Mode: Full precision (float16) - requires 14GB+ RAM
- Optimized Mode: 8-bit quantization - requires 8-10GB RAM
- Extreme Mode: 4-bit quantization with disk offloading - requires 6-8GB RAM
If the model fails to load due to memory constraints, the application will:
- Display a user-friendly error message
- Continue operating with graceful error handling
- NOT show "ANALYSIS_ERROR" in risk flags
Acceptance Criteria
β
Local web app running on localhost:5173
β
FastAPI backend on localhost:8000
β
FastVLM-7B integration with trust_remote_code=True
β
IMAGE_TOKEN_INDEX = -200 configured
β
JSON output format with required fields
β
Demo automation functionality
β
NDJSON logging with timestamps
β
ZIP export with logs and frames
β
Project structure matches specifications
Troubleshooting
- Model Loading Issues: Check GPU memory and CUDA installation
- Screen Capture Errors: Ensure proper display permissions
- Browser Automation: Install Chrome/Chromium and check WebDriver
- Port Conflicts: Ensure ports 5173 and 8000 are available
License
MIT