operations / scripts /discovery /omirl discovery
jbbove's picture
hard coded the 12 sensor types that were wrong in the table scraper and fixed test that were getting stuck because there was no browser session clean up
5e42519
# OMIRL Website Discovery and Web Scraping Implementation
## Overview
This document summarizes our discovery process for the OMIRL (Osservatorio Meteorologico Idro-Radar Liguria) website and how we adapted our web scraping service to extract weather station data for emergency management operations.
## Discovery Process
### Target Website
- **Base URL**: `https://omirl.regione.liguria.it`
- **Target Page**: `https://omirl.regione.liguria.it/#/sensorstable`
- **Purpose**: Extract weather station sensor data for Liguria region emergency management
### Discovery Methodology
Our discovery process involved:
1. **Direct Navigation Testing** - Systematically testing different URL patterns
2. **Table Structure Analysis** - Identifying data tables and their structure
3. **Filter Control Discovery** - Understanding available filtering mechanisms
4. **Content Validation** - Verifying data relevance for emergency operations
## Key Discoveries
### 1. Correct Navigation Path
After testing multiple URL patterns, we identified the correct endpoint:
```
βœ… CORRECT: /#/sensorstable
❌ TRIED: /#/summarytable, /#/valori_stazioni, /#/tabelle/valori, etc.
```
### 2. Table Structure Discovery
#### Primary Data Table Headers
```json
{
"actual_headers": [
"Nome", // Station Name
"Codice", // Station Code
"Comune", // Municipality
"Provincia", // Province
"Area", // Area Classification
"Bacino", // River Basin
"Sottobacino", // Sub-basin
"ultimo", // Latest Reading
"Max", // Maximum Value
"Min", // Minimum Value
"UM" // Unit of Measurement
]
}
```
#### Data Characteristics
- **Expected Station Count**: ~206 weather stations
- **Geographic Coverage**: Liguria region (GE, SV, IM, SP provinces)
- **Numeric Data Columns**: `ultimo`, `Max`, `Min`
- **Units Column**: `UM` (typically "mm" for precipitation)
### 3. Sensor Type Filtering
#### Available Sensor Types
```json
{
"actual_sensor_types": [
{"index": 0, "name": "Precipitazione", "value": "0"},
{"index": 1, "name": "Temperatura", "value": "1"},
{"index": 2, "name": "Livelli Idrometrici", "value": "2"},
{"index": 3, "name": "Vento", "value": "3"},
{"index": 4, "name": "UmiditΓ  dell'aria", "value": "4"},
{"index": 5, "name": "Eliofanie", "value": "5"},
{"index": 6, "name": "Radiazione Solare", "value": "6"},
{"index": 7, "name": "Bagnatura Fogliare", "value": "7"},
{"index": 8, "name": "Pressione Atmosferica", "value": "8"},
{"index": 9, "name": "Tensione Batteria", "value": "9"}
]
}
```
#### Filter Implementation
- **Selector**: `select#stationType`
- **Filter Method**: JavaScript-based dropdown selection
- **Dynamic Loading**: Table updates via AJAX after filter selection
### 4. Geographic Filtering
#### Provincial Coverage
- **GE** (Genova) - Primary metropolitan area
- **SV** (Savona) - Western coastal region
- **IM** (Imperia) - Northwestern region
- **SP** (La Spezia) - Eastern region
## Implementation Adaptation
### Architecture Overview
Our implementation follows a clean, layered architecture:
```
tools/omirl/adapter.py # LangGraph tool interface
β”œβ”€β”€ tools/omirl/services_tables.py # Business logic
β”œβ”€β”€ services/web/table_scraper.py # HTML parsing
└── services/web/browser.py # Browser automation
```
### Core Functions Implemented
#### 1. Primary Data Extraction
```python
async def fetch_station_data(
sensor_type: Optional[str] = None,
provincia: Optional[str] = None
) -> OMIRLResult
```
#### 2. Sensor Type Discovery
```python
async def get_available_sensor_types() -> OMIRLResult
```
#### 3. Convenience Functions
```python
async def get_precipitation_stations(provincia: Optional[str] = None) -> List[Dict]
def validate_sensor_type(sensor_type: str) -> bool
```
### Technical Implementation Details
#### Browser Automation
- **Technology**: Playwright with Chromium
- **Mode**: Headless for production, visible for debugging
- **Wait Strategy**: Network idle detection for AngularJS app
- **Rate Limiting**: 500ms delays between operations
#### Data Processing
- **HTML Parsing**: BeautifulSoup4 for table extraction
- **Data Validation**: Type checking and required field validation
- **Error Handling**: Graceful failure with structured error messages
- **Filtering**: Post-extraction filtering for geographic constraints
#### Result Structure
```python
@dataclass
class OMIRLResult:
success: bool
data: List[Dict[str, Any]]
message: str
metadata: Dict[str, Any] = field(default_factory=dict)
warnings: List[str] = field(default_factory=list)
```
## Testing Strategy
### Comprehensive Test Suite
Our testing strategy covers:
1. **Basic Extraction** - Verify table scraping without filters
2. **Sensor Filtering** - Test precipitation sensor filtering
3. **Geographic Filtering** - Test provincia-based filtering
4. **Sensor Discovery** - Validate available sensor types
5. **Input Validation** - Test parameter validation
6. **Convenience Functions** - Test helper functions
### Test Execution
```bash
# Full test suite
pytest tests/test_omirl_implementation.py -v
# Specific test
pytest tests/test_omirl_implementation.py::test_basic_extraction -v
# With async support
pytest tests/test_omirl_implementation.py --asyncio-mode=auto -v
```
### Test Results Validation
Each test validates:
- **Data Structure**: Required fields present
- **Data Quality**: Non-empty critical fields
- **Filter Behavior**: Correct filtering application
- **Performance**: Response times under acceptable limits
- **Error Handling**: Graceful failure scenarios
## Production Considerations
### Performance Optimization
- **Selective Browser Installation**: Chromium only (smaller Docker image)
- **Table Targeting**: Direct table extraction (avoid full page parsing)
- **Connection Reuse**: Browser session persistence
- **Timeout Management**: Configurable wait times
### Reliability Features
- **Retry Logic**: Automatic retry on transient failures
- **Error Recovery**: Structured error reporting
- **Data Validation**: Field presence and type checking
- **Rate Limiting**: Respectful scraping practices
### Security & Compliance
- **User Agent**: Standard browser identification
- **Request Timing**: Human-like interaction patterns
- **Data Handling**: No sensitive data storage
- **Regional Compliance**: Public data access only
## Emergency Management Integration
### Use Cases for Operations
1. **Precipitation Monitoring**: Real-time rainfall data for flood risk assessment
2. **Temperature Tracking**: Heat wave and cold snap monitoring
3. **Wind Conditions**: Storm and high wind alerts
4. **Multi-sensor Analysis**: Comprehensive weather situation assessment
### Data Applications
- **Risk Assessment**: Station data for regional risk evaluation
- **Resource Allocation**: Targeted response based on geographic data
- **Trend Analysis**: Historical pattern recognition
- **Alert Systems**: Threshold-based warning systems
## Future Enhancements
### Potential Improvements
1. **Historical Data**: Extend to historical weather patterns
2. **Real-time Updates**: WebSocket or polling for live data
3. **Data Caching**: Local storage for performance optimization
4. **Alert Integration**: Direct integration with emergency alert systems
### Monitoring Requirements
- **Service Health**: Regular connectivity testing
- **Data Quality**: Validation of extracted data integrity
- **Performance Metrics**: Response time and success rate tracking
- **Error Alerting**: Notification system for service failures
## Conclusion
Our OMIRL discovery and implementation successfully created a robust web scraping service that:
- βœ… **Accurately extracts** weather station data from 206+ stations
- βœ… **Supports filtering** by sensor type and geographic region
- βœ… **Handles dynamic content** with proper AngularJS interaction
- βœ… **Provides reliable service** with comprehensive error handling
- βœ… **Integrates seamlessly** with LangGraph agents for emergency operations
The implementation is now ready for production deployment and integration into emergency management workflows for the Liguria region.