Spaces:
Runtime error
Runtime error
| # OMIRL Website Discovery and Web Scraping Implementation | |
| ## Overview | |
| This document summarizes our discovery process for the OMIRL (Osservatorio Meteorologico Idro-Radar Liguria) website and how we adapted our web scraping service to extract weather station data for emergency management operations. | |
| ## Discovery Process | |
| ### Target Website | |
| - **Base URL**: `https://omirl.regione.liguria.it` | |
| - **Target Page**: `https://omirl.regione.liguria.it/#/sensorstable` | |
| - **Purpose**: Extract weather station sensor data for Liguria region emergency management | |
| ### Discovery Methodology | |
| Our discovery process involved: | |
| 1. **Direct Navigation Testing** - Systematically testing different URL patterns | |
| 2. **Table Structure Analysis** - Identifying data tables and their structure | |
| 3. **Filter Control Discovery** - Understanding available filtering mechanisms | |
| 4. **Content Validation** - Verifying data relevance for emergency operations | |
| ## Key Discoveries | |
| ### 1. Correct Navigation Path | |
| After testing multiple URL patterns, we identified the correct endpoint: | |
| ``` | |
| β CORRECT: /#/sensorstable | |
| β TRIED: /#/summarytable, /#/valori_stazioni, /#/tabelle/valori, etc. | |
| ``` | |
| ### 2. Table Structure Discovery | |
| #### Primary Data Table Headers | |
| ```json | |
| { | |
| "actual_headers": [ | |
| "Nome", // Station Name | |
| "Codice", // Station Code | |
| "Comune", // Municipality | |
| "Provincia", // Province | |
| "Area", // Area Classification | |
| "Bacino", // River Basin | |
| "Sottobacino", // Sub-basin | |
| "ultimo", // Latest Reading | |
| "Max", // Maximum Value | |
| "Min", // Minimum Value | |
| "UM" // Unit of Measurement | |
| ] | |
| } | |
| ``` | |
| #### Data Characteristics | |
| - **Expected Station Count**: ~206 weather stations | |
| - **Geographic Coverage**: Liguria region (GE, SV, IM, SP provinces) | |
| - **Numeric Data Columns**: `ultimo`, `Max`, `Min` | |
| - **Units Column**: `UM` (typically "mm" for precipitation) | |
| ### 3. Sensor Type Filtering | |
| #### Available Sensor Types | |
| ```json | |
| { | |
| "actual_sensor_types": [ | |
| {"index": 0, "name": "Precipitazione", "value": "0"}, | |
| {"index": 1, "name": "Temperatura", "value": "1"}, | |
| {"index": 2, "name": "Livelli Idrometrici", "value": "2"}, | |
| {"index": 3, "name": "Vento", "value": "3"}, | |
| {"index": 4, "name": "UmiditΓ dell'aria", "value": "4"}, | |
| {"index": 5, "name": "Eliofanie", "value": "5"}, | |
| {"index": 6, "name": "Radiazione Solare", "value": "6"}, | |
| {"index": 7, "name": "Bagnatura Fogliare", "value": "7"}, | |
| {"index": 8, "name": "Pressione Atmosferica", "value": "8"}, | |
| {"index": 9, "name": "Tensione Batteria", "value": "9"} | |
| ] | |
| } | |
| ``` | |
| #### Filter Implementation | |
| - **Selector**: `select#stationType` | |
| - **Filter Method**: JavaScript-based dropdown selection | |
| - **Dynamic Loading**: Table updates via AJAX after filter selection | |
| ### 4. Geographic Filtering | |
| #### Provincial Coverage | |
| - **GE** (Genova) - Primary metropolitan area | |
| - **SV** (Savona) - Western coastal region | |
| - **IM** (Imperia) - Northwestern region | |
| - **SP** (La Spezia) - Eastern region | |
| ## Implementation Adaptation | |
| ### Architecture Overview | |
| Our implementation follows a clean, layered architecture: | |
| ``` | |
| tools/omirl/adapter.py # LangGraph tool interface | |
| βββ tools/omirl/services_tables.py # Business logic | |
| βββ services/web/table_scraper.py # HTML parsing | |
| βββ services/web/browser.py # Browser automation | |
| ``` | |
| ### Core Functions Implemented | |
| #### 1. Primary Data Extraction | |
| ```python | |
| async def fetch_station_data( | |
| sensor_type: Optional[str] = None, | |
| provincia: Optional[str] = None | |
| ) -> OMIRLResult | |
| ``` | |
| #### 2. Sensor Type Discovery | |
| ```python | |
| async def get_available_sensor_types() -> OMIRLResult | |
| ``` | |
| #### 3. Convenience Functions | |
| ```python | |
| async def get_precipitation_stations(provincia: Optional[str] = None) -> List[Dict] | |
| def validate_sensor_type(sensor_type: str) -> bool | |
| ``` | |
| ### Technical Implementation Details | |
| #### Browser Automation | |
| - **Technology**: Playwright with Chromium | |
| - **Mode**: Headless for production, visible for debugging | |
| - **Wait Strategy**: Network idle detection for AngularJS app | |
| - **Rate Limiting**: 500ms delays between operations | |
| #### Data Processing | |
| - **HTML Parsing**: BeautifulSoup4 for table extraction | |
| - **Data Validation**: Type checking and required field validation | |
| - **Error Handling**: Graceful failure with structured error messages | |
| - **Filtering**: Post-extraction filtering for geographic constraints | |
| #### Result Structure | |
| ```python | |
| @dataclass | |
| class OMIRLResult: | |
| success: bool | |
| data: List[Dict[str, Any]] | |
| message: str | |
| metadata: Dict[str, Any] = field(default_factory=dict) | |
| warnings: List[str] = field(default_factory=list) | |
| ``` | |
| ## Testing Strategy | |
| ### Comprehensive Test Suite | |
| Our testing strategy covers: | |
| 1. **Basic Extraction** - Verify table scraping without filters | |
| 2. **Sensor Filtering** - Test precipitation sensor filtering | |
| 3. **Geographic Filtering** - Test provincia-based filtering | |
| 4. **Sensor Discovery** - Validate available sensor types | |
| 5. **Input Validation** - Test parameter validation | |
| 6. **Convenience Functions** - Test helper functions | |
| ### Test Execution | |
| ```bash | |
| # Full test suite | |
| pytest tests/test_omirl_implementation.py -v | |
| # Specific test | |
| pytest tests/test_omirl_implementation.py::test_basic_extraction -v | |
| # With async support | |
| pytest tests/test_omirl_implementation.py --asyncio-mode=auto -v | |
| ``` | |
| ### Test Results Validation | |
| Each test validates: | |
| - **Data Structure**: Required fields present | |
| - **Data Quality**: Non-empty critical fields | |
| - **Filter Behavior**: Correct filtering application | |
| - **Performance**: Response times under acceptable limits | |
| - **Error Handling**: Graceful failure scenarios | |
| ## Production Considerations | |
| ### Performance Optimization | |
| - **Selective Browser Installation**: Chromium only (smaller Docker image) | |
| - **Table Targeting**: Direct table extraction (avoid full page parsing) | |
| - **Connection Reuse**: Browser session persistence | |
| - **Timeout Management**: Configurable wait times | |
| ### Reliability Features | |
| - **Retry Logic**: Automatic retry on transient failures | |
| - **Error Recovery**: Structured error reporting | |
| - **Data Validation**: Field presence and type checking | |
| - **Rate Limiting**: Respectful scraping practices | |
| ### Security & Compliance | |
| - **User Agent**: Standard browser identification | |
| - **Request Timing**: Human-like interaction patterns | |
| - **Data Handling**: No sensitive data storage | |
| - **Regional Compliance**: Public data access only | |
| ## Emergency Management Integration | |
| ### Use Cases for Operations | |
| 1. **Precipitation Monitoring**: Real-time rainfall data for flood risk assessment | |
| 2. **Temperature Tracking**: Heat wave and cold snap monitoring | |
| 3. **Wind Conditions**: Storm and high wind alerts | |
| 4. **Multi-sensor Analysis**: Comprehensive weather situation assessment | |
| ### Data Applications | |
| - **Risk Assessment**: Station data for regional risk evaluation | |
| - **Resource Allocation**: Targeted response based on geographic data | |
| - **Trend Analysis**: Historical pattern recognition | |
| - **Alert Systems**: Threshold-based warning systems | |
| ## Future Enhancements | |
| ### Potential Improvements | |
| 1. **Historical Data**: Extend to historical weather patterns | |
| 2. **Real-time Updates**: WebSocket or polling for live data | |
| 3. **Data Caching**: Local storage for performance optimization | |
| 4. **Alert Integration**: Direct integration with emergency alert systems | |
| ### Monitoring Requirements | |
| - **Service Health**: Regular connectivity testing | |
| - **Data Quality**: Validation of extracted data integrity | |
| - **Performance Metrics**: Response time and success rate tracking | |
| - **Error Alerting**: Notification system for service failures | |
| ## Conclusion | |
| Our OMIRL discovery and implementation successfully created a robust web scraping service that: | |
| - β **Accurately extracts** weather station data from 206+ stations | |
| - β **Supports filtering** by sensor type and geographic region | |
| - β **Handles dynamic content** with proper AngularJS interaction | |
| - β **Provides reliable service** with comprehensive error handling | |
| - β **Integrates seamlessly** with LangGraph agents for emergency operations | |
| The implementation is now ready for production deployment and integration into emergency management workflows for the Liguria region. |