Spaces:

OxonTechnologies
/

CSS_EDA_Dashboard

Sleeping

App Files Files Community

arash7920 commited on Jan 27

Commit

e869d90

verified ·

1 Parent(s): 5ff5a4f

Upload 38 files

Browse files

Files changed (38) hide show

src/.DS_Store +0 -0
src/README.md +439 -0
src/config.yaml +137 -0
src/datalake/__init__.py +22 -0
src/datalake/__pycache__/__init__.cpython-310.pyc +0 -0
src/datalake/__pycache__/athena.cpython-310.pyc +0 -0
src/datalake/__pycache__/batch.cpython-310.pyc +0 -0
src/datalake/__pycache__/catalog.cpython-310.pyc +0 -0
src/datalake/__pycache__/config.cpython-310.pyc +0 -0
src/datalake/__pycache__/logger.cpython-310.pyc +0 -0
src/datalake/__pycache__/query.cpython-310.pyc +0 -0
src/datalake/athena.py +356 -0
src/datalake/batch.py +231 -0
src/datalake/catalog.py +269 -0
src/datalake/config.py +192 -0
src/datalake/logger.py +33 -0
src/datalake/query.py +277 -0
src/examples/__init__.py +1 -0
src/examples/batch_example.py +169 -0
src/examples/explore_example.py +96 -0
src/examples/query_example.py +188 -0
src/explore_datalake.ipynb +1165 -0
src/images/analysis.png +0 -0
src/images/logo.png +0 -0
src/images/oxon.jpeg +0 -0
src/requirements.txt +10 -0
src/setup.py +43 -0
src/streamlit_app.py +1110 -35
src/test_connection.py +78 -0
src/utils/__init__.py +9 -0
src/utils/__pycache__/__init__.cpython-310.pyc +0 -0
src/utils/__pycache__/correlation.cpython-310.pyc +0 -0
src/utils/__pycache__/dimension_reduction.cpython-310.pyc +0 -0
src/utils/__pycache__/feature_class.cpython-310.pyc +0 -0
src/utils/correlation.py +248 -0
src/utils/dimension_reduction.py +222 -0
src/utils/feature_class.py +119 -0
src/workshop.ipynb +1448 -0

src/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

src/README.md ADDED Viewed

	@@ -0,0 +1,439 @@

+# CANedge Data Lake Python SDK
+Production-ready Python package for querying and analyzing CAN/LIN data lakes created from CSS Electronics CANedge MDF4 logs using AWS Athena.
+## Features
+- **AWS Athena Integration**: Query Parquet data using SQL via Athena
+- **CloudFormation Configuration**: Automatic configuration from CloudFormation stack outputs
+- **Scalable**: Leverage Athena's distributed query engine for large datasets
+- **Type-safe**: Full type hints and docstrings
+- **Well-architected**: Clean module design with logging and error handling
+## Installation
+```bash
+# Clone or download project
+cd CSS
+# Install in development mode
+pip install -e .
+# Or install from requirements
+pip install -r requirements.txt
+```
+## Prerequisites
+1. **AWS Account** with:
+   - CloudFormation stack named `datalake-stack` (or specify custom name)
+   - Athena database configured
+   - S3 bucket with Parquet data
+   - AWS Glue catalog with table definitions
+2. **CloudFormation Stack Outputs**:
+   Your `datalake-stack` must have the following outputs:
+   - `DatabaseName`: Athena database name
+   - `S3OutputLocation`: S3 location for Athena query results (e.g., `s3://bucket/athena-results/`)
+   - `WorkGroup`: (Optional) Athena workgroup name
+   - `Region`: (Optional) AWS region
+3. **AWS Credentials**:
+   - AWS CLI configured: `aws configure`
+   - Or IAM role (for EC2/ECS/Lambda)
+   - Or environment variables
+## Quick Start
+### Option 1: Using Explicit Credentials (Recommended for Testing)
+```python
+from datalake.config import DataLakeConfig
+from datalake.athena import AthenaQuery
+from datalake.catalog import DataLakeCatalog
+from datalake.query import DataLakeQuery
+# Load config with explicit credentials
+config = DataLakeConfig.from_credentials(
+    database_name="dbparquetdatalake05",
+    workgroup="athenaworkgroup-datalake05",
+    s3_output_location="s3://canedge-raw-data-parquet/athena-results/",
+    region="eu-north-1",
+    access_key_id="YOUR_ACCESS_KEY_ID",
+    secret_access_key="YOUR_SECRET_ACCESS_KEY",
+)
+# Initialize Athena and catalog
+athena = AthenaQuery(config)
+catalog = DataLakeCatalog(athena, config)
+query = DataLakeQuery(athena, catalog)
+# List devices
+devices = catalog.list_devices()
+print(f"Devices: {devices}")
+# Query data
+df = query.read_device_message(
+    device_id="device_001",
+    message="EngineData",
+    date_range=("2024-01-15", "2024-01-20"),
+    limit=1000
+)
+print(f"Loaded {len(df)} records")
+```
+### Option 2: Using CloudFormation Stack
+```python
+from datalake.config import DataLakeConfig
+from datalake.athena import AthenaQuery
+from datalake.catalog import DataLakeCatalog
+from datalake.query import DataLakeQuery
+# Load config from CloudFormation stack
+config = DataLakeConfig.from_cloudformation(
+    stack_name="datalake-stack",
+    region=None,  # Auto-detect from stack or use default
+    profile=None,  # Use default profile or IAM role
+)
+# Initialize Athena and catalog
+athena = AthenaQuery(config)
+catalog = DataLakeCatalog(athena, config)
+query = DataLakeQuery(athena, catalog)
+```
+## Configuration
+### Option 1: Using Explicit Credentials
+For direct access with AWS credentials:
+```python
+config = DataLakeConfig.from_credentials(
+    database_name="dbparquetdatalake05",
+    workgroup="athenaworkgroup-datalake05",
+    s3_output_location="s3://canedge-raw-data-parquet/athena-results/",
+    region="eu-north-1",
+    access_key_id="AKIARJQJFFVASPMSGNNY",
+    secret_access_key="YOUR_SECRET_KEY",
+)
+```
+**Parameters:**
+- `database_name`: Athena database name
+- `workgroup`: Athena workgroup name
+- `s3_output_location`: S3 path for query results (must end with `/`)
+- `region`: AWS region
+- `access_key_id`: AWS access key ID
+- `secret_access_key`: AWS secret access key
+### Option 2: Using CloudFormation Stack
+### CloudFormation Stack Setup
+Your CloudFormation stack (`datalake-stack`) should output:
+```yaml
+Outputs:
+  DatabaseName:
+    Description: Athena database name
+    Value: canedge_datalake
+  S3OutputLocation:
+    Description: S3 location for Athena query results
+    Value: s3://my-bucket/athena-results/
+  WorkGroup:
+    Description: Athena workgroup name (optional)
+    Value: primary
+  Region:
+    Description: AWS region
+    Value: us-east-1
+```
+### Loading Configuration
+```python
+from datalake.config import DataLakeConfig
+# Load from CloudFormation stack (default: 'datalake-stack')
+config = DataLakeConfig.from_cloudformation()
+# Or specify custom stack name
+config = DataLakeConfig.from_cloudformation(
+    stack_name="my-custom-stack",
+    region="us-east-1",  # Optional: override region
+    profile="myprofile",  # Optional: use named AWS profile
+)
+```
+## Data Lake Structure
+### Athena Database Organization
+The data lake is organized in Athena with:
+- **Database**: Contains all tables (from CloudFormation output `DatabaseName`)
+- **Tables**: Named by device and message (e.g., `device_001_EngineData`)
+- **Partitions**: Date-based partitioning for efficient queries
+- **Schema**: Each table has columns: `t` (timestamp), signal columns from DBC files
+### Table Naming Convention
+Tables are typically named:
+- `{device_id}_{message_rule}` (e.g., `device_001_EngineData`)
+- Or `{device_id}__{message_rule}` (double underscore)
+- The catalog automatically detects the pattern
+## Usage Patterns
+### 1. Explore Data Lake
+```python
+from datalake.config import DataLakeConfig
+from datalake.athena import AthenaQuery
+from datalake.catalog import DataLakeCatalog
+config = DataLakeConfig.from_cloudformation()
+athena = AthenaQuery(config)
+catalog = DataLakeCatalog(athena, config)
+# List all tables
+tables = catalog.list_tables()
+print(f"Tables: {tables}")
+# List devices
+devices = catalog.list_devices()
+print(f"Devices: {devices}")
+# List messages for device
+messages = catalog.list_messages("device_001")
+print(f"Messages: {messages}")
+# Get schema
+schema = catalog.get_schema("device_001", "EngineData")
+print(f"Columns: {list(schema.keys())}")
+# List partitions (dates)
+partitions = catalog.list_partitions("device_001", "EngineData")
+print(f"Dates: {partitions}")
+```
+### 2. Query Data
+```python
+from datalake.query import DataLakeQuery
+query = DataLakeQuery(athena, catalog)
+# Read all data for device/message
+df = query.read_device_message(
+    device_id="device_001",
+    message="EngineData",
+    date_range=("2024-01-15", "2024-01-20"),
+    columns=["t", "RPM", "Temperature"],
+    limit=10000
+)
+print(f"Loaded {len(df)} records")
+```
+### 3. Time Series Query
+```python
+# Query single signal over time window
+df_ts = query.time_series_query(
+    device_id="device_001",
+    message="EngineData",
+    signal_name="RPM",
+    start_time=1000000000000000,  # microseconds
+    end_time=2000000000000000,
+    limit=10000
+)
+# Convert timestamp and plot
+df_ts['timestamp'] = pd.to_datetime(df_ts['t'], unit='us')
+print(df_ts[['timestamp', 'RPM']].head())
+```
+### 4. Custom SQL Queries
+```python
+# Execute custom SQL
+# Note: Use path-based filtering for date ranges
+# Data structure: {device_id}/{message}/{year}/{month}/{day}/file.parquet
+sql = """
+SELECT
+    COUNT(*) as record_count,
+    AVG(RPM) as avg_rpm,
+    MAX(Temperature) as max_temp
+FROM canedge_datalake.device_001_EngineData
+WHERE try_cast(element_at(split("$path", '/'), -4) AS INTEGER) = 2024
+  AND try_cast(element_at(split("$path", '/'), -3) AS INTEGER) >= 1
+  AND try_cast(element_at(split("$path", '/'), -2) AS INTEGER) >= 15
+"""
+df = query.execute_sql(sql)
+print(df)
+```
+### 5. Aggregation Queries
+```python
+# Use built-in aggregation method
+# For date filtering, use path-based extraction
+path_year = "try_cast(element_at(split(\"$path\", '/'), -4) AS INTEGER)"
+path_month = "try_cast(element_at(split(\"$path\", '/'), -3) AS INTEGER)"
+path_day = "try_cast(element_at(split(\"$path\", '/'), -2) AS INTEGER)"
+where_clause = f"{path_year} = 2024 AND {path_month} >= 1 AND {path_day} >= 15"
+df_agg = query.aggregate(
+    device_id="device_001",
+    message="EngineData",
+    aggregation="COUNT(*) as count, AVG(RPM) as avg_rpm, MIN(RPM) as min_rpm",
+    where_clause=where_clause
+)
+print(df_agg)
+```
+### 6. Batch Processing
+```python
+from datalake.batch import BatchProcessor
+processor = BatchProcessor(query)
+# Compute statistics across all data
+stats = processor.aggregate_by_device_message(
+    aggregation_func=processor.compute_statistics,
+    message_filter="Engine.*"
+)
+for device, messages in stats.items():
+    for message, metrics in messages.items():
+        print(f"{device}/{message}: {metrics['count']} records")
+# Export to CSV
+processor.export_to_csv(
+    device_id="device_001",
+    message="EngineData",
+    output_path="engine_export.csv",
+    limit=100000
+)
+```
+## Running Examples
+```bash
+# Test connection first
+python test_connection.py
+# Explore data lake structure
+python examples/explore_example.py
+# Query and analyze data
+python examples/query_example.py
+# Batch processing
+python examples/batch_example.py
+```
+**Note:** All examples use explicit credentials. Update them with your actual credentials or modify to use CloudFormation stack.
+## CloudFormation Stack Requirements
+### Required Stack Outputs
+1. **DatabaseName** (required)
+   - Athena database name containing your tables
+   - Example: `canedge_datalake`
+2. **S3OutputLocation** (required)
+   - S3 bucket/path for Athena query results
+   - Must end with `/`
+   - Example: `s3://my-bucket/athena-results/`
+   - Must have write permissions for Athena
+3. **WorkGroup** (optional)
+   - Athena workgroup name
+   - If not provided, uses default workgroup
+4. **Region** (optional)
+   - AWS region
+   - If not provided, uses default region or stack region
+### Example CloudFormation Template
+```yaml
+Resources:
+  AthenaDatabase:
+    Type: AWS::Glue::Database
+    Properties:
+      CatalogId: !Ref AWS::AccountId
+      DatabaseInput:
+        Name: canedge_datalake
+Outputs:
+  DatabaseName:
+    Description: Athena database name
+    Value: canedge_datalake
+    Export:
+      Name: !Sub "${AWS::StackName}-DatabaseName"
+  S3OutputLocation:
+    Description: S3 location for Athena query results
+    Value: !Sub "s3://${ResultsBucket}/athena-results/"
+    Export:
+      Name: !Sub "${AWS::StackName}-S3OutputLocation"
+  WorkGroup:
+    Description: Athena workgroup name
+    Value: primary
+    Export:
+      Name: !Sub "${AWS::StackName}-WorkGroup"
+  Region:
+    Description: AWS region
+    Value: !Ref AWS::Region
+    Export:
+      Name: !Sub "${AWS::StackName}-Region"
+```
+## Performance Notes
+- **Athena Query Limits**: Use `limit` parameter to control result size
+- **Partition Pruning**: Date-based queries automatically use partition pruning
+- **Query Costs**: Athena charges per TB scanned - use column selection and filters
+- **Result Caching**: Athena caches query results for 24 hours
+- **Concurrent Queries**: Athena supports multiple concurrent queries
+## Troubleshooting
+**"Stack not found"**
+- Verify stack name: `aws cloudformation describe-stacks --stack-name datalake-stack`
+- Check AWS credentials and region
+- Ensure you have CloudFormation read permissions
+**"Required output not found"**
+- Verify stack outputs: `aws cloudformation describe-stacks --stack-name datalake-stack --query 'Stacks[0].Outputs'`
+- Ensure `DatabaseName` and `S3OutputLocation` outputs exist
+**"Query execution failed"**
+- Check Athena permissions (Glue catalog access, S3 read permissions)
+- Verify table names exist in the database
+- Check S3 output location has write permissions
+**"Table not found"**
+- List tables: `catalog.list_tables()` to see available tables
+- Verify table naming convention matches expected pattern
+- Check Glue catalog for table definitions
+## License
+MIT
+## References
+- [CSS Electronics CANedge Documentation](https://www.csselectronics.com/pages/can-bus-logger-canedge)
+- [AWS Athena Documentation](https://docs.aws.amazon.com/athena/)
+- [AWS Glue Catalog](https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html)

src/config.yaml ADDED Viewed

	@@ -0,0 +1,137 @@

+# AWS Configuration
+aws:
+  database_name: "dbparquetdatalake05"
+  workgroup: "athenaworkgroup-datalake05"
+  s3_output_location: "s3://canedge-raw-data-parquet/athena-results/"
+  region: "eu-north-1"
+  access_key_id: "AKIARJQJFFVASPMSGNNY"
+  secret_access_key: "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"
+# Message Name Mapping
+message_mapping:
+  "010C":
+    name: "Engine RPM"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "010D":
+    name: "Vehicle speed"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0105":
+    name: "Engine coolant temperature"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "010F":
+    name: "Intake air temperature (IAT)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "012F":
+    name: "Fuel level input"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0106":
+    name: "Short-term fuel trim (Bank 1)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0107":
+    name: "Long-term fuel trim (Bank 1)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0144":
+    name: "Commanded equivalence ratio (λ/EQR)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0134":
+    name: "O₂ wideband B1S1 (equivalence/voltage)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0132":
+    name: "Evaporative system vapor pressure"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0103":
+    name: "Fuel system status (open/closed loop)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0104":
+    name: "Calculated engine load"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0143":
+    name: "Absolute engine load"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0110":
+    name: "Mass air flow (MAF)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "012E":
+    name: "Commanded evap purge"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "010E":
+    name: "Ignition timing advance"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "011F":
+    name: "Engine runtime (since start)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "015C":
+    name: "Engine oil temperature"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0135":
+    name: "O₂ wideband B1S2 (equivalence/voltage)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "013C":
+    name: "Catalyst temperature Bank1-Sensor1"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "013D":
+    name: "Catalyst temperature Bank1-Sensor2"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0162":
+    name: "Engine commanded torque"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0163":
+    name: "Engine actual torque (percent)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0164":
+    name: "Engine reference torque (N·m)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0149":
+    name: "Accelerator pedal position D"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "010B":
+    name: "Manifold absolute pressure (MAP)"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "0133":
+    name: "Barometric pressure"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "012C":
+    name: "Commanded EGR"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+  "012D":
+    name: "EGR error"
+    tx_id: "0x7DF"
+    expected_rx_ids: ["0x7E8", "0x7E9", "0x7EA", "0x7EB", "0x7EC", "0x7ED", "0x7EE", "0x7EF"]
+# Dashboard Configuration
+dashboard:
+  page_title: "OXON Technologies"
+  page_icon: ":mag:"
+  layout: "wide"
+  sidebar_background_color: "#74b9ff"
+  logo_path: "images/logo.png"
+  header_logo_path: "images/analysis.png"
+  dosing_stage_date: "2025-12-16"

src/datalake/__init__.py ADDED Viewed

	@@ -0,0 +1,22 @@

+"""
+CANedge Data Lake Python SDK
+Production-ready Python package for querying and analyzing CAN/LIN data lakes
+created from CSS Electronics CANedge MDF4 logs using AWS Athena.
+"""
+__version__ = "0.1.0"
+from .config import DataLakeConfig
+from .athena import AthenaQuery
+from .catalog import DataLakeCatalog
+from .query import DataLakeQuery
+from .batch import BatchProcessor
+__all__ = [
+    "DataLakeConfig",
+    "AthenaQuery",
+    "DataLakeCatalog",
+    "DataLakeQuery",
+    "BatchProcessor",
+]

src/datalake/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (624 Bytes). View file

src/datalake/__pycache__/athena.cpython-310.pyc ADDED Viewed

Binary file (10.4 kB). View file

src/datalake/__pycache__/batch.cpython-310.pyc ADDED Viewed

Binary file (6.97 kB). View file

src/datalake/__pycache__/catalog.cpython-310.pyc ADDED Viewed

Binary file (8.08 kB). View file

src/datalake/__pycache__/config.cpython-310.pyc ADDED Viewed

Binary file (5.68 kB). View file

src/datalake/__pycache__/logger.cpython-310.pyc ADDED Viewed

Binary file (950 Bytes). View file

src/datalake/__pycache__/query.cpython-310.pyc ADDED Viewed

Binary file (7.51 kB). View file

src/datalake/athena.py ADDED Viewed

	@@ -0,0 +1,356 @@

+"""
+AWS Athena query interface for data lake access.
+Provides methods to execute SQL queries against Athena and retrieve results
+as pandas DataFrames.
+"""
+from typing import Optional, List, Dict, Any
+import time
+import pandas as pd
+import boto3
+from botocore.exceptions import ClientError
+from urllib.parse import urlparse
+import io
+from .config import DataLakeConfig
+from .logger import setup_logger
+logger = setup_logger(__name__)
+class AthenaQuery:
+    """
+    AWS Athena query interface.
+    Executes SQL queries against Athena and retrieves results as pandas DataFrames.
+    Handles query execution, polling, and result retrieval.
+    """
+    def __init__(self, config: DataLakeConfig):
+        """
+        Initialize Athena query interface.
+        Args:
+            config: DataLakeConfig instance with Athena configuration
+        """
+        self.config = config
+        session = config.get_boto3_session()
+        self.athena_client = session.client('athena', region_name=config.region)
+        self.s3_client = session.client('s3', region_name=config.region)
+        logger.info(f"Initialized Athena client for database: {config.database_name}")
+    def execute_query(
+        self,
+        query: str,
+        wait: bool = True,
+        timeout: int = 300,
+    ) -> Optional[str]:
+        """
+        Execute SQL query in Athena.
+        Args:
+            query: SQL query string
+            wait: If True, wait for query to complete and return execution ID
+            timeout: Maximum time to wait for query completion (seconds)
+        Returns:
+            Query execution ID (if wait=False) or execution ID after completion (if wait=True)
+        Raises:
+            ClientError: If query execution fails
+            TimeoutError: If query exceeds timeout
+        """
+        query_execution_config = {
+            'Database': self.config.database_name,
+        }
+        # OutputLocation should be in ResultConfiguration
+        result_configuration = {
+            'OutputLocation': self.config.s3_output_location,
+        }
+        logger.debug(f"Executing query: {query[:100]}...")
+        try:
+            start_params = {
+                'QueryString': query,
+                'QueryExecutionContext': query_execution_config,
+                'ResultConfiguration': result_configuration,
+            }
+            # WorkGroup is a separate parameter, not in QueryExecutionContext
+            if self.config.workgroup:
+                start_params['WorkGroup'] = self.config.workgroup
+            response = self.athena_client.start_query_execution(**start_params)
+            execution_id = response['QueryExecutionId']
+            logger.info(f"Query started with execution ID: {execution_id}")
+            if not wait:
+                return execution_id
+            # Wait for query to complete
+            return self._wait_for_completion(execution_id, timeout)
+        except ClientError as e:
+            logger.error(f"Query execution failed: {e}")
+            raise
+    def _wait_for_completion(self, execution_id: str, timeout: int = 300) -> str:
+        """
+        Wait for query execution to complete.
+        Args:
+            execution_id: Query execution ID
+            timeout: Maximum time to wait (seconds)
+        Returns:
+            Execution ID
+        Raises:
+            TimeoutError: If query exceeds timeout
+            RuntimeError: If query fails
+        """
+        start_time = time.time()
+        while True:
+            response = self.athena_client.get_query_execution(QueryExecutionId=execution_id)
+            status = response['QueryExecution']['Status']['State']
+            if status == 'SUCCEEDED':
+                logger.info(f"Query {execution_id} completed successfully")
+                return execution_id
+            elif status == 'FAILED':
+                reason = response['QueryExecution']['Status'].get('StateChangeReason', 'Unknown error')
+                logger.error(f"Query {execution_id} failed: {reason}")
+                raise RuntimeError(f"Query failed: {reason}")
+            elif status == 'CANCELLED':
+                logger.warning(f"Query {execution_id} was cancelled")
+                raise RuntimeError("Query was cancelled")
+            elapsed = time.time() - start_time
+            if elapsed > timeout:
+                raise TimeoutError(f"Query {execution_id} exceeded timeout of {timeout} seconds")
+            time.sleep(1)  # Poll every second
+    def get_query_results(self, execution_id: str) -> pd.DataFrame:
+        """
+        Get query results as pandas DataFrame.
+        Optimized to read directly from S3 for large result sets, which is
+        exponentially faster than paginated API calls.
+        Args:
+            execution_id: Query execution ID
+        Returns:
+            DataFrame with query results
+        Raises:
+            ClientError: If results cannot be retrieved
+        """
+        logger.debug(f"Retrieving results for execution {execution_id}")
+        # Try to read from S3 first (much faster for large result sets)
+        try:
+            return self._get_results_from_s3(execution_id)
+        except Exception as e:
+            logger.debug(f"Failed to read from S3, falling back to API: {e}")
+            # Fall back to API method for backward compatibility
+            return self._get_results_from_api(execution_id)
+    def _get_results_from_s3(self, execution_id: str) -> pd.DataFrame:
+        """
+        Get query results directly from S3 CSV file.
+        This is exponentially faster than paginated API calls because:
+        - Single file read vs hundreds/thousands of API calls
+        - Pandas reads CSV in optimized C code
+        - No row-by-row Python processing overhead
+        Args:
+            execution_id: Query execution ID
+        Returns:
+            DataFrame with query results
+        Raises:
+            Exception: If S3 read fails
+        """
+        # Get query execution details to find S3 result location
+        response = self.athena_client.get_query_execution(QueryExecutionId=execution_id)
+        result_location = response['QueryExecution']['ResultConfiguration']['OutputLocation']
+        # Parse S3 URI: s3://bucket/path/to/file.csv
+        parsed = urlparse(result_location)
+        bucket = parsed.netloc
+        key = parsed.path.lstrip('/')
+        logger.debug(f"Reading results from s3://{bucket}/{key}")
+        # Read CSV directly from S3
+        obj = self.s3_client.get_object(Bucket=bucket, Key=key)
+        csv_content = obj['Body'].read()
+        # Parse CSV with pandas (much faster than row-by-row processing)
+        # Read as strings first to match original API behavior, then parse types
+        df = pd.read_csv(io.BytesIO(csv_content), dtype=str, keep_default_na=False)
+        # Apply type parsing to match original behavior
+        # Convert to string first to handle any edge cases, then parse
+        for col in df.columns:
+            df[col] = df[col].astype(str).apply(self._parse_value)
+        logger.info(f"Retrieved {len(df)} rows from S3 for query {execution_id}")
+        return df
+    def _get_results_from_api(self, execution_id: str) -> pd.DataFrame:
+        """
+        Get query results using paginated API calls (fallback method).
+        This is the original implementation, kept for backward compatibility
+        when S3 read fails.
+        Args:
+            execution_id: Query execution ID
+        Returns:
+            DataFrame with query results
+        Raises:
+            ClientError: If results cannot be retrieved
+        """
+        logger.debug(f"Using API method for execution {execution_id}")
+        # Get result set
+        paginator = self.athena_client.get_paginator('get_query_results')
+        pages = paginator.paginate(QueryExecutionId=execution_id)
+        rows = []
+        column_names = None
+        for page in pages:
+            result_set = page['ResultSet']
+            # Get column names from first page
+            if column_names is None:
+                column_names = [col['Name'] for col in result_set['ResultSetMetadata']['ColumnInfo']]
+            # Get data rows (skip header row)
+            for row in result_set['Rows'][1:]:  # Skip header
+                values = [self._parse_value(cell.get('VarCharValue', ''))
+                         for cell in row['Data']]
+                rows.append(values)
+        if not rows:
+            logger.warning(f"No results returned for execution {execution_id}")
+            return pd.DataFrame(columns=column_names or [])
+        df = pd.DataFrame(rows, columns=column_names)
+        logger.info(f"Retrieved {len(df)} rows from query {execution_id}")
+        return df
+    def _parse_value(self, value: str) -> Any:
+        """
+        Parse string value to appropriate Python type.
+        Args:
+            value: String value from Athena result
+        Returns:
+            Parsed value (int, float, bool, or str)
+        """
+        if value == '' or value is None:
+            return None
+        # Try to parse as number
+        try:
+            if '.' in value:
+                return float(value)
+            return int(value)
+        except ValueError:
+            pass
+        # Try to parse as boolean
+        if value.lower() in ('true', 'false'):
+            return value.lower() == 'true'
+        return value
+    def query_to_dataframe(
+        self,
+        query: str,
+        timeout: int = 300,
+    ) -> pd.DataFrame:
+        """
+        Execute query and return results as DataFrame.
+        Convenience method that combines execute_query and get_query_results.
+        Args:
+            query: SQL query string
+            timeout: Maximum time to wait for query completion (seconds)
+        Returns:
+            DataFrame with query results
+        """
+        execution_id = self.execute_query(query, wait=True, timeout=timeout)
+        return self.get_query_results(execution_id)
+    def list_tables(self, schema: Optional[str] = None) -> List[str]:
+        """
+        List tables in the database.
+        Args:
+            schema: Optional schema name (defaults to database)
+        Returns:
+            List of table names
+        """
+        if schema is None:
+            schema = self.config.database_name
+        query = f"""
+        SELECT table_name
+        FROM information_schema.tables
+        WHERE table_schema = '{schema}'
+        ORDER BY table_name
+        """
+        try:
+            df = self.query_to_dataframe(query)
+            return df['table_name'].tolist() if not df.empty else []
+        except Exception as e:
+            logger.error(f"Failed to list tables: {e}")
+            return []
+    def describe_table(self, table_name: str, schema: Optional[str] = None) -> pd.DataFrame:
+        """
+        Get table schema/columns.
+        Args:
+            table_name: Table name
+            schema: Optional schema name (defaults to database)
+        Returns:
+            DataFrame with column information (column_name, data_type, etc.)
+        """
+        if schema is None:
+            schema = self.config.database_name
+        query = f"""
+        SELECT
+            column_name,
+            data_type,
+            is_nullable
+        FROM information_schema.columns
+        WHERE table_schema = '{schema}'
+          AND table_name = '{table_name}'
+        ORDER BY ordinal_position
+        """
+        try:
+            return self.query_to_dataframe(query)
+        except Exception as e:
+            logger.error(f"Failed to describe table {table_name}: {e}")
+            return pd.DataFrame()

src/datalake/batch.py ADDED Viewed

	@@ -0,0 +1,231 @@

+"""
+Batch processing utilities for scalable data lake analysis.
+Provides patterns for aggregating, analyzing, and exporting data across
+the entire data lake or subsets thereof.
+"""
+from typing import Callable, Dict, Any, Optional
+import pandas as pd
+from .query import DataLakeQuery
+from .logger import setup_logger
+logger = setup_logger(__name__)
+class BatchProcessor:
+    """
+    Batch processing utilities for scalable data lake analysis.
+    Provides high-level patterns for common analysis tasks across
+    multiple devices and messages.
+    """
+    def __init__(self, query: DataLakeQuery):
+        """
+        Initialize batch processor.
+        Args:
+            query: DataLakeQuery instance
+        """
+        self.query = query
+        logger.info("Initialized BatchProcessor")
+    def aggregate_by_device_message(
+        self,
+        aggregation_func: Callable[[pd.DataFrame], Dict[str, Any]],
+        device_filter: Optional[str] = None,
+        message_filter: Optional[str] = None,
+    ) -> Dict[str, Dict[str, Any]]:
+        """
+        Apply aggregation function to each device/message combination.
+        Pattern for scalable analysis across entire data lake. Processes
+        each device/message combination separately to manage memory.
+        Args:
+            aggregation_func: Function (df) -> dict of metrics/statistics
+            device_filter: Device regex filter (applied via catalog)
+            message_filter: Message regex filter
+        Returns:
+            Nested dict: {device_id: {message: aggregation_result}}
+        Example:
+            >>> def compute_stats(df):
+            ...     return {
+            ...         'count': len(df),
+            ...         'rpm_mean': df['RPM'].mean() if 'RPM' in df else None
+            ...     }
+            >>> results = processor.aggregate_by_device_message(compute_stats)
+        """
+        results: Dict[str, Dict[str, Any]] = {}
+        # Get filtered device list
+        devices = self.query.catalog.list_devices(device_filter)
+        for device in devices:
+            messages = self.query.catalog.list_messages(device, message_filter)
+            for message in messages:
+                try:
+                    # Read data for this device/message
+                    df = self.query.read_device_message(device, message)
+                    if device not in results:
+                        results[device] = {}
+                    results[device][message] = aggregation_func(df)
+                except Exception as e:
+                    logger.error(f"Aggregation failed for {device}/{message}: {e}")
+                    if device not in results:
+                        results[device] = {}
+                    results[device][message] = {"error": str(e)}
+        logger.info(f"Aggregation completed for {len(results)} devices")
+        return results
+    def export_to_csv(
+        self,
+        device_id: str,
+        message: str,
+        output_path: str,
+        date_range: Optional[tuple[str, str]] = None,
+        limit: Optional[int] = None,
+    ) -> None:
+        """
+        Export device/message data to CSV.
+        Args:
+            device_id: Device identifier
+            message: Message name
+            output_path: Output CSV file path
+            date_range: Optional (start_date, end_date) tuple
+            limit: Optional row limit
+        Raises:
+            Exception: If export fails
+        """
+        logger.info(f"Exporting {device_id}/{message} to {output_path}")
+        df = self.query.read_device_message(
+            device_id=device_id,
+            message=message,
+            date_range=date_range,
+            limit=limit,
+        )
+        if df.empty:
+            logger.warning(f"No data to export for {device_id}/{message}")
+            return
+        df.to_csv(output_path, index=False)
+        logger.info(f"Exported {len(df)} rows to {output_path}")
+    def export_to_parquet(
+        self,
+        device_id: str,
+        message: str,
+        output_path: str,
+        date_range: Optional[tuple[str, str]] = None,
+    ) -> None:
+        """
+        Export device/message data to Parquet file.
+        Args:
+            device_id: Device identifier
+            message: Message name
+            output_path: Output Parquet file path
+            date_range: Optional (start_date, end_date) tuple
+        Raises:
+            Exception: If export fails
+        """
+        logger.info(f"Exporting {device_id}/{message} to {output_path}")
+        df = self.query.read_device_message(
+            device_id=device_id,
+            message=message,
+            date_range=date_range,
+        )
+        if df.empty:
+            logger.warning(f"No data to export for {device_id}/{message}")
+            return
+        df.to_parquet(output_path, index=False, compression='snappy')
+        logger.info(f"Exported {len(df)} rows to {output_path}")
+    def compute_statistics(self, df: pd.DataFrame) -> Dict[str, Any]:
+        """
+        Compute basic statistics for aggregation.
+        Args:
+            df: Input DataFrame
+        Returns:
+            Dict with count, mean, min, max, std for numeric columns
+        Note:
+            Skips timestamp column 't' in statistics computation.
+        """
+        stats: Dict[str, Any] = {"count": len(df)}
+        if df.empty:
+            return stats
+        # Compute statistics for numeric columns (excluding timestamp)
+        numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
+        numeric_cols = [c for c in numeric_cols if c != 't']
+        for col in numeric_cols:
+            try:
+                stats[f"{col}_mean"] = float(df[col].mean())
+                stats[f"{col}_min"] = float(df[col].min())
+                stats[f"{col}_max"] = float(df[col].max())
+                stats[f"{col}_std"] = float(df[col].std())
+                stats[f"{col}_null_count"] = int(df[col].isna().sum())
+            except Exception as e:
+                logger.warning(f"Failed to compute stats for {col}: {e}")
+        return stats
+    def find_anomalies(
+        self,
+        device_id: str,
+        message: str,
+        signal_name: str,
+        threshold_std: float = 3.0,
+    ) -> pd.DataFrame:
+        """
+        Find anomalous values in a signal using z-score method.
+        Args:
+            device_id: Device identifier
+            message: Message name
+            signal_name: Signal column name
+            threshold_std: Number of standard deviations for anomaly threshold
+        Returns:
+            DataFrame with anomalous records
+        """
+        df = self.query.read_device_message(
+            device_id=device_id,
+            message=message,
+            columns=['t', signal_name],
+        )
+        if df.empty or signal_name not in df.columns:
+            logger.warning(f"No data or signal not found: {signal_name}")
+            return pd.DataFrame()
+        # Compute z-scores
+        mean = df[signal_name].mean()
+        std = df[signal_name].std()
+        if std == 0:
+            logger.warning(f"Zero standard deviation for {signal_name}")
+            return pd.DataFrame()
+        z_scores = (df[signal_name] - mean) / std
+        anomalies = df[abs(z_scores) > threshold_std].copy()
+        logger.info(f"Found {len(anomalies)} anomalies in {signal_name} "
+                   f"(threshold: {threshold_std} std)")
+        return anomalies

src/datalake/catalog.py ADDED Viewed

	@@ -0,0 +1,269 @@

+"""
+Data lake catalog for discovering structure and metadata using AWS Athena/Glue.
+Provides methods to explore the data lake organization using Athena metadata:
+- List devices, messages, and dates from table structure
+- Get schemas for message/rule tables
+- Understand data availability
+"""
+from typing import List, Dict, Optional
+import re
+from .athena import AthenaQuery
+from .config import DataLakeConfig
+from .logger import setup_logger
+logger = setup_logger(__name__)
+class DataLakeCatalog:
+    """
+    Catalog for exploring data lake structure using Athena/Glue.
+    Assumes Athena database contains tables organized by device and message.
+    Table naming convention: {device_id}_{message_rule} or similar
+    """
+    def __init__(self, athena_query: AthenaQuery, config: DataLakeConfig):
+        """
+        Initialize catalog.
+        Args:
+            athena_query: AthenaQuery instance
+            config: DataLakeConfig instance
+        """
+        self.athena = athena_query
+        self.config = config
+        self._cache: Dict[str, Dict] = {}
+        logger.info(f"Initialized catalog for database: {config.database_name}")
+    def list_tables(self) -> List[str]:
+        """
+        List all tables in the database.
+        Returns:
+            Sorted list of table names
+        """
+        tables = self.athena.list_tables()
+        logger.info(f"Found {len(tables)} tables in database")
+        return sorted(tables)
+    def list_devices(self, device_filter: Optional[str] = None) -> List[str]:
+        """
+        List all device IDs by extracting from table names.
+        Args:
+            device_filter: Optional regex pattern to filter devices
+        Returns:
+            Sorted list of device IDs
+        Note:
+            Extracts device IDs from table names. Assumes table naming like:
+            - {device_id}_{message_rule}
+            - {device_id}__{message_rule}
+            - Or similar pattern
+        """
+        tables = self.list_tables()
+        devices = set()
+        for table in tables:
+            # Try common patterns: device_message, device__message, device.message
+            parts = re.split(r'_', table, maxsplit=2)
+            if len(parts) >= 1:
+                device = parts[1]
+                if device == 'aggregations': # skip aggregations table
+                    continue
+                if device_filter is None or re.search(device_filter, device):
+                    devices.add(device)
+        result = sorted(devices)
+        logger.info(f"Found {len(result)} device(s)")
+        return result
+    def list_messages(self, device_id: str, message_filter: Optional[str] = None) -> List[str]:
+        """
+        List all message/rule names for a device.
+        Args:
+            device_id: Device identifier
+            message_filter: Optional regex pattern to filter messages
+        Returns:
+            Sorted list of message/rule names
+        Note:
+            Extracts message names from table names. Assumes table naming like:
+            - prefix_{device_id}_{message_rule}
+            - Or {device_id}_{message_rule}
+        """
+        tables = self.list_tables()
+        messages = set()
+        for table in tables:
+            # Split table name by underscore (consistent with list_devices)
+            parts = re.split(r'_', table, maxsplit=2)
+            # Try pattern: prefix_device_message
+            if len(parts) >= 3:
+                table_device = parts[1]
+                if table_device == device_id:
+                    message = parts[2]
+                    if message_filter is None or re.search(message_filter, message):
+                        messages.add(message)
+            # Try pattern: device_message (no prefix)
+            elif len(parts) >= 2:
+                table_device = parts[0]
+                if table_device == device_id:
+                    message = parts[1]
+                    if message_filter is None or re.search(message_filter, message):
+                        messages.add(message)
+        result = sorted(messages)
+        logger.info(f"Found {len(result)} messages for device {device_id}")
+        return result
+    def get_table_name(self, device_id: str, message: str) -> str:
+        """
+        Get table name for device/message combination.
+        Args:
+            device_id: Device identifier
+            message: Message/rule name
+        Returns:
+            Table name (tries common patterns)
+        Raises:
+            ValueError: If table not found
+        """
+        tables = self.list_tables()
+        # Try patterns consistent with list_devices/list_messages
+        # Pattern 1: prefix_device_message
+        for table in tables:
+            parts = re.split(r'_', table, maxsplit=2)
+            if len(parts) >= 3:
+                if parts[1] == device_id and parts[2] == message:
+                    return table
+        # Pattern 2: device_message (no prefix)
+        for table in tables:
+            parts = re.split(r'_', table, maxsplit=1)
+            if len(parts) >= 2:
+                if parts[0] == device_id and parts[1] == message:
+                    return table
+        # Fallback: try exact matches
+        patterns = [
+            f"{device_id}_{message}",
+            f"{device_id}__{message}",
+            f"{device_id}_{message}".lower(),
+            f"{device_id}__{message}".lower(),
+        ]
+        for pattern in patterns:
+            if pattern in tables:
+                return pattern
+        raise ValueError(
+            f"Table not found for {device_id}/{message}. "
+            f"Available tables: {tables[:10]}..."
+        )
+    def get_schema(self, device_id: str, message: str) -> Optional[Dict[str, str]]:
+        """
+        Get schema for a message table.
+        Args:
+            device_id: Device identifier
+            message: Message/rule name
+        Returns:
+            Dict mapping column names to data types, or None if not found
+        """
+        cache_key = f"{device_id}/{message}"
+        if cache_key in self._cache:
+            logger.debug(f"Using cached schema for {cache_key}")
+            return self._cache[cache_key]
+        try:
+            table_name = self.get_table_name(device_id, message)
+            schema_df = self.athena.describe_table(table_name)
+            if schema_df.empty:
+                logger.warning(f"No schema found for {device_id}/{message}")
+                return None
+            schema_dict = {
+                row['column_name']: row['data_type']
+                for _, row in schema_df.iterrows()
+            }
+            self._cache[cache_key] = schema_dict
+            logger.info(f"Schema for {cache_key}: {len(schema_dict)} columns")
+            return schema_dict
+        except Exception as e:
+            logger.error(f"Failed to get schema for {device_id}/{message}: {e}")
+            return None
+    def list_partitions(self, device_id: str, message: str) -> List[str]:
+        """
+        List partition values (dates) for a table.
+        Args:
+            device_id: Device identifier
+            message: Message/rule name
+        Returns:
+            List of partition values (dates) in YYYY-MM-DD format
+        Note:
+            Handles hierarchical partitioning format: year=YYYY/month=MM/day=DD
+            Data structure: {device_id}/{message}/{year}/{month}/{day}/file.parquet
+        """
+        try:
+            table_name = self.get_table_name(device_id, message)
+            # Query partition information
+            # query = f"SHOW PARTITIONS {self.config.database_name}.{table_name}"
+            query = f"""
+                WITH files AS (
+                SELECT DISTINCT "$path" AS p
+                FROM {self.config.database_name}.{table_name}
+                WHERE "$path" LIKE '%.parquet'
+                ),
+                parts AS (
+                SELECT
+                    try_cast(element_at(split(p, '/'), -4) AS INTEGER) AS year,
+                    try_cast(element_at(split(p, '/'), -3) AS INTEGER) AS month,
+                    try_cast(element_at(split(p, '/'), -2) AS INTEGER) AS day
+                FROM files
+                )
+                SELECT DISTINCT year, month, day
+                FROM parts
+                WHERE year IS NOT NULL AND month IS NOT NULL AND day IS NOT NULL
+                ORDER BY year DESC, month DESC, day DESC
+                """
+            df = self.athena.query_to_dataframe(query)
+            if df.empty:
+                logger.warning(f"No partitions found for {table_name}")
+                return []
+            # Extract date from partition string
+            # Format: YYYY-MM-DD
+            dates = []
+            for _, row in df.iterrows():
+                dates.append(f'{row.iloc[0]}-{row.iloc[1]}-{row.iloc[2]:02d}')
+            logger.info(f"Found {len(dates)} partitions for {table_name}")
+            return sorted(set(dates))
+        except Exception as e:
+            logger.warning(f"Could not list partitions for {device_id}/{message}: {e}")
+            # Table might not be partitioned or query might have failed
+            return []
+    def clear_cache(self) -> None:
+        """Clear schema cache."""
+        self._cache.clear()
+        logger.debug("Schema cache cleared")

src/datalake/config.py ADDED Viewed

	@@ -0,0 +1,192 @@

+"""
+Configuration management for data lake access.
+Supports AWS Athena-based data lakes with configuration from
+CloudFormation stack outputs.
+"""
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Optional
+import boto3
+from botocore.exceptions import ClientError
+@dataclass
+class DataLakeConfig:
+    """
+    Data lake configuration for AWS Athena-based data lakes.
+    Configuration can be loaded from CloudFormation stack outputs or
+    created directly with credentials.
+    Attributes:
+        stack_name: CloudFormation stack name (default: 'datalake-stack')
+        database_name: Athena database name
+        workgroup: Athena workgroup name (optional)
+        s3_output_location: S3 location for query results (must end with /)
+        region: AWS region
+        profile: AWS profile name for credentials (optional)
+        access_key_id: AWS access key ID (optional, for explicit credentials)
+        secret_access_key: AWS secret access key (optional, for explicit credentials)
+        device_filter: Optional device ID filter (e.g., 'device_001')
+        message_filter: Optional message/rule filter (e.g., 'CAN_Message_001')
+        cache_enabled: Enable schema caching
+    """
+    stack_name: str = "datalake-stack"
+    database_name: Optional[str] = None
+    workgroup: Optional[str] = None
+    s3_output_location: Optional[str] = None
+    region: str = "us-east-1"
+    profile: Optional[str] = None
+    access_key_id: Optional[str] = None
+    secret_access_key: Optional[str] = None
+    device_filter: Optional[str] = None
+    message_filter: Optional[str] = None
+    cache_enabled: bool = True
+    @classmethod
+    def from_cloudformation(
+        cls,
+        stack_name: str = "datalake-stack",
+        region: Optional[str] = None,
+        profile: Optional[str] = None,
+    ) -> "DataLakeConfig":
+        """
+        Load config from CloudFormation stack outputs.
+        Args:
+            stack_name: CloudFormation stack name (default: 'datalake-stack')
+            region: AWS region (if None, will try to get from stack or use default)
+            profile: AWS profile name for credentials (optional)
+        Returns:
+            DataLakeConfig instance with values from stack outputs
+        Raises:
+            ClientError: If stack doesn't exist or can't be accessed
+            KeyError: If required stack outputs are missing
+        Expected CloudFormation stack outputs:
+        - DatabaseName: Athena database name (required)
+        - WorkGroup: Athena workgroup name (optional)
+        - S3OutputLocation: S3 location for Athena query results (required)
+        - Region: AWS region (optional, will use provided region or default)
+        """
+        session = boto3.Session(profile_name=profile)
+        if region:
+            cf_client = session.client('cloudformation', region_name=region)
+        else:
+            # Try to get region from default config
+            try:
+                region = session.region_name or "us-east-1"
+            except:
+                region = "us-east-1"
+            cf_client = session.client('cloudformation', region_name=region)
+        try:
+            response = cf_client.describe_stacks(StackName=stack_name)
+        except ClientError as e:
+            raise ClientError(
+                {
+                    'Error': {
+                        'Code': 'StackNotFound',
+                        'Message': f"CloudFormation stack '{stack_name}' not found. "
+                                 f"Make sure the stack exists and you have permissions to access it."
+                    }
+                },
+                'DescribeStacks'
+            ) from e
+        if not response['Stacks']:
+            raise ValueError(f"Stack '{stack_name}' not found")
+        stack = response['Stacks'][0]
+        outputs = {output['OutputKey']: output['OutputValue']
+                   for output in stack.get('Outputs', [])}
+        # Get region from stack or use provided/default
+        if not region:
+            region = outputs.get('Region', session.region_name or "us-east-1")
+        # Required outputs
+        database_name = outputs.get('DatabaseName')
+        if not database_name:
+            raise KeyError(
+                f"Required output 'DatabaseName' not found in stack '{stack_name}'. "
+                f"Available outputs: {list(outputs.keys())}"
+            )
+        s3_output_location = outputs.get('S3OutputLocation')
+        if not s3_output_location:
+            raise KeyError(
+                f"Required output 'S3OutputLocation' not found in stack '{stack_name}'. "
+                f"Available outputs: {list(outputs.keys())}"
+            )
+        # Optional outputs
+        workgroup = outputs.get('WorkGroup')
+        return cls(
+            stack_name=stack_name,
+            database_name=database_name,
+            workgroup=workgroup,
+            s3_output_location=s3_output_location,
+            region=region,
+            profile=profile,
+        )
+    @classmethod
+    def from_credentials(
+        cls,
+        database_name: str,
+        workgroup: str,
+        s3_output_location: str,
+        region: str,
+        access_key_id: str,
+        secret_access_key: str,
+    ) -> "DataLakeConfig":
+        """
+        Create config directly with AWS credentials.
+        Args:
+            database_name: Athena database name
+            workgroup: Athena workgroup name
+            s3_output_location: S3 location for query results (must end with /)
+            region: AWS region
+            access_key_id: AWS access key ID
+            secret_access_key: AWS secret access key
+        Returns:
+            DataLakeConfig instance
+        """
+        # Ensure S3 output location ends with /
+        if s3_output_location and not s3_output_location.endswith('/'):
+            s3_output_location = s3_output_location + '/'
+        return cls(
+            database_name=database_name,
+            workgroup=workgroup,
+            s3_output_location=s3_output_location,
+            region=region,
+            access_key_id=access_key_id,
+            secret_access_key=secret_access_key,
+        )
+    def get_boto3_session(self) -> boto3.Session:
+        """
+        Get boto3 session with configured credentials, profile, and region.
+        Returns:
+            boto3.Session instance
+        """
+        if self.access_key_id and self.secret_access_key:
+            # Use explicit credentials
+            return boto3.Session(
+                aws_access_key_id=self.access_key_id,
+                aws_secret_access_key=self.secret_access_key,
+                region_name=self.region,
+            )
+        else:
+            # Use profile or IAM role
+            return boto3.Session(profile_name=self.profile, region_name=self.region)

src/datalake/logger.py ADDED Viewed

	@@ -0,0 +1,33 @@

+"""
+Structured logging utilities for the datalake package.
+"""
+import logging
+from typing import Optional
+def setup_logger(name: str, level: str = "INFO") -> logging.Logger:
+    """
+    Initialize logger with structured output.
+    Args:
+        name: Logger module name
+        level: Logging level (INFO, DEBUG, WARNING, ERROR)
+    Returns:
+        Configured logger instance
+    """
+    logger = logging.getLogger(name)
+    # Avoid adding duplicate handlers
+    if logger.handlers:
+        return logger
+    handler = logging.StreamHandler()
+    formatter = logging.Formatter(
+        '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+    )
+    handler.setFormatter(formatter)
+    logger.addHandler(handler)
+    logger.setLevel(getattr(logging, level.upper(), logging.INFO))
+    return logger

src/datalake/query.py ADDED Viewed

	@@ -0,0 +1,277 @@

+"""
+Query interface for data lake using AWS Athena SQL queries.
+Provides methods to read and filter data from the Athena data lake
+using SQL queries with support for device/message filtering and date ranges.
+"""
+from typing import List, Optional, Tuple
+import pandas as pd
+from .athena import AthenaQuery
+from .catalog import DataLakeCatalog
+from .config import DataLakeConfig
+from .logger import setup_logger
+logger = setup_logger(__name__)
+class DataLakeQuery:
+    """
+    Query interface for Athena-based data lake.
+    Provides efficient methods to read data using SQL queries,
+    with support for filtering by device, message, date range, and time windows.
+    """
+    def __init__(self, athena_query: AthenaQuery, catalog: DataLakeCatalog):
+        """
+        Initialize query engine.
+        Args:
+            athena_query: AthenaQuery instance
+            catalog: Data lake catalog
+        """
+        self.athena = athena_query
+        self.catalog = catalog
+        logger.info("Initialized DataLakeQuery")
+    def read_device_message(
+        self,
+        device_id: str,
+        message: str,
+        date_range: Optional[Tuple[str, str]] = None,
+        columns: Optional[List[str]] = None,
+        limit: Optional[int] = None,
+    ) -> pd.DataFrame:
+        """
+        Read all data for a device/message combination using SQL.
+        Args:
+            device_id: Device identifier
+            message: Message/rule name
+            date_range: Optional (start_date, end_date) tuple (YYYY-MM-DD format)
+            columns: Optional column subset to read (improves performance)
+            limit: Optional row limit
+        Returns:
+            DataFrame with query results
+        """
+        table_name = self.catalog.get_table_name(device_id, message)
+        # Build SELECT clause
+        if columns:
+            # Validate columns exist
+            schema = self.catalog.get_schema(device_id, message)
+            if schema:
+                valid_columns = [c for c in columns if c in schema]
+                if not valid_columns:
+                    logger.warning(f"None of requested columns found, using all columns")
+                    select_clause = "*"
+                else:
+                    select_clause = ", ".join(valid_columns)
+            else:
+                select_clause = "*"
+        else:
+            select_clause = "*"
+        # Build WHERE clause
+        where_conditions = []
+        if date_range:
+            start_date, end_date = date_range
+            # Parse dates and filter using $path column
+            # Format: YYYY-MM-DD
+            # Data structure: {device_id}/{message}/{year}/{month}/{day}/file.parquet
+            start_parts = start_date.split('-')
+            end_parts = end_date.split('-')
+            if len(start_parts) == 3 and len(end_parts) == 3:
+                start_year, start_month, start_day = start_parts
+                end_year, end_month, end_day = end_parts
+                # Extract year, month, day from path and filter
+                # Path structure: .../year/month/day/file.parquet
+                # Use element_at(split($path, '/'), -4) for year, -3 for month, -2 for day
+                path_year = "try_cast(element_at(split(\"$path\", '/'), -4) AS INTEGER)"
+                path_month = "try_cast(element_at(split(\"$path\", '/'), -3) AS INTEGER)"
+                path_day = "try_cast(element_at(split(\"$path\", '/'), -2) AS INTEGER)"
+                # Build partition filter using path-based extraction
+                # This handles hierarchical partitioning: {device_id}/{message}/{year}/{month}/{day}/file.parquet
+                where_conditions.append(
+                    f"({path_year} > {start_year} OR "
+                    f"({path_year} = {start_year} AND "
+                    f"({path_month} > {start_month} OR "
+                    f"({path_month} = {start_month} AND {path_day} >= {start_day}))))"
+                )
+                where_conditions.append(
+                    f"({path_year} < {end_year} OR "
+                    f"({path_year} = {end_year} AND "
+                    f"({path_month} < {end_month} OR "
+                    f"({path_month} = {end_month} AND {path_day} <= {end_day}))))"
+                )
+            else:
+                # Fallback: try date column if it exists
+                where_conditions.append(f"date >= '{start_date}' AND date <= '{end_date}'")
+        where_clause = ""
+        if where_conditions:
+            where_clause = "WHERE " + " AND ".join(where_conditions)
+        # Build LIMIT clause
+        limit_clause = f"LIMIT {limit}" if limit else ""
+        query = f"""
+        SELECT {select_clause}
+        FROM {self.catalog.config.database_name}.{table_name}
+        {where_clause}
+        {limit_clause}
+        """
+        logger.info(f"Executing query for {device_id}/{message}")
+        return self.athena.query_to_dataframe(query)
+    def read_date_range(
+        self,
+        device_id: str,
+        message: str,
+        start_date: str,
+        end_date: str,
+        columns: Optional[List[str]] = None,
+    ) -> pd.DataFrame:
+        """
+        Read data for a specific date range.
+        Convenience method wrapping read_device_message with date range.
+        Args:
+            device_id: Device identifier
+            message: Message name
+            start_date: Start date (YYYY-MM-DD format)
+            end_date: End date (YYYY-MM-DD format)
+            columns: Optional column subset
+        Returns:
+            DataFrame with data for the date range
+        """
+        return self.read_device_message(
+            device_id=device_id,
+            message=message,
+            date_range=(start_date, end_date),
+            columns=columns,
+        )
+    def time_series_query(
+        self,
+        device_id: str,
+        message: str,
+        signal_name: str,
+        start_time: Optional[int] = None,
+        end_time: Optional[int] = None,
+        limit: Optional[int] = None,
+    ) -> pd.DataFrame:
+        """
+        Query single signal as time series.
+        Args:
+            device_id: Device identifier
+            message: Message name
+            signal_name: Signal column name
+            start_time: Min timestamp (microseconds since epoch)
+            end_time: Max timestamp (microseconds since epoch)
+            limit: Optional row limit
+        Returns:
+            DataFrame with 't' (timestamp) and signal columns, sorted by time
+        """
+        table_name = self.catalog.get_table_name(device_id, message)
+        # Build WHERE clause
+        where_conditions = []
+        if start_time is not None:
+            where_conditions.append(f"t >= {start_time}")
+        if end_time is not None:
+            where_conditions.append(f"t <= {end_time}")
+        where_clause = ""
+        if where_conditions:
+            where_clause = "WHERE " + " AND ".join(where_conditions)
+        limit_clause = f"LIMIT {limit}" if limit else ""
+        query = f"""
+        SELECT t, {signal_name}
+        FROM {self.catalog.config.database_name}.{table_name}
+        {where_clause}
+        ORDER BY t
+        {limit_clause}
+        """
+        logger.info(f"Time series query for {device_id}/{message}/{signal_name}")
+        return self.athena.query_to_dataframe(query)
+    def execute_sql(self, sql: str) -> pd.DataFrame:
+        """
+        Execute custom SQL query.
+        Args:
+            sql: SQL query string
+        Returns:
+            DataFrame with query results
+        Note:
+            Query should reference tables in the format:
+            {database_name}.{table_name}
+        """
+        logger.info("Executing custom SQL query")
+        return self.athena.query_to_dataframe(sql)
+    def aggregate(
+        self,
+        device_id: str,
+        message: str,
+        aggregation: str,
+        group_by: Optional[List[str]] = None,
+        where_clause: Optional[str] = None,
+    ) -> pd.DataFrame:
+        """
+        Execute aggregation query.
+        Args:
+            device_id: Device identifier
+            message: Message name
+            aggregation: Aggregation expression (e.g., "COUNT(*), AVG(RPM)")
+            group_by: Optional list of columns to group by
+            where_clause: Optional WHERE clause (without WHERE keyword)
+        Returns:
+            DataFrame with aggregation results
+        Example:
+            df = query.aggregate(
+                "device_001", "EngineData",
+                "COUNT(*) as count, AVG(RPM) as avg_rpm, MIN(RPM) as min_rpm",
+                group_by=["date"]
+            )
+        """
+        table_name = self.catalog.get_table_name(device_id, message)
+        group_by_clause = ""
+        if group_by:
+            group_by_clause = f"GROUP BY {', '.join(group_by)}"
+        where_clause_sql = ""
+        if where_clause:
+            where_clause_sql = f"WHERE {where_clause}"
+        query = f"""
+        SELECT {aggregation}
+        FROM {self.catalog.config.database_name}.{table_name}
+        {where_clause_sql}
+        {group_by_clause}
+        """
+        logger.info(f"Aggregation query for {device_id}/{message}")
+        return self.athena.query_to_dataframe(query)

src/examples/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Examples package

src/examples/batch_example.py ADDED Viewed

	@@ -0,0 +1,169 @@

+"""
+Example: Batch processing patterns for large-scale analysis using Athena.
+This script demonstrates memory-efficient batch processing across
+the entire data lake using SQL queries.
+"""
+from datalake.config import DataLakeConfig
+from datalake.athena import AthenaQuery
+from datalake.catalog import DataLakeCatalog
+from datalake.query import DataLakeQuery
+from datalake.batch import BatchProcessor
+import pandas as pd
+def main():
+    """Batch process data lake."""
+    # Setup
+    # Load config with explicit credentials
+    config = DataLakeConfig.from_credentials(
+        database_name="dbparquetdatalake05",
+        workgroup="athenaworkgroup-datalake05",
+        s3_output_location="s3://canedge-raw-data-parquet/athena-results/",
+        region="eu-north-1",
+        access_key_id="AKIARJQJFFVASPMSGNNY",
+        secret_access_key="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",
+    )
+    athena = AthenaQuery(config)
+    catalog = DataLakeCatalog(athena, config)
+    query = DataLakeQuery(athena, catalog)
+    processor = BatchProcessor(query)
+    print("=" * 60)
+    print("Batch Processing Examples (Athena)")
+    print("=" * 60)
+    print()
+    # Example 1: Compute statistics across all data
+    print("Example 1: Compute statistics across all devices/messages")
+    print("-" * 60)
+    try:
+        stats = processor.aggregate_by_device_message(
+            aggregation_func=processor.compute_statistics,
+            message_filter=config.message_filter,  # Optional filter
+        )
+        print(f"Processed {len(stats)} device(s):")
+        for device, messages in stats.items():
+            print(f"\n  Device: {device}")
+            for message, metrics in messages.items():
+                print(f"    Message: {message}")
+                print(f"      Record count: {metrics.get('count', 0):,}")
+                # Show statistics for first numeric column found
+                for key, value in metrics.items():
+                    if key != 'count' and '_mean' in key:
+                        signal = key.replace('_mean', '')
+                        print(f"      {signal}:")
+                        print(f"        Mean: {value:.2f}")
+                        print(f"        Min: {metrics.get(f'{signal}_min', 'N/A')}")
+                        print(f"        Max: {metrics.get(f'{signal}_max', 'N/A')}")
+                        break
+    except Exception as e:
+        print(f"Error in batch aggregation: {e}")
+    print()
+    # Example 2: Custom aggregation using SQL
+    print("Example 2: Custom SQL aggregation")
+    print("-" * 60)
+    try:
+        devices = catalog.list_devices()
+        if devices:
+            device_id = devices[0]
+            messages = catalog.list_messages(device_id)
+            if messages:
+                message = messages[0]
+                table_name = catalog.get_table_name(device_id, message)
+                # Use SQL for aggregation
+                sql = f"""
+                SELECT
+                    COUNT(*) as record_count,
+                    MIN(t) as min_timestamp,
+                    MAX(t) as max_timestamp
+                FROM {config.database_name}.{table_name}
+                """
+                df_agg = query.execute_sql(sql)
+                print(f"Aggregation for {device_id}/{message}:")
+                print(df_agg)
+    except Exception as e:
+        print(f"Error in SQL aggregation: {e}")
+    print()
+    # Example 3: Export specific data
+    print("Example 3: Export data to CSV")
+    print("-" * 60)
+    try:
+        devices = catalog.list_devices()
+        if devices:
+            device_id = devices[0]
+            messages = catalog.list_messages(device_id)
+            if messages:
+                message = messages[0]
+                output_path = f"{device_id}_{message}_export.csv"
+                processor.export_to_csv(
+                    device_id=device_id,
+                    message=message,
+                    output_path=output_path,
+                    limit=10000,  # Limit for example
+                )
+                print(f"Exported to: {output_path}")
+    except Exception as e:
+        print(f"Error exporting data: {e}")
+    print()
+    # Example 4: Find anomalies using SQL
+    print("Example 4: Find anomalies using SQL")
+    print("-" * 60)
+    try:
+        devices = catalog.list_devices()
+        if devices:
+            device_id = devices[0]
+            messages = catalog.list_messages(device_id)
+            if messages:
+                message = messages[0]
+                schema = catalog.get_schema(device_id, message)
+                if schema:
+                    signal_cols = [c for c in schema.keys() if c != 't' and c.lower() != 'date']
+                    if signal_cols:
+                        signal_name = signal_cols[0]
+                        table_name = catalog.get_table_name(device_id, message)
+                        # Use SQL to find outliers (3 standard deviations)
+                        sql = f"""
+                        WITH stats AS (
+                            SELECT
+                                AVG({signal_name}) as mean_val,
+                                STDDEV({signal_name}) as std_val
+                            FROM {config.database_name}.{table_name}
+                            WHERE {signal_name} IS NOT NULL
+                        )
+                        SELECT t, {signal_name}
+                        FROM {config.database_name}.{table_name}, stats
+                        WHERE {signal_name} IS NOT NULL
+                          AND ABS({signal_name} - mean_val) > 3 * std_val
+                        ORDER BY ABS({signal_name} - mean_val) DESC
+                        LIMIT 10
+                        """
+                        anomalies = query.execute_sql(sql)
+                        if not anomalies.empty:
+                            print(f"Found {len(anomalies)} anomalies in {signal_name}")
+                            print(anomalies.head())
+                        else:
+                            print("No anomalies found")
+    except Exception as e:
+        print(f"Error finding anomalies: {e}")
+if __name__ == "__main__":
+    main()

src/examples/explore_example.py ADDED Viewed

	@@ -0,0 +1,96 @@

+"""
+Example: Explore data lake structure using Athena.
+This script demonstrates how to discover devices, messages, dates,
+and schemas in the CANedge Athena data lake.
+"""
+from datalake.config import DataLakeConfig
+from datalake.athena import AthenaQuery
+from datalake.catalog import DataLakeCatalog
+def main():
+    """Explore data lake structure."""
+    # Load config with explicit credentials
+    config = DataLakeConfig.from_credentials(
+        database_name="dbparquetdatalake05",
+        workgroup="athenaworkgroup-datalake05",
+        s3_output_location="s3://canedge-raw-data-parquet/athena-results/",
+        region="eu-north-1",
+        access_key_id="AKIARJQJFFVASPMSGNNY",
+        secret_access_key="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",
+    )
+    # Initialize Athena and catalog
+    athena = AthenaQuery(config)
+    catalog = DataLakeCatalog(athena, config)
+    # List available devices
+    print("=" * 60)
+    print("Exploring Data Lake (Athena)")
+    print("=" * 60)
+    print(f"Database: {config.database_name}")
+    print(f"Region: {config.region}")
+    print(f"Workgroup: {config.workgroup}")
+    print()
+    # List all tables
+    try:
+        tables = catalog.list_tables()
+        print(f"Found {len(tables)} table(s) in database")
+        if tables:
+            print(f"Sample tables: {tables[:10]}")
+        print()
+    except Exception as e:
+        print(f"Error listing tables: {e}")
+        return
+    # List devices
+    try:
+        devices = catalog.list_devices(device_filter=config.device_filter)
+        print(f"Found {len(devices)} device(s):")
+        for device in devices:
+            print(f"  - {device}")
+    except Exception as e:
+        print(f"Error listing devices: {e}")
+        return
+    # List messages for first device
+    if devices:
+        device_id = devices[0]
+        print(f"\nMessages for device '{device_id}':")
+        try:
+            messages = catalog.list_messages(device_id, message_filter=config.message_filter)
+            for message in messages:
+                print(f"  - {message}")
+                # Get schema
+                try:
+                    schema = catalog.get_schema(device_id, message)
+                    if schema:
+                        print(f"    Schema: {len(schema)} column(s)")
+                        print(f"    Columns: {', '.join(list(schema.keys())[:5])}")
+                        if len(schema) > 5:
+                            print(f"             ... and {len(schema) - 5} more")
+                except Exception as e:
+                    print(f"    Error getting schema: {e}")
+                # Try to list partitions (dates)
+                try:
+                    partitions = catalog.list_partitions(device_id, message)
+                    if partitions:
+                        print(f"    Partitions: {len(partitions)} date(s)")
+                        if partitions:
+                            print(f"    Date range: {partitions[0]} to {partitions[-1]}")
+                except Exception as e:
+                    print(f"    Could not list partitions: {e}")
+                print()
+        except Exception as e:
+            print(f"Error listing messages: {e}")
+if __name__ == "__main__":
+    main()

src/examples/query_example.py ADDED Viewed

	@@ -0,0 +1,188 @@

+"""
+Example: Query and analyze data from the Athena data lake.
+This script demonstrates how to read data for specific devices/messages,
+perform time series queries, and filter by date ranges using SQL.
+"""
+from datalake.config import DataLakeConfig
+from datalake.athena import AthenaQuery
+from datalake.catalog import DataLakeCatalog
+from datalake.query import DataLakeQuery
+import pandas as pd
+def main():
+    """Query and analyze data."""
+    # Setup
+    # Load config with explicit credentials
+    config = DataLakeConfig.from_credentials(
+        database_name="dbparquetdatalake05",
+        workgroup="athenaworkgroup-datalake05",
+        s3_output_location="s3://canedge-raw-data-parquet/athena-results/",
+        region="eu-north-1",
+        access_key_id="AKIARJQJFFVASPMSGNNY",
+        secret_access_key="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",
+    )
+    athena = AthenaQuery(config)
+    catalog = DataLakeCatalog(athena, config)
+    query = DataLakeQuery(athena, catalog)
+    # Get first available device and message
+    try:
+        devices = catalog.list_devices()
+        if not devices:
+            print("No devices found in data lake")
+            return
+        device_id = devices[0]
+        messages = catalog.list_messages(device_id)
+        if not messages:
+            print(f"No messages found for device {device_id}")
+            return
+        message = messages[0]
+    except Exception as e:
+        print(f"Error discovering devices/messages: {e}")
+        return
+    print("=" * 60)
+    print("Querying Data Lake (Athena)")
+    print("=" * 60)
+    print(f"Device: {device_id}")
+    print(f"Message: {message}")
+    print()
+    # Example 1: Read all data for device/message
+    print("Example 1: Read all data")
+    print("-" * 60)
+    try:
+        df = query.read_device_message(
+            device_id=device_id,
+            message=message,
+            columns=["t"],  # Only read timestamp initially to check structure
+            limit=100,  # Limit for example
+        )
+        print(f"Loaded {len(df)} records")
+        if not df.empty:
+            print(f"Columns: {list(df.columns)}")
+            if 't' in df.columns:
+                print(f"Time range: {df['t'].min()} to {df['t'].max()} microseconds")
+            print(f"Sample data:")
+            print(df.head())
+    except Exception as e:
+        print(f"Error reading data: {e}")
+    print()
+    # Example 2: Read with date range
+    print("Example 2: Read with date range")
+    print("-" * 60)
+    try:
+        partitions = catalog.list_partitions(device_id, message)
+        if partitions:
+            start_date = partitions[0]
+            end_date = partitions[-1] if len(partitions) > 1 else partitions[0]
+            print(f"Date range: {start_date} to {end_date}")
+            df_date = query.read_date_range(
+                device_id=device_id,
+                message=message,
+                start_date=start_date,
+                end_date=end_date,
+                limit=100,
+            )
+            print(f"Loaded {len(df_date)} records for date range")
+    except Exception as e:
+        print(f"Error reading date range: {e}")
+    print()
+    # Example 3: Time series query (if signal columns exist)
+    print("Example 3: Time series query")
+    print("-" * 60)
+    try:
+        schema = catalog.get_schema(device_id, message)
+        if schema:
+            # Find first signal column (not 't')
+            signal_cols = [c for c in schema.keys() if c != 't' and c.lower() != 'date']
+            if signal_cols:
+                signal_name = signal_cols[0]
+                print(f"Querying signal: {signal_name}")
+                df_ts = query.time_series_query(
+                    device_id=device_id,
+                    message=message,
+                    signal_name=signal_name,
+                    limit=100,
+                )
+                if not df_ts.empty:
+                    print(f"Time series: {len(df_ts)} records")
+                    # Convert timestamp to datetime for display
+                    if 't' in df_ts.columns:
+                        df_ts['timestamp'] = pd.to_datetime(df_ts['t'], unit='us')
+                        print(df_ts[['timestamp', signal_name]].head())
+                        # Basic statistics
+                        print(f"\nStatistics for {signal_name}:")
+                        print(f"  Mean: {df_ts[signal_name].mean():.2f}")
+                        print(f"  Min: {df_ts[signal_name].min():.2f}")
+                        print(f"  Max: {df_ts[signal_name].max():.2f}")
+    except Exception as e:
+        print(f"Error in time series query: {e}")
+    print()
+    # Example 4: Custom SQL query
+    print("Example 4: Custom SQL query")
+    print("-" * 60)
+    try:
+        table_name = catalog.get_table_name(device_id, message)
+        custom_sql = f"""
+        SELECT COUNT(*) as record_count,
+               MIN(t) as min_time,
+               MAX(t) as max_time
+        FROM {config.database_name}.{table_name}
+        LIMIT 1
+        """
+        df_custom = query.execute_sql(custom_sql)
+        print("Custom query results:")
+        print(df_custom)
+    except Exception as e:
+        print(f"Error in custom SQL query: {e}")
+    print()
+    # Example 5: Aggregation query
+    print("Example 5: Aggregation query")
+    print("-" * 60)
+    try:
+        partitions = catalog.list_partitions(device_id, message)
+        if partitions:
+            # Filter by date using path-based extraction
+            # Data structure: {device_id}/{message}/{year}/{month}/{day}/file.parquet
+            target_date = partitions[0]
+            date_parts = target_date.split('-')
+            if len(date_parts) == 3:
+                year, month, day = date_parts
+                # Use path-based filtering consistent with data architecture
+                path_year = "try_cast(element_at(split(\"$path\", '/'), -4) AS INTEGER)"
+                path_month = "try_cast(element_at(split(\"$path\", '/'), -3) AS INTEGER)"
+                path_day = "try_cast(element_at(split(\"$path\", '/'), -2) AS INTEGER)"
+                where_clause = f"{path_year} = {year} AND {path_month} = {month} AND {path_day} = {day}"
+            else:
+                where_clause = None
+            df_agg = query.aggregate(
+                device_id=device_id,
+                message=message,
+                aggregation="COUNT(*) as count, AVG(t) as avg_time",
+                where_clause=where_clause,
+            )
+            print("Aggregation results:")
+            print(df_agg)
+    except Exception as e:
+        print(f"Error in aggregation query: {e}")
+if __name__ == "__main__":
+    main()

src/explore_datalake.ipynb ADDED Viewed

	@@ -0,0 +1,1165 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# CANedge Data Lake Explorer\n",
+        "\n",
+        "This notebook helps you explore and analyze your CANedge data lake using AWS Athena.\n",
+        "\n",
+        "## Setup\n",
+        "\n",
+        "First, let's configure the connection and test it."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "✓ Libraries imported successfully\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Import required libraries\n",
+        "import pandas as pd\n",
+        "import matplotlib.pyplot as plt\n",
+        "import seaborn as sns\n",
+        "from datalake.config import DataLakeConfig\n",
+        "from datalake.athena import AthenaQuery\n",
+        "from datalake.catalog import DataLakeCatalog\n",
+        "from datalake.query import DataLakeQuery\n",
+        "from datalake.batch import BatchProcessor\n",
+        "\n",
+        "# Set up plotting\n",
+        "%matplotlib inline\n",
+        "plt.style.use('seaborn-v0_8')\n",
+        "sns.set_palette(\"husl\")\n",
+        "\n",
+        "print(\"✓ Libraries imported successfully\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "✓ Configuration loaded\n",
+            "  Database: dbparquetdatalake05\n",
+            "  Workgroup: athenaworkgroup-datalake05\n",
+            "  Region: eu-north-1\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Configure connection with your credentials\n",
+        "config = DataLakeConfig.from_credentials(\n",
+        "    database_name=\"dbparquetdatalake05\",\n",
+        "    workgroup=\"athenaworkgroup-datalake05\",\n",
+        "    s3_output_location=\"s3://canedge-raw-data-parquet/athena-results/\",\n",
+        "    region=\"eu-north-1\",\n",
+        "    access_key_id=\"AKIARJQJFFVASPMSGNNY\",\n",
+        "    secret_access_key=\"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\",\n",
+        ")\n",
+        "\n",
+        "print(f\"✓ Configuration loaded\")\n",
+        "print(f\"  Database: {config.database_name}\")\n",
+        "print(f\"  Workgroup: {config.workgroup}\")\n",
+        "print(f\"  Region: {config.region}\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2026-01-25 16:42:53,113 - datalake.athena - INFO - Initialized Athena client for database: dbparquetdatalake05\n",
+            "2026-01-25 16:42:53,113 - datalake.catalog - INFO - Initialized catalog for database: dbparquetdatalake05\n",
+            "2026-01-25 16:42:53,114 - datalake.query - INFO - Initialized DataLakeQuery\n",
+            "2026-01-25 16:42:53,114 - datalake.batch - INFO - Initialized BatchProcessor\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "✓ Athena client and catalog initialized\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Initialize Athena and catalog\n",
+        "athena = AthenaQuery(config)\n",
+        "catalog = DataLakeCatalog(athena, config)\n",
+        "query = DataLakeQuery(athena, catalog)\n",
+        "processor = BatchProcessor(query)\n",
+        "\n",
+        "print(\"✓ Athena client and catalog initialized\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Test Connection\n",
+        "\n",
+        "Let's verify the connection works by listing tables."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2026-01-25 16:43:00,494 - datalake.athena - INFO - Query started with execution ID: fb177297-ccc0-4c3d-b0ee-44078f0d3fa8\n",
+            "2026-01-25 16:43:01,953 - datalake.athena - INFO - Query fb177297-ccc0-4c3d-b0ee-44078f0d3fa8 completed successfully\n",
+            "2026-01-25 16:43:02,379 - datalake.athena - INFO - Retrieved 77 rows from query fb177297-ccc0-4c3d-b0ee-44078f0d3fa8\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "✓ Connection successful!\n",
+            "  Found 77 tables in database\n",
+            "\n",
+            "  First 10 tables:\n",
+            "                                  tab_name\n",
+            "0  tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m03\n",
+            "1  tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m04\n",
+            "2  tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m05\n",
+            "3  tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m06\n",
+            "4  tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m07\n",
+            "5  tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m0c\n",
+            "6  tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m0d\n",
+            "7  tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m0e\n",
+            "8  tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m0f\n",
+            "9  tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m10\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Test connection with a simple query\n",
+        "try:\n",
+        "    test_query = f\"SHOW TABLES IN {config.database_name}\"\n",
+        "    df_tables = athena.query_to_dataframe(test_query, timeout=60)\n",
+        "    print(f\"✓ Connection successful!\")\n",
+        "    print(f\"  Found {len(df_tables)} tables in database\")\n",
+        "    if not df_tables.empty:\n",
+        "        print(f\"\\n  First 10 tables:\")\n",
+        "        print(df_tables.head(10))\n",
+        "except Exception as e:\n",
+        "    print(f\"✗ Connection failed: {e}\")\n",
+        "    import traceback\n",
+        "    traceback.print_exc()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Explore Data Lake Structure\n",
+        "\n",
+        "Discover devices, messages, and available data."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2026-01-25 16:45:30,372 - datalake.athena - INFO - Query started with execution ID: f341e52d-c3ea-4baf-b805-eb1f327b1d1c\n",
+            "2026-01-25 16:45:31,482 - datalake.athena - INFO - Query f341e52d-c3ea-4baf-b805-eb1f327b1d1c completed successfully\n",
+            "2026-01-25 16:45:31,613 - datalake.athena - INFO - Retrieved 78 rows from query f341e52d-c3ea-4baf-b805-eb1f327b1d1c\n",
+            "2026-01-25 16:45:31,614 - datalake.catalog - INFO - Found 78 tables in database\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Total tables: 78\n",
+            "\n",
+            "Sample tables:\n",
+            "  - tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m00\n",
+            "  - tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m03\n",
+            "  - tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m04\n",
+            "  - tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m05\n",
+            "  - tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m06\n",
+            "  - tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m07\n",
+            "  - tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m0c\n",
+            "  - tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m0d\n",
+            "  - tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m0e\n",
+            "  - tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m0f\n"
+          ]
+        }
+      ],
+      "source": [
+        "# List all tables\n",
+        "tables = catalog.list_tables()\n",
+        "print(f\"Total tables: {len(tables)}\")\n",
+        "print(f\"\\nSample tables:\")\n",
+        "for table in tables[:10]:\n",
+        "    print(f\"  - {table}\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 14,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2026-01-25 21:21:04,434 - datalake.athena - INFO - Query started with execution ID: 4f1cfb71-2b52-4226-bd01-412f44cf23e3\n",
+            "2026-01-25 21:21:05,589 - datalake.athena - INFO - Query 4f1cfb71-2b52-4226-bd01-412f44cf23e3 completed successfully\n",
+            "2026-01-25 21:21:05,720 - datalake.athena - INFO - Retrieved 78 rows from query 4f1cfb71-2b52-4226-bd01-412f44cf23e3\n",
+            "2026-01-25 21:21:05,721 - datalake.catalog - INFO - Found 78 tables in database\n",
+            "2026-01-25 21:21:05,721 - datalake.catalog - INFO - Found 1 device(s)\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Found 1 device(s):\n",
+            "  - tbl\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Discover devices\n",
+        "devices = catalog.list_devices()\n",
+        "print(f\"Found {len(devices)} device(s):\")\n",
+        "for device in devices:\n",
+        "    print(f\"  - {device}\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 15,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2026-01-25 21:21:12,744 - datalake.athena - INFO - Query started with execution ID: 3e5558cf-432c-4beb-8217-97bcbbf71694\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            "Exploring device: tbl\n",
+            "============================================================\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2026-01-25 21:21:13,885 - datalake.athena - INFO - Query 3e5558cf-432c-4beb-8217-97bcbbf71694 completed successfully\n",
+            "2026-01-25 21:21:14,016 - datalake.athena - INFO - Retrieved 78 rows from query 3e5558cf-432c-4beb-8217-97bcbbf71694\n",
+            "2026-01-25 21:21:14,017 - datalake.catalog - INFO - Found 78 tables in database\n",
+            "2026-01-25 21:21:14,017 - datalake.catalog - INFO - Found 78 messages for device tbl\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Found 78 message(s):\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m00\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m03\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m04\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m05\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m06\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m07\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m0c\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m0d\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m0e\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m0f\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m10\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m11\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m1f\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m2e\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m2f\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m33\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m34\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m35\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m43\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m44\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m49\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m55\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m56\n",
+            "  - 97a4aaf4_can1_obd2_s_m41_s01pid_m5c\n",
+            "  - 97a4aaf4_can9_gnssaltitude\n",
+            "  - 97a4aaf4_can9_gnssdistance\n",
+            "  - 97a4aaf4_can9_gnsspos\n",
+            "  - 97a4aaf4_can9_gnssspeed\n",
+            "  - 97a4aaf4_can9_gnssstatus\n",
+            "  - 97a4aaf4_can9_gnsstime\n",
+            "  - 97a4aaf4_can9_heartbeat\n",
+            "  - 97a4aaf4_can9_imudata\n",
+            "  - 97a4aaf4_can9_timecalendar\n",
+            "  - 97a4aaf4_can9_timeexternal\n",
+            "  - 97a4aaf4_messages\n",
+            "  - aggregations_devicemeta\n",
+            "  - aggregations_tripsummary\n",
+            "  - b8280fd1_can9_gnssaltitude\n",
+            "  - b8280fd1_can9_gnssdistance\n",
+            "  - b8280fd1_can9_gnsspos\n",
+            "  - b8280fd1_can9_gnssspeed\n",
+            "  - b8280fd1_can9_gnssstatus\n",
+            "  - b8280fd1_can9_gnsstime\n",
+            "  - b8280fd1_can9_heartbeat\n",
+            "  - b8280fd1_can9_imudata\n",
+            "  - b8280fd1_can9_timecalendar\n",
+            "  - b8280fd1_can9_timeexternal\n",
+            "  - b8280fd1_messages\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m03\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m04\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m05\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m06\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m07\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m0c\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m0d\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m0e\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m0f\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m10\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m1f\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m2e\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m33\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m34\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m35\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m43\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m44\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m49\n",
+            "  - f1da612a_can1_obd2_s_m41_s01pid_m5c\n",
+            "  - f1da612a_can9_gnssaltitude\n",
+            "  - f1da612a_can9_gnssdistance\n",
+            "  - f1da612a_can9_gnsspos\n",
+            "  - f1da612a_can9_gnssspeed\n",
+            "  - f1da612a_can9_gnssstatus\n",
+            "  - f1da612a_can9_gnsstime\n",
+            "  - f1da612a_can9_heartbeat\n",
+            "  - f1da612a_can9_imudata\n",
+            "  - f1da612a_can9_timecalendar\n",
+            "  - f1da612a_can9_timeexternal\n",
+            "  - f1da612a_messages\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Explore messages for the first device\n",
+        "if devices:\n",
+        "    device_id = devices[0]\n",
+        "    print(f\"\\nExploring device: {device_id}\")\n",
+        "    print(\"=\" * 60)\n",
+        "    \n",
+        "    messages = catalog.list_messages(device_id)\n",
+        "    print(f\"Found {len(messages)} message(s):\")\n",
+        "    for message in messages:\n",
+        "        print(f\"  - {message}\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            "Schema for tbl/97a4aaf4_can1_obd2_s_m41_s01pid_m00:\n",
+            "============================================================\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2026-01-25 17:31:48,556 - datalake.athena - INFO - Query started with execution ID: 6916c876-c526-4474-bbe0-ad626b4786e7\n",
+            "2026-01-25 17:31:49,686 - datalake.athena - INFO - Query 6916c876-c526-4474-bbe0-ad626b4786e7 completed successfully\n",
+            "2026-01-25 17:31:49,816 - datalake.athena - INFO - Retrieved 78 rows from query 6916c876-c526-4474-bbe0-ad626b4786e7\n",
+            "2026-01-25 17:31:49,817 - datalake.catalog - INFO - Found 78 tables in database\n",
+            "2026-01-25 17:31:49,993 - datalake.athena - INFO - Query started with execution ID: b71d0e14-e0e8-4e3f-8c8c-ceffcd9984f2\n",
+            "2026-01-25 17:31:51,132 - datalake.athena - INFO - Query b71d0e14-e0e8-4e3f-8c8c-ceffcd9984f2 completed successfully\n",
+            "2026-01-25 17:31:51,371 - datalake.athena - INFO - Retrieved 3 rows from query b71d0e14-e0e8-4e3f-8c8c-ceffcd9984f2\n",
+            "2026-01-25 17:31:51,372 - datalake.catalog - INFO - Schema for tbl/97a4aaf4_can1_obd2_s_m41_s01pid_m00: 3 columns\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "                      Column         Type\n",
+            "                           t timestamp(3)\n",
+            "s01pid00_pidssupported_01_20       double\n",
+            "                date_created      varchar\n",
+            "\n",
+            "Total columns: 3\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Get schema for first device/message combination\n",
+        "if devices and messages:\n",
+        "    device_id = devices[0]\n",
+        "    message = messages[0]\n",
+        "    \n",
+        "    print(f\"\\nSchema for {device_id}/{message}:\")\n",
+        "    print(\"=\" * 60)\n",
+        "    \n",
+        "    schema = catalog.get_schema(device_id, message)\n",
+        "    if schema:\n",
+        "        schema_df = pd.DataFrame([\n",
+        "            {\"Column\": col, \"Type\": dtype}\n",
+        "            for col, dtype in schema.items()\n",
+        "        ])\n",
+        "        print(schema_df.to_string(index=False))\n",
+        "        print(f\"\\nTotal columns: {len(schema)}\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 9,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2026-01-25 17:31:58,489 - datalake.athena - INFO - Query started with execution ID: 844cf5ba-7756-46cf-a0e4-d6bfe8c98f74\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            "Partitions (dates) for tbl/97a4aaf4_can1_obd2_s_m41_s01pid_m00:\n",
+            "============================================================\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2026-01-25 17:31:59,938 - datalake.athena - INFO - Query 844cf5ba-7756-46cf-a0e4-d6bfe8c98f74 completed successfully\n",
+            "2026-01-25 17:32:00,137 - datalake.athena - INFO - Retrieved 78 rows from query 844cf5ba-7756-46cf-a0e4-d6bfe8c98f74\n",
+            "2026-01-25 17:32:00,137 - datalake.catalog - INFO - Found 78 tables in database\n",
+            "2026-01-25 17:32:00,265 - datalake.athena - INFO - Query started with execution ID: c4a13ea7-d58e-4658-90aa-9413d04b9417\n",
+            "2026-01-25 17:32:02,108 - datalake.athena - INFO - Query c4a13ea7-d58e-4658-90aa-9413d04b9417 completed successfully\n",
+            "2026-01-25 17:32:02,219 - datalake.athena - WARNING - No results returned for execution c4a13ea7-d58e-4658-90aa-9413d04b9417\n",
+            "2026-01-25 17:32:02,222 - datalake.catalog - WARNING - No partitions found for tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m00\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "No partitions found (table may not be partitioned)\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Check available partitions (dates)\n",
+        "if devices and messages:\n",
+        "    device_id = devices[0]\n",
+        "    message = messages[0]\n",
+        "    \n",
+        "    print(f\"\\nPartitions (dates) for {device_id}/{message}:\")\n",
+        "    print(\"=\" * 60)\n",
+        "    \n",
+        "    try:\n",
+        "        partitions = catalog.list_partitions(device_id, message)\n",
+        "        if partitions:\n",
+        "            print(f\"Found {len(partitions)} partition(s):\")\n",
+        "            print(f\"  Date range: {partitions[0]} to {partitions[-1]}\")\n",
+        "            print(f\"\\n  All dates:\")\n",
+        "            for date in partitions[:20]:  # Show first 20\n",
+        "                print(f\"    - {date}\")\n",
+        "            if len(partitions) > 20:\n",
+        "                print(f\"    ... and {len(partitions) - 20} more\")\n",
+        "        else:\n",
+        "            print(\"No partitions found (table may not be partitioned)\")\n",
+        "    except Exception as e:\n",
+        "        print(f\"Could not list partitions: {e}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Query Data\n",
+        "\n",
+        "Now let's query some actual data."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Reading sample data from tbl/97a4aaf4_can1_obd2_s_m41_s01pid_m00...\n",
+            "============================================================\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2026-01-25 17:32:42,022 - datalake.athena - INFO - Query started with execution ID: 2a7e2ed0-8c44-46e7-a5b4-1a57fab4938b\n",
+            "2026-01-25 17:32:43,601 - datalake.athena - INFO - Query 2a7e2ed0-8c44-46e7-a5b4-1a57fab4938b completed successfully\n",
+            "2026-01-25 17:32:43,731 - datalake.athena - INFO - Retrieved 78 rows from query 2a7e2ed0-8c44-46e7-a5b4-1a57fab4938b\n",
+            "2026-01-25 17:32:43,732 - datalake.catalog - INFO - Found 78 tables in database\n",
+            "2026-01-25 17:32:43,732 - datalake.query - INFO - Executing query for tbl/97a4aaf4_can1_obd2_s_m41_s01pid_m00\n",
+            "2026-01-25 17:32:43,859 - datalake.athena - INFO - Query started with execution ID: 02fe08ed-4f1c-4363-b167-c1a4a0196094\n",
+            "2026-01-25 17:32:48,300 - datalake.athena - INFO - Query 02fe08ed-4f1c-4363-b167-c1a4a0196094 completed successfully\n",
+            "2026-01-25 17:32:48,430 - datalake.athena - INFO - Retrieved 100 rows from query 02fe08ed-4f1c-4363-b167-c1a4a0196094\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "✓ Loaded 100 records\n",
+            "\n",
+            "Data shape: (100, 3)\n",
+            "\n",
+            "Columns: ['t', 's01pid00_pidssupported_01_20', 'date_created']\n",
+            "\n",
+            "First few rows:\n"
+          ]
+        },
+        {
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>t</th>\n",
+              "      <th>s01pid00_pidssupported_01_20</th>\n",
+              "      <th>date_created</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>2025-10-29 05:45:53.063</td>\n",
+              "      <td>3.189744e+09</td>\n",
+              "      <td>2025/10/29</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>2025-10-29 05:46:18.062</td>\n",
+              "      <td>3.189744e+09</td>\n",
+              "      <td>2025/10/29</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>2025-10-29 05:46:48.063</td>\n",
+              "      <td>3.189744e+09</td>\n",
+              "      <td>2025/10/29</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>2025-10-29 05:47:43.062</td>\n",
+              "      <td>3.189744e+09</td>\n",
+              "      <td>2025/10/29</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>2025-10-29 05:48:08.062</td>\n",
+              "      <td>3.189744e+09</td>\n",
+              "      <td>2025/10/29</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>5</th>\n",
+              "      <td>2025-10-29 05:49:33.063</td>\n",
+              "      <td>3.189744e+09</td>\n",
+              "      <td>2025/10/29</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>6</th>\n",
+              "      <td>2025-10-29 05:49:48.063</td>\n",
+              "      <td>3.189744e+09</td>\n",
+              "      <td>2025/10/29</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>7</th>\n",
+              "      <td>2025-10-29 05:50:03.063</td>\n",
+              "      <td>3.189744e+09</td>\n",
+              "      <td>2025/10/29</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>8</th>\n",
+              "      <td>2025-10-29 05:50:33.064</td>\n",
+              "      <td>3.189744e+09</td>\n",
+              "      <td>2025/10/29</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>9</th>\n",
+              "      <td>2025-10-29 05:50:58.064</td>\n",
+              "      <td>3.189744e+09</td>\n",
+              "      <td>2025/10/29</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "                         t  s01pid00_pidssupported_01_20 date_created\n",
+              "0  2025-10-29 05:45:53.063                  3.189744e+09   2025/10/29\n",
+              "1  2025-10-29 05:46:18.062                  3.189744e+09   2025/10/29\n",
+              "2  2025-10-29 05:46:48.063                  3.189744e+09   2025/10/29\n",
+              "3  2025-10-29 05:47:43.062                  3.189744e+09   2025/10/29\n",
+              "4  2025-10-29 05:48:08.062                  3.189744e+09   2025/10/29\n",
+              "5  2025-10-29 05:49:33.063                  3.189744e+09   2025/10/29\n",
+              "6  2025-10-29 05:49:48.063                  3.189744e+09   2025/10/29\n",
+              "7  2025-10-29 05:50:03.063                  3.189744e+09   2025/10/29\n",
+              "8  2025-10-29 05:50:33.064                  3.189744e+09   2025/10/29\n",
+              "9  2025-10-29 05:50:58.064                  3.189744e+09   2025/10/29"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            "Data types:\n",
+            "t                                object\n",
+            "s01pid00_pidssupported_01_20    float64\n",
+            "date_created                     object\n",
+            "dtype: object\n",
+            "\n",
+            "Basic statistics:\n"
+          ]
+        },
+        {
+          "data": {
+            "text/html": [
+              "<div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>s01pid00_pidssupported_01_20</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>count</th>\n",
+              "      <td>1.000000e+02</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>mean</th>\n",
+              "      <td>3.189744e+09</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>std</th>\n",
+              "      <td>0.000000e+00</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>min</th>\n",
+              "      <td>3.189744e+09</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>25%</th>\n",
+              "      <td>3.189744e+09</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>50%</th>\n",
+              "      <td>3.189744e+09</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>75%</th>\n",
+              "      <td>3.189744e+09</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>max</th>\n",
+              "      <td>3.189744e+09</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>"
+            ],
+            "text/plain": [
+              "       s01pid00_pidssupported_01_20\n",
+              "count                  1.000000e+02\n",
+              "mean                   3.189744e+09\n",
+              "std                    0.000000e+00\n",
+              "min                    3.189744e+09\n",
+              "25%                    3.189744e+09\n",
+              "50%                    3.189744e+09\n",
+              "75%                    3.189744e+09\n",
+              "max                    3.189744e+09"
+            ]
+          },
+          "metadata": {},
+          "output_type": "display_data"
+        }
+      ],
+      "source": [
+        "# Read a sample of data\n",
+        "if devices and messages:\n",
+        "    device_id = devices[0]\n",
+        "    message = messages[0]\n",
+        "    \n",
+        "    print(f\"Reading sample data from {device_id}/{message}...\")\n",
+        "    print(\"=\" * 60)\n",
+        "    \n",
+        "    try:\n",
+        "        df = query.read_device_message(\n",
+        "            device_id=device_id,\n",
+        "            message=message,\n",
+        "            limit=100  # Limit for quick preview\n",
+        "        )\n",
+        "        \n",
+        "        print(f\"✓ Loaded {len(df)} records\")\n",
+        "        print(f\"\\nData shape: {df.shape}\")\n",
+        "        print(f\"\\nColumns: {list(df.columns)}\")\n",
+        "        print(f\"\\nFirst few rows:\")\n",
+        "        display(df.head(10))\n",
+        "        \n",
+        "        print(f\"\\nData types:\")\n",
+        "        print(df.dtypes)\n",
+        "        \n",
+        "        print(f\"\\nBasic statistics:\")\n",
+        "        display(df.describe())\n",
+        "        \n",
+        "    except Exception as e:\n",
+        "        print(f\"✗ Error reading data: {e}\")\n",
+        "        import traceback\n",
+        "        traceback.print_exc()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Query ALL Data (No Limits)\n",
+        "\n",
+        "To see all your data, remove the `limit` parameter or set it to `None`. \n",
+        "**Note:** This may take longer and use more memory for large datasets."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Query ALL data (no limit) - use with caution for large datasets\n",
+        "if 'first_device' in locals() and 'first_message' in locals() and first_message:\n",
+        "    device_id = first_device\n",
+        "    message = first_message\n",
+        "    \n",
+        "    print(f\"Querying ALL data from {device_id}/{message}...\")\n",
+        "    print(\"=\" * 60)\n",
+        "    print(\"⚠️  This may take a while for large datasets!\")\n",
+        "    print()\n",
+        "    \n",
+        "    # Uncomment the lines below to query all data (remove limit)\n",
+        "    # try:\n",
+        "    #     df_all = query.read_device_message(\n",
+        "    #         device_id=device_id,\n",
+        "    #         message=message,\n",
+        "    #         limit=None  # No limit - gets ALL data\n",
+        "    #     )\n",
+        "    #     \n",
+        "    #     print(f\"✓ Loaded ALL {len(df_all)} records\")\n",
+        "    #     print(f\"\\nData shape: {df_all.shape}\")\n",
+        "    #     display(df_all.head(20))\n",
+        "    #     \n",
+        "    # except Exception as e:\n",
+        "    #     print(f\"✗ Error reading all data: {e}\")\n",
+        "    #     import traceback\n",
+        "    #     traceback.print_exc()\n",
+        "    \n",
+        "    print(\"(Uncomment the code above to query all data)\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 11,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "2026-01-25 17:32:55,234 - datalake.athena - INFO - Query started with execution ID: ee01954e-ad1e-4044-aba1-d5b9695cbaef\n",
+            "2026-01-25 17:32:56,362 - datalake.athena - INFO - Query ee01954e-ad1e-4044-aba1-d5b9695cbaef completed successfully\n",
+            "2026-01-25 17:32:56,510 - datalake.athena - INFO - Retrieved 78 rows from query ee01954e-ad1e-4044-aba1-d5b9695cbaef\n",
+            "2026-01-25 17:32:56,511 - datalake.catalog - INFO - Found 78 tables in database\n",
+            "2026-01-25 17:32:56,630 - datalake.athena - INFO - Query started with execution ID: 6b5ca88b-bfaa-4cb2-9aa2-95e7e8d0facc\n",
+            "2026-01-25 17:32:57,977 - datalake.athena - INFO - Query 6b5ca88b-bfaa-4cb2-9aa2-95e7e8d0facc completed successfully\n",
+            "2026-01-25 17:32:58,133 - datalake.athena - WARNING - No results returned for execution 6b5ca88b-bfaa-4cb2-9aa2-95e7e8d0facc\n",
+            "2026-01-25 17:32:58,134 - datalake.catalog - WARNING - No partitions found for tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m00\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "No partitions available for date filtering\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Query with date range (if partitions available)\n",
+        "if devices and messages:\n",
+        "    device_id = devices[0]\n",
+        "    message = messages[0]\n",
+        "    \n",
+        "    try:\n",
+        "        partitions = catalog.list_partitions(device_id, message)\n",
+        "        if partitions:\n",
+        "            start_date = partitions[0]\n",
+        "            end_date = partitions[-1] if len(partitions) > 1 else partitions[0]\n",
+        "            \n",
+        "            print(f\"Querying data from {start_date} to {end_date}...\")\n",
+        "            \n",
+        "            df_date = query.read_date_range(\n",
+        "                device_id=device_id,\n",
+        "                message=message,\n",
+        "                start_date=start_date,\n",
+        "                end_date=end_date,\n",
+        "                limit=1000\n",
+        "            )\n",
+        "            \n",
+        "            print(f\"✓ Loaded {len(df_date)} records\")\n",
+        "            display(df_date.head())\n",
+        "        else:\n",
+        "            print(\"No partitions available for date filtering\")\n",
+        "    except Exception as e:\n",
+        "        print(f\"Error querying date range: {e}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Time Series Analysis\n",
+        "\n",
+        "Analyze signals over time."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 12,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Available signal columns (2):\n",
+            "  - s01pid00_pidssupported_01_20\n",
+            "  - date_created\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Get available signal columns\n",
+        "if devices and messages:\n",
+        "    device_id = devices[0]\n",
+        "    message = messages[0]\n",
+        "    \n",
+        "    schema = catalog.get_schema(device_id, message)\n",
+        "    if schema:\n",
+        "        # Find signal columns (exclude timestamp and date)\n",
+        "        signal_cols = [\n",
+        "            col for col in schema.keys() \n",
+        "            if col not in ['t', 'date', 'timestamp']\n",
+        "        ]\n",
+        "        \n",
+        "        print(f\"Available signal columns ({len(signal_cols)}):\")\n",
+        "        for col in signal_cols[:10]:\n",
+        "            print(f\"  - {col}\")\n",
+        "        if len(signal_cols) > 10:\n",
+        "            print(f\"  ... and {len(signal_cols) - 10} more\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Query time series for a specific signal\n",
+        "if devices and messages and 'signal_cols' in locals() and signal_cols:\n",
+        "    device_id = devices[0]\n",
+        "    message = messages[0]\n",
+        "    signal_name = signal_cols[0]  # Use first signal\n",
+        "    \n",
+        "    print(f\"Querying time series for {signal_name}...\")\n",
+        "    print(\"=\" * 60)\n",
+        "    \n",
+        "    try:\n",
+        "        df_ts = query.time_series_query(\n",
+        "            device_id=device_id,\n",
+        "            message=message,\n",
+        "            signal_name=signal_name,\n",
+        "            limit=10000  # Adjust based on your needs\n",
+        "        )\n",
+        "        \n",
+        "        if not df_ts.empty:\n",
+        "            # Convert timestamp to datetime\n",
+        "            if 't' in df_ts.columns:\n",
+        "                df_ts['timestamp'] = pd.to_datetime(df_ts['t'], unit='us')\n",
+        "            \n",
+        "            print(f\"✓ Loaded {len(df_ts)} records\")\n",
+        "            print(f\"\\nTime range: {df_ts['timestamp'].min()} to {df_ts['timestamp'].max()}\")\n",
+        "            \n",
+        "            # Display sample\n",
+        "            display(df_ts[['timestamp', signal_name]].head(10))\n",
+        "            \n",
+        "            # Statistics\n",
+        "            print(f\"\\nStatistics for {signal_name}:\")\n",
+        "            print(f\"  Mean: {df_ts[signal_name].mean():.2f}\")\n",
+        "            print(f\"  Min: {df_ts[signal_name].min():.2f}\")\n",
+        "            print(f\"  Max: {df_ts[signal_name].max():.2f}\")\n",
+        "            print(f\"  Std: {df_ts[signal_name].std():.2f}\")\n",
+        "        else:\n",
+        "            print(\"No data returned\")\n",
+        "            \n",
+        "    except Exception as e:\n",
+        "        print(f\"✗ Error querying time series: {e}\")\n",
+        "        import traceback\n",
+        "        traceback.print_exc()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Plot time series (if data available)\n",
+        "if 'df_ts' in locals() and not df_ts.empty and 'timestamp' in df_ts.columns:\n",
+        "    try:\n",
+        "        plt.figure(figsize=(14, 6))\n",
+        "        plt.plot(df_ts['timestamp'], df_ts[signal_name], linewidth=0.5, alpha=0.7)\n",
+        "        plt.title(f\"Time Series: {signal_name}\", fontsize=14, fontweight='bold')\n",
+        "        plt.xlabel('Time', fontsize=12)\n",
+        "        plt.ylabel(signal_name, fontsize=12)\n",
+        "        plt.grid(True, alpha=0.3)\n",
+        "        plt.xticks(rotation=45)\n",
+        "        plt.tight_layout()\n",
+        "        plt.show()\n",
+        "        \n",
+        "        # Histogram\n",
+        "        plt.figure(figsize=(10, 6))\n",
+        "        plt.hist(df_ts[signal_name], bins=50, edgecolor='black', alpha=0.7)\n",
+        "        plt.title(f\"Distribution: {signal_name}\", fontsize=14, fontweight='bold')\n",
+        "        plt.xlabel(signal_name, fontsize=12)\n",
+        "        plt.ylabel('Frequency', fontsize=12)\n",
+        "        plt.grid(True, alpha=0.3)\n",
+        "        plt.tight_layout()\n",
+        "        plt.show()\n",
+        "        \n",
+        "    except Exception as e:\n",
+        "        print(f\"Error plotting: {e}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Custom SQL Queries\n",
+        "\n",
+        "Execute custom SQL queries for advanced analysis."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Example: Get record counts per device/message\n",
+        "if devices and messages:\n",
+        "    device_id = devices[0]\n",
+        "    message = messages[0]\n",
+        "    table_name = catalog.get_table_name(device_id, message)\n",
+        "    \n",
+        "    custom_sql = f\"\"\"\n",
+        "    SELECT \n",
+        "        COUNT(*) as record_count,\n",
+        "        MIN(t) as min_timestamp,\n",
+        "        MAX(t) as max_timestamp\n",
+        "    FROM {config.database_name}.{table_name}\n",
+        "    \"\"\"\n",
+        "    \n",
+        "    try:\n",
+        "        df_stats = query.execute_sql(custom_sql)\n",
+        "        print(f\"Statistics for {device_id}/{message}:\")\n",
+        "        display(df_stats)\n",
+        "    except Exception as e:\n",
+        "        print(f\"Error executing custom SQL: {e}\")\n",
+        "        import traceback\n",
+        "        traceback.print_exc()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Example: Aggregation query\n",
+        "if devices and messages and 'signal_cols' in locals() and signal_cols:\n",
+        "    device_id = devices[0]\n",
+        "    message = messages[0]\n",
+        "    signal_name = signal_cols[0]\n",
+        "    \n",
+        "    try:\n",
+        "        df_agg = query.aggregate(\n",
+        "            device_id=device_id,\n",
+        "            message=message,\n",
+        "            aggregation=f\"\"\"\n",
+        "                COUNT(*) as count,\n",
+        "                AVG({signal_name}) as avg_{signal_name},\n",
+        "                MIN({signal_name}) as min_{signal_name},\n",
+        "                MAX({signal_name}) as max_{signal_name},\n",
+        "                STDDEV({signal_name}) as std_{signal_name}\n",
+        "            \"\"\",\n",
+        "        )\n",
+        "        \n",
+        "        print(f\"Aggregation for {signal_name}:\")\n",
+        "        display(df_agg)\n",
+        "        \n",
+        "    except Exception as e:\n",
+        "        print(f\"Error in aggregation: {e}\")\n",
+        "        import traceback\n",
+        "        traceback.print_exc()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Summary\n",
+        "\n",
+        "You've successfully:\n",
+        "1. ✓ Connected to Athena\n",
+        "2. ✓ Explored the data lake structure\n",
+        "3. ✓ Queried sample data\n",
+        "4. ✓ Analyzed time series\n",
+        "5. ✓ Executed custom SQL queries\n",
+        "\n",
+        "### Next Steps\n",
+        "\n",
+        "- Modify the queries to explore your specific data\n",
+        "- Add more visualizations\n",
+        "- Perform statistical analysis\n",
+        "- Export data for further analysis\n",
+        "\n",
+        "### Useful Commands\n",
+        "\n",
+        "```python\n",
+        "# List all devices\n",
+        "devices = catalog.list_devices()\n",
+        "\n",
+        "# List messages for a device\n",
+        "messages = catalog.list_messages('device_id')\n",
+        "\n",
+        "# Get schema\n",
+        "schema = catalog.get_schema('device_id', 'message_name')\n",
+        "\n",
+        "# Query data\n",
+        "df = query.read_device_message('device_id', 'message_name', limit=1000)\n",
+        "\n",
+        "# Custom SQL\n",
+        "df = query.execute_sql('SELECT * FROM database.table LIMIT 100')\n",
+        "```"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "venv",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.18"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}

src/images/analysis.png ADDED Viewed

src/images/logo.png ADDED Viewed

src/images/oxon.jpeg ADDED Viewed

src/requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+pandas>=1.3.0
+boto3>=1.20.0
+matplotlib>=3.5.0
+seaborn>=0.12.0
+jupyter>=1.0.0
+streamlit>=1.28.0
+pyyaml>=6.0
+plotly>=5.0.0
+pillow>=9.0.0
+ydata-profiling>=4.0.0

src/setup.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""
+Setup script for CANedge Data Lake Python SDK.
+"""
+from setuptools import setup, find_packages
+with open("README.md", "r", encoding="utf-8") as fh:
+    long_description = fh.read()
+setup(
+    name="canedge-datalake",
+    version="0.1.0",
+    author="CSS Electronics",
+    description="Production-ready Python package for querying and analyzing CAN/LIN data lakes",
+    long_description=long_description,
+    long_description_content_type="text/markdown",
+    url="https://github.com/CSS-Electronics/canedge-datalake",
+    packages=find_packages(),
+    classifiers=[
+        "Development Status :: 4 - Beta",
+        "Intended Audience :: Developers",
+        "Topic :: Scientific/Engineering",
+        "License :: OSI Approved :: MIT License",
+        "Programming Language :: Python :: 3",
+        "Programming Language :: Python :: 3.10",
+        "Programming Language :: Python :: 3.11",
+        "Programming Language :: Python :: 3.12",
+    ],
+    python_requires=">=3.10",
+    install_requires=[
+        "pandas>=1.3.0",
+        "pyarrow>=8.0.0",
+        "boto3>=1.20.0",
+    ],
+    extras_require={
+        "dev": [
+            "pytest>=7.0.0",
+            "black>=22.0.0",
+            "mypy>=0.950",
+            "ruff>=0.1.0",
+        ],
+    },
+)

src/streamlit_app.py CHANGED Viewed

@@ -1,40 +1,1115 @@
-import altair as alt
-import numpy as np
-import pandas as pd
-import streamlit as st
 """
-# Welcome to Streamlit!
-Edit `/streamlit_app.py` to customize this app to your heart's desire :heart:.
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).
-In the meantime, below is an example of what you can do with just a few lines of code:
-"""
-num_points = st.slider("Number of points in spiral", 1, 10000, 1100)
-num_turns = st.slider("Number of turns in spiral", 1, 300, 31)
-indices = np.linspace(0, 1, num_points)
-theta = 2 * np.pi * num_turns * indices
-radius = indices
-x = radius * np.cos(theta)
-y = radius * np.sin(theta)
-df = pd.DataFrame({
-    "x": x,
-    "y": y,
-    "idx": indices,
-    "rand": np.random.randn(num_points),
-})
-st.altair_chart(alt.Chart(df, height=700, width=700)
-    .mark_point(filled=True)
-    .encode(
-        x=alt.X("x", axis=None),
-        y=alt.Y("y", axis=None),
-        color=alt.Color("idx", legend=None, scale=alt.Scale()),
-        size=alt.Size("rand", legend=None, scale=alt.Scale(range=[1, 150])),
-    ))

+"""
+OXON Technologies - Professional Streamlit Dashboard
+A comprehensive dashboard for analyzing device data from AWS Athena data lake.
 """
+import streamlit as st
+from warnings import filterwarnings
+import base64
+from pathlib import Path
+from PIL import Image
+import pandas as pd
+import numpy as np
+import yaml
+import re
+import plotly.graph_objects as go
+from typing import Dict, Optional, List, Tuple
+from ydata_profiling import ProfileReport
+import plotly.express as px
+from datalake.config import DataLakeConfig
+from datalake.athena import AthenaQuery
+from datalake.catalog import DataLakeCatalog
+from datalake.query import DataLakeQuery
+from datalake.batch import BatchProcessor
+from utils.correlation import CorrelationMatrixGenerator
+from utils.dimension_reduction import DimensionReduction
+from utils.feature_class import DetectFeatureClasses
+# Ignore warnings
+filterwarnings("ignore")
+# ============================================================================
+# Configuration Management
+# ============================================================================
+def load_config(config_path: str = "config.yaml") -> Dict:
+    """
+    Load configuration from YAML file.
+    Args:
+        config_path: Path to the configuration YAML file
+    Returns:
+        Dictionary containing configuration settings
+    Raises:
+        FileNotFoundError: If config file doesn't exist
+        yaml.YAMLError: If config file is invalid YAML
+    """
+    config_file = Path(config_path)
+    if not config_file.exists():
+        raise FileNotFoundError(f"Configuration file not found: {config_path}")
+    with open(config_file, 'r') as f:
+        config = yaml.safe_load(f)
+    return config
+def initialize_aws_services(config: Dict) -> Tuple[DataLakeConfig, AthenaQuery, DataLakeCatalog, DataLakeQuery, BatchProcessor]:
+    """
+    Initialize AWS services using configuration.
+    Args:
+        config: Configuration dictionary with AWS credentials
+    Returns:
+        Tuple of (config, athena, catalog, query, processor)
+    Raises:
+        KeyError: If required configuration keys are missing
+        Exception: If AWS service initialization fails
+    """
+    aws_config = config.get('aws', {})
+    required_keys = ['database_name', 'workgroup', 's3_output_location', 'region',
+                     'access_key_id', 'secret_access_key']
+    missing_keys = [key for key in required_keys if key not in aws_config]
+    if missing_keys:
+        raise KeyError(f"Missing required AWS configuration keys: {missing_keys}")
+    data_lake_config = DataLakeConfig.from_credentials(
+        database_name=aws_config['database_name'],
+        workgroup=aws_config['workgroup'],
+        s3_output_location=aws_config['s3_output_location'],
+        region=aws_config['region'],
+        access_key_id=aws_config['access_key_id'],
+        secret_access_key=aws_config['secret_access_key'],
+    )
+    athena = AthenaQuery(data_lake_config)
+    catalog = DataLakeCatalog(athena, data_lake_config)
+    query = DataLakeQuery(athena, catalog)
+    processor = BatchProcessor(query)
+    return data_lake_config, athena, catalog, query, processor
+# ============================================================================
+# Session State Management
+# ============================================================================
+def initialize_session_state():
+    """Initialize all session state variables with proper defaults."""
+    # Configuration
+    if 'app_config' not in st.session_state:
+        try:
+            st.session_state['app_config'] = load_config()
+        except Exception as e:
+            st.session_state['app_config'] = None
+            st.session_state['config_error'] = str(e)
+    # AWS Services (only initialize when needed)
+    if 'aws_initialized' not in st.session_state:
+        st.session_state['aws_initialized'] = False
+    if 'aws_error' not in st.session_state:
+        st.session_state['aws_error'] = None
+    # User selections
+    if 'selected_device' not in st.session_state:
+        st.session_state['selected_device'] = None
+    if 'selected_message' not in st.session_state:
+        st.session_state['selected_message'] = None
+    if 'message_mapping' not in st.session_state:
+        st.session_state['message_mapping'] = None
+    # Date range filter
+    if 'date_range_enabled' not in st.session_state:
+        st.session_state['date_range_enabled'] = False
+    # Selected dates (what user picks in the UI)
+    if 'date_range_start' not in st.session_state:
+        st.session_state['date_range_start'] = None
+    if 'date_range_end' not in st.session_state:
+        st.session_state['date_range_end'] = None
+    # Applied dates (what's actually being used for filtering)
+    if 'applied_date_range_start' not in st.session_state:
+        st.session_state['applied_date_range_start'] = None
+    if 'applied_date_range_end' not in st.session_state:
+        st.session_state['applied_date_range_end'] = None
+    # Data cache
+    if 'device_list' not in st.session_state:
+        st.session_state['device_list'] = None
+    if 'message_list' not in st.session_state:
+        st.session_state['message_list'] = None
+    if 'current_data' not in st.session_state:
+        st.session_state['current_data'] = None
+    # Correlations tab
+    if 'correlations_run_clicked' not in st.session_state:
+        st.session_state['correlations_run_clicked'] = False
+    if 'correlations_data' not in st.session_state:
+        st.session_state['correlations_data'] = None
+    if 'correlation_matrix' not in st.session_state:
+        st.session_state['correlation_matrix'] = None
+    if 'feature_clusters' not in st.session_state:
+        st.session_state['feature_clusters'] = None
+def initialize_aws_if_needed():
+    """
+    Initialize AWS services if not already initialized.
+    Returns True if successful, False otherwise.
+    """
+    if st.session_state['aws_initialized']:
+        return True
+    if st.session_state['app_config'] is None:
+        return False
+    try:
+        config, athena, catalog, query, processor = initialize_aws_services(
+            st.session_state['app_config']
+        )
+        st.session_state['config'] = config
+        st.session_state['athena'] = athena
+        st.session_state['catalog'] = catalog
+        st.session_state['query'] = query
+        st.session_state['processor'] = processor
+        st.session_state['aws_initialized'] = True
+        st.session_state['aws_error'] = None
+        return True
+    except Exception as e:
+        st.session_state['aws_error'] = str(e)
+        st.session_state['aws_initialized'] = False
+        return False
+# ============================================================================
+# UI Components
+# ============================================================================
+def get_base64_image(image_path: str) -> Optional[str]:
+    """
+    Convert image to base64 string.
+    Args:
+        image_path: Path to the image file
+    Returns:
+        Base64 encoded string or None if file not found
+    """
+    try:
+        image_file = Path(image_path)
+        if not image_file.exists():
+            return None
+        with open(image_file, "rb") as f:
+            return base64.b64encode(f.read()).decode()
+    except Exception:
+        return None
+def display_header(logo_path: str, title: str):
+    """
+    Display header with logo and title.
+    Args:
+        logo_path: Path to logo image
+        title: Header title text
+    """
+    logo_base64 = get_base64_image(logo_path)
+    if logo_base64:
+        st.markdown(
+            f"""
+            <div style="display: flex; align-items: center;">
+                <img src="data:image/png;base64,{logo_base64}" alt="Logo"
+                     style="height: 200px; margin-right: 10px;">
+                <h1 style="display: inline; margin: 0;">{title} 🔍</h1>
+            </div>
+            """,
+            unsafe_allow_html=True,
+        )
+    else:
+        st.title(f"{title} 🔍")
+def display_sidebar():
+    """Display sidebar with device selection."""
+    with st.sidebar:
+        # Logo
+        logo_path = st.session_state['app_config'].get('dashboard', {}).get('logo_path', 'images/logo.png')
+        try:
+            st.image(Image.open(logo_path), width='stretch')
+        except Exception:
+            st.write("OXON Technologies")
+        st.title("OXON Technologies")
+        st.write("Welcome to the OXON Technologies dashboard. "
+                 "Select a device ID and click **Go!** to begin analysis.")
+        # Check if AWS services are initialized
+        if not st.session_state['aws_initialized']:
+            st.warning("⚠️ AWS services not initialized. Please check configuration.")
+            return
+        # Load device list if not cached
+        if st.session_state['device_list'] is None:
+            try:
+                with st.spinner("Loading devices..."):
+                    st.session_state['device_list'] = st.session_state['catalog'].list_devices()
+            except Exception as e:
+                st.error(f"Error loading devices: {str(e)}")
+                return
+        devices_list = st.session_state['device_list']
+        if not devices_list:
+            st.warning("No devices found in the data lake.")
+            return
+        # Device selection
+        current_index = 0
+        if st.session_state['selected_device'] in devices_list:
+            current_index = devices_list.index(st.session_state['selected_device'])
+        selected_device = st.selectbox(
+            "Device ID",
+            devices_list,
+            index=current_index,
+            key="sidebar_device_select"
+        )
+        # Apply device selection only when user clicks the button
+        if st.button("Go!", key="device_go_btn", width='stretch'):
+            st.session_state['selected_device'] = selected_device
+            st.session_state['selected_message'] = None
+            st.session_state['message_list'] = None
+            st.session_state['message_mapping'] = None
+            st.session_state['current_data'] = None
+            st.session_state['date_range_enabled'] = False
+            st.session_state['date_range_start'] = None
+            st.session_state['date_range_end'] = None
+            st.session_state['applied_date_range_start'] = None
+            st.session_state['applied_date_range_end'] = None
+            st.session_state['correlations_run_clicked'] = False
+            st.session_state['correlations_data'] = None
+            st.session_state['correlation_matrix'] = None
+            st.session_state['feature_clusters'] = None
+            st.rerun()
+        # Show selected device info only after user has confirmed
+        if st.session_state['selected_device']:
+            st.success(f"✓ Selected: {st.session_state['selected_device']}")
+# ============================================================================
+# Message Processing
+# ============================================================================
+def build_message_mapping(messages_list: List[str], mapping_config: Dict) -> Tuple[Dict[str, str], List[str]]:
+    """
+    Build message mapping dictionary from raw messages.
+    Args:
+        messages_list: List of raw message names
+        mapping_config: Configuration dictionary with message mappings
+    Returns:
+        Tuple of (messages_mapping_dict, lost_messages_list)
+    """
+    pattern = re.compile(r"s(?P<s>\d{2})pid.*m(?P<m>[0-9a-fA-F]{2})$")
+    messages_mapping_dict = {}
+    lost_messages_list = []
+    for message in messages_list:
+        # Do not change name for messages that are not can1
+        if not message.startswith('can1'):
+            messages_mapping_dict[message] = message
+            continue
+        message_id_parts = pattern.search(message)
+        if not message_id_parts:
+            continue
+        message_id = (message_id_parts.group("s") + message_id_parts.group("m")).upper()
+        if message_id in mapping_config:
+            message_name = mapping_config[message_id]['name']
+            messages_mapping_dict[message_name] = message
+        else:
+            lost_messages_list.append(message)
+    return messages_mapping_dict, lost_messages_list
+def load_message_list(device_id: str) -> Optional[List[str]]:
+    """
+    Load message list for a device.
+    Args:
+        device_id: Device ID to load messages for
+    Returns:
+        List of message names or None if error
+    """
+    try:
+        return st.session_state['catalog'].list_messages(device_id)
+    except Exception as e:
+        st.error(f"Error loading messages: {str(e)}")
+        return None
+# ============================================================================
+# Tab Components
+# ============================================================================
+def render_message_viewer_tab():
+    """Render the Message Viewer tab."""
+    # Check prerequisites
+    if not st.session_state['aws_initialized']:
+        st.error("AWS services not initialized. Please check configuration.")
+        return
+    if not st.session_state['selected_device']:
+        st.info("👈 Please select a device from the sidebar and click **Go!** to begin.")
+        return
+    device_id = st.session_state['selected_device']
+    # Load message list if not cached
+    if st.session_state['message_list'] is None:
+        with st.spinner(f"Loading messages for device {device_id}..."):
+            st.session_state['message_list'] = load_message_list(device_id)
+    if st.session_state['message_list'] is None:
+        return
+    messages_list = st.session_state['message_list']
+    if not messages_list:
+        st.warning(f"No messages found for device {device_id}.")
+        return
+    # Get message mapping configuration
+    mapping_config = st.session_state['app_config'].get('message_mapping', {})
+    # Build message mapping
+    if st.session_state['message_mapping'] is None:
+        messages_mapping_dict, lost_messages_list = build_message_mapping(
+            messages_list, mapping_config
+        )
+        st.session_state['message_mapping'] = messages_mapping_dict
+        if lost_messages_list:
+            st.warning(
+                f"The following messages were not found in the mapping: "
+                f"{', '.join(lost_messages_list[:10])}"
+                f"{'...' if len(lost_messages_list) > 10 else ''}"
+            )
+    else:
+        messages_mapping_dict = st.session_state['message_mapping']
+    if not messages_mapping_dict:
+        st.warning("No valid messages found after mapping.")
+        return
+    # Message selection
+    current_index = 0
+    if st.session_state['selected_message']:
+        # Find the message name that corresponds to selected_message
+        for name, msg in messages_mapping_dict.items():
+            if msg == st.session_state['selected_message']:
+                if name in list(messages_mapping_dict.keys()):
+                    current_index = list(messages_mapping_dict.keys()).index(name)
+                break
+    st.markdown('<div style="text-align: center;"><h2>Message Viewer</h2></div>', unsafe_allow_html=True)
+    st.divider()
+    selected_message_name = st.selectbox(
+        "Select Message",
+        list(messages_mapping_dict.keys()),
+        index=current_index,
+        key="message_selectbox"
+    )
+    message_clicked = st.button("Show!", key="message_show_btn", width='stretch')
+    selected_message = messages_mapping_dict[selected_message_name]
+    # Apply message selection only when user clicks the button
+    if message_clicked:
+        st.session_state['selected_message'] = selected_message
+        st.session_state['current_data'] = None
+        st.rerun()
+    if st.session_state['selected_message']:
+        st.info(f"📊 Selected message: `{st.session_state['selected_message']}` ({selected_message_name})")
+        # Date range selection (optional filter)
+        st.divider()
+        date_range_enabled = st.checkbox(
+            "Filter by Date Range",
+            value=st.session_state.get('date_range_enabled', False),
+            key="date_range_checkbox",
+            help="Enable to filter data by date range"
+        )
+        if date_range_enabled:
+            # Get min/max dates from cached data if available
+            min_date = None
+            max_date = None
+            if st.session_state.get('current_data') is not None:
+                try:
+                    df_temp = st.session_state['current_data']
+                    if 'timestamp' in df_temp.columns:
+                        min_date = df_temp['timestamp'].min().date()
+                        max_date = df_temp['timestamp'].max().date()
+                except Exception:
+                    pass
+            col_start, col_end = st.columns([1, 1])
+            with col_start:
+                date_start = st.date_input(
+                    "Start Date",
+                    value=st.session_state.get('date_range_start') or min_date,
+                    min_value=min_date,
+                    max_value=max_date,
+                    key="date_range_start_input",
+                    help="Select start date for filtering"
+                )
+            with col_end:
+                date_end = st.date_input(
+                    "End Date",
+                    value=st.session_state.get('date_range_end') or max_date,
+                    min_value=min_date,
+                    max_value=max_date,
+                    key="date_range_end_input",
+                    help="Select end date for filtering"
+                )
+            apply_filter_clicked = st.button(
+                "Apply Filter",
+                key="apply_date_filter_btn",
+                use_container_width=True
+            )
+            # Update selected dates in session state
+            st.session_state['date_range_start'] = date_start
+            st.session_state['date_range_end'] = date_end
+            # Apply filter only when button is clicked
+            if apply_filter_clicked:
+                # Validate date range before applying
+                if date_start > date_end:
+                    st.error("⚠️ Start date must be before or equal to end date.")
+                else:
+                    st.session_state['applied_date_range_start'] = date_start
+                    st.session_state['applied_date_range_end'] = date_end
+                    st.rerun()
+            # Show current applied filter status
+            if st.session_state.get('applied_date_range_start') and st.session_state.get('applied_date_range_end'):
+                st.success(
+                    f"📅 **Applied filter:** {st.session_state['applied_date_range_start']} to "
+                    f"{st.session_state['applied_date_range_end']}"
+                )
+            elif date_start and date_end:
+                if date_start <= date_end:
+                    st.info("ℹ️ Select dates and click **Apply Filter** to filter the data.")
+                else:
+                    st.error("⚠️ Start date must be before or equal to end date.")
+        else:
+            # Clear applied date range when disabled
+            if st.session_state.get('date_range_enabled'):
+                st.session_state['applied_date_range_start'] = None
+                st.session_state['applied_date_range_end'] = None
+                st.session_state['date_range_start'] = None
+                st.session_state['date_range_end'] = None
+        # Update enabled state
+        st.session_state['date_range_enabled'] = date_range_enabled
+        render_message_data(device_id, st.session_state['selected_message'])
+    else:
+        st.info("Select a message and click **Show!** to load data.")
+def render_message_data(device_id: str, message: str):
+    """
+    Render data and plot for a selected message.
+    Args:
+        device_id: Device ID
+        message: Message name
+    """
+    # Load data if not cached
+    if st.session_state['current_data'] is None:
+        with st.spinner("Loading data..."):
+            try:
+                df = st.session_state['query'].read_device_message(
+                    device_id=device_id,
+                    message=message,
+                )
+                if df is None or df.empty:
+                    st.warning("No data found for the selected message.")
+                    return
+                # Process data
+                df['t'] = pd.to_datetime(df['t'])
+                df = df.sort_values(by='t').reset_index(drop=True)
+                df = df.rename(columns={'t': 'timestamp'})
+                st.session_state['current_data'] = df
+            except Exception as e:
+                st.error(f"Error loading data: {str(e)}")
+                return
+    df = st.session_state['current_data'].copy()
+    df = df.drop(columns=['date_created'], errors='ignore')
+    if df is None or df.empty:
+        return
+    # Apply date range filter if enabled and applied dates are set
+    original_row_count = len(df)
+    if (st.session_state.get('date_range_enabled') and
+        st.session_state.get('applied_date_range_start') and
+        st.session_state.get('applied_date_range_end')):
+        start_date = pd.to_datetime(st.session_state['applied_date_range_start'])
+        end_date = pd.to_datetime(st.session_state['applied_date_range_end'])
+        # Include the entire end date (set to end of day)
+        end_date = end_date.replace(hour=23, minute=59, second=59)
+        df = df[(df['timestamp'] >= start_date) & (df['timestamp'] <= end_date)].copy()
+        if len(df) == 0:
+            st.warning(
+                f"⚠️ No data found in the selected date range "
+                f"({st.session_state['applied_date_range_start']} to {st.session_state['applied_date_range_end']})."
+            )
+            st.info("Try selecting a different date range or disable the filter to see all data.")
+            return
+        elif len(df) < original_row_count:
+            st.info(f"📊 Showing {len(df):,} of {original_row_count:,} records (filtered by date range).")
+    # Display statistics
+    # st.subheader("Statistics")
+    st.divider()
+    st.markdown('<div style="text-align: center;"><h2>Overview</h2></div>', unsafe_allow_html=True)
+    st.divider()
+    col1, col2, col3, col4 = st.columns([1, 2, 1, 1])
+    with col1:
+        st.metric("Total Records", len(df))
+    with col2:
+        st.metric("Date Range", f"{df['timestamp'].min().date()} to {df['timestamp'].max().date()}")
+    with col3:
+        st.metric("Data Columns", len(df.columns) - 1)  # Exclude timestamp
+    with col4:
+        st.metric("Time Span", f"{(df['timestamp'].max() - df['timestamp'].min()).days} days")
+    # Display data section
+    st.divider()
+    st.markdown('<div style="text-align: center;"><h2>Data & Profile Report</h2></div>', unsafe_allow_html=True)
+    st.divider()
+    col1, col2 = st.columns([1, 2])
+    with col1:
+        try:
+            st.dataframe(df.set_index('timestamp'), width='stretch', height=700)
+        except Exception as e: # dataframe was too large
+            st.warning(f"Dataframe was too large to display: {str(e)}")
+            st.info("Dataframe was too large to display. Please use the profile report to analyze the data.")
+    with col2:
+        try:
+            pr = ProfileReport(df, title="Data Profile", explorative=False, vars={"num": {"low_categorical_threshold": 0}})
+            st.components.v1.html(pr.to_html(), scrolling=True, height=700)
+        except Exception as e:
+            st.warning(f"Profile report could not be generated: {e}")
+    # Display plot section
+    st.divider()
+    st.markdown('<div style="text-align: center;"><h2>Visualization</h2></div>', unsafe_allow_html=True)
+    st.divider()
+    try:
+        # Prepare aggregated data
+        daily_aggregated_df = df.groupby(
+            pd.Grouper(key='timestamp', freq='D')
+        ).mean().reset_index().fillna(0)
+        # Create plot
+        fig = go.Figure()
+        data_columns = [col for col in daily_aggregated_df.columns
+                       if col not in ['timestamp']]
+        for column in data_columns:
+            fig.add_trace(
+                go.Scatter(
+                    x=daily_aggregated_df['timestamp'],
+                    y=daily_aggregated_df[column],
+                    name=column,
+                    mode='lines+markers'
+                )
+            )
+        # Red vertical line at 16 December 2025 with legend entry "Dosing Stage"
+        dosing_date = st.session_state['app_config'].get('dashboard', {}).get('dosing_stage_date', '2025-12-16')
+        try:
+            dosing_datetime = pd.to_datetime(dosing_date)
+            if data_columns:
+                y_min = daily_aggregated_df[data_columns].min().min()
+                y_max = daily_aggregated_df[data_columns].max().max()
+                if y_min == y_max:
+                    y_min, y_max = y_min - 0.1, y_max + 0.1
+            else:
+                y_min, y_max = 0, 1
+            # Add vertical line as a trace so it appears in the legend as "Dosing Stage"
+            fig.add_trace(
+                go.Scatter(
+                    x=[dosing_datetime, dosing_datetime],
+                    y=[y_min, y_max],
+                    mode='lines',
+                    name='Dosing Stage',
+                    line=dict(color='red', width=2)
+                )
+            )
+        except Exception:
+            pass
+        # Update layout with legend
+        fig.update_layout(
+            title="Daily Aggregated Data",
+            xaxis_title="Date",
+            yaxis_title="Value",
+            hovermode='x unified',
+            width=800,
+            height=700,
+            showlegend=True,
+            legend=dict(
+                orientation="h",
+                yanchor="bottom",
+                y=1.02,
+                xanchor="right",
+                x=1,
+                title_text=""
+            )
+        )
+        st.plotly_chart(fig, width='stretch')
+    except Exception as e:
+        st.error(f"Error creating visualization: {str(e)}")
+def load_all_device_messages(device_id: str) -> Optional[pd.DataFrame]:
+    """
+    Load all messages for a device, aggregate daily, and merge on timestamp.
+    Args:
+        device_id: Device ID to load messages for
+    Returns:
+        Merged DataFrame with all messages aggregated daily, or None if error
+    """
+    try:
+        messages_list = st.session_state['catalog'].list_messages(device_id)
+        if not messages_list:
+            return None
+        aggregated_dfs = []
+        failed_messages = []
+        progress_bar = st.progress(0)
+        status_text = st.empty()
+        total_messages = len(messages_list)
+        for idx, message in enumerate(messages_list):
+            if message.startswith('can9'):
+                continue
+            status_text.text(f"Loading message {idx + 1}/{total_messages}: {message}")
+            progress_bar.progress((idx + 1) / total_messages)
+            try:
+                # Load message data
+                df = st.session_state['query'].read_device_message(
+                    device_id=device_id,
+                    message=message,
+                )
+                if df is None or df.empty:
+                    failed_messages.append(message)
+                    continue
+                # Process data
+                df['t'] = pd.to_datetime(df['t'])
+                df = df.sort_values(by='t').reset_index(drop=True)
+                df = df.rename(columns={'t': 'timestamp'})
+                # Drop date_created column
+                df = df.drop(columns=['date_created'], errors='ignore')
+                # Aggregate daily by mean
+                daily_df = df.groupby(
+                    pd.Grouper(key='timestamp', freq='D')
+                ).mean().reset_index()
+                # Remove rows with all NaN (days with no data)
+                daily_df = daily_df.dropna(how='all', subset=[col for col in daily_df.columns if col != 'timestamp'])
+                if daily_df.empty:
+                    failed_messages.append(message)
+                    continue
+                # Rename columns to include message name (except timestamp)
+                # Handle multiple data columns for non-can1 messages
+                rename_dict = {}
+                for col in daily_df.columns:
+                    if col != 'timestamp':
+                        # Create unique column name: message_name__column_name
+                        rename_dict[col] = f"{message}__{col}"
+                daily_df = daily_df.rename(columns=rename_dict)
+                aggregated_dfs.append(daily_df)
+            except Exception as e:
+                failed_messages.append(f"{message} ({str(e)})")
+                continue
+        progress_bar.empty()
+        status_text.empty()
+        if not aggregated_dfs:
+            if failed_messages:
+                st.warning(f"Failed to load all messages. Errors: {', '.join(failed_messages[:5])}")
+            return None
+        if failed_messages:
+            st.warning(f"Failed to load {len(failed_messages)} message(s). Continuing with {len(aggregated_dfs)} messages.")
+        # Merge all dataframes on timestamp
+        merged_df = aggregated_dfs[0]
+        for df in aggregated_dfs[1:]:
+            merged_df = pd.merge(
+                merged_df,
+                df,
+                on='timestamp',
+                how='outer'  # Keep all days from all messages
+            )
+        # Sort by timestamp
+        merged_df = merged_df.sort_values(by='timestamp').reset_index(drop=True)
+        # Fill NaN with 0 for numeric columns (or forward fill)
+        numeric_cols = merged_df.select_dtypes(include=[np.number]).columns
+        merged_df[numeric_cols] = merged_df[numeric_cols].fillna(0)
+        return merged_df
+    except Exception as e:
+        st.error(f"Error loading device messages: {str(e)}")
+        return None
+def _reset_correlations():
+    """Clear correlations run state and caches (used by Start over button)."""
+    st.session_state['correlations_run_clicked'] = False
+    st.session_state['correlations_data'] = None
+    st.session_state['correlation_matrix'] = None
+    st.session_state['feature_clusters'] = None
+def render_correlations_tab():
+    """Render the Correlations tab with correlation matrix and feature clusters."""
+    # Check prerequisites
+    if not st.session_state['aws_initialized']:
+        st.error("AWS services not initialized. Please check configuration.")
+        return
+    if not st.session_state['selected_device']:
+        st.info("👈 Please select a device from the sidebar and click **Go!** to begin.")
+        return
+    device_id = st.session_state['selected_device']
+    st.markdown('<div style="text-align: center;"><h2>Correlation Analysis</h2></div>', unsafe_allow_html=True)
+    st.divider()
+    # Run button: calculations start only after user presses it
+    if not st.session_state.get('correlations_run_clicked'):
+        st.info(
+            "This analysis loads **all messages** for the selected device, aggregates them daily, "
+            "and computes correlations and feature cohorts. Click the button below to start."
+        )
+        if st.button("Run Correlation Analysis", key="run_correlations_btn", type="primary", use_container_width=True):
+            st.session_state['correlations_run_clicked'] = True
+            st.rerun()
+        return
+    # Load all device messages if not cached
+    if st.session_state['correlations_data'] is None:
+        with st.spinner(f"Loading all messages for device {device_id}..."):
+            st.session_state['correlations_data'] = load_all_device_messages(device_id)
+    if st.session_state['correlations_data'] is None or st.session_state['correlations_data'].empty:
+        st.error("No data available for correlation analysis.")
+        if st.button("Start over", key="correlations_start_over_btn"):
+            _reset_correlations()
+            st.rerun()
+        return
+    df = st.session_state['correlations_data'].copy()
+    # Remove timestamp column for correlation analysis
+    df_features = df.drop(columns=['timestamp'])
+    if df_features.empty:
+        st.error("No features available for correlation analysis.")
+        return
+    st.info(f"📊 Analyzing {len(df_features.columns)} features from {len(df)} days of data.")
+    # Detect feature classes
+    st.subheader("1. Feature Classification")
+    with st.spinner("Classifying features..."):
+        try:
+            detector = DetectFeatureClasses(df_features, categorical_threshold=0.5, string_data_policy='drop')
+            feature_classes, dropped_features = detector.feature_classes()
+            if dropped_features:
+                st.warning(f"Dropped {len(dropped_features)} non-numeric features: {', '.join(dropped_features[:5])}")
+                df_features = df_features.drop(columns=dropped_features)
+            # Display feature class summary
+            class_counts = {}
+            for cls in feature_classes.values():
+                class_counts[cls] = class_counts.get(cls, 0) + 1
+            col1, col2, col3 = st.columns(3)
+            with col1:
+                st.metric("Continuous", class_counts.get('Continuous', 0))
+            with col2:
+                st.metric("Binary", class_counts.get('Binary', 0))
+            with col3:
+                st.metric("Categorical", class_counts.get('Categorical', 0))
+        except Exception as e:
+            st.error(f"Error classifying features: {str(e)}")
+            return
+    # Generate correlation matrix
+    st.subheader("2. Correlation Matrix")
+    if st.session_state['correlation_matrix'] is None:
+        with st.spinner("Generating correlation matrix (this may take a while)..."):
+            try:
+                corr_generator = CorrelationMatrixGenerator(
+                    df=df_features,
+                    feature_classes=feature_classes,
+                    continuous_vs_continuous_method='pearson'
+                )
+                st.session_state['correlation_matrix'] = corr_generator.generate_matrix()
+            except Exception as e:
+                st.error(f"Error generating correlation matrix: {str(e)}")
+                return
+    corr_matrix = st.session_state['correlation_matrix']
+    # Display interactive heatmap
+    st.markdown("**Interactive Correlation Heatmap**")
+    try:
+        # Create heatmap using plotly
+        fig = px.imshow(
+            corr_matrix,
+            color_continuous_scale='RdBu',
+            aspect='auto',
+            labels=dict(x="Feature", y="Feature", color="Correlation"),
+            title="Feature Correlation Matrix"
+        )
+        fig.update_layout(
+            height=max(800, len(corr_matrix) * 40),
+            width=max(800, len(corr_matrix) * 40)
+        )
+        st.plotly_chart(fig, use_container_width=True)
+    except Exception as e:
+        st.error(f"Error displaying heatmap: {str(e)}")
+    # Find feature clusters using dimension reduction
+    st.subheader("3. Feature Clusters (Cohorts)")
+    if st.session_state['feature_clusters'] is None:
+        with st.spinner("Finding feature clusters..."):
+            try:
+                dim_reduction = DimensionReduction(
+                    dataframe=df_features,
+                    feature_classes=feature_classes,
+                    method='pearson',
+                    projection_dimension=1
+                )
+                # Find clusters at different correlation thresholds; store (lower, upper) with each band for correct labeling
+                st.session_state['feature_clusters'] = [
+                    ((0.95, 1.0), dim_reduction.find_clusters(lower_bound=0.95, upper_bound=1.0)),
+                    ((0.90, 0.95), dim_reduction.find_clusters(lower_bound=0.90, upper_bound=0.95)),
+                    ((0.85, 0.90), dim_reduction.find_clusters(lower_bound=0.85, upper_bound=0.90)),
+                    ((0.80, 0.85), dim_reduction.find_clusters(lower_bound=0.80, upper_bound=0.85)),
+                    ((0.75, 0.80), dim_reduction.find_clusters(lower_bound=0.75, upper_bound=0.80)),
+                    ((0.70, 0.75), dim_reduction.find_clusters(lower_bound=0.70, upper_bound=0.75)),
+                ]
+            except Exception as e:
+                st.error(f"Error finding clusters: {str(e)}")
+                return
+    cluster_bands = st.session_state['feature_clusters']
+    # Display clusters with band-bound labels so captions match the shown matrices
+    for (lower, upper), cluster_list in cluster_bands:
+        band_label = f"[{lower}, {upper}]"
+        if cluster_list:
+            st.markdown(f"**Cohorts with pairwise correlation in {band_label}**")
+            for idx, cluster in enumerate(cluster_list):
+                with st.expander(f"Cohort {idx + 1}: {len(cluster)} features (all pairs in {band_label})"):
+                    for feature in cluster:
+                        st.write(f"  • {feature}")
+                    if len(cluster) > 1:
+                        st.markdown("**Pairwise correlations (values lie in " + band_label + "):**")
+                        cluster_corr = corr_matrix.loc[cluster, cluster]
+                        st.dataframe(cluster_corr, use_container_width=True)
+                        # Sanity check: ensure displayed matrix matches the band
+                        vals = cluster_corr.values
+                        off_diag = vals[~np.eye(len(cluster), dtype=bool)]
+                        if off_diag.size > 0:
+                            in_range = np.sum((off_diag >= lower) & (off_diag <= upper)) == off_diag.size
+                            if in_range:
+                                st.caption(f"All off-diagonal values in {band_label}.")
+                            else:
+                                st.caption(f"Note: some values fall outside {band_label} (may include NaNs or rounding).")
+        else:
+            st.info(f"No cohorts found with pairwise correlation in {band_label}.")
+    # Summary statistics
+    st.subheader("4. Summary")
+    total_clusters = sum(len(cluster_list) for (_, cluster_list) in cluster_bands)
+    total_features_in_clusters = sum(
+        len(cluster) for (_, cluster_list) in cluster_bands for cluster in cluster_list
+    )
+    col1, col2 = st.columns(2)
+    with col1:
+        st.metric("Total Cohorts Found", total_clusters)
+    with col2:
+        st.metric("Features in Cohorts", total_features_in_clusters)
+    st.divider()
+    if st.button("Start over", key="correlations_start_over_bottom", use_container_width=True):
+        _reset_correlations()
+        st.rerun()
+def render_placeholder_tab():
+    """Render placeholder tab."""
+    st.info("🚧 This feature is under development.")
+# ============================================================================
+# Main Application
+# ============================================================================
+def main():
+    """Main application entry point."""
+    # Initialize session state
+    initialize_session_state()
+    # Load configuration
+    if st.session_state['app_config'] is None:
+        st.error(
+            f"❌ Configuration Error: {st.session_state.get('config_error', 'Unknown error')}\n\n"
+            "Please ensure `config.yaml` exists and is properly formatted."
+        )
+        st.stop()
+    # Initialize AWS services
+    if not initialize_aws_if_needed():
+        if st.session_state['aws_error']:
+            st.error(
+                f"❌ AWS Initialization Error: {st.session_state['aws_error']}\n\n"
+                "Please check your AWS credentials in `config.yaml`."
+            )
+        st.stop()
+    # Get dashboard configuration
+    dashboard_config = st.session_state['app_config'].get('dashboard', {})
+    # Set page config
+    st.set_page_config(
+        page_title=dashboard_config.get('page_title', 'OXON Technologies'),
+        page_icon=dashboard_config.get('page_icon', ':mag:'),
+        layout=dashboard_config.get('layout', 'wide')
+    )
+    # Custom sidebar styling
+    sidebar_color = dashboard_config.get('sidebar_background_color', '#74b9ff')
+    st.markdown(
+        f"""
+        <style>
+            section[data-testid="stSidebar"] {{
+                background-color: {sidebar_color};
+            }}
+        </style>
+        """,
+        unsafe_allow_html=True,
+    )
+    # Display header
+    header_logo = dashboard_config.get('header_logo_path', 'images/analysis.png')
+    header_title = dashboard_config.get('page_title', 'Analytical Dashboard')
+    display_header(header_logo, header_title)
+    # Display sidebar
+    display_sidebar()
+    # Main content tabs
+    tabs = st.tabs(['Message Viewer', 'Correlations', 'To be Implemented'])
+    with tabs[0]:
+        render_message_viewer_tab()
+    with tabs[1]:
+        render_correlations_tab()
+    with tabs[2]:
+        render_placeholder_tab()
+if __name__ == "__main__":
+    main()

src/test_connection.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+Quick test script to verify Athena connection and basic functionality.
+"""
+from datalake.config import DataLakeConfig
+from datalake.athena import AthenaQuery
+from datalake.catalog import DataLakeCatalog
+def main():
+    """Test basic connection and functionality."""
+    print("Testing Athena Connection...")
+    print("=" * 60)
+    # Load config with explicit credentials
+    config = DataLakeConfig.from_credentials(
+        database_name="dbparquetdatalake05",
+        workgroup="athenaworkgroup-datalake05",
+        s3_output_location="s3://canedge-raw-data-parquet/athena-results/",
+        region="eu-north-1",
+        access_key_id="AKIARJQJFFVASPMSGNNY",
+        secret_access_key="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",
+    )
+    print(f"✓ Configuration loaded")
+    print(f"  Database: {config.database_name}")
+    print(f"  Workgroup: {config.workgroup}")
+    print(f"  Region: {config.region}")
+    print(f"  S3 Output: {config.s3_output_location}")
+    print()
+    # Initialize Athena
+    try:
+        athena = AthenaQuery(config)
+        print("✓ Athena client initialized")
+    except Exception as e:
+        print(f"✗ Failed to initialize Athena client: {e}")
+        return
+    # Test simple query
+    try:
+        print("Testing simple query...")
+        test_query = f"SHOW TABLES IN {config.database_name}"
+        df = athena.query_to_dataframe(test_query, timeout=60)
+        print(f"✓ Query executed successfully")
+        print(f"  Found {len(df)} tables")
+        if not df.empty:
+            print(f"  Sample tables: {list(df.iloc[:, 0])[:5]}")
+    except Exception as e:
+        print(f"✗ Query failed: {e}")
+        import traceback
+        traceback.print_exc()
+        return
+    # Test catalog
+    try:
+        print("\nTesting catalog...")
+        catalog = DataLakeCatalog(athena, config)
+        tables = catalog.list_tables()
+        print(f"✓ Catalog initialized")
+        print(f"  Total tables: {len(tables)}")
+        if tables:
+            devices = catalog.list_devices()
+            print(f"  Devices found: {len(devices)}")
+            if devices:
+                print(f"  Sample devices: {devices[:3]}")
+    except Exception as e:
+        print(f"✗ Catalog test failed: {e}")
+        import traceback
+        traceback.print_exc()
+        return
+    print("\n" + "=" * 60)
+    print("✓ All tests passed! Connection is working.")
+    print("=" * 60)
+if __name__ == "__main__":
+    main()

src/utils/__init__.py ADDED Viewed

	@@ -0,0 +1,9 @@

+from .correlation import CorrelationMatrixGenerator
+from .dimension_reduction import DimensionReduction
+from .feature_class import DetectFeatureClasses
+__all__ = [
+    'CorrelationMatrixGenerator',
+    'DimensionReduction',
+    'DetectFeatureClasses'
+    ]

src/utils/__pycache__/__init__.cpython-310.pyc ADDED Viewed

Binary file (355 Bytes). View file

src/utils/__pycache__/correlation.cpython-310.pyc ADDED Viewed

Binary file (7.66 kB). View file

src/utils/__pycache__/dimension_reduction.cpython-310.pyc ADDED Viewed

Binary file (7.96 kB). View file

src/utils/__pycache__/feature_class.cpython-310.pyc ADDED Viewed

Binary file (4.57 kB). View file

src/utils/correlation.py ADDED Viewed

	@@ -0,0 +1,248 @@

+"""
+Correlation matrix generation module for mixed data types.
+This module provides the CorrelationMatrixGenerator class which computes
+correlation/association matrices for DataFrames containing mixed data types
+(Continuous, Binary, Categorical). It automatically selects appropriate
+correlation measures based on feature type pairs.
+"""
+import numpy as np
+import pandas as pd
+from scipy.stats import chi2_contingency, pointbiserialr
+from tqdm import tqdm
+class CorrelationMatrixGenerator:
+    """
+    A class to generate a correlation/association matrix for a pandas DataFrame,
+    handling different data types appropriately. It supports Continuous, Binary, and Categorical data types.
+    Parameters:
+    ----------
+    df : pd.DataFrame
+        The input DataFrame containing features for correlation analysis.
+    feature_classes : dict
+        A dictionary mapping column names to their data types ('Continuous', 'Binary', 'Categorical').
+    continuous_vs_continuous_method : str, optional
+        Method to use for estimating the correlation coefficient of two continuous data types. Default is 'pearson'.
+    Methods:
+    -------
+    generate_matrix() -> pd.DataFrame
+        Generates and returns a symmetric correlation/association matrix for the DataFrame.
+    """
+    def __init__(self, df, feature_classes, continuous_vs_continuous_method='pearson'):
+        """
+        Initialize with a DataFrame and a dictionary mapping column names to data types.
+        Parameters:
+            df : pandas.DataFrame
+                The DataFrame containing your data.
+            feature_classes : dict
+                A dictionary where keys are column names in df and values are their data types.
+                Valid types are 'Continuous', 'Binary', or 'Categorical'.
+            continuous_vs_continuous_method : str
+                Method to use for estimating the correlation coefficient of two continuous data
+        """
+        self.df = df
+        self.feature_classes = feature_classes
+        self.continuous_vs_continuous_method = continuous_vs_continuous_method
+    @staticmethod
+    def recode_binary(series):
+        """
+        Ensure a binary series is coded as 0 and 1.
+        If the series is already numeric with values {0,1}, it is returned as is.
+        Otherwise, it maps the two unique values to 0 and 1.
+        Parameters
+        ----------
+        series : pd.Series
+            A binary series to recode.
+        Returns
+        -------
+        pd.Series
+            Binary series with values {0, 1}.
+        Raises
+        ------
+        ValueError
+            If the series does not appear to be binary (has more than 2 unique values).
+        """
+        # Check if already numeric and in {0, 1}
+        if pd.api.types.is_numeric_dtype(series):
+            unique_vals = series.dropna().unique()
+            if set(unique_vals) <= {0, 1}:
+                return series
+        # Map two unique values to {0, 1}
+        unique_vals = series.dropna().unique()
+        if len(unique_vals) == 2:
+            mapping = {unique_vals[0]: 0, unique_vals[1]: 1}
+            return series.map(mapping)
+        else:
+            raise ValueError("Series does not appear to be binary")
+    @staticmethod
+    def cramers_v(x, y):
+        """
+        Calculate Cramér's V statistic for a categorical-categorical association.
+        Cramér's V is a measure of association between two nominal variables,
+        ranging from 0 (no association) to 1 (perfect association).
+        Parameters
+        ----------
+        x, y : array-like
+            Two categorical variables.
+        Returns
+        -------
+        float
+            Cramér's V statistic, or np.nan if computation is not possible.
+        """
+        contingency_table = pd.crosstab(x, y)
+        chi2 = chi2_contingency(contingency_table)[0]
+        n = contingency_table.values.sum()
+        min_dim = min(contingency_table.shape) - 1
+        if n == 0 or min_dim == 0:
+            return np.nan
+        return np.sqrt(chi2 / (n * min_dim))
+    @staticmethod
+    def anova_eta(categories, measurements):
+        """
+        Compute the eta (η) as an effect size measure derived from one-way ANOVA.
+        It indicates the proportion of variance in the continuous variable (measurements)
+        explained by the categorical grouping (categories). Higher values indicate a stronger effect.
+        Parameters:
+          categories : array-like (categorical grouping)
+          measurements : array-like (continuous values)
+        Returns:
+          eta : float, between 0 and 1 representing the effect size.
+        """
+        # Factorize the categorical variable
+        factors, _ = pd.factorize(categories)
+        categories_count = np.max(factors) + 1
+        overall_mean = np.mean(measurements)
+        ss_between = 0.0 # Sum of Squares
+        for i in range(categories_count):
+            group = measurements[factors == i]
+            n_i = len(group)
+            if n_i == 0:
+                continue
+            group_mean = np.mean(group)
+            ss_between += n_i * ((group_mean - overall_mean) ** 2)
+        ss_total = np.sum((measurements - overall_mean) ** 2)
+        if ss_total == 0:
+            return np.nan
+        eta = np.sqrt(ss_between / ss_total)
+        return eta
+    def compute_pairwise_correlation(self, series1, type1, series2, type2):
+        """
+        Compute the correlation/association between two series based on their data types.
+        Parameters:
+          series1, series2 : pandas.Series
+          type1, type2 : str, one of 'Continuous', 'Binary', 'Categorical'
+        Returns:
+          A correlation/association measure (float) or np.nan if not defined.
+        """
+        # ------------- Homogeneous Data types -------------
+        # Continuous vs. Continuous: Pearson correlation
+        if {type1, type2} == {'Continuous', 'Continuous'}:
+            return series1.corr(series2, method=self.continuous_vs_continuous_method)
+        # Binary vs. Binary: Phi coefficient (using Pearson on recoded binaries)
+        elif {type1, type2} == {'Binary', 'Binary'}:
+            try:
+                s1 = self.recode_binary(series1)
+                s2 = self.recode_binary(series2)
+            except Exception as e:
+                return np.nan
+            return s1.corr(s2, method='pearson')
+        # Categorical vs. Categorical: Use Cramér's V
+        elif {type1, type2} == {'Categorical', 'Categorical'}:
+            return self.cramers_v(series1, series2)
+        # ------------- Heterogeneous Data Types -------------
+        # Binary & Continuous: Point-biserial correlation coefficient
+        elif {type1, type2} == {'Continuous', 'Binary'}:
+            binary_series = series1 if type1 == 'Binary' else series2
+            continuous_series = series2 if type2 == 'Continuous' else series1
+            try:
+                binary_series = self.recode_binary(binary_series)
+            except Exception as e:
+                return np.nan
+            corr, _ = pointbiserialr(binary_series, continuous_series)
+            return corr
+        # Categorical vs. Continuous: Use ANOVA-based effect size (η)
+        elif {type1, type2} == {'Continuous', 'Categorical'}:
+            return self.anova_eta(series1, series2) if type1 == 'Categorical' else self.anova_eta(series2, series1)
+        # Binary vs. Categorical: Treat as nominal and use Cramér's V
+        elif {type1, type2} == {'Binary', 'Categorical'}:
+            return self.cramers_v(series1, series2)
+        else:
+            return np.nan
+    def generate_matrix(self):
+        """
+        Generate a symmetric correlation/association matrix for the specified columns,
+        using the appropriate method based on their data types.
+        The matrix is computed by iterating over all feature pairs and selecting
+        the appropriate correlation measure based on their types. The matrix
+        is symmetric (corr(A, B) = corr(B, A)).
+        Returns
+        -------
+        pd.DataFrame
+            A symmetric correlation/association matrix with feature names as
+            both index and columns. Values are rounded to 4 decimal places.
+        """
+        factors = list(self.feature_classes.keys())
+        corr_matrix = pd.DataFrame(index=factors, columns=factors, dtype=float)
+        # Compute pairwise correlations
+        for i, var1 in tqdm(list(enumerate(factors))):
+            for j, var2 in enumerate(factors):
+                if i == j:
+                    # Diagonal: perfect correlation with itself
+                    corr_matrix.loc[var1, var2] = 1.0
+                elif pd.isna(corr_matrix.loc[var1, var2]):
+                    # Compute correlation only if not already computed (upper triangle)
+                    series1 = self.df[var1]
+                    series2 = self.df[var2]
+                    type1 = self.feature_classes[var1]
+                    type2 = self.feature_classes[var2]
+                    corr_value = self.compute_pairwise_correlation(series1, type1, series2, type2)
+                    # Fill both upper and lower triangle for symmetry
+                    corr_matrix.loc[var1, var2] = corr_value
+                    corr_matrix.loc[var2, var1] = corr_value  # ensure symmetry
+        return corr_matrix.round(4)

src/utils/dimension_reduction.py ADDED Viewed

	@@ -0,0 +1,222 @@

+from itertools import chain
+import numpy as np
+import pandas as pd
+from sklearn.decomposition import PCA
+from sklearn.preprocessing import MinMaxScaler
+from utils.correlation import CorrelationMatrixGenerator
+class DimensionReduction:
+    """
+    Correlation-driven clustering of features with a STRICT pairwise constraint:
+    every pair of features in a cluster must have correlation within [lower_bound, upper_bound].
+    Clusters are found as (maximal) cliques in the graph where an edge connects two features
+    iff their correlation lies in the requested band.
+    """
+    def __init__(self, dataframe, feature_classes, method="pearson", projection_dimension=1):
+        self.dataframe = dataframe.copy()
+        self.correlation_matrix = CorrelationMatrixGenerator(
+            df=self.dataframe,
+            feature_classes=feature_classes,
+            continuous_vs_continuous_method=method
+        ).generate_matrix()
+        if not isinstance(self.correlation_matrix, pd.DataFrame):
+            raise TypeError("CorrelationMatrixGenerator.generate_matrix() must return a pandas.DataFrame")
+        if projection_dimension < 1:
+            raise ValueError("projection_dimension must be >= 1")
+        self.k = int(projection_dimension)
+    # ---------------------------
+    # Strict clique-based clustering
+    # ---------------------------
+    def _cluster_features(self, lower_bound, upper_bound):
+        """
+        Return DISJOINT clusters where every pair is within [lower_bound, upper_bound].
+        Implemented via maximal cliques + greedy disjoint selection.
+        """
+        if not (0 <= lower_bound <= upper_bound <= 1):
+            raise ValueError("Bounds must satisfy 0 <= lower_bound <= upper_bound <= 1")
+        cm = self.correlation_matrix
+        # Use only features present in the correlation matrix (and ideally in dataframe)
+        features = [c for c in cm.columns if c in cm.index]
+        if not features:
+            return []
+        def in_band(x):
+            return pd.notna(x) and (lower_bound <= x <= upper_bound)
+        # Build adjacency sets
+        adj = {f: set() for f in features}
+        for f in features:
+            row = cm.loc[f, features]
+            for g, val in row.items():
+                if g == f:
+                    continue
+                if in_band(val):
+                    adj[f].add(g)
+        # Bron–Kerbosch with pivot to enumerate maximal cliques
+        cliques = []
+        def bron_kerbosch(R, P, X):
+            if not P and not X:
+                if len(R) >= 2:
+                    cliques.append(set(R))
+                return
+            # Choose a pivot to reduce branching
+            if P or X:
+                u = max(P | X, key=lambda v: len(adj[v] & P))
+                candidates = P - (adj[u] if u in adj else set())
+            else:
+                candidates = set(P)
+            for v in list(candidates):
+                bron_kerbosch(R | {v}, P & adj[v], X & adj[v])
+                P.remove(v)
+                X.add(v)
+        bron_kerbosch(set(), set(features), set())
+        if not cliques:
+            return []
+        # Score cliques: prefer larger, then higher average correlation (tie-break deterministic)
+        def avg_corr(clique_set):
+            cols = sorted(clique_set)
+            sub = cm.loc[cols, cols].to_numpy(dtype=float)
+            tri = sub[np.triu_indices_from(sub, k=1)]
+            tri = tri[~np.isnan(tri)]
+            return float(tri.mean()) if tri.size else -np.inf
+        cliques_sorted = sorted(
+            cliques,
+            key=lambda c: (-len(c), -avg_corr(c), tuple(sorted(c)))
+        )
+        # Greedily produce DISJOINT clusters (otherwise PCA/drop will conflict)
+        used = set()
+        final_clusters = []
+        for c in cliques_sorted:
+            # Subset of a clique is still a clique -> pairwise constraint remains valid
+            remaining = sorted(list(set(c) - used))
+            if len(remaining) >= 2:
+                final_clusters.append(remaining)
+                used.update(remaining)
+        return final_clusters
+    @staticmethod
+    def _solve_conflict(clusters_dictionary):
+        """
+        Safe conflict resolver across correlation bands:
+        later keys win, features removed from earlier clusters.
+        Removing elements from a clique keeps it a clique, so pairwise constraint is preserved.
+        """
+        keys = list(clusters_dictionary.keys())
+        used = set()
+        for key in reversed(keys):  # later bands win
+            cleaned = []
+            for cluster in clusters_dictionary[key]:
+                remaining = [f for f in cluster if f not in used]
+                if len(remaining) >= 2:
+                    cleaned.append(remaining)
+                    used.update(remaining)
+            clusters_dictionary[key] = cleaned
+        return clusters_dictionary
+    def find_clusters(self, lower_bound, upper_bound):
+        return self._cluster_features(lower_bound=lower_bound, upper_bound=upper_bound)
+    # ---------------------------
+    # PCA projection / replacement
+    # ---------------------------
+    def _assign_pca_components(self, cluster_index, comps, index):
+        """
+        Assign PCA components into group.* columns, supporting k==1 and k>1.
+        """
+        if comps.ndim != 2:
+            raise ValueError("PCA output must be 2D")
+        k_eff = comps.shape[1]
+        if k_eff == 1:
+            self.dataframe[f"group.{cluster_index}"] = pd.Series(comps[:, 0], index=index)
+        else:
+            for c in range(k_eff):
+                self.dataframe[f"group.{cluster_index}.{c}"] = pd.Series(comps[:, c], index=index)
+    def reduce_dimension(self, lower_bound=0.95, upper_bound=1.0, scale=True):
+        clusters = self._cluster_features(lower_bound=lower_bound, upper_bound=upper_bound)
+        for cluster_index, cols in enumerate(clusters):
+            # Guard: only keep columns still present
+            cols = [c for c in cols if c in self.dataframe.columns]
+            if len(cols) < 2:
+                continue
+            subset = self.dataframe[cols]
+            # PCA needs numeric matrix; if you truly have non-numerics here, you must encode upstream.
+            if not all(pd.api.types.is_numeric_dtype(subset[c]) for c in subset.columns):
+                raise TypeError(
+                    f"Non-numeric columns found in cluster {cluster_index}: {cols}. "
+                    "Encode them before PCA or restrict clustering to numeric features."
+                )
+            X = subset.to_numpy()
+            if scale:
+                X = MinMaxScaler().fit_transform(X)
+            pca = PCA(n_components=min(self.k, X.shape[1]))
+            comps = pca.fit_transform(X)
+            self._assign_pca_components(cluster_index, comps, index=subset.index)
+            self.dataframe.drop(columns=cols, inplace=True)
+        return self.dataframe
+    def reduce_dimension_by_grouping(self, threshold=0.8, group_count=4, scale=True):
+        clusters = {}
+        steps = np.round(np.linspace(threshold, 1.0, group_count + 1), 4)
+        for i in range(len(steps) - 1):
+            lb, ub = float(steps[i]), float(steps[i + 1])
+            clusters[(lb, ub)] = self._cluster_features(lower_bound=lb, upper_bound=ub)
+        clusters = self._solve_conflict(clusters_dictionary=clusters)
+        final_clusters = list(chain(*clusters.values()))
+        for cluster_index, cols in enumerate(final_clusters):
+            cols = [c for c in cols if c in self.dataframe.columns]
+            if len(cols) < 2:
+                continue
+            subset = self.dataframe[cols]
+            if not all(pd.api.types.is_numeric_dtype(subset[c]) for c in subset.columns):
+                raise TypeError(
+                    f"Non-numeric columns found in cluster {cluster_index}: {cols}. "
+                    "Encode them before PCA or restrict clustering to numeric features."
+                )
+            X = subset.to_numpy()
+            if scale:
+                X = MinMaxScaler().fit_transform(X)
+            pca = PCA(n_components=min(self.k, X.shape[1]))
+            comps = pca.fit_transform(X)
+            self._assign_pca_components(cluster_index, comps, index=subset.index)
+            self.dataframe.drop(columns=cols, inplace=True)
+        return self.dataframe, final_clusters

src/utils/feature_class.py ADDED Viewed

	@@ -0,0 +1,119 @@

+"""
+Feature classification module for detecting data types in DataFrames.
+This module provides the DetectFeatureClasses class which automatically
+classifies features as Binary, Categorical, or Continuous based on their
+statistical properties.
+"""
+import numpy as np
+class DetectFeatureClasses:
+    """
+    A class to detect feature classes in a pandas DataFrame.
+    Parameters:
+    ----------
+    dataframe : pd.DataFrame
+        The input DataFrame containing features to be classified.
+    categorical_threshold : float, optional
+        The relative threshold to determine if a feature is categorical based on the ratio of unique values to total rows. Default is 0.5.
+    string_data_policy : str, optional
+        Policy for handling string data that cannot be converted to float. Options are 'drop' to drop such features or 'ignore' to leave them as is. Default is 'drop'.
+    Methods:
+    -------
+    feature_classes() -> dict
+        Classifies features into 'Binary', 'Categorical', or 'Continuous' and returns a dictionary with feature names as keys and their classes as values.
+    """
+    def __init__(self, dataframe, categorical_threshold=0.5, string_data_policy='drop'):
+        """
+        Initializes the DetectFeatureClasses with the provided DataFrame and parameters.
+        Parameters:
+        ----------
+        dataframe : pd.DataFrame
+            The input DataFrame containing features to be classified.
+        categorical_threshold : float, optional
+            The relative threshold to determine if a feature is categorical based on the ratio of unique values to total rows. Default is 0.5.
+        string_data_policy : str, optional
+            Policy for handling string data that cannot be converted to float. Options are 'drop' to drop such features or 'ignore' to leave them as is. Default is 'drop'.
+        """
+        self.dataframe = dataframe
+        self.categorical_threshold = categorical_threshold
+        self.string_data_policy = string_data_policy
+    def _binaries(self):
+        """
+        Identifies binary features in the DataFrame.
+        A feature is considered binary if it has at most 2 unique values.
+        Returns
+        -------
+        list
+            A list of column names that are classified as binary features.
+        """
+        binary_columns = [column for column in self.dataframe.columns if len(self.dataframe[column].unique()) <= 2]
+        return binary_columns
+    def _categorical(self):
+        """
+        Identifies categorical features in the DataFrame based on the categorical threshold.
+        A feature is considered categorical if the number of unique values
+        is significantly less than the total number of rows (using the
+        categorical_threshold as a relative tolerance).
+        Returns
+        -------
+        list
+            A list of column names that are classified as categorical features.
+        """
+        categorical_columns = []
+        for column in self.dataframe.columns:
+            # Check if unique count is not close to total rows (within threshold)
+            if np.isclose(len(
+                self.dataframe[column].unique()), len(self.dataframe),
+                rtol=self.categorical_threshold
+                ) is False:
+                categorical_columns.append(column)
+        return categorical_columns
+    def feature_classes(self):
+        """
+        Classifies features in the DataFrame into 'Binary', 'Categorical', or 'Continuous'.
+        Returns:
+        -------
+        dict
+            A dictionary with feature names as keys and their classes ('Binary', 'Categorical', 'Continuous') as values.
+        list
+            A list of features that were dropped due to string data policy.
+        """
+        binary_columns = self._binaries()
+        categorical_columns = self._categorical()
+        features_class_types = {}
+        excess_columns = []
+        # Classify each feature
+        for feature in self.dataframe.columns:
+            if feature in binary_columns:
+                features_class_types[feature] = 'Binary'
+            elif feature in categorical_columns:
+                features_class_types[feature] = 'Categorical'
+            else:
+                # Try to convert to float to determine if continuous
+                try:
+                    self.dataframe[feature] = self.dataframe[feature].astype(float)
+                    features_class_types[feature] = 'Continuous'
+                except ValueError:
+                    # Cannot convert to float - handle based on policy
+                    if self.string_data_policy == 'drop':
+                        excess_columns.append(feature)
+                    else:
+                        # 'ignore' policy: leave as-is (not recommended)
+                        pass
+        return features_class_types, excess_columns

src/workshop.ipynb ADDED Viewed

	@@ -0,0 +1,1448 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "3ebfe4e7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✓ Libraries imported successfully\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Import required libraries\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "from datalake.config import DataLakeConfig\n",
+    "from datalake.athena import AthenaQuery\n",
+    "from datalake.catalog import DataLakeCatalog\n",
+    "from datalake.query import DataLakeQuery\n",
+    "from datalake.batch import BatchProcessor\n",
+    "\n",
+    "# Set up plotting\n",
+    "%matplotlib inline\n",
+    "plt.style.use('seaborn-v0_8')\n",
+    "sns.set_palette(\"husl\")\n",
+    "\n",
+    "print(\"✓ Libraries imported successfully\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "f03eaae2",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✓ Configuration loaded\n",
+      "  Database: dbparquetdatalake05\n",
+      "  Workgroup: athenaworkgroup-datalake05\n",
+      "  Region: eu-north-1\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Configure connection with your credentials\n",
+    "config = DataLakeConfig.from_credentials(\n",
+    "    database_name=\"dbparquetdatalake05\",\n",
+    "    workgroup=\"athenaworkgroup-datalake05\",\n",
+    "    s3_output_location=\"s3://canedge-raw-data-parquet/athena-results/\",\n",
+    "    region=\"eu-north-1\",\n",
+    "    access_key_id=\"AKIARJQJFFVASPMSGNNY\",\n",
+    "    secret_access_key=\"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\",\n",
+    ")\n",
+    "\n",
+    "print(f\"✓ Configuration loaded\")\n",
+    "print(f\"  Database: {config.database_name}\")\n",
+    "print(f\"  Workgroup: {config.workgroup}\")\n",
+    "print(f\"  Region: {config.region}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "9e8ceaf6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2026-01-26 23:23:13,728 - datalake.athena - INFO - Initialized Athena client for database: dbparquetdatalake05\n",
+      "2026-01-26 23:23:13,729 - datalake.catalog - INFO - Initialized catalog for database: dbparquetdatalake05\n",
+      "2026-01-26 23:23:13,729 - datalake.query - INFO - Initialized DataLakeQuery\n",
+      "2026-01-26 23:23:13,730 - datalake.batch - INFO - Initialized BatchProcessor\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✓ Athena client and catalog initialized\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Initialize Athena and catalog\n",
+    "athena = AthenaQuery(config)\n",
+    "catalog = DataLakeCatalog(athena, config)\n",
+    "query = DataLakeQuery(athena, catalog)\n",
+    "processor = BatchProcessor(query)\n",
+    "\n",
+    "print(\"✓ Athena client and catalog initialized\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "0e3d813f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2026-01-26 23:23:14,057 - datalake.athena - INFO - Query started with execution ID: beffdb49-e31a-48bf-8dbf-8c06ae7960cc\n",
+      "2026-01-26 23:23:15,190 - datalake.athena - INFO - Query beffdb49-e31a-48bf-8dbf-8c06ae7960cc completed successfully\n",
+      "2026-01-26 23:23:15,490 - datalake.athena - INFO - Retrieved 77 rows from S3 for query beffdb49-e31a-48bf-8dbf-8c06ae7960cc\n"
+     ]
+    }
+   ],
+   "source": [
+    "test_query = f\"SHOW TABLES IN {config.database_name}\"\n",
+    "df_tables = athena.query_to_dataframe(test_query, timeout=60)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "fca55b3b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2026-01-26 23:23:15,601 - datalake.athena - INFO - Query started with execution ID: bd81d8c7-2371-431b-a6ed-0208bc4b4f1c\n",
+      "2026-01-26 23:23:16,798 - datalake.athena - INFO - Query bd81d8c7-2371-431b-a6ed-0208bc4b4f1c completed successfully\n",
+      "2026-01-26 23:23:16,920 - datalake.athena - INFO - Retrieved 78 rows from S3 for query bd81d8c7-2371-431b-a6ed-0208bc4b4f1c\n",
+      "2026-01-26 23:23:16,921 - datalake.catalog - INFO - Found 78 tables in database\n",
+      "2026-01-26 23:23:16,922 - datalake.catalog - INFO - Found 3 device(s)\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Found 3 device(s):\n",
+      "  - 97a4aaf4\n",
+      "  - b8280fd1\n",
+      "  - f1da612a\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Discover devices\n",
+    "devices = catalog.list_devices()\n",
+    "print(f\"Found {len(devices)} device(s):\")\n",
+    "for device in devices:\n",
+    "    print(f\"  - {device}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "103ddb07",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{1: 'out of memory',\n",
+       " 2: 'syntax error',\n",
+       " 3: 'no element found',\n",
+       " 4: 'not well-formed (invalid token)',\n",
+       " 5: 'unclosed token',\n",
+       " 6: 'partial character',\n",
+       " 7: 'mismatched tag',\n",
+       " 8: 'duplicate attribute',\n",
+       " 9: 'junk after document element',\n",
+       " 10: 'illegal parameter entity reference',\n",
+       " 11: 'undefined entity',\n",
+       " 12: 'recursive entity reference',\n",
+       " 13: 'asynchronous entity',\n",
+       " 14: 'reference to invalid character number',\n",
+       " 15: 'reference to binary entity',\n",
+       " 16: 'reference to external entity in attribute',\n",
+       " 17: 'XML or text declaration not at start of entity',\n",
+       " 18: 'unknown encoding',\n",
+       " 19: 'encoding specified in XML declaration is incorrect',\n",
+       " 20: 'unclosed CDATA section',\n",
+       " 21: 'error in processing external entity reference',\n",
+       " 22: 'document is not standalone',\n",
+       " 23: 'unexpected parser state - please send a bug report',\n",
+       " 24: 'entity declared in parameter entity',\n",
+       " 25: 'requested feature requires XML_DTD support in Expat',\n",
+       " 26: 'cannot change setting once parsing has begun',\n",
+       " 27: 'unbound prefix',\n",
+       " 28: 'must not undeclare prefix',\n",
+       " 29: 'incomplete markup in parameter entity',\n",
+       " 30: 'XML declaration not well-formed',\n",
+       " 31: 'text declaration not well-formed',\n",
+       " 32: 'illegal character(s) in public id',\n",
+       " 33: 'parser suspended',\n",
+       " 34: 'parser not suspended',\n",
+       " 35: 'parsing aborted',\n",
+       " 36: 'parsing finished',\n",
+       " 37: 'cannot suspend in external parameter entity'}"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "messages"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "fbc4938b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['0100', '0103', '0104', '0105', '0106', '0107', '010c', '010d', '010e', '010f', '0110', '0111', '011f', '012e', '012f', '0133', '0134', '0135', '0143', '0144', '0149', '0155', '0156', '015c']\n"
+     ]
+    }
+   ],
+   "source": [
+    "import re\n",
+    "\n",
+    "pattern = re.compile(r\"s(?P<s>\\d{2})pid.*m(?P<m>[0-9a-fA-F]{2})$\")\n",
+    "\n",
+    "strings = [\n",
+    "    \"can1_obd2_s_m41_s01pid_m00\",\n",
+    "    \"can1_obd2_s_m41_s01pid_m03\",\n",
+    "    \"can1_obd2_s_m41_s01pid_m04\",\n",
+    "    \"can1_obd2_s_m41_s01pid_m05\",\n",
+    "    \"can1_obd2_s_m41_s01pid_m06\",\n",
+    "    \"can1_obd2_s_m41_s01pid_m07\",\n",
+    "    \"can1_obd2_s_m41_s01pid_m0c\",\n",
+    "]\n",
+    "\n",
+    "out = []\n",
+    "for x in messages:\n",
+    "    if x.startswith('can1') is False:\n",
+    "        continue\n",
+    "    m = pattern.search(x)\n",
+    "    out.append((m.group(\"s\") + m.group(\"m\")))\n",
+    "\n",
+    "print(out)\n",
+    "# [('01', '00'), ('01', '03'), ('01', '04'), ('01', '05'), ('01', '06'), ('01', '07'), ('01', '0c')]\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "41a79e1e",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['97a4aaf4', 'b8280fd1', 'f1da612a']"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "devices"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "ffe04714",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Exploring device: 97a4aaf4\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2026-01-26 23:34:44,692 - datalake.athena - INFO - Query started with execution ID: 442a7cfd-68ed-46eb-98ee-b964a3e3cb6d\n",
+      "2026-01-26 23:34:45,808 - datalake.athena - INFO - Query 442a7cfd-68ed-46eb-98ee-b964a3e3cb6d completed successfully\n",
+      "2026-01-26 23:34:46,146 - datalake.athena - INFO - Retrieved 78 rows from S3 for query 442a7cfd-68ed-46eb-98ee-b964a3e3cb6d\n",
+      "2026-01-26 23:34:46,146 - datalake.catalog - INFO - Found 78 tables in database\n",
+      "2026-01-26 23:34:46,146 - datalake.catalog - INFO - Found 35 messages for device 97a4aaf4\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Found 35 message(s):\n",
+      "  - can1_obd2_s_m41_s01pid_m00\n",
+      "  - can1_obd2_s_m41_s01pid_m03\n",
+      "  - can1_obd2_s_m41_s01pid_m04\n",
+      "  - can1_obd2_s_m41_s01pid_m05\n",
+      "  - can1_obd2_s_m41_s01pid_m06\n",
+      "  - can1_obd2_s_m41_s01pid_m07\n",
+      "  - can1_obd2_s_m41_s01pid_m0c\n",
+      "  - can1_obd2_s_m41_s01pid_m0d\n",
+      "  - can1_obd2_s_m41_s01pid_m0e\n",
+      "  - can1_obd2_s_m41_s01pid_m0f\n",
+      "  - can1_obd2_s_m41_s01pid_m10\n",
+      "  - can1_obd2_s_m41_s01pid_m11\n",
+      "  - can1_obd2_s_m41_s01pid_m1f\n",
+      "  - can1_obd2_s_m41_s01pid_m2e\n",
+      "  - can1_obd2_s_m41_s01pid_m2f\n",
+      "  - can1_obd2_s_m41_s01pid_m33\n",
+      "  - can1_obd2_s_m41_s01pid_m34\n",
+      "  - can1_obd2_s_m41_s01pid_m35\n",
+      "  - can1_obd2_s_m41_s01pid_m43\n",
+      "  - can1_obd2_s_m41_s01pid_m44\n",
+      "  - can1_obd2_s_m41_s01pid_m49\n",
+      "  - can1_obd2_s_m41_s01pid_m55\n",
+      "  - can1_obd2_s_m41_s01pid_m56\n",
+      "  - can1_obd2_s_m41_s01pid_m5c\n",
+      "  - can9_gnssaltitude\n",
+      "  - can9_gnssdistance\n",
+      "  - can9_gnsspos\n",
+      "  - can9_gnssspeed\n",
+      "  - can9_gnssstatus\n",
+      "  - can9_gnsstime\n",
+      "  - can9_heartbeat\n",
+      "  - can9_imudata\n",
+      "  - can9_timecalendar\n",
+      "  - can9_timeexternal\n",
+      "  - messages\n"
+     ]
+    }
+   ],
+   "source": [
+    "if devices:\n",
+    "    device_id = devices[0]\n",
+    "    print(f\"\\nExploring device: {device_id}\")\n",
+    "    print(\"=\" * 60)\n",
+    "    \n",
+    "    messages = catalog.list_messages(device_id)\n",
+    "    print(f\"Found {len(messages)} message(s):\")\n",
+    "    for message in messages:\n",
+    "        print(f\"  - {message}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "a7bae557",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2026-01-26 15:14:27,675 - datalake.athena - INFO - Query started with execution ID: 62096cc6-be14-49f4-ae61-80efc006dbc2\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Schema for 97a4aaf4/can1_obd2_s_m41_s01pid_m00:\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2026-01-26 15:14:28,793 - datalake.athena - INFO - Query 62096cc6-be14-49f4-ae61-80efc006dbc2 completed successfully\n",
+      "2026-01-26 15:14:28,947 - datalake.athena - INFO - Retrieved 78 rows from S3 for query 62096cc6-be14-49f4-ae61-80efc006dbc2\n",
+      "2026-01-26 15:14:28,948 - datalake.catalog - INFO - Found 78 tables in database\n",
+      "2026-01-26 15:14:29,052 - datalake.athena - INFO - Query started with execution ID: 93497778-7bd4-4976-bf95-7009ab18b6df\n",
+      "2026-01-26 15:14:30,176 - datalake.athena - INFO - Query 93497778-7bd4-4976-bf95-7009ab18b6df completed successfully\n",
+      "2026-01-26 15:14:30,296 - datalake.athena - INFO - Retrieved 3 rows from S3 for query 93497778-7bd4-4976-bf95-7009ab18b6df\n",
+      "2026-01-26 15:14:30,297 - datalake.catalog - INFO - Schema for 97a4aaf4/can1_obd2_s_m41_s01pid_m00: 3 columns\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "                      Column         Type\n",
+      "                           t timestamp(3)\n",
+      "s01pid00_pidssupported_01_20       double\n",
+      "                date_created      varchar\n",
+      "\n",
+      "Total columns: 3\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Get schema for first device/message combination\n",
+    "if devices and messages:\n",
+    "    device_id = devices[0]\n",
+    "    message = messages[0]\n",
+    "    \n",
+    "    print(f\"\\nSchema for {device_id}/{message}:\")\n",
+    "    print(\"=\" * 60)\n",
+    "    \n",
+    "    schema = catalog.get_schema(device_id, message)\n",
+    "    if schema:\n",
+    "        schema_df = pd.DataFrame([\n",
+    "            {\"Column\": col, \"Type\": dtype}\n",
+    "            for col, dtype in schema.items()\n",
+    "        ])\n",
+    "        print(schema_df.to_string(index=False))\n",
+    "        print(f\"\\nTotal columns: {len(schema)}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "f3b16b2d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2026-01-26 15:14:30,406 - datalake.athena - INFO - Query started with execution ID: 4fb3b506-612a-458e-a0e7-709b60a9f91e\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Partitions (dates) for 97a4aaf4/can1_obd2_s_m41_s01pid_m00:\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2026-01-26 15:14:31,530 - datalake.athena - INFO - Query 4fb3b506-612a-458e-a0e7-709b60a9f91e completed successfully\n",
+      "2026-01-26 15:14:31,648 - datalake.athena - INFO - Retrieved 78 rows from S3 for query 4fb3b506-612a-458e-a0e7-709b60a9f91e\n",
+      "2026-01-26 15:14:31,649 - datalake.catalog - INFO - Found 78 tables in database\n",
+      "2026-01-26 15:14:31,755 - datalake.athena - INFO - Query started with execution ID: 4ce87717-db17-4fa5-b33d-ef63fb4a89fe\n",
+      "2026-01-26 15:14:36,039 - datalake.athena - INFO - Query 4ce87717-db17-4fa5-b33d-ef63fb4a89fe completed successfully\n",
+      "2026-01-26 15:14:36,162 - datalake.athena - INFO - Retrieved 13 rows from S3 for query 4ce87717-db17-4fa5-b33d-ef63fb4a89fe\n",
+      "2026-01-26 15:14:36,164 - datalake.catalog - INFO - Found 13 partitions for tbl_97a4aaf4_can1_obd2_s_m41_s01pid_m00\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Found 13 partition(s):\n",
+      "  Date range: 2025-10-21 to 2025-11-11\n",
+      "\n",
+      "  All dates:\n",
+      "    - 2025-10-21\n",
+      "    - 2025-10-27\n",
+      "    - 2025-10-28\n",
+      "    - 2025-10-29\n",
+      "    - 2025-10-30\n",
+      "    - 2025-10-31\n",
+      "    - 2025-11-03\n",
+      "    - 2025-11-04\n",
+      "    - 2025-11-05\n",
+      "    - 2025-11-06\n",
+      "    - 2025-11-07\n",
+      "    - 2025-11-10\n",
+      "    - 2025-11-11\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Check available partitions (dates)\n",
+    "if devices and messages:\n",
+    "    device_id = devices[0]\n",
+    "    message = messages[0]\n",
+    "    \n",
+    "    print(f\"\\nPartitions (dates) for {device_id}/{message}:\")\n",
+    "    print(\"=\" * 60)\n",
+    "    \n",
+    "    try:\n",
+    "        partitions = catalog.list_partitions(device_id, message)\n",
+    "        if partitions:\n",
+    "            print(f\"Found {len(partitions)} partition(s):\")\n",
+    "            print(f\"  Date range: {partitions[0]} to {partitions[-1]}\")\n",
+    "            print(f\"\\n  All dates:\")\n",
+    "            for date in partitions[:20]:  # Show first 20\n",
+    "                print(f\"    - {date}\")\n",
+    "            if len(partitions) > 20:\n",
+    "                print(f\"    ... and {len(partitions) - 20} more\")\n",
+    "        else:\n",
+    "            print(\"No partitions found (table may not be partitioned)\")\n",
+    "    except Exception as e:\n",
+    "        print(f\"Could not list partitions: {e}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "66579956",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'97a4aaf4'"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "device_id"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "3411bcdb",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['can1_obd2_s_m41_s01pid_m00', 'can1_obd2_s_m41_s01pid_m03', 'can1_obd2_s_m41_s01pid_m04', 'can1_obd2_s_m41_s01pid_m05', 'can1_obd2_s_m41_s01pid_m06', 'can1_obd2_s_m41_s01pid_m07', 'can1_obd2_s_m41_s01pid_m0c', 'can1_obd2_s_m41_s01pid_m0d', 'can1_obd2_s_m41_s01pid_m0e', 'can1_obd2_s_m41_s01pid_m0f', 'can1_obd2_s_m41_s01pid_m10', 'can1_obd2_s_m41_s01pid_m11', 'can1_obd2_s_m41_s01pid_m1f', 'can1_obd2_s_m41_s01pid_m2e', 'can1_obd2_s_m41_s01pid_m2f', 'can1_obd2_s_m41_s01pid_m33', 'can1_obd2_s_m41_s01pid_m34', 'can1_obd2_s_m41_s01pid_m35', 'can1_obd2_s_m41_s01pid_m43', 'can1_obd2_s_m41_s01pid_m44', 'can1_obd2_s_m41_s01pid_m49', 'can1_obd2_s_m41_s01pid_m55', 'can1_obd2_s_m41_s01pid_m56', 'can1_obd2_s_m41_s01pid_m5c', 'can9_gnssaltitude', 'can9_gnssdistance', 'can9_gnsspos', 'can9_gnssspeed', 'can9_gnssstatus', 'can9_gnsstime', 'can9_heartbeat', 'can9_imudata', 'can9_timecalendar', 'can9_timeexternal', 'messages']\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(messages)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "b98df0e7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Reading sample data from 97a4aaf4/can1_obd2_s_m41_s01pid_m49...\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2026-01-26 23:48:28,200 - datalake.athena - INFO - Query started with execution ID: 2501a646-a908-47a8-95df-3870e5696e62\n",
+      "2026-01-26 23:48:29,314 - datalake.athena - INFO - Query 2501a646-a908-47a8-95df-3870e5696e62 completed successfully\n",
+      "2026-01-26 23:48:29,604 - datalake.athena - INFO - Retrieved 78 rows from S3 for query 2501a646-a908-47a8-95df-3870e5696e62\n",
+      "2026-01-26 23:48:29,604 - datalake.catalog - INFO - Found 78 tables in database\n",
+      "2026-01-26 23:48:29,604 - datalake.query - INFO - Executing query for 97a4aaf4/can1_obd2_s_m41_s01pid_m49\n",
+      "2026-01-26 23:48:29,718 - datalake.athena - INFO - Query started with execution ID: b9f04fc6-6408-4054-8d9a-bf77c0bcf28d\n",
+      "2026-01-26 23:48:35,706 - datalake.athena - INFO - Query b9f04fc6-6408-4054-8d9a-bf77c0bcf28d completed successfully\n",
+      "2026-01-26 23:48:41,916 - datalake.athena - INFO - Retrieved 652001 rows from S3 for query b9f04fc6-6408-4054-8d9a-bf77c0bcf28d\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✓ Loaded 652001 records\n",
+      "\n",
+      "Data shape: (652001, 3)\n",
+      "\n",
+      "Columns: ['t', 's01pid49_absthrottleposd', 'date_created']\n",
+      "\n",
+      "First few rows:\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>t</th>\n",
+       "      <th>s01pid49_absthrottleposd</th>\n",
+       "      <th>date_created</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>2026-01-02 03:04:02.441</td>\n",
+       "      <td>15.686275</td>\n",
+       "      <td>2026/01/02</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2025-12-03 03:10:02.217</td>\n",
+       "      <td>15.686275</td>\n",
+       "      <td>2025/12/03</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2025-12-31 03:00:00.162</td>\n",
+       "      <td>15.686275</td>\n",
+       "      <td>2025/12/31</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>2025-12-19 04:00:00.157</td>\n",
+       "      <td>30.980392</td>\n",
+       "      <td>2025/12/19</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2025-12-22 04:00:00.661</td>\n",
+       "      <td>15.686275</td>\n",
+       "      <td>2025/12/22</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>2026-01-13 06:00:00.339</td>\n",
+       "      <td>15.686275</td>\n",
+       "      <td>2026/01/13</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>2025-12-19 07:00:00.010</td>\n",
+       "      <td>38.823529</td>\n",
+       "      <td>2025/12/19</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>2025-12-19 04:00:01.156</td>\n",
+       "      <td>33.333333</td>\n",
+       "      <td>2025/12/19</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>2025-12-19 04:00:02.157</td>\n",
+       "      <td>35.294118</td>\n",
+       "      <td>2025/12/19</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>2025-12-19 07:00:01.009</td>\n",
+       "      <td>34.901961</td>\n",
+       "      <td>2025/12/19</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                         t  s01pid49_absthrottleposd date_created\n",
+       "0  2026-01-02 03:04:02.441                 15.686275   2026/01/02\n",
+       "1  2025-12-03 03:10:02.217                 15.686275   2025/12/03\n",
+       "2  2025-12-31 03:00:00.162                 15.686275   2025/12/31\n",
+       "3  2025-12-19 04:00:00.157                 30.980392   2025/12/19\n",
+       "4  2025-12-22 04:00:00.661                 15.686275   2025/12/22\n",
+       "5  2026-01-13 06:00:00.339                 15.686275   2026/01/13\n",
+       "6  2025-12-19 07:00:00.010                 38.823529   2025/12/19\n",
+       "7  2025-12-19 04:00:01.156                 33.333333   2025/12/19\n",
+       "8  2025-12-19 04:00:02.157                 35.294118   2025/12/19\n",
+       "9  2025-12-19 07:00:01.009                 34.901961   2025/12/19"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Data types:\n",
+      "t                            object\n",
+      "s01pid49_absthrottleposd    float64\n",
+      "date_created                 object\n",
+      "dtype: object\n",
+      "\n",
+      "Basic statistics:\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>s01pid49_absthrottleposd</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>count</th>\n",
+       "      <td>652001.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>mean</th>\n",
+       "      <td>21.921143</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>std</th>\n",
+       "      <td>8.487119</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>min</th>\n",
+       "      <td>15.686275</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>25%</th>\n",
+       "      <td>15.686275</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>50%</th>\n",
+       "      <td>15.686275</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>75%</th>\n",
+       "      <td>29.019608</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>max</th>\n",
+       "      <td>58.431373</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "       s01pid49_absthrottleposd\n",
+       "count             652001.000000\n",
+       "mean                  21.921143\n",
+       "std                    8.487119\n",
+       "min                   15.686275\n",
+       "25%                   15.686275\n",
+       "50%                   15.686275\n",
+       "75%                   29.019608\n",
+       "max                   58.431373"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "# Read a sample of data\n",
+    "if devices and messages:\n",
+    "    device_id = devices[0]\n",
+    "    # message = messages[0]\n",
+    "    message = 'can1_obd2_s_m41_s01pid_m49'\n",
+    "    \n",
+    "    print(f\"Reading sample data from {device_id}/{message}...\")\n",
+    "    print(\"=\" * 60)\n",
+    "    \n",
+    "    try:\n",
+    "        df = query.read_device_message(\n",
+    "            device_id=device_id,\n",
+    "            message=message,\n",
+    "        )\n",
+    "        \n",
+    "        print(f\"✓ Loaded {len(df)} records\")\n",
+    "        print(f\"\\nData shape: {df.shape}\")\n",
+    "        print(f\"\\nColumns: {list(df.columns)}\")\n",
+    "        print(f\"\\nFirst few rows:\")\n",
+    "        display(df.head(10))\n",
+    "        \n",
+    "        print(f\"\\nData types:\")\n",
+    "        print(df.dtypes)\n",
+    "        \n",
+    "        print(f\"\\nBasic statistics:\")\n",
+    "        display(df.describe())\n",
+    "        \n",
+    "    except Exception as e:\n",
+    "        print(f\"✗ Error reading data: {e}\")\n",
+    "        import traceback\n",
+    "        traceback.print_exc()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "31396a98",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df['t'] = pd.to_datetime(df['t']) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "8fa88ee6",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>t</th>\n",
+       "      <th>s01pid49_absthrottleposd</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>2025-11-22</td>\n",
+       "      <td>15.709290</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>2025-11-23</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2025-11-24</td>\n",
+       "      <td>21.347658</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>2025-11-25</td>\n",
+       "      <td>22.176305</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2025-11-26</td>\n",
+       "      <td>22.074130</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>5</th>\n",
+       "      <td>2025-11-27</td>\n",
+       "      <td>22.379063</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6</th>\n",
+       "      <td>2025-11-28</td>\n",
+       "      <td>22.611687</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>7</th>\n",
+       "      <td>2025-11-29</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>8</th>\n",
+       "      <td>2025-11-30</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>9</th>\n",
+       "      <td>2025-12-01</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>10</th>\n",
+       "      <td>2025-12-02</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>11</th>\n",
+       "      <td>2025-12-03</td>\n",
+       "      <td>22.212069</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12</th>\n",
+       "      <td>2025-12-04</td>\n",
+       "      <td>21.593356</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>13</th>\n",
+       "      <td>2025-12-05</td>\n",
+       "      <td>22.048014</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>14</th>\n",
+       "      <td>2025-12-06</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>15</th>\n",
+       "      <td>2025-12-07</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>16</th>\n",
+       "      <td>2025-12-08</td>\n",
+       "      <td>21.288014</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>17</th>\n",
+       "      <td>2025-12-09</td>\n",
+       "      <td>22.105263</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>18</th>\n",
+       "      <td>2025-12-10</td>\n",
+       "      <td>22.144666</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>19</th>\n",
+       "      <td>2025-12-11</td>\n",
+       "      <td>21.774071</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>20</th>\n",
+       "      <td>2025-12-12</td>\n",
+       "      <td>21.957367</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>21</th>\n",
+       "      <td>2025-12-13</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>22</th>\n",
+       "      <td>2025-12-14</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>23</th>\n",
+       "      <td>2025-12-15</td>\n",
+       "      <td>20.411036</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>24</th>\n",
+       "      <td>2025-12-16</td>\n",
+       "      <td>21.394285</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>25</th>\n",
+       "      <td>2025-12-17</td>\n",
+       "      <td>21.644342</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>26</th>\n",
+       "      <td>2025-12-18</td>\n",
+       "      <td>22.130631</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>27</th>\n",
+       "      <td>2025-12-19</td>\n",
+       "      <td>21.194253</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>28</th>\n",
+       "      <td>2025-12-20</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>29</th>\n",
+       "      <td>2025-12-21</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>30</th>\n",
+       "      <td>2025-12-22</td>\n",
+       "      <td>21.804700</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>31</th>\n",
+       "      <td>2025-12-23</td>\n",
+       "      <td>21.961360</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>32</th>\n",
+       "      <td>2025-12-24</td>\n",
+       "      <td>22.252882</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>33</th>\n",
+       "      <td>2025-12-25</td>\n",
+       "      <td>21.916508</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>34</th>\n",
+       "      <td>2025-12-26</td>\n",
+       "      <td>22.494252</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>35</th>\n",
+       "      <td>2025-12-27</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>36</th>\n",
+       "      <td>2025-12-28</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>37</th>\n",
+       "      <td>2025-12-29</td>\n",
+       "      <td>21.873543</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>38</th>\n",
+       "      <td>2025-12-30</td>\n",
+       "      <td>21.890226</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>39</th>\n",
+       "      <td>2025-12-31</td>\n",
+       "      <td>22.529185</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>40</th>\n",
+       "      <td>2026-01-01</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>41</th>\n",
+       "      <td>2026-01-02</td>\n",
+       "      <td>22.761742</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>42</th>\n",
+       "      <td>2026-01-03</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>43</th>\n",
+       "      <td>2026-01-04</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>44</th>\n",
+       "      <td>2026-01-05</td>\n",
+       "      <td>22.315963</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>45</th>\n",
+       "      <td>2026-01-06</td>\n",
+       "      <td>22.110849</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>46</th>\n",
+       "      <td>2026-01-07</td>\n",
+       "      <td>21.613014</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>47</th>\n",
+       "      <td>2026-01-08</td>\n",
+       "      <td>21.953064</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>48</th>\n",
+       "      <td>2026-01-09</td>\n",
+       "      <td>21.585424</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>49</th>\n",
+       "      <td>2026-01-10</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>50</th>\n",
+       "      <td>2026-01-11</td>\n",
+       "      <td>NaN</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>51</th>\n",
+       "      <td>2026-01-12</td>\n",
+       "      <td>22.092380</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>52</th>\n",
+       "      <td>2026-01-13</td>\n",
+       "      <td>22.664499</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>53</th>\n",
+       "      <td>2026-01-14</td>\n",
+       "      <td>22.124919</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>54</th>\n",
+       "      <td>2026-01-15</td>\n",
+       "      <td>22.252390</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>55</th>\n",
+       "      <td>2026-01-16</td>\n",
+       "      <td>22.551813</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "            t  s01pid49_absthrottleposd\n",
+       "0  2025-11-22                 15.709290\n",
+       "1  2025-11-23                       NaN\n",
+       "2  2025-11-24                 21.347658\n",
+       "3  2025-11-25                 22.176305\n",
+       "4  2025-11-26                 22.074130\n",
+       "5  2025-11-27                 22.379063\n",
+       "6  2025-11-28                 22.611687\n",
+       "7  2025-11-29                       NaN\n",
+       "8  2025-11-30                       NaN\n",
+       "9  2025-12-01                       NaN\n",
+       "10 2025-12-02                       NaN\n",
+       "11 2025-12-03                 22.212069\n",
+       "12 2025-12-04                 21.593356\n",
+       "13 2025-12-05                 22.048014\n",
+       "14 2025-12-06                       NaN\n",
+       "15 2025-12-07                       NaN\n",
+       "16 2025-12-08                 21.288014\n",
+       "17 2025-12-09                 22.105263\n",
+       "18 2025-12-10                 22.144666\n",
+       "19 2025-12-11                 21.774071\n",
+       "20 2025-12-12                 21.957367\n",
+       "21 2025-12-13                       NaN\n",
+       "22 2025-12-14                       NaN\n",
+       "23 2025-12-15                 20.411036\n",
+       "24 2025-12-16                 21.394285\n",
+       "25 2025-12-17                 21.644342\n",
+       "26 2025-12-18                 22.130631\n",
+       "27 2025-12-19                 21.194253\n",
+       "28 2025-12-20                       NaN\n",
+       "29 2025-12-21                       NaN\n",
+       "30 2025-12-22                 21.804700\n",
+       "31 2025-12-23                 21.961360\n",
+       "32 2025-12-24                 22.252882\n",
+       "33 2025-12-25                 21.916508\n",
+       "34 2025-12-26                 22.494252\n",
+       "35 2025-12-27                       NaN\n",
+       "36 2025-12-28                       NaN\n",
+       "37 2025-12-29                 21.873543\n",
+       "38 2025-12-30                 21.890226\n",
+       "39 2025-12-31                 22.529185\n",
+       "40 2026-01-01                       NaN\n",
+       "41 2026-01-02                 22.761742\n",
+       "42 2026-01-03                       NaN\n",
+       "43 2026-01-04                       NaN\n",
+       "44 2026-01-05                 22.315963\n",
+       "45 2026-01-06                 22.110849\n",
+       "46 2026-01-07                 21.613014\n",
+       "47 2026-01-08                 21.953064\n",
+       "48 2026-01-09                 21.585424\n",
+       "49 2026-01-10                       NaN\n",
+       "50 2026-01-11                       NaN\n",
+       "51 2026-01-12                 22.092380\n",
+       "52 2026-01-13                 22.664499\n",
+       "53 2026-01-14                 22.124919\n",
+       "54 2026-01-15                 22.252390\n",
+       "55 2026-01-16                 22.551813"
+      ]
+     },
+     "execution_count": 25,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.drop(columns=['date_created']).groupby(pd.Grouper(key='t', freq='D')).mean().reset_index()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "64376fb5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0         15.686275\n",
+       "1         15.686275\n",
+       "2         15.686275\n",
+       "3         30.980392\n",
+       "4         15.686275\n",
+       "            ...    \n",
+       "651996    15.686275\n",
+       "651997    15.686275\n",
+       "651998    15.686275\n",
+       "651999    15.686275\n",
+       "652000    15.686275\n",
+       "Name: s01pid49_absthrottleposd, Length: 652001, dtype: float64"
+      ]
+     },
+     "execution_count": 22,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df['s01pid49_absthrottleposd']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "ec3d240b",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>t</th>\n",
+       "      <th>altitudevalid</th>\n",
+       "      <th>altitude</th>\n",
+       "      <th>altitudeaccuracy</th>\n",
+       "      <th>date_created</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>6923253</th>\n",
+       "      <td>2025-10-21 15:09:49.454</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-45.4</td>\n",
+       "      <td>43.0</td>\n",
+       "      <td>2025/10/21</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6923254</th>\n",
+       "      <td>2025-10-21 15:09:49.654</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-44.4</td>\n",
+       "      <td>37.0</td>\n",
+       "      <td>2025/10/21</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6923255</th>\n",
+       "      <td>2025-10-21 15:09:49.854</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-43.8</td>\n",
+       "      <td>32.0</td>\n",
+       "      <td>2025/10/21</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6923256</th>\n",
+       "      <td>2025-10-21 15:09:50.263</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-43.3</td>\n",
+       "      <td>29.0</td>\n",
+       "      <td>2025/10/21</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>6923257</th>\n",
+       "      <td>2025-10-21 15:09:50.463</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-41.9</td>\n",
+       "      <td>24.0</td>\n",
+       "      <td>2025/10/21</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12876766</th>\n",
+       "      <td>2026-01-17 12:59:59.093</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-12.4</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>2026/01/17</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12876767</th>\n",
+       "      <td>2026-01-17 12:59:59.293</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-12.4</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>2026/01/17</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12876768</th>\n",
+       "      <td>2026-01-17 12:59:59.493</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-12.4</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>2026/01/17</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12876769</th>\n",
+       "      <td>2026-01-17 12:59:59.693</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-12.4</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>2026/01/17</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>12876770</th>\n",
+       "      <td>2026-01-17 12:59:59.893</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-12.4</td>\n",
+       "      <td>1.0</td>\n",
+       "      <td>2026/01/17</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>12981228 rows × 5 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                               t  altitudevalid  altitude  altitudeaccuracy  \\\n",
+       "6923253  2025-10-21 15:09:49.454            1.0     -45.4              43.0   \n",
+       "6923254  2025-10-21 15:09:49.654            1.0     -44.4              37.0   \n",
+       "6923255  2025-10-21 15:09:49.854            1.0     -43.8              32.0   \n",
+       "6923256  2025-10-21 15:09:50.263            1.0     -43.3              29.0   \n",
+       "6923257  2025-10-21 15:09:50.463            1.0     -41.9              24.0   \n",
+       "...                          ...            ...       ...               ...   \n",
+       "12876766 2026-01-17 12:59:59.093            1.0     -12.4               1.0   \n",
+       "12876767 2026-01-17 12:59:59.293            1.0     -12.4               1.0   \n",
+       "12876768 2026-01-17 12:59:59.493            1.0     -12.4               1.0   \n",
+       "12876769 2026-01-17 12:59:59.693            1.0     -12.4               1.0   \n",
+       "12876770 2026-01-17 12:59:59.893            1.0     -12.4               1.0   \n",
+       "\n",
+       "         date_created  \n",
+       "6923253    2025/10/21  \n",
+       "6923254    2025/10/21  \n",
+       "6923255    2025/10/21  \n",
+       "6923256    2025/10/21  \n",
+       "6923257    2025/10/21  \n",
+       "...               ...  \n",
+       "12876766   2026/01/17  \n",
+       "12876767   2026/01/17  \n",
+       "12876768   2026/01/17  \n",
+       "12876769   2026/01/17  \n",
+       "12876770   2026/01/17  \n",
+       "\n",
+       "[12981228 rows x 5 columns]"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "df.sort_values(by='t')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ed959149",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.18"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}