Spaces:

pilotstuki
/

odatalogparser

Sleeping

pilotstuki Claude commited on Oct 15, 2025

Commit

002262c

0 Parent(s):

Initial commit: IIS Log Performance Analyzer

Add complete Streamlit application for analyzing large IIS log files:
- High-performance log parsing with Polars
- Interactive web UI with Streamlit
- Comprehensive metrics and visualizations
- Support for multi-file analysis
- Smart filtering for monitoring requests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (7) hide show

.gitignore +44 -0
README.md +208 -0
app.py +499 -0
log_parser.py +419 -0
requirements.txt +8 -0
run.sh +20 -0
test_parser.py +118 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,44 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+venv/
+ENV/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Streamlit
+.streamlit/
+# Log files (example/sample files - users will upload their own)
+*.log
+# PDF files (example reports/analysis)
+*.pdf
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+Thumbs.db

README.md ADDED Viewed

	@@ -0,0 +1,208 @@

+# IIS Log Performance Analyzer
+High-performance web application for analyzing large IIS log files (200MB-1GB+). Built with Streamlit and Polars for fast, efficient processing.
+**GitHub Repository**: [https://github.com/pilot-stuk/odata_log_parser](https://github.com/pilot-stuk/odata_log_parser)
+**Live Demo**: Deploy on [Streamlit Cloud](https://streamlit.io/cloud)
+## Features
+- **Fast Processing**: Uses Polars library for 10-100x faster parsing compared to pandas
+- **Large File Support**: Efficiently handles files up to 1GB+
+- **Comprehensive Metrics**:
+  - Total requests (before/after filtering)
+  - Error rates and breakdown by status code
+  - Response time statistics (min/max/avg)
+  - Slow request detection (configurable threshold)
+  - Peak RPS (Requests Per Second) with timestamp
+  - Top methods by request count and response time
+- **Multi-File Analysis**: Upload and compare multiple log files side-by-side
+- **Interactive Visualizations**: Charts and graphs using Plotly
+- **Smart Filtering**: Automatically excludes monitoring requests (Zabbix HEAD) and 401 unauthorized
+## Requirements
+- Python 3.8+
+- See `requirements.txt` for package dependencies
+## Installation
+### Local Installation
+1. Clone the repository:
+```bash
+git clone https://github.com/pilot-stuk/odata_log_parser.git
+cd odata_log_parser
+```
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+### Deploy to Streamlit Cloud
+1. Fork or clone this repository to your GitHub account
+2. Go to [share.streamlit.io](https://share.streamlit.io/)
+3. Sign in with your GitHub account
+4. Click "New app"
+5. Select your repository: `pilot-stuk/odata_log_parser`
+6. Set the main file path: `app.py`
+7. Click "Deploy"
+The app will be live at: `https://share.streamlit.io/pilot-stuki/odata_log_parser/main/app.py`
+## Usage
+### Run the Streamlit App
+```bash
+streamlit run app.py
+```
+The application will open in your browser at `http://localhost:8501`
+### Upload Log Files
+1. Click "Browse files" in the sidebar
+2. Select one or more IIS log files (.log or .txt)
+3. View the analysis results
+### Configuration Options
+- **Upload Mode**: Single or Multiple files
+- **Top N Methods**: Number of top methods to display (3-20)
+- **Slow Request Threshold**: Configure what constitutes a "slow" request (default: 3000ms)
+## Log Format
+This tool supports **IIS W3C Extended Log Format** with the following fields:
+```
+date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip
+cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
+```
+Example log line:
+```
+2025-09-22 00:00:46 10.21.31.42 GET /Service/Contact/Get sessionid='xxx' 443 - 212.233.92.232 - - 200 0 0 24
+```
+## Filtering Rules
+The analyzer applies the following filters automatically:
+1. **Monitoring Exclusion**: Lines containing both `HEAD` method and `Zabbix` are excluded
+2. **401 Handling**: 401 Unauthorized responses are excluded from error counts (considered authentication attempts, not system errors)
+3. **Error Definition**: Errors are HTTP status codes ≠ 200 and ≠ 401
+## Metrics Explained
+| Metric | Description |
+|--------|-------------|
+| **Total Requests (before filtering)** | Raw number of log entries |
+| **Excluded Requests** | Lines filtered out (HEAD+Zabbix + 401) |
+| **Processed Requests** | Valid requests included in analysis |
+| **Errors** | Requests with status ≠ 200 and ≠ 401 |
+| **Slow Requests** | Requests exceeding threshold (default: 3000ms) |
+| **Peak RPS** | Maximum requests per second observed |
+| **Avg/Max/Min Response Time** | Response time statistics in milliseconds |
+## Performance
+- **Small files** (<50MB): Process in seconds
+- **Medium files** (50-200MB): Process in 10-30 seconds
+- **Large files** (200MB-1GB): Process in 1-3 minutes
+Performance depends on:
+- File size
+- Number of log entries
+- System CPU and RAM
+- Disk I/O speed
+## Architecture
+```
+app.py              # Streamlit UI application
+log_parser.py       # Core parsing and analysis logic using Polars
+requirements.txt    # Python dependencies
+README.md          # This file
+```
+### Key Components
+- **IISLogParser**: Parses IIS W3C log format into Polars DataFrame
+- **LogAnalyzer**: Calculates metrics and statistics
+- **Streamlit UI**: Interactive web interface with visualizations
+## Use Cases
+- **Performance Analysis**: Identify slow endpoints and response time patterns
+- **Error Investigation**: Track error rates and problematic methods
+- **Capacity Planning**: Analyze peak load and RPS patterns
+- **Service Comparison**: Compare performance across multiple services
+- **Incident Review**: Analyze logs from specific time periods
+## Troubleshooting
+### Large File Upload Issues
+If Streamlit has trouble with very large files (>500MB):
+1. Increase Streamlit's upload size limit:
+```bash
+streamlit run app.py --server.maxUploadSize=1024
+```
+2. Or modify `.streamlit/config.toml`:
+```toml
+[server]
+maxUploadSize = 1024
+```
+### Memory Issues
+For files >1GB, you may need to:
+- Increase available system memory
+- Process files in smaller chunks
+- Use the CLI version (can be developed if needed)
+### Performance Tips
+- Close other memory-intensive applications
+- Process files one at a time for very large files
+- Use SSD for faster I/O
+- Ensure adequate RAM (8GB+ recommended for 1GB files)
+## Future Enhancements
+Potential features for future versions:
+- CLI tool for batch processing
+- Export results to PDF/Excel
+- Real-time log monitoring
+- Custom metric definitions
+- Time range filtering
+- IP address analysis
+- Session tracking
+## Example Output
+The application generates:
+1. **Summary Table**: Key metrics for each log file
+2. **Top Methods Chart**: Most frequently called endpoints
+3. **Response Time Distribution**: Histogram of response times
+4. **Error Breakdown**: Pie chart of error types
+5. **Service Comparison**: Side-by-side comparison for multiple files
+## License
+This tool is provided as-is for log analysis purposes.
+## Support
+For issues or questions:
+1. Check log file format matches IIS W3C Extended format
+2. Verify all required fields are present
+3. Ensure Python and dependencies are correctly installed

app.py ADDED Viewed

	@@ -0,0 +1,499 @@

+"""
+IIS Log Analyzer - Streamlit Application
+High-performance log analysis tool for large IIS log files (200MB-1GB+)
+"""
+import streamlit as st
+import plotly.graph_objects as go
+import plotly.express as px
+from plotly.subplots import make_subplots
+import pandas as pd
+from pathlib import Path
+import tempfile
+from typing import List
+import time
+from log_parser import IISLogParser, LogAnalyzer, analyze_multiple_logs
+# Page configuration
+st.set_page_config(
+    page_title="IIS Log Analyzer",
+    page_icon="📊",
+    layout="wide",
+    initial_sidebar_state="expanded"
+)
+# Custom CSS
+st.markdown("""
+<style>
+    .metric-card {
+        background-color: #f0f2f6;
+        padding: 20px;
+        border-radius: 10px;
+        margin: 10px 0;
+    }
+    .error-metric {
+        background-color: #ffebee;
+    }
+    .success-metric {
+        background-color: #e8f5e9;
+    }
+    .warning-metric {
+        background-color: #fff3e0;
+    }
+</style>
+""", unsafe_allow_html=True)
+def format_number(num: int) -> str:
+    """Format large numbers with thousand separators."""
+    return f"{num:,}"
+def create_summary_table(stats: dict) -> pd.DataFrame:
+    """Create summary statistics table."""
+    data = {
+        "Metric": [
+            "Total Requests (before filtering)",
+            "Excluded Requests (HEAD+Zabbix + 401)",
+            "Processed Requests",
+            "Errors (≠200, ≠401)",
+            "Slow Requests (>3s)",
+            "Peak RPS",
+            "Peak Timestamp",
+            "Avg Response Time (ms)",
+            "Max Response Time (ms)",
+            "Min Response Time (ms)",
+        ],
+        "Value": [
+            format_number(stats["total_requests_before"]),
+            format_number(stats["excluded_requests"]),
+            format_number(stats["total_requests_after"]),
+            format_number(stats["errors"]),
+            format_number(stats["slow_requests"]),
+            format_number(stats["peak_rps"]),
+            stats["peak_timestamp"] or "N/A",
+            format_number(stats["avg_time_ms"]),
+            format_number(stats["max_time_ms"]),
+            format_number(stats["min_time_ms"]),
+        ]
+    }
+    return pd.DataFrame(data)
+def create_response_time_chart(dist: dict, title: str) -> go.Figure:
+    """Create response time distribution chart."""
+    labels = list(dist.keys())
+    values = list(dist.values())
+    fig = go.Figure(data=[
+        go.Bar(
+            x=labels,
+            y=values,
+            marker_color='lightblue',
+            text=values,
+            textposition='auto',
+        )
+    ])
+    fig.update_layout(
+        title=title,
+        xaxis_title="Response Time Range",
+        yaxis_title="Request Count",
+        height=400,
+        showlegend=False
+    )
+    return fig
+def create_top_methods_chart(methods: List[dict], title: str) -> go.Figure:
+    """Create top methods bar chart."""
+    if not methods:
+        return go.Figure()
+    df = pd.DataFrame(methods)
+    fig = make_subplots(
+        rows=1, cols=2,
+        subplot_titles=("Request Count", "Avg Response Time (ms)")
+    )
+    # Request count
+    fig.add_trace(
+        go.Bar(
+            x=df["method_name"],
+            y=df["count"],
+            name="Count",
+            marker_color='steelblue',
+            text=df["count"],
+            textposition='auto',
+        ),
+        row=1, col=1
+    )
+    # Average time
+    fig.add_trace(
+        go.Bar(
+            x=df["method_name"],
+            y=df["avg_time"].round(1),
+            name="Avg Time",
+            marker_color='coral',
+            text=df["avg_time"].round(1),
+            textposition='auto',
+        ),
+        row=1, col=2
+    )
+    fig.update_layout(
+        title_text=title,
+        height=400,
+        showlegend=False
+    )
+    return fig
+def create_metrics_comparison(individual_stats: List[dict]) -> go.Figure:
+    """Create comparison chart for multiple services."""
+    services = [s["summary"]["service_name"] for s in individual_stats]
+    requests = [s["summary"]["total_requests_after"] for s in individual_stats]
+    errors = [s["summary"]["errors"] for s in individual_stats]
+    avg_times = [s["summary"]["avg_time_ms"] for s in individual_stats]
+    fig = make_subplots(
+        rows=1, cols=3,
+        subplot_titles=("Processed Requests", "Errors", "Avg Response Time (ms)"),
+        specs=[[{"type": "bar"}, {"type": "bar"}, {"type": "bar"}]]
+    )
+    fig.add_trace(
+        go.Bar(x=services, y=requests, marker_color='lightblue', text=requests, textposition='auto'),
+        row=1, col=1
+    )
+    fig.add_trace(
+        go.Bar(x=services, y=errors, marker_color='salmon', text=errors, textposition='auto'),
+        row=1, col=2
+    )
+    fig.add_trace(
+        go.Bar(x=services, y=avg_times, marker_color='lightgreen', text=avg_times, textposition='auto'),
+        row=1, col=3
+    )
+    fig.update_layout(
+        title_text="Service Comparison",
+        height=400,
+        showlegend=False
+    )
+    return fig
+def process_log_file(file_path: str, service_name: str = None) -> dict:
+    """Process a single log file and return statistics."""
+    parser = IISLogParser(file_path)
+    if service_name:
+        parser.service_name = service_name
+    with st.spinner(f"Parsing {Path(file_path).name}..."):
+        df = parser.parse()
+    if df.height == 0:
+        st.error(f"No valid log entries found in {Path(file_path).name}")
+        return None
+    with st.spinner(f"Analyzing {parser.service_name}..."):
+        analyzer = LogAnalyzer(df, parser.service_name)
+        stats = {
+            "summary": analyzer.get_summary_stats(),
+            "top_methods": analyzer.get_top_methods(),
+            "error_breakdown": analyzer.get_error_breakdown(),
+            "errors_by_method": analyzer.get_errors_by_method(n=10),
+            "response_time_dist": analyzer.get_response_time_distribution(),
+            "analyzer": analyzer,  # Keep reference for detailed error queries
+        }
+    return stats
+def main():
+    st.title("📊 IIS Log Performance Analyzer")
+    st.markdown("High-performance analysis tool for large IIS log files (up to 1GB+)")
+    # Sidebar
+    st.sidebar.header("Configuration")
+    # File upload mode
+    upload_mode = st.sidebar.radio(
+        "Upload Mode",
+        ["Single File", "Multiple Files"],
+        help="Analyze one or multiple log files"
+    )
+    # File uploader
+    if upload_mode == "Single File":
+        uploaded_files = st.sidebar.file_uploader(
+            "Upload IIS Log File",
+            type=["log", "txt"],
+            help="Upload IIS W3C Extended format log file"
+        )
+        uploaded_files = [uploaded_files] if uploaded_files else []
+    else:
+        uploaded_files = st.sidebar.file_uploader(
+            "Upload IIS Log Files",
+            type=["log", "txt"],
+            accept_multiple_files=True,
+            help="Upload multiple IIS log files for comparison"
+        )
+    # Analysis options
+    st.sidebar.header("Analysis Options")
+    show_top_n = st.sidebar.slider("Top N Methods", 3, 20, 5)
+    slow_threshold = st.sidebar.number_input(
+        "Slow Request Threshold (ms)",
+        min_value=100,
+        max_value=10000,
+        value=3000,
+        step=100
+    )
+    # Process files
+    if uploaded_files:
+        st.info(f"Processing {len(uploaded_files)} file(s)...")
+        # Save uploaded files to temp directory
+        temp_files = []
+        for uploaded_file in uploaded_files:
+            with tempfile.NamedTemporaryFile(delete=False, suffix=".log") as tmp:
+                tmp.write(uploaded_file.getvalue())
+                temp_files.append(tmp.name)
+        start_time = time.time()
+        # Process each file
+        all_stats = []
+        for i, temp_file in enumerate(temp_files):
+            file_name = uploaded_files[i].name
+            st.subheader(f"📄 {file_name}")
+            stats = process_log_file(temp_file, None)
+            if stats:
+                all_stats.append(stats)
+                # Display summary metrics
+                col1, col2, col3, col4 = st.columns(4)
+                with col1:
+                    st.metric(
+                        "Total Requests",
+                        format_number(stats["summary"]["total_requests_after"])
+                    )
+                with col2:
+                    st.metric(
+                        "Errors",
+                        format_number(stats["summary"]["errors"]),
+                        delta=None,
+                        delta_color="inverse"
+                    )
+                with col3:
+                    st.metric(
+                        "Avg Time (ms)",
+                        format_number(stats["summary"]["avg_time_ms"])
+                    )
+                with col4:
+                    st.metric(
+                        "Peak RPS",
+                        format_number(stats["summary"]["peak_rps"])
+                    )
+                # Tabs for detailed analysis
+                tab1, tab2, tab3, tab4, tab5 = st.tabs([
+                    "Summary", "Top Methods", "Response Time", "Error Breakdown", "Errors by Method"
+                ])
+                with tab1:
+                    st.dataframe(
+                        create_summary_table(stats["summary"]),
+                        hide_index=True,
+                        use_container_width=True
+                    )
+                with tab2:
+                    if stats["top_methods"]:
+                        st.plotly_chart(
+                            create_top_methods_chart(
+                                stats["top_methods"][:show_top_n],
+                                f"Top {show_top_n} Methods - {stats['summary']['service_name']}"
+                            ),
+                            use_container_width=True
+                        )
+                        # Show table
+                        methods_df = pd.DataFrame(stats["top_methods"][:show_top_n])
+                        methods_df["avg_time"] = methods_df["avg_time"].round(1)
+                        st.dataframe(methods_df, hide_index=True, use_container_width=True)
+                    else:
+                        st.info("No method data available")
+                with tab3:
+                    if stats["response_time_dist"]:
+                        st.plotly_chart(
+                            create_response_time_chart(
+                                stats["response_time_dist"],
+                                f"Response Time Distribution - {stats['summary']['service_name']}"
+                            ),
+                            use_container_width=True
+                        )
+                    else:
+                        st.info("No response time distribution data")
+                with tab4:
+                    if stats["error_breakdown"]:
+                        error_df = pd.DataFrame(stats["error_breakdown"])
+                        error_df.columns = ["Status Code", "Count"]
+                        st.dataframe(error_df, hide_index=True, use_container_width=True)
+                        # Pie chart
+                        fig = px.pie(
+                            error_df,
+                            values="Count",
+                            names="Status Code",
+                            title=f"Error Distribution - {stats['summary']['service_name']}"
+                        )
+                        st.plotly_chart(fig, use_container_width=True)
+                    else:
+                        st.success("No errors found! ✓")
+                with tab5:
+                    st.markdown("### 🔍 Errors by Method")
+                    st.markdown("This view shows which specific methods are causing errors, with full context for debugging.")
+                    if stats["errors_by_method"]:
+                        # Display summary table
+                        errors_method_df = pd.DataFrame(stats["errors_by_method"])
+                        errors_method_df["error_rate_percent"] = errors_method_df["error_rate_percent"].round(2)
+                        errors_method_df["avg_response_time_ms"] = errors_method_df["avg_response_time_ms"].round(1)
+                        # Rename columns for better display
+                        errors_method_df.columns = [
+                            "Method Path", "Total Calls", "Error Count",
+                            "Most Common Error", "Avg Response Time (ms)", "Error Rate (%)"
+                        ]
+                        st.dataframe(errors_method_df, hide_index=True, use_container_width=True)
+                        # Bar chart of top error-prone methods
+                        fig = go.Figure()
+                        fig.add_trace(go.Bar(
+                            x=errors_method_df["Method Path"],
+                            y=errors_method_df["Error Count"],
+                            marker_color='red',
+                            text=errors_method_df["Error Count"],
+                            textposition='auto',
+                            name="Error Count"
+                        ))
+                        fig.update_layout(
+                            title=f"Top Error-Prone Methods - {stats['summary']['service_name']}",
+                            xaxis_title="Method Path",
+                            yaxis_title="Error Count",
+                            height=400,
+                            showlegend=False
+                        )
+                        st.plotly_chart(fig, use_container_width=True)
+                        # Allow users to drill down into specific methods
+                        st.markdown("#### 🔎 Detailed Error Logs")
+                        selected_method = st.selectbox(
+                            "Select a method to view detailed error logs:",
+                            options=["All"] + errors_method_df["Method Path"].tolist(),
+                            key=f"method_select_{file_name}"
+                        )
+                        if selected_method and selected_method != "All":
+                            error_details = stats["analyzer"].get_error_details(
+                                method_path=selected_method,
+                                limit=50
+                            )
+                            if error_details:
+                                details_df = pd.DataFrame(error_details)
+                                st.dataframe(details_df, hide_index=True, use_container_width=True)
+                                st.info(f"Showing up to 50 most recent errors for {selected_method}")
+                            else:
+                                st.info(f"No error details found for {selected_method}")
+                        elif selected_method == "All":
+                            error_details = stats["analyzer"].get_error_details(limit=50)
+                            if error_details:
+                                details_df = pd.DataFrame(error_details)
+                                st.dataframe(details_df, hide_index=True, use_container_width=True)
+                                st.info("Showing up to 50 most recent errors across all methods")
+                    else:
+                        st.success("No errors found in any methods! ✓")
+                st.divider()
+        # Multi-file comparison
+        if len(all_stats) > 1:
+            st.header("📊 Service Comparison")
+            st.plotly_chart(
+                create_metrics_comparison(all_stats),
+                use_container_width=True
+            )
+            # Combined summary
+            st.subheader("Combined Statistics")
+            combined = {
+                "total_requests_before": sum(s["summary"]["total_requests_before"] for s in all_stats),
+                "excluded_requests": sum(s["summary"]["excluded_requests"] for s in all_stats),
+                "total_requests_after": sum(s["summary"]["total_requests_after"] for s in all_stats),
+                "errors": sum(s["summary"]["errors"] for s in all_stats),
+                "slow_requests": sum(s["summary"]["slow_requests"] for s in all_stats),
+            }
+            col1, col2, col3 = st.columns(3)
+            with col1:
+                st.metric("Total Requests (All Services)", format_number(combined["total_requests_after"]))
+            with col2:
+                st.metric("Total Errors (All Services)", format_number(combined["errors"]))
+            with col3:
+                st.metric("Total Slow Requests (All Services)", format_number(combined["slow_requests"]))
+        processing_time = time.time() - start_time
+        st.success(f"✓ Analysis completed in {processing_time:.2f} seconds")
+        # Clean up temp files
+        for temp_file in temp_files:
+            Path(temp_file).unlink(missing_ok=True)
+    else:
+        # Welcome screen
+        st.info("👆 Upload one or more IIS log files to begin analysis")
+        st.markdown("""
+        ### Features
+        - ⚡ **Fast processing** of large files (200MB-1GB+) using Polars
+        - 📊 **Comprehensive metrics**: RPS, response times, error rates
+        - 🔍 **Detailed analysis**: Top methods, error breakdown, time distribution
+        - 📈 **Visual reports**: Interactive charts with Plotly
+        - 🔄 **Multi-file support**: Compare multiple services side-by-side
+        ### Log Format
+        This tool supports **IIS W3C Extended Log Format** with the following fields:
+        ```
+        date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username
+        c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status time-taken
+        ```
+        ### Filtering Rules
+        - Excludes lines with both `HEAD` method and `Zabbix` in User-Agent
+        - 401 Unauthorized responses are excluded from error counts
+        - Errors are defined as status codes ≠ 200 and ≠ 401
+        - Slow requests are those with response time > 3000ms (configurable)
+        """)
+if __name__ == "__main__":
+    main()

log_parser.py ADDED Viewed

	@@ -0,0 +1,419 @@

+"""
+IIS Log Parser using Polars for high-performance processing.
+Handles large log files (200MB-1GB+) efficiently with streaming.
+"""
+import polars as pl
+from pathlib import Path
+from typing import Dict, List, Tuple, Optional
+from datetime import datetime
+import re
+class IISLogParser:
+    """Parser for IIS W3C Extended Log Format."""
+    # IIS log column names
+    COLUMNS = [
+        "date", "time", "s_ip", "cs_method", "cs_uri_stem", "cs_uri_query",
+        "s_port", "cs_username", "c_ip", "cs_user_agent", "cs_referer",
+        "sc_status", "sc_substatus", "sc_win32_status", "time_taken"
+    ]
+    def __init__(self, file_path: str):
+        self.file_path = Path(file_path)
+        self.service_name = None  # Will be determined from URI paths during parsing
+    def parse(self, chunk_size: Optional[int] = None) -> pl.DataFrame:
+        """
+        Parse IIS log file.
+        Args:
+            chunk_size: If provided, process in chunks (for very large files)
+        Returns:
+            Polars DataFrame with parsed log data
+        """
+        # Read file, skip comment lines
+        with open(self.file_path, 'r', encoding='utf-8', errors='ignore') as f:
+            lines = []
+            for line in f:
+                # Skip header/comment lines starting with #
+                if not line.startswith('#'):
+                    lines.append(line.strip())
+        # Create DataFrame from lines
+        if not lines:
+            return pl.DataFrame()
+        # Split each line by space and create DataFrame
+        data = [line.split() for line in lines if line]
+        # Filter out lines that don't have correct number of columns
+        data = [row for row in data if len(row) == len(self.COLUMNS)]
+        if not data:
+            return pl.DataFrame()
+        df = pl.DataFrame(data, schema=self.COLUMNS, orient="row")
+        # Convert data types
+        df = df.with_columns([
+            pl.col("date").cast(pl.Utf8),
+            pl.col("time").cast(pl.Utf8),
+            pl.col("sc_status").cast(pl.Int32),
+            pl.col("sc_substatus").cast(pl.Int32),
+            pl.col("sc_win32_status").cast(pl.Int32),
+            pl.col("time_taken").cast(pl.Int32),
+        ])
+        # Create timestamp column
+        df = df.with_columns([
+            (pl.col("date") + " " + pl.col("time")).alias("timestamp")
+        ])
+        # Convert timestamp to datetime
+        df = df.with_columns([
+            pl.col("timestamp").str.strptime(pl.Datetime, format="%Y-%m-%d %H:%M:%S")
+        ])
+        # Extract service name and method name from URI
+        df = df.with_columns([
+            self._extract_service_name().alias("service_name"),
+            self._extract_method_name().alias("method_name"),
+            self._extract_full_method_path().alias("full_method_path")
+        ])
+        # Determine the primary service name for this log file
+        if df.height > 0:
+            # Get the most common service name
+            service_counts = df.group_by("service_name").agg([
+                pl.count().alias("count")
+            ]).sort("count", descending=True)
+            if service_counts.height > 0:
+                self.service_name = service_counts.row(0, named=True)["service_name"]
+            else:
+                self.service_name = "Unknown"
+        else:
+            self.service_name = "Unknown"
+        return df
+    def _extract_service_name(self) -> pl.Expr:
+        """Extract service name from URI stem (e.g., AdministratorOfficeService, CustomerOfficeService)."""
+        # Extract the first meaningful part after the leading slash
+        # Example: /AdministratorOfficeService/Contact/Get -> AdministratorOfficeService
+        return (
+            pl.col("cs_uri_stem")
+            .str.split("/")
+            .list.get(1)  # Get first element after leading /
+            .fill_null("Unknown")
+        )
+    def _extract_full_method_path(self) -> pl.Expr:
+        """Extract full method path for better error tracking (e.g., Contact/Get, Order/Create)."""
+        # Extract everything after the service name
+        # Example: /AdministratorOfficeService/Contact/Get -> Contact/Get
+        return (
+            pl.col("cs_uri_stem")
+            .str.split("/")
+            .list.slice(2)  # Skip leading / and service name
+            .list.join("/")
+            .fill_null("Unknown")
+        )
+    def _extract_method_name(self) -> pl.Expr:
+        """Extract method name from URI stem."""
+        # Extract last part of URI path (e.g., /Service/Contact/Get -> Get)
+        return pl.col("cs_uri_stem").str.split("/").list.last().fill_null("Unknown")
+class LogAnalyzer:
+    """Analyze parsed IIS logs and generate performance metrics."""
+    def __init__(self, df: pl.DataFrame, service_name: str = "Unknown"):
+        self.df = df
+        self.service_name = service_name
+        self._filtered_df = None
+    def filter_logs(self) -> pl.DataFrame:
+        """
+        Apply filtering rules:
+        1. Exclude lines with both HEAD and Zabbix
+        2. Exclude 401 status codes (for error counting)
+        Returns:
+            Filtered DataFrame
+        """
+        if self._filtered_df is not None:
+            return self._filtered_df
+        # Filter out HEAD + Zabbix
+        filtered = self.df.filter(
+            ~(
+                (pl.col("cs_method") == "HEAD") &
+                (
+                    pl.col("cs_user_agent").str.contains("Zabbix") |
+                    pl.col("cs_uri_stem").str.contains("Zabbix")
+                )
+            )
+        )
+        self._filtered_df = filtered
+        return filtered
+    def get_summary_stats(self) -> Dict:
+        """Get overall summary statistics."""
+        df = self.filter_logs()
+        # Count requests
+        total_before = self.df.height
+        total_after = df.height
+        excluded = total_before - total_after
+        # Count 401s separately
+        count_401 = self.df.filter(pl.col("sc_status") == 401).height
+        # Count errors (status != 200 and != 401)
+        errors = df.filter(
+            (pl.col("sc_status") != 200) & (pl.col("sc_status") != 401)
+        ).height
+        # Count slow requests (>3000ms)
+        slow_requests = df.filter(pl.col("time_taken") > 3000).height
+        # Response time statistics
+        time_stats = df.select([
+            pl.col("time_taken").min().alias("min_time"),
+            pl.col("time_taken").max().alias("max_time"),
+            pl.col("time_taken").mean().alias("avg_time"),
+        ]).to_dicts()[0]
+        # Peak RPS
+        rps_data = self._calculate_peak_rps(df)
+        return {
+            "service_name": self.service_name,
+            "total_requests_before": total_before,
+            "excluded_requests": excluded,
+            "excluded_401": count_401,
+            "total_requests_after": total_after,
+            "errors": errors,
+            "slow_requests": slow_requests,
+            "min_time_ms": int(time_stats["min_time"]) if time_stats["min_time"] else 0,
+            "max_time_ms": int(time_stats["max_time"]) if time_stats["max_time"] else 0,
+            "avg_time_ms": int(time_stats["avg_time"]) if time_stats["avg_time"] else 0,
+            "peak_rps": rps_data["peak_rps"],
+            "peak_timestamp": rps_data["peak_timestamp"],
+        }
+    def _calculate_peak_rps(self, df: pl.DataFrame) -> Dict:
+        """Calculate peak requests per second."""
+        if df.height == 0:
+            return {"peak_rps": 0, "peak_timestamp": None}
+        # Group by second and count requests
+        rps = df.group_by("timestamp").agg([
+            pl.count().alias("count")
+        ]).sort("count", descending=True)
+        if rps.height == 0:
+            return {"peak_rps": 0, "peak_timestamp": None}
+        peak_row = rps.row(0, named=True)
+        return {
+            "peak_rps": peak_row["count"],
+            "peak_timestamp": str(peak_row["timestamp"])
+        }
+    def get_top_methods(self, n: int = 5) -> List[Dict]:
+        """Get top N methods by request count."""
+        df = self.filter_logs()
+        if df.height == 0:
+            return []
+        # Group by method name
+        method_stats = df.group_by("method_name").agg([
+            pl.count().alias("count"),
+            pl.col("time_taken").mean().alias("avg_time"),
+            pl.col("sc_status").filter(
+                (pl.col("sc_status") != 200) & (pl.col("sc_status") != 401)
+            ).count().alias("errors")
+        ]).sort("count", descending=True).limit(n)
+        return method_stats.to_dicts()
+    def get_error_breakdown(self) -> List[Dict]:
+        """Get breakdown of errors by status code."""
+        df = self.filter_logs()
+        errors = df.filter(
+            (pl.col("sc_status") != 200) & (pl.col("sc_status") != 401)
+        )
+        if errors.height == 0:
+            return []
+        error_stats = errors.group_by("sc_status").agg([
+            pl.count().alias("count")
+        ]).sort("count", descending=True)
+        return error_stats.to_dicts()
+    def get_errors_by_method(self, n: int = 10) -> List[Dict]:
+        """
+        Get detailed error breakdown by method with full context.
+        Shows which methods are causing the most errors.
+        Args:
+            n: Number of top error-prone methods to return
+        Returns:
+            List of dicts with method, error count, total calls, and error rate
+        """
+        df = self.filter_logs()
+        if df.height == 0:
+            return []
+        # Get error counts and total counts per full method path
+        method_errors = df.group_by("full_method_path").agg([
+            pl.count().alias("total_calls"),
+            pl.col("sc_status").filter(
+                (pl.col("sc_status") != 200) & (pl.col("sc_status") != 401)
+            ).count().alias("error_count"),
+            pl.col("sc_status").filter(
+                (pl.col("sc_status") != 200) & (pl.col("sc_status") != 401)
+            ).first().alias("most_common_error_status"),
+            pl.col("time_taken").mean().alias("avg_response_time_ms"),
+        ]).filter(
+            pl.col("error_count") > 0
+        ).with_columns([
+            (pl.col("error_count") * 100.0 / pl.col("total_calls")).alias("error_rate_percent")
+        ]).sort("error_count", descending=True).limit(n)
+        return method_errors.to_dicts()
+    def get_error_details(self, method_path: str = None, limit: int = 100) -> List[Dict]:
+        """
+        Get detailed error logs with full context for debugging.
+        Args:
+            method_path: Optional filter for specific method path
+            limit: Maximum number of error records to return
+        Returns:
+            List of error records with timestamp, method, status, response time, etc.
+        """
+        df = self.filter_logs()
+        # Filter for errors only
+        errors = df.filter(
+            (pl.col("sc_status") != 200) & (pl.col("sc_status") != 401)
+        )
+        # Apply method filter if specified
+        if method_path:
+            errors = errors.filter(pl.col("full_method_path") == method_path)
+        if errors.height == 0:
+            return []
+        # Select relevant columns for debugging
+        error_details = errors.select([
+            "timestamp",
+            "service_name",
+            "full_method_path",
+            "method_name",
+            "sc_status",
+            "sc_substatus",
+            "sc_win32_status",
+            "time_taken",
+            "c_ip",
+            "cs_uri_query"
+        ]).sort("timestamp", descending=True).limit(limit)
+        return error_details.to_dicts()
+    def get_response_time_distribution(self, buckets: List[int] = None) -> Dict:
+        """Get response time distribution by buckets."""
+        if buckets is None:
+            buckets = [0, 50, 100, 200, 500, 1000, 3000, 10000]
+        df = self.filter_logs()
+        if df.height == 0:
+            return {}
+        distribution = {}
+        for i in range(len(buckets) - 1):
+            lower = buckets[i]
+            upper = buckets[i + 1]
+            count = df.filter(
+                (pl.col("time_taken") >= lower) & (pl.col("time_taken") < upper)
+            ).height
+            distribution[f"{lower}-{upper}ms"] = count
+        # Add bucket for values above last threshold
+        count = df.filter(pl.col("time_taken") >= buckets[-1]).height
+        distribution[f">{buckets[-1]}ms"] = count
+        return distribution
+    def get_rps_timeline(self, interval: str = "1m") -> pl.DataFrame:
+        """Get RPS over time with specified interval."""
+        df = self.filter_logs()
+        if df.height == 0:
+            return pl.DataFrame()
+        # Group by time interval
+        timeline = df.group_by_dynamic("timestamp", every=interval).agg([
+            pl.count().alias("requests")
+        ]).sort("timestamp")
+        return timeline
+def analyze_multiple_logs(log_files: List[str]) -> Tuple[Dict, List[Dict]]:
+    """
+    Analyze multiple log files and generate combined report.
+    Args:
+        log_files: List of log file paths
+    Returns:
+        Tuple of (combined_stats, individual_stats)
+    """
+    individual_stats = []
+    for log_file in log_files:
+        parser = IISLogParser(log_file)
+        df = parser.parse()
+        analyzer = LogAnalyzer(df, parser.service_name)
+        stats = {
+            "summary": analyzer.get_summary_stats(),
+            "top_methods": analyzer.get_top_methods(),
+            "error_breakdown": analyzer.get_error_breakdown(),
+            "errors_by_method": analyzer.get_errors_by_method(n=10),
+            "response_time_dist": analyzer.get_response_time_distribution(),
+            "analyzer": analyzer,
+        }
+        individual_stats.append(stats)
+    # Calculate combined statistics
+    combined = {
+        "total_requests_before": sum(s["summary"]["total_requests_before"] for s in individual_stats),
+        "excluded_requests": sum(s["summary"]["excluded_requests"] for s in individual_stats),
+        "excluded_401": sum(s["summary"]["excluded_401"] for s in individual_stats),
+        "total_requests_after": sum(s["summary"]["total_requests_after"] for s in individual_stats),
+        "errors": sum(s["summary"]["errors"] for s in individual_stats),
+        "slow_requests": sum(s["summary"]["slow_requests"] for s in individual_stats),
+    }
+    return combined, individual_stats

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+# Core dependencies
+streamlit>=1.28.0
+polars>=0.19.0
+plotly>=5.17.0
+pandas>=2.0.0
+# Optional performance improvements
+pyarrow>=13.0.0

run.sh ADDED Viewed

	@@ -0,0 +1,20 @@

+#!/bin/bash
+# Launch script for IIS Log Analyzer
+echo "🚀 Starting IIS Log Analyzer..."
+echo ""
+# Check if dependencies are installed
+if ! python -c "import streamlit" 2>/dev/null; then
+    echo "📦 Installing dependencies..."
+    pip install -r requirements.txt
+    echo ""
+fi
+# Launch Streamlit app
+echo "✓ Launching web application..."
+echo "  URL: http://localhost:8501"
+echo "  Press Ctrl+C to stop"
+echo ""
+streamlit run app.py --server.maxUploadSize=1024

test_parser.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""
+Test script for the IIS log parser
+"""
+from log_parser import IISLogParser, LogAnalyzer
+import time
+def test_log_file(file_path: str):
+    """Test parsing a single log file."""
+    print(f"\n{'='*80}")
+    print(f"Testing: {file_path}")
+    print(f"{'='*80}")
+    start_time = time.time()
+    # Parse
+    parser = IISLogParser(file_path)
+    df = parser.parse()
+    parse_time = time.time() - start_time
+    print(f"Service Name: {parser.service_name}")
+    print(f"✓ Parsed {df.height:,} log entries in {parse_time:.2f}s")
+    # Analyze
+    analyzer = LogAnalyzer(df, parser.service_name)
+    stats = analyzer.get_summary_stats()
+    analyze_time = time.time() - start_time - parse_time
+    print(f"✓ Analyzed in {analyze_time:.2f}s")
+    # Display summary
+    print(f"\n📊 Summary Statistics:")
+    print(f"  Total Requests (before): {stats['total_requests_before']:,}")
+    print(f"  Excluded Requests: {stats['excluded_requests']:,}")
+    print(f"  Total Requests (after): {stats['total_requests_after']:,}")
+    print(f"  Errors (≠200,≠401): {stats['errors']:,}")
+    print(f"  Slow Requests (>3s): {stats['slow_requests']:,}")
+    print(f"  Peak RPS: {stats['peak_rps']:,} @ {stats['peak_timestamp']}")
+    print(f"  Avg Response Time: {stats['avg_time_ms']:,}ms")
+    print(f"  Max Response Time: {stats['max_time_ms']:,}ms")
+    print(f"  Min Response Time: {stats['min_time_ms']:,}ms")
+    # Top methods
+    print(f"\n🔝 Top 5 Methods:")
+    top_methods = analyzer.get_top_methods(5)
+    for i, method in enumerate(top_methods, 1):
+        print(f"  {i}. {method['method_name']}")
+        print(f"     Count: {method['count']:,} | Avg Time: {method['avg_time']:.1f}ms | Errors: {method['errors']}")
+    # Error breakdown
+    errors = analyzer.get_error_breakdown()
+    if errors:
+        print(f"\n❌ Error Breakdown:")
+        for error in errors:
+            print(f"  Status {error['sc_status']}: {error['count']:,} occurrences")
+    else:
+        print(f"\n✓ No errors found!")
+    # Errors by method
+    errors_by_method = analyzer.get_errors_by_method(5)
+    if errors_by_method:
+        print(f"\n⚠️  Top 5 Error-Prone Methods:")
+        for i, method_error in enumerate(errors_by_method, 1):
+            print(f"  {i}. {method_error['full_method_path']}")
+            print(f"     Total Calls: {method_error['total_calls']:,} | Errors: {method_error['error_count']:,} | "
+                  f"Error Rate: {method_error['error_rate_percent']:.2f}% | "
+                  f"Most Common Error: {method_error.get('most_common_error_status', 'N/A')}")
+    else:
+        print(f"\n✓ No method errors found!")
+    # Response time distribution
+    dist = analyzer.get_response_time_distribution()
+    print(f"\n⏱️  Response Time Distribution:")
+    for bucket, count in dist.items():
+        print(f"  {bucket}: {count:,}")
+    total_time = time.time() - start_time
+    print(f"\n⏱️  Total processing time: {total_time:.2f}s")
+    return stats
+if __name__ == "__main__":
+    import sys
+    # Test with both log files
+    files = [
+        "administrator_rhr_ex250922.log",
+        "customer_rhr_ex250922.log"
+    ]
+    all_stats = []
+    total_start = time.time()
+    for file_path in files:
+        try:
+            stats = test_log_file(file_path)
+            all_stats.append(stats)
+        except Exception as e:
+            print(f"\n❌ Error processing {file_path}: {e}")
+            import traceback
+            traceback.print_exc()
+    # Combined summary
+    if len(all_stats) > 1:
+        print(f"\n{'='*80}")
+        print(f"COMBINED STATISTICS")
+        print(f"{'='*80}")
+        total_requests = sum(s['total_requests_after'] for s in all_stats)
+        total_errors = sum(s['errors'] for s in all_stats)
+        total_slow = sum(s['slow_requests'] for s in all_stats)
+        print(f"Total Requests (all services): {total_requests:,}")
+        print(f"Total Errors (all services): {total_errors:,}")
+        print(f"Total Slow Requests (all services): {total_slow:,}")
+    total_elapsed = time.time() - total_start
+    print(f"\n⏱️  Total elapsed time: {total_elapsed:.2f}s")
+    print(f"\n✓ All tests completed successfully!")