Spaces:
Runtime error
Runtime error
| SECTION 1: Dataset Overview | |
| This dataset contains monthly-level flight arrival statistics across U.S. airports and carriers. | |
| Each row represents a unique (carrier, airport, month, year) combination and includes delay causes, volumes, and outcomes. | |
| Temporal Features: | |
| year, month: Track seasonal and yearly trends. | |
| Airline Info: | |
| carrier: IATA code (e.g., AA) | |
| carrier_name: Full airline name (used for display) | |
| Airport Info: | |
| airport: IATA code (e.g., LAX) | |
| airport_name: Full airport name (used for display) | |
| Flight & Delay Counts: | |
| arr_flights: Total arriving flights in the month | |
| arr_del15: Flights delayed over 15 minutes | |
| arr_cancelled: Canceled flights | |
| arr_diverted: Flights diverted to other destinations | |
| Delay Causes (Counts per Flight): | |
| carrier_ct: Delays caused by airline-side issues (e.g., crew, maintenance) | |
| weather_ct: Delays caused by weather conditions | |
| nas_ct: Delays caused by National Airspace System congestion | |
| security_ct: Delays caused by security-related disruptions | |
| late_aircraft_ct: Delays caused by late incoming aircraft | |
| Delay Duration (Minutes): | |
| arr_delay: Total delay time in minutes | |
| carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay: Breakdown of total delay by cause | |
| Additional Features (engineered): | |
| delay_ratio: Proportion of delayed flights over total flights (engineered) | |
| cancellation_rate: Canceled flights divided by total flights (engineered) | |
| diversion_rate: Diversions divided by total flights (engineered) | |
| disrupted: Indicates whether a flight was disrupted (1 = delayed or canceled, 0 = on-time) (engineered) | |
| carrier_delay_pct, weather_delay_pct, etc.: Proportion of each delay cause relative to total delay (engineered) | |
| carrier_total_flights: Total flights per carrier across dataset (engineered) | |
| airport_delay_rate: Total airport delay rate based on delay volume and traffic (engineered) | |
| year_month: Concatenated feature combining year and month (e.g., "2022-07") (engineered) | |
| season: Categorical season derived from month (e.g., Summer, Winter) (engineered) | |
| season_airport_combo: Combined feature of season and airport code (engineered) | |
| delay_risk_level: Categorized delay risk score (0 = low, 1 = medium, 2 = high) (engineered) | |
| mean_delay_per_flight: Average delay minutes per flight (engineered) | |
| dominant_delay_cause: Most significant delay cause on each row (engineered) | |
| SECTION 2: Analytical Insights | |
| Yearly Flight Arrivals | |
| Description: Displays the total number of flight arrivals per year to identify long-term trends. | |
| Insight: The airline industry experienced strong growth from 2013 to 2019, followed by a sharp collapse in 2020 due to the COVID-19 pandemic. The industry recovered in 2021 and 2022, but the 2023 drop signals continuing operational and economic volatility. This highlights the need for resilient forecasting and adaptive capacity planning. | |
| Proportion of Carrier | |
| Description: Shows the market share of each airline carrier. | |
| Insight: Flight activity is highly concentrated among a few dominant carriers such as OO, DL, and MQ. Around 26 percent of flights are grouped under "Other," indicating the need for detailed carrier-level analysis when assessing delay risks. | |
| Yearly Delay Ratio | |
| Description: Percentage of delayed flights (15+ minutes) per year. | |
| Insight: Delay ratios rebounded after 2020, surpassing pre-pandemic levels by 2023. This suggests that operational bottlenecks and increased demand now outweigh pre-2020 system efficiency. | |
| Yearly Cancellation Rate | |
| Description: Annual trend in flight cancellations. | |
| Insight: Apart from the 2020 spike during the COVID crisis, cancellation rates have remained consistently low, demonstrating strong schedule execution across airlines. | |
| Yearly Diversion Rate | |
| Description: Share of flights diverted from their original destination. | |
| Insight: Diversions remain rare but have increased since the pandemic, peaking in 2023. This indicates rising instability in airport and airspace operations. | |
| Average Delay Percentage by Cause | |
| Description: Average share of total delay minutes for each cause. | |
| Insight: Carrier and late aircraft delays are the leading contributors, accounting for more than 70 percent of total delay time. Improving internal operations offers the highest potential return on investment. | |
| Delay Causes During Peak Season | |
| Description: Breakdown of delay causes during high-traffic months. | |
| Insight: Late aircraft delays surge to over 40 percent during peak seasons, indicating a scalability issue in airline operations. | |
| Dominant Delay Cause per Flight | |
| Description: Most common primary delay cause for each flight. | |
| Insight: Carrier and late aircraft delays are the most frequent causes, reflecting internal inefficiencies and ripple effects from tight scheduling. | |
| Average Delay per Flight by Season | |
| Description: Mean delay duration across different seasons. | |
| Insight: Summer has the longest average delays, followed by winter. Fall is the most efficient season and can serve as a benchmark for best practices. | |
| Total Flights per Season | |
| Description: Traffic volume by season. | |
| Insight: Traffic volume does not correlate directly with delay performance. For example, winter has the lowest traffic but experiences some of the highest disruption rates. | |
| Delay Causes by Season | |
| Description: Aggregated delay causes across each season. | |
| Insight: Summer is dominated by carrier and late aircraft delays, while winter is driven by weather. NAS-related delays are more prominent in spring and summer. | |
| Disruption Rate by Season | |
| Description: Percentage of disrupted (delayed or canceled) flights by season. | |
| Insight: Winter has the highest disruption rate despite its low traffic. Fall consistently performs best across key metrics. | |
| Average Delay Ratio per Carrier | |
| Description: Ranks airlines by their average delay ratio. | |
| Insight: JetBlue and Envoy have the highest average delay ratios, while Delta and Southwest demonstrate consistent reliability. | |
| Carrier Delay Ratio by Season | |
| Description: Seasonal changes in delay ratios across airlines. | |
| Insight: Airlines like JetBlue and Frontier show a sharp increase in delays during summer. Other carriers maintain consistent performance across seasons. | |
| Top 10 Carriers: Delay Cause Breakdown | |
| Description: Total delay minutes by cause across the top 10 airlines. | |
| Insight: Late aircraft and carrier-related delays dominate most carriers. Some also experience significant NAS and weather delays. | |
| Disruption Rate vs Delay Cause (Carrier Level) | |
| Description: Correlation between disruption rate and delay cause. | |
| Insight: Carrier and late aircraft delays are the most predictive of system-wide disruption. | |
| Average Delay vs Delay Rate (Top Airports) | |
| Description: Comparison of delay frequency and severity at top airports. | |
| Insight: Airports such as Chicago and San Francisco face frequent and severe delays, while Atlanta and Dallas manage high volumes with greater efficiency. | |
| Delay Ratio vs Flight Volume (All Airports) | |
| Description: Delay rate in relation to flight volume for all airports. | |
| Insight: Efficiency is not necessarily dependent on traffic volume. Large airports can still maintain low delay ratios. | |
| Heatmap: Delay Cause by Airport | |
| Description: Geographic distribution of delay types across major airports. | |
| Insight: Chicago and Newark suffer from NAS-related delays, while Atlanta and Dallas are impacted more by carrier-side issues. Location-specific delay profiles enable more effective intervention strategies. | |
| Distribution of Delay Risk Levels | |
| Description: Share of flights categorized as low, medium, or high delay risk. | |
| Insight: Nearly 40 percent of flights are in the medium-to-high risk category. This classification provides a foundation for proactive resource planning and risk mitigation. | |
| Univariate Analysis Summary | |
| Yearly arrivals showed steady growth until the COVID-19 crash in 2020. | |
| Carrier market share is concentrated among 3–4 dominant players, with the rest grouped as "Other." | |
| Time Trend Summary | |
| Delay ratios, cancellation rates, and diversion rates have fluctuated since 2020, with disruptions worsening despite lower flight volumes. | |
| Delay Cause Summary | |
| Carrier and late aircraft delays dominate. | |
| Delays peak in high-travel seasons and are largely caused by internal airline processes. | |
| Seasonal Summary | |
| Summer has the worst performance; fall is the most efficient. | |
| Winter delays are mostly weather-related. | |
| Carrier Behavior Summary | |
| Envoy and JetBlue have poor delay ratios. | |
| Delta and Southwest remain consistently reliable. | |
| Airport Insights Summary | |
| Chicago and San Francisco suffer frequent delays. | |
| Atlanta and Dallas perform well despite high volumes. | |
| Delay patterns vary by airport and region. | |
| Risk Level Summary | |
| Over 40 percent of flights are at moderate or high risk of delay. | |
| Understanding risk level is critical for operational planning. | |
| SECTION 3: Application Overview | |
| This Streamlit-based dashboard provides an end-to-end analysis of U.S. domestic flight delays using a unified, modular interface. It is designed for analysts, planners, airline stakeholders, and decision-makers to explore delay causes, trends, risk levels, and model-driven predictions. The interface is interactive, visually dynamic, and structured around five core tabs. | |
| Available Tabs: | |
| Home Page | |
| The homepage provides a welcome message, key visual summaries, and interactive KPI cards that display core metrics such as total flights, delay ratios, cancellation rates, and average delay duration. A brief dataset description and high-level descriptive analysis offer users an immediate understanding of the dataset's scope. | |
| Explore Data | |
| This tab allows users to create their own visualizations through a fully modular plotting engine. Users can select any variable or combination of variables to generate univariate, bivariate, or heatmap plots using various chart types (bar, line, histogram, scatter, box, KDE, pie). A built-in table viewer lets users inspect raw data with filtering and selection tools. This tab is ideal for exploratory analysis and hypothesis testing. | |
| Data Analysis | |
| The Data Analysis page presents a curated, structured set of visual insights based on a complete EDA (Exploratory Data Analysis) of the dataset. The page is divided into thematic sections, such as Univariate Distributions, Seasonal Trends, Delay Cause Breakdown, Carrier Behavior, and Airport Performance. Each section includes a brief description, a stylized plot, and a professional insight box summarizing the key takeaway. | |
| Machine Learning Models | |
| This tab features two predictive pipelines: a classifier (for delay risk level prediction) and a regressor (for arrival delay estimation in minutes). Each model is explained with summary metrics, feature importance, and visual diagnostics. Users can enter new data or select from sample rows to receive real-time predictions. The section highlights interpretability, data preparation, and modeling techniques. | |
| Business Insights | |
| This page translates analytical findings into operational recommendations for airlines, airports, and policymakers. It includes strategic suggestions for addressing seasonal disruptions, optimizing carrier performance, improving airport efficiency, and managing delay risks. The content is designed in a professional tone and focused on real-world applications. | |
| Key User Interface Features: | |
| A chatbot is embedded in the sidebar and available on all pages. It can answer questions about the dataset, dashboard features, delay causes, seasonal effects, model performance, and more. | |
| A dynamic dark mode toggle adjusts the interface theme and background video. Each mode has its own looping animated video to enhance the user experience. | |
| A real-time clock is displayed at the top of the page. | |
| The sidebar provides seamless navigation and user interaction, including access to chatbot assistance and customization controls. | |
| Purpose and Intended Audience: | |
| The dashboard is built to support data-driven decision-making across airline operations, airport management, and public transport policy. It empowers users to explore data freely, understand underlying delay patterns, evaluate predictive models, and apply insights toward improving scheduling efficiency, passenger experience, and system resilience. | |
| GLOSSARY OF TERMS | |
| arr_flights: | |
| Total number of arriving flights for a given (carrier, airport, month, year) record. | |
| arr_del15: | |
| Number of flights delayed by more than 15 minutes. Used to compute delay ratio. | |
| arr_cancelled: | |
| Total number of canceled flights during the period. | |
| arr_diverted: | |
| Number of flights that were diverted to a different destination airport. | |
| carrier_ct: | |
| Count of delays attributed to the carrier, such as crew issues, maintenance, or scheduling errors. | |
| weather_ct: | |
| Count of delays caused by weather-related conditions such as storms or low visibility. | |
| nas_ct: | |
| Delays from National Airspace System issues, including air traffic congestion or routing problems. | |
| security_ct: | |
| Delays caused by security-related issues such as threats or screenings. | |
| late_aircraft_ct: | |
| Count of delays caused by an incoming aircraft arriving late, affecting its turnaround. | |
| arr_delay: | |
| Total arrival delay time (in minutes) for all flights in that row. | |
| carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay: | |
| Total delay duration for each delay type, expressed in minutes. | |
| delay_ratio (engineered): | |
| The proportion of delayed flights: arr_del15 divided by arr_flights. | |
| cancellation_rate (engineered): | |
| The proportion of canceled flights: arr_cancelled divided by arr_flights. | |
| diversion_rate (engineered): | |
| The proportion of diverted flights: arr_diverted divided by arr_flights. | |
| disrupted (engineered): | |
| A binary indicator (1 or 0) showing whether the flight group was either delayed or canceled. | |
| total_delay (engineered): | |
| The sum of all delay minutes from carrier, weather, NAS, security, and late aircraft. | |
| carrier_delay_pct, etc. (engineered): | |
| The share of total_delay attributable to each cause. For example, carrier_delay_pct = carrier_delay / total_delay. | |
| mean_delay_per_flight (engineered): | |
| Average delay per flight, calculated as total_delay divided by arr_flights. | |
| dominant_delay_cause (engineered): | |
| The delay type with the largest value for a given row. | |
| season (engineered): | |
| A categorical value (Winter, Spring, Summer, Fall) derived from the month column. | |
| season_airport_combo (engineered): | |
| A composite feature representing the interaction between season and airport. | |
| carrier_total_flights (engineered): | |
| The total number of arriving flights for a given carrier across all months and airports. | |
| airport_delay_rate (engineered): | |
| The average delay rate for a specific airport, calculated across all months and carriers. | |
| month_delay_rate (engineered): | |
| The average delay rate for a specific month, calculated across all carriers and airports. | |
| delay_risk_level (engineered): | |
| A categorical label that classifies flights into 0 (low risk), 1 (medium risk), or 2 (high risk) based on delay_ratio thresholds. | |
| year_month (engineered): | |
| A string combining year and month into a single value, formatted as "YYYY-MM". | |
| SECTION: Classifier Model Summary | |
| The classifier is a multi-class XGBoost model designed to predict the delay risk level of a flight. The target variable is delay_risk_level, which has three classes: 0 for low risk, 1 for medium risk, and 2 for high risk. The classifier uses custom thresholds to adjust class boundaries for better performance. These thresholds are defined in a JSON file located in the models folder and can be loaded dynamically depending on the selected strategy. | |
| The preprocessing pipeline includes data cleaning, feature engineering, encoding, and scaling. Features that leak future information such as arr_delay or carrier_name are dropped. The model is trained using OneHotEncoded categorical variables and standardized numerical features. All features are reindexed to match the training schema to avoid mismatches during inference. | |
| The classifier supports two modes. In test mode, it accepts labeled data and returns prediction metrics such as precision, recall, and F1-score. In realtime mode, it accepts new, unlabeled data and returns the predicted risk level directly. | |
| SECTION: Regressor Model Summary | |
| The regressor is an XGBoost model trained to predict the total arrival delay in minutes. The target variable is arr_delay. The model uses the same core pipeline structure as the classifier. It includes cleaning, feature engineering, encoding, and scaling. All delay-related features such as arr_del15, carrier_delay, and total_delay are dropped to prevent data leakage. | |
| The regressor also supports two modes. In test mode, it returns evaluation metrics including mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and R-squared (R2). In realtime mode, it only outputs the predicted delay value. | |
| Both models share a modular design, allowing consistent preprocessing and flexible integration with the dashboard and chatbot. The models are stored in the models directory and loaded during runtime for inference. |