Homework5.1 / app.py
JanhaviZarapkar's picture
Final Commit
f12e548 verified
import pandas as pd
import streamlit as st
import altair as alt
import streamlit.components.v1 as components
st.set_page_config(page_title="Building Inventory Analysis", layout="wide")
components.html(
"""
<script>
document.querySelector('iframe').style.height = '100vh';
</script>
""",
height=0,
)
st.markdown(
"""
<style>
html, body, [data-testid="stAppViewContainer"] {
height: 100vh;
overflow: hidden;
}
</style>
""",
unsafe_allow_html=True,
)
# Load and clean dataset
url = "https://raw.githubusercontent.com/UIUC-iSchool-DataViz/is445_data/main/building_inventory.csv"
df = pd.read_csv(url, na_values={'Year Acquired': 0, 'Year Constructed': 0, 'Square Footage': 0})
# Displaying the Dataset Overview
st.header("Building Inventory Dataset Analysis")
st.write("Below are the first 10 rows of the dataset:")
st.write(df.head(10))
st.write(f"The shape of dataset before cleaning is: {df.shape}")
#Drop irrelevant columns
columns_to_drop = [
'Rep Full Name', 'Senator Full Name', 'Usage Description 3',
'Usage Description 2', 'Congressional Full Name', 'Address'
]
df = df.drop(columns=columns_to_drop)
#Check and handle missing values
missing_values = df.isnull().sum()
st.subheader("Missing Values Before Cleaning")
st.write(missing_values)
# Drop rows where 'Year Acquired' or 'Year Constructed' is NaN
df = df.dropna(subset=['Year Acquired', 'Year Constructed'])
df['County'] = df['County'].fillna('Unknown')
df['Square Footage'] = df['Square Footage'].fillna(df['Square Footage'].mean())
st.subheader("Missing Values After Cleaning")
st.write(df.isnull().sum())
st.write(f"The shape of dataset after cleaning is: {df.shape}")
# Visualization 1: Number of Buildings by County and Agency
st.markdown("""
<h4>Visualization 1: Number of Buildings by County and Agency</h4>
""", unsafe_allow_html=True)
# Group the data by 'County' and 'Agency Name' to get the count of buildings
county_agency_building = df.groupby(['County', 'Agency Name']).size().reset_index(name='Number of Buildings')
# Create a stacked bar chart with adjusted legend properties
stacked_bar_chart_county = (
alt.Chart(county_agency_building)
.mark_bar()
.encode(
alt.X('County:N', title='County', sort='-y'),
alt.Y(
'Number of Buildings:Q',
title='Number of Buildings',
scale=alt.Scale(domain=[0, 550]),
),
color =
alt.Color(
'Agency Name:N',
title='Agency Name',
legend=alt.Legend(orient='right', padding=0, symbolSize=50,labelFontSize=10,labelOverlap="greedy", columnPadding=0,rowPadding=0)
),
tooltip=['County', 'Agency Name', 'Number of Buildings']
).properties(
width=800,
height=500,
title="Stacked Bar Graph for Number of Buildings by County and Agency"
))
# Display the stacked bar chart in Streamlit
st.altair_chart(stacked_bar_chart_county,theme="streamlit", use_container_width=True)
# Write-up for Stacked Bar Chart
st.markdown("""
**Number of Buildings by County and Agency**
For this visualization, I wanted to highlight the distribution of buildings across different counties and agencies. The primary goal was to show how buildings are spread out by agency in each county, allowing for a clear comparison of agency activity within a county.
I chose a **stacked bar chart** for this visualization because it effectively conveys the breakdown of buildings by agency in each county. This chart type allows viewers to easily compare the total number of buildings in each county while also seeing the proportion of buildings allocated to each agency within that county. For the **x-axis**, I used the counties, as they are the main categories for comparison, while the **y-axis** represents the total number of buildings. I employed the **color encoding** to differentiate between agencies, using distinct colors to make it easier for users to distinguish between them. The tooltips provide additional context by displaying detailed numbers when hovering over the bars.
If I had more time, I would focus on adding **interactivity** by implementing filters so users can select specific counties or agencies they are interested in. This would allow them to drill down into the data and explore individual trends. Additionally, it would be valuable to **show proportions** in the tooltip to provide more insights into the relative size of each agency's buildings within a county. This would enhance the user experience by making the data more dynamic and interactive.
""", unsafe_allow_html=True)
# Visualization 2: Bubble Chart for County, Total Floors, and Square Footage
st.markdown("""
<h4 >Visualization 2: County, Total Floors, and Square Footage</h4>
""", unsafe_allow_html=True)
bubble_chart = alt.Chart(df).mark_circle().encode(
x=alt.X('County:N', title='County'),
y=alt.Y('sum(Square Footage):Q', title='Total Square Footage (sq ft)'),
size=alt.Size('sum(Total Floors):Q', title='Total Floors'),
color=alt.Color('County:N', scale=alt.Scale(scheme='category20'), title='County'),
tooltip=['County', 'sum(Square Footage)', 'sum(Total Floors)']
).properties(
width=800,
height=500,
title="Relationship Between County, Square Footage, and Total Floors"
)
st.altair_chart(bubble_chart,theme="streamlit", use_container_width=True)
# Write-up for Bubble Chart
st.markdown("""
**County, Total Floors, and Square Footage Relationship**
In this bubble chart, I aimed to highlight the relationship between three key features: the total square footage of buildings, the number of floors, and the counties where these buildings are located. The purpose of this chart is to show how different counties compare in terms of building size and the number of floors. By plotting total square footage on the y-axis and using the size of the bubbles to represent the number of floors, I wanted to illustrate how building size correlates with the number of floors across various counties.
For the **design choices**, I chose a **bubble chart** because it allows for the visualization of three variables at once, making it ideal for this type of data. The **x-axis** represents the counties, while the **y-axis** shows the total square footage of buildings. The **size** of each bubble is used to represent the total number of floors, providing a quick visual reference for the size of each building. I applied the **"category20" color scheme** to the counties, ensuring that each county is distinctly represented by a unique color, which helps differentiate them easily. The **tooltips** are included to provide detailed information when hovering over each bubble, allowing users to quickly access the specific county, total square footage, and total floors of the buildings.
If I had more time, I would focus on adding **interactive filters** that would allow users to filter by specific counties or range of floors, making the chart more customizable and user-friendly. Another improvement could be the addition of a **scroll bar** to the legend since there are a large number of counties, and currently, not all of them are visible at once, which could limit the user's ability to differentiate them easily.
""", unsafe_allow_html=True)
# Visualization 3: Heatmap for Building Count by County and Status
st.markdown("""
<h4 >Visualization 3: Building Count by County and Status</h4>
""", unsafe_allow_html=True)
heatmap = alt.Chart(df).mark_rect().encode(
x=alt.X('County:N', title='County'),
y=alt.Y('Bldg Status:N', title='Building Status'),
color=alt.Color('count():Q', scale=alt.Scale(scheme='blues'), title='Count of Buildings'),
tooltip=['County', 'Bldg Status', 'count()']
).properties(
width=800,
height=500,
title="Heatmap of Building Count by County and Status"
)
st.altair_chart(heatmap, theme="streamlit", use_container_width=True)
# Write-up for Heatmap
st.markdown("""
**Building Count by County and Status**
In this heatmap, I aimed to highlight the distribution of building counts across counties and their statuses, such as whether they are in use, in progress, or abandoned. The goal was to help users quickly see which counties have high building activity and how building statuses are spread out across those counties.
For the **design choices**, I selected a **heatmap** because it effectively visualizes data across two categorical variables: counties and building statuses. The **x-axis** represents the counties, while the **y-axis** shows the different building statuses (e.g., In use, Progress, Abandon). To make the differences in building counts more visually distinguishable, I chose the **"blues" color scheme**, which provides a gradient where darker shades represent higher counts of buildings. This allows users to easily spot areas with high building activity, while maintaining an accessible color scheme. **Tooltips** were added to display exact counts of buildings in each county and status combination, offering users more detailed insights when they hover over the heatmap cells.
If I had more time, I would enhance the **interactivity** of the heatmap by allowing users to zoom into specific counties or statuses, enabling them to focus on areas of interest. Additionally, I would incorporate a filter to select specific counties, as the map contains a large number of them, and not all are immediately visible. This would allow users to dive deeper into particular regions for a more customized analysis.
""", unsafe_allow_html=True)
st.markdown("""
**Thank you for exploring the Building Inventory Analysis with me!**
""", unsafe_allow_html=True)
st.markdown("""
**~By Janhavi Tushar Zarapkar**
""", unsafe_allow_html=True)