File size: 5,049 Bytes
619bdd7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
'''
Achmad Dhani

Objective : Creating EDA page specifically to explain insights from EDA
'''

import streamlit as st
import pandas as pd
from PIL import Image

def run():
    '''
    Function for EDA page
    '''
    st.title('Exploration Data Analysis Section')

    df= pd.read_csv('water_potability.csv') # reading CSV

#============================= Display Data ===============================

    col1, col2 = st.columns(2)
    
    with col1.expander("View the top 10 entries of the original dataset"):
        st.table(df.head(10))
    
    with col2.expander("View the bottom 10 entries of the original dataset"):
        st.table(df.tail(10))


#============================= Correlation =====================================
    st.subheader('Correlation Matrix Between The Chemicals')
    col3, col4 = st.columns(2)

    # 1st image
    col3.write('Pearsons Correlation Matrix')
    image1 = Image.open('pearsons.png')
    col3.image(image1, caption='Figure 1 Pearsons Correlation Matrix of All Chemicals')

    # 2nd image
    col4.write('Spearman Correlation Matrix')
    image2 = Image.open('spearman.png')
    col4.image(image2, caption='Figure 2 Spearman Correlation Matrix of All Chemicals')

    # explaination
    with st.expander('Explanation'):
        st.caption(
            '''
            Based on both visualization, most of the variables do not have any relationship except for a few.
            
            Based on both visualization, most of the variables do not have any relationship except for a few.

            - `Hardness` has a very positive low value with `ph` in spearman but close to 0 in pearsons. This suggests there might be a very weak positive non
            linear relationship.
            - `Sulfate` with `Solids` and with `Sulfate` has a very low negative value both in spearman and pearsons. This suggests there might be a very weak
            negative linear relationship.
            '''
            )
        
#================================ ph ==========================================
    
    st.subheader('ph Values Distribution')
    image3 = Image.open('ph.png')
    st.image(image3, caption='Figure 3 ph values distribution histogram',  width=600)

    # explaination
    with st.expander('Explanation'):
        st.caption(
            '''
            - The water sample taken mostly has ph between `5-9`
            - The visualization also suggest a lot of data are in the range for drinkable water but doesn't mean that the water is drinkable.
            - This could mean most water samples that's taken could come contaminated water bodies.
            '''
            )

#================================ Missing Values ===============================
    st.subheader('Missing Values Visualizations')

    # missing plot
    st.write('Missing Values Bar Plot')
    image4 = Image.open('missing_values.png')
    st.image(image4, caption='Figure 4 Bar plot of missing values of each column')

    # displaying explaination
    with st.expander('Explanation'):
        st.caption(
            '''
            **From Data Loading**
            
            - There are otal missing values in the dataset: 1434
            
            - Columns with missing values: 
            
            `['ph', 'Sulfate', 'Trihalomethanes']`
            
            Number of missing values per column:
            >ph                 `491`
            >
            >Sulfate            `781`
            >
            >Trihalomethanes    `162`
            >
            >dtype: int64
            
            Missing data percentage (%):
            >ph                 `15`
            >
            >Sulfate            `24`
            >
            >Trihalomethanes     `5`
            '''
            )
        
    # missing matrix
    st.write('Missing Values Correlation Matrix')
    image5 = Image.open('missing_corr.png')
    st.image(image5, caption='Figure 5 Correlation matrix of the missing values')
    
        
    # display explaination
    with st.expander('Explanation'):
        st.caption(
            '''
            - Based on the visualization above, the missing values have no correlation and can be cosidered the missingness is `completly random`
            - The missing values being random could be due to the person that took the water sample did not have the equipment to measure the chemical level.
            '''
            )

#================================== PCA =============================
    
    st.subheader('Feature Importance')
    image6 = Image.open('PCA.png')
    st.image(image6, caption='Figure 6 Linechart of explained variance ratio with number of components')

    # displaying explaination
    with st.expander('Explanation'):
        st.caption(
            '''
            - Based on the visualization of PCA, there is a linear relationship between number of components and the EVR cummulative
            - This suggest, each feature is important and retains unique information of the dataset
            '''
            )