Spaces:

Phani1008
/

Data_analysis

Sleeping

App Files Files Community

Data_analysis / app.py

Phani1008

Update app.py

9a56cc2 verified over 1 year ago

raw

history blame contribute delete

16.8 kB


	import streamlit as st
	import math
	from functools import reduce

	st.title("INTRODUCTION TO STATISTICS :chart_with_upwards_trend:")
	import streamlit as st

	st.write("In this field, we will work with data using the Python programming language. The term Data Analysis indicates that it focuses on handling data. This involves gathering, cleaning, and then analyzing the data to extract valuable insights. Now, let's explore what data means.")

	st.header("What is Data?")

	st.write("In a simple definition we can say that data is a collection of information. And we can also say Facts or pieces of information that can be measured. It can be in various form such as")
	st.markdown("- IMAGE is one of the best source of data.")
	st.markdown("- TEXT is one of the best source of data.")
	st.markdown("- VIDEO is one of the best source of data.")
	st.markdown("- AUDIO is one of the best source of data.")

	st.write("Not all data is created equal. Data can come in various forms, and knowing how to classify it is essential for choosing the right tools and methods to analyze it. Broadly, data can be classified into three main categories: ")
	st.markdown("- Structured")
	st.markdown("- Semi-structured")
	st.markdown("- Unstructured")
	st.write("Each type of data has its own characteristics, advantages, and challenges when it comes to processing and extracting insights.")

	st.subheader("Structured Data:")
	st.write("Structured data is the most organized and easily accessible form of data. It refers to information that is highly organized and formatted in a way that can be easily stored, accessed, and processed by machines. Think of structured data as data that fits neatly into rows and columns, like in a spreadsheet or a relational database.")
	st.write("Examples : ")
	st.markdown("- Databases: Tables in SQL databases where each column represents a different attribute (e.g., name, age, salary), and each row represents a record.")
	st.markdown("- Excel Sheets: Rows and columns filled with categorical and numerical data.")

	st.subheader("Semi structured Data:")
	st.write("Semi-structured data doesn’t fit as neatly into the traditional table format as structured data, but it still follows a certain organizational framework. This type of data contains tags, markers, or attributes that make it somewhat organized, but it doesn’t strictly conform to a table format.")
	st.write("Examples :")
	st.markdown("- JSON and XML Files: Data encoded with a structure but flexible enough to allow for variations in how it’s formatted.")
	st.markdown("- NoSQL Databases: Collections of documents or key-value pairs that are structured but don’t require a fixed schema like relational databases.")

	st.subheader("Unstructured Data:")
	st.write("Unstructured data is the most complex and least organized form of data. It does not follow a specific format or structure, making it difficult to process and analyze using traditional methods. Unstructured data includes a wide range of formats, such as text, images, videos, and more.")
	st.write("Examples :")
	st.markdown("- Text Files: Documents, emails, and written reports.")
	st.markdown("- Multimedia: Photos, videos, and audio files.")
	st.markdown("- Social Media: Tweets, posts, and comments on platforms like Facebook, Twitter, or Instagram.")


	st.write("In today's world, we're constantly surrounded by data, especially with the internet being a significant part of our daily lives. However, data alone isn't always helpful. To gain value from it, we need to transform that data into meaningful information.This transformation is the essence of statistics, the science that turns data into actionable insights.")

	import streamlit as st
	st.header("What is Statistics?")

	st.write("Statistics is a branch of mathematics and a huge field that involves collecting, analysing, interpreting, and organizing data. It allows us to take raw data, make sense of it, and turn it into useful information. In essence, statistics is the science of understanding data and using it to answer questions or solve problems.")
	st.write("Statistics is classified into two types based on data collected.")
	st.markdown("- Descriptive Statistics")
	st.markdown("- Inferential Statistics")

	st.subheader("Descriptive Statistics")
	st.write("Descriptive statistics is a vital branch of statistics that focuses on summarizing and explaining the main features of data. It helps us get a clear overview of the dataset by analyzing sample data or population data. ")
	st.write("Here, what is sample and population? Let’s learn")
	st.markdown(''':blue-background[Population] The entire set of observations of interest in a particular study. For example, all the people living the country.''')
	st.markdown("Parameters are numerical characteristics of a population, such as the population mean, that are used to describe and analyse the population")
	st.markdown(''':blue-background[Sample] A subset of population or subset of observations selected from a population, intended to represent the population in a study. ''')
	st.markdown("Statistics are numerical characteristics of a sample, such as the sample mean, that are used to describe and analyse the data.")

	st.write("The key concepts in descriptive statistics include :")
	st.markdown("- Measurement of Central Tendency which involves finding Mean, Median, and Mode.")
	st.markdown("- Measurement of Dispersion which involves finding Range, Variance and Standard Deviation.")
	st.markdown("- Distribution which gives how frequently the data is occurring some of examples of distribution are Gaussian, Random, and Normal distribution")

	st.subheader("Measure Of Central Tendency")
	st.write("This involves finding a single value that represents the center of the dataset. It includes three primary measures:")
	st.markdown("- Mean: The average of all data points.")
	st.markdown("- Median: The middle value in a sorted dataset.")
	st.markdown("- Mode: The most frequently occurring value in the dataset.")

	st.subheader("MODE")
	st.write("Mode represents the most frequently occurring value in the dataset. Unlike the mean or median, mode focuses on the frequency of the data. However, mode has its limitations, primarily because it is frequency-based, which means it gives the most weight to the data that appears most often, possibly ignoring other important information.")
	st.write("Here are the different types of modes and the situations where they arise:")
	st.markdown(''':blue-background[No Mode] This situation occurs when there are no repeating values in the dataset. For example, in the list [1, 2, 3, 4, 5], no number appears more than once, resulting in No Mode.''')
	st.markdown(''':blue-background[Uni Mode] Uni-mode occurs when one value appears more frequently than others. For example, in the list [1, 1, 1, 2, 3, 4, 5], the number 1 appears most often, making it the Uni-Mode.''')
	st.markdown(''':blue-background[Bi Mode] Bi-mode arises when two different values appear with the same highest frequency. For example, in the list [1, 1, 2, 2, 3, 4, 5], both 1 and 2 appear twice, resulting in Bi-Mode.''')
	st.markdown(''':blue-background[Tri Mode] Tri-mode occurs when three different values have the same highest frequency. For example, in the list [1, 1, 2, 2, 3, 3, 4, 5], the numbers 1, 2, and 3 each appear twice, making it a Tri-Mode scenario.''')
	st.markdown(''':blue-background[Multi Mode] Multi-mode happens when more than three values have the same frequency of occurrence. For example, in the list [1, 1, 2, 2, 3, 3, 4, 4, 5], the numbers 1, 2, 3, and 4 each appear twice, creating a Multi-Mode scenario.''')
	st.header("Calculate Mode")
	def mode(*args):
	list1 = list(args)
	dict1 = {}
	dict2 = {}
	set1 = set(list1)
	for j in set1:
	dict1[j] = list1.count(j)
	max_value = max(dict1.values())
	count = [key for key, value in dict1.items() if value == max_value]
	if max_value == 1:
	return 'no mode'
	elif len(count) == len(set1):
	return 'no mode'
	elif len(count) == 1:
	dict2[count[0]] = dict1.get(count[0])
	return dict2
	elif len(count) == 2:
	return 'bi mode'
	elif len(count) == 3:
	return 'tri mode'
	else:
	return 'multimode'
	numbers_input = st.text_input("Enter a list of numbers separated by commas (e.g., 1, 2, 2, 3, 4):")

	if numbers_input:
	try:
	list1 = list(map(int, numbers_input.split(',')))
	result = mode(*list1)
	st.write("Mode result:", result)
	except ValueError:
	st.write("Please enter a valid list of numbers separated by commas.")

	st.subheader("MEDIAN")
	st.write("The median is another measure of central tendency that focuses on the middle value of an ordered dataset. The key advantage of the median is that it is not affected by outliers.")
	st.subheader("Median Formula for Odd Number of Observations")
	st.latex(r'''\text{Median} = X_{\left(\frac{n+1}{2}\right)}''')
	st.subheader("Median Formula for Even Number of Observations")
	st.latex(r'''\text{Median} = \frac{X_{\left(\frac{n}{2}\right)} + X_{\left(\frac{n}{2}+1\right)}}{2}''')
	st.header("Calculate Median")
	def median(list1):
	list1.sort()
	length = len(list1)
	if length % 2 == 0:
	mid1 = length // 2 - 1
	mid2 = length // 2
	return (list1[mid1] + list1[mid2]) / 2
	else:
	mid = length // 2
	return list1[mid]
	numbers_input_1 = st.text_input("Enter a list of numbers separated by commas (e.g., 1, 2, 3, 4, 5):", key="numbers_input_1")
	if numbers_input_1:
	parts = numbers_input_1.split(',')
	list1 = []

	for num in parts:
	num = num.strip()
	if num.isdigit():
	list1.append(int(num))

	if list1:
	result = median(list1)
	st.write("Median result:", result)
	else:
	st.write("No valid numbers provided.")

	st.subheader("MEAN")
	st.write("The mean is the arithmetic average of all values in the dataset. It is widely used because it considers all the data points. However, it can be heavily affected by outliers.")
	st.write("Based on the data we will compute the mean in three types :")

	st.subheader("Arithmetic Mean")
	st.write("Arthmetic Mean is used on data which have")
	st.markdown("- Interval and Ratio Data")
	st.markdown("- Symmetric Distributions")
	st.markdown("- Data Without Outliers")
	st.subheader("Population Mean Formula")
	st.latex(r'''\mu = \frac{1}{N} \sum_{i=1}^{N} x_i''')
	st.subheader("Sample Mean Formula")
	st.latex(r'''\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i''')
	st.header("Calculate Arithmetic Mean")
	def arthamatic_mean(list1):
	sum=reduce(lambda x,y: x+y,list1)
	return sum/len(list1)
	numbers_input_2 = st.text_input("Enter a list of numbers separated by commas (e.g., 1, 2, 3, 4, 5):", key="numbers_input_2")
	if numbers_input_2:
	parts=numbers_input_2.split(",")
	list1=[]
	for i in parts:
	i = i.strip()
	if i.isdigit():
	list1.append(int(i))
	if list1:
	result=arthamatic_mean(list1)
	st.write("Arthmetic_Mean",result)
	else:
	st.write("No valid numbers provided.")


	st.subheader("Geometric Mean")
	st.write("Geometric Mean is used on data which have")
	st.markdown("Multiplicative Data")
	st.markdown("Percentages and Rates")
	st.markdown("Normalized Data")
	st.subheader("Geometric Mean for Population")
	st.latex(r'''\text{GM}_{\text{population}} = \left( \prod_{i=1}^{N} x_i \right)^{\frac{1}{N}}''')
	st.subheader("Geometric Mean for Sample")
	st.latex(r'''\text{GM}_{\text{sample}} = \left( \prod_{i=1}^{n} x_i \right)^{\frac{1}{n}}''')
	st.header("Calculate Geometric Mean")
	def geometric_mean(list1):
	mul=reduce(lambda x,y: x*y,list1)
	return round(mul**(1/len(list1)),2)
	numbers_input_3 = st.text_input("Enter a list of numbers separated by commas (e.g., 1, 2, 3, 4, 5):", key="numbers_input_3")
	if numbers_input_3:
	parts=numbers_input_3.split(",")
	list1=[]
	for i in parts:
	i = i.strip()
	if i.isdigit():
	list1.append(int(i))
	if list1:
	result=geometric_mean(list1)
	st.write("Geometric_Mean",result)
	else:
	st.write("No valid numbers provided.")


	st.subheader("Harmonic Mean")
	st.write("Harmonic Mean is used on data which have")
	st.markdown("Rates and Ratios")
	st.markdown("Data with Reciprocal Relationships")
	st.subheader("Harmonic Mean for Population")
	st.latex(r'''\text{HM}_{\text{population}} = \frac{N}{\sum_{i=1}^{N} \frac{1}{x_i}}''')
	st.subheader("Harmonic Mean for Sample")
	st.latex(r'''\text{HM}_{\text{sample}} = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}''')
	st.header("Calculate Harmonic Mean")
	def harmonic_mean(list1):
	sum=reduce(lambda x,y: x+1/y,list1)
	return round(len(list1)/sum,2)
	numbers_input_4 = st.text_input("Enter a list of numbers separated by commas (e.g., 1, 2, 3, 4, 5):", key="numbers_input_4")
	if numbers_input_4:
	parts=numbers_input_4.split(",")
	list1=[]
	for i in parts:
	i = i.strip()
	if i.isdigit():
	list1.append(int(i))
	if list1:
	result=harmonic_mean(list1)
	st.write("Geometric_Mean",result)
	else:
	st.write("No valid numbers provided.")

	st.subheader("Measure Of Disperssion")
	st.write("Measures of dispersion describe the variability of the data. They indicate how much the data points differ from the central value. Range, variance and Standard deviation.")
	st.markdown(''':blue-background[Absolute Measure] These give the spread in the same unit as the original data (e.g., cm, kg, etc.) for example if the given data is in 'cm' the output will be in cm.''')
	st.markdown(''':blue-background[Relative Measure] These are unit-free and give a ratio or percentage that indicates variability in relation to the central tendency.''')

	st.subheader("Absolute Measure")
	st.markdown("- Range")
	st.markdown("- Quartile Deviation")
	st.markdown("- Varience")
	st.markdown("- Standard Deviation")

	st.subheader("Relative Measure")
	st.markdown("- Coefficent Of Range")
	st.markdown("- Coefficent Of Quartile Deviation")
	st.markdown("- Coefficent Of Varience")
	st.markdown("- Coefficent Of Standard Deviation")

	st.markdown(''':blue-background[Range] is one of the measure to find the disperssion.But is not at all mostly used beause it don't focus on the entire data.''')
	st.subheader("Absolute Range")
	st.latex(r'''
	\text{Absolute Range} = \text{Maximum Value} - \text{Minimum Value}
	''')
	st.subheader("Relative Range")
	st.latex(r'''
	\text{Relative Range} = \frac{\text{Absolute Range}}{\text{Mean}} \times 100
	''')

	st.markdown(''':blue-background[Quartile Deviation] is one of the measure to find the disperssion.In this type the data is divided into 4 equal parts. It will mostly focus on the central data.''')
	st.subheader("Absolute Quartile Deviation")
	st.latex(r'''
	QD = \frac{Q3 - Q1}{2}
	''')
	st.subheader("Relative Quartile Deviation")
	st.latex(r'''
	\text{Relative QD} = \frac{Q3 - Q1}{Q3 + Q1} \times 100
	''')


	st.markdown(''':blue-background[Varience] is one of the measure to find the disperssion.It is one of the best measure to find the disperssion.The only drawback is when in Varience is in order to overcome negitive value we square them thus the distance is doubled.''')
	st.subheader("Absolute Variance")
	st.latex(r'''
	\text{Var} = \frac{1}{N} \sum_{i=1}^{N} (x_i - \bar{x})^2
	''')
	st.subheader("Relative Variance")
	st.latex(r'''
	\text{Relative Var} = \frac{\text{Var}}{\bar{x}} \times 100
	''')


	st.markdown(''':blue-background[Standard Deviation] is one of the measure to find the disperssion.It is one of the best measure to find the disperssion.It over comes the disadvantage occured in varience by square rooting it.''')
	st.subheader("Absolute Standard Deviation")
	st.latex(r'''
	\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \bar{x})^2}
	''')
	st.subheader("Relative Standard Deviation")
	st.latex(r'''
	\text{Relative SD} = \frac{\sigma}{\bar{x}}
	''')

	st.subheader("Distribution")
	st.write("Measures of distribution describe the pattern or shape of the data distribution. They help in understanding how data points are distributed across different values.")
	st.write("There are few types of distribution:")
	st.markdown("- Normal Distribution")
	st.markdown("- Uniform Distribution")
	st.markdown("- Binomial Distribution")
	st.markdown("- Poisson Distribution")
	st.markdown("- Exponential Distribution")
	st.markdown("- Chi-Square Distribution")
	st.markdown("- T-Distribution")

	st.subheader("Inferential Statistics")
	st.write("Inferential statistics is a branch of statistics that involves making predictions, inferences, or generalizations about a population based on a sample of data taken from that population. It allows us to draw conclusions beyond the immediate data available, using various methods such as hypothesis testing, estimation, and confidence intervals.")