DOMMETI commited on
Commit
2f145ad
Β·
verified Β·
1 Parent(s): 209b944

Create 9_natural_language_processing.py

Browse files
pages/9_natural_language_processing.py ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+ # Page Configuration
4
+ st.set_page_config(page_title="NLP Guide", layout="wide")
5
+
6
+ # Custom CSS Styling
7
+ st.markdown("""
8
+ <style>
9
+ body {
10
+ background-color: #eef2f7;
11
+ font-family: 'Roboto', sans-serif;
12
+ }
13
+ h1 {
14
+ color: #00FFFF;
15
+ font-family: 'Roboto', sans-serif;
16
+ font-weight: bold;
17
+ text-align: center;
18
+ margin-bottom: 25px;
19
+ }
20
+ h2 {
21
+ color: #FFFACD;
22
+ font-family: 'Roboto', sans-serif;
23
+ font-weight: 700;
24
+ margin-top: 30px;
25
+ }
26
+ h3 {
27
+ color: #ba95b0;
28
+ font-family: 'Roboto', sans-serif;
29
+ font-weight: 600;
30
+ margin-top: 20px;
31
+ }
32
+ p, ul {
33
+ font-family: 'Georgia', serif;
34
+ line-height: 1.8;
35
+ color: #2b2b2b;
36
+ margin-bottom: 20px;
37
+ }
38
+ .icon-bullet {
39
+ list-style-type: none;
40
+ padding-left: 20px;
41
+ }
42
+ .icon-bullet li {
43
+ font-family: 'Georgia', serif;
44
+ font-size: 1.1em;
45
+ margin-bottom: 10px;
46
+ color: #2b2b2b;
47
+ }
48
+ .icon-bullet li::before {
49
+ content: "βœ”οΈ";
50
+ padding-right: 10px;
51
+ color: #00FFFF;
52
+ }
53
+ .stImage img {
54
+ border-radius: 10px;
55
+ }
56
+ </style>
57
+ """, unsafe_allow_html=True)
58
+
59
+ # Function to display the Home Page
60
+ def show_home_page():
61
+ st.title("Natural Language Processing (NLP)")
62
+ st.markdown(
63
+ """
64
+ ### Welcome to NLP Guide 🌟
65
+ Natural Language Processing (NLP) bridges the gap between computers and human language. It's the core technology behind:
66
+ - Chatbots (e.g., Alexa, Siri)
67
+ - Machine Translation (Google Translate)
68
+ - Sentiment Analysis
69
+ - Search Engines (e.g., Google, Bing)
70
+
71
+ Dive into **Tokenization**, **Vectorization**, and more to understand how machines process text!
72
+ """
73
+ )
74
+ st.image(
75
+ "https://cdn-uploads.huggingface.co/production/uploads/64c972774515835c4dadd754/wSlRj9jk4szr4yy3wTlfA.webp",
76
+ caption="Applications of NLP",
77
+ width=800,
78
+ )
79
+
80
+ # Function to display specific topic pages
81
+ def show_page(page):
82
+ if page == "Tokenization":
83
+ st.title("Tokenization")
84
+ st.markdown("""
85
+ ### Tokenization πŸ› οΈ
86
+
87
+ Tokenization breaks text into smaller units (tokens), such as words or sentences. This is the first step in most NLP pipelines.
88
+
89
+ #### Types of Tokenization:
90
+ 1. **Word Tokenization**:
91
+ - Splits text into individual words.
92
+ - Example: *"I love NLP"* β†’ `["I", "love", "NLP"]`
93
+ 2. **Sentence Tokenization**:
94
+ - Splits text into sentences.
95
+ - Example: *"NLP is exciting. Let's learn it."* β†’ `["NLP is exciting.", "Let's learn it."]`
96
+
97
+ #### Libraries for Tokenization:
98
+ - **NLTK**: Popular for academic projects.
99
+ - **SpaCy**: Fast and production-ready.
100
+ - **Transformers**: Advanced tokenization for models like BERT.
101
+
102
+ #### Challenges in Tokenization:
103
+ - Handling contractions (e.g., "I'm" β†’ ["I", "'m"]).
104
+ - Handling multi-lingual data (e.g., "Bonjour NLP").
105
+ """)
106
+
107
+ elif page == "NLP Terminologies":
108
+ st.title("NLP Terminologies")
109
+ st.markdown("""
110
+ ### NLP Terminologies πŸ“š
111
+
112
+ - **Stop Words**: Commonly used words like "the" or "is" that are removed during preprocessing.
113
+ - **Stemming**: Reducing words to their root forms (e.g., "running" β†’ "run").
114
+ - **Lemmatization**: Converting words to their base dictionary forms (e.g., "better" β†’ "good").
115
+ - **POS Tagging**: Assigning parts of speech to words (e.g., noun, verb).
116
+ - **NER (Named Entity Recognition)**: Identifying entities like names or places (e.g., "New York").
117
+ """)
118
+
119
+ elif page == "One-Hot Vectorization":
120
+ st.title("One-Hot Vectorization")
121
+ st.markdown("""
122
+ ### One-Hot Vectorization πŸ”’
123
+
124
+ A simple way to represent text where each word is converted into a unique binary vector.
125
+
126
+ #### How It Works:
127
+ - Each word in the vocabulary is assigned an index.
128
+ - The vector is all zeros except for a `1` at the word's index.
129
+
130
+ #### Example:
131
+ Vocabulary: ["cat", "dog", "bird"]
132
+ - "cat" β†’ [1, 0, 0]
133
+ - "dog" β†’ [0, 1, 0]
134
+
135
+ #### Advantages:
136
+ - Easy to implement.
137
+
138
+ #### Limitations:
139
+ - High dimensionality for large vocabularies.
140
+ - Does not capture semantic relationships (e.g., "king" and "queen").
141
+ """)
142
+
143
+ elif page == "Bag of Words":
144
+ st.title("Bag of Words (BoW)")
145
+ st.markdown("""
146
+ ### Bag of Words 🧳
147
+
148
+ Represents text as word frequency counts.
149
+
150
+ #### How It Works:
151
+ 1. Create a vocabulary of unique words.
152
+ 2. Count the frequency of each word in a document.
153
+
154
+ #### Example:
155
+ Given two sentences:
156
+ - "I love NLP."
157
+ - "I love programming."
158
+
159
+ Vocabulary: ["I", "love", "NLP", "programming"]
160
+ - Sentence 1: [1, 1, 1, 0]
161
+ - Sentence 2: [1, 1, 0, 1]
162
+ """)
163
+
164
+ elif page == "TF-IDF Vectorizer":
165
+ st.title("TF-IDF Vectorizer")
166
+ st.markdown("""
167
+ ### TF-IDF Vectorizer πŸ“Š
168
+
169
+ A statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus).
170
+
171
+ #### Formula:
172
+ \[
173
+ \text{TF-IDF} = \text{TF} \times \text{IDF}
174
+ \]
175
+
176
+ - *TF*: Term Frequency
177
+ - *IDF*: Inverse Document Frequency
178
+ """)
179
+
180
+ elif page == "Word2Vec":
181
+ st.title("Word2Vec")
182
+ st.markdown("""
183
+ ### Word2Vec πŸ€–
184
+
185
+ A neural network-based method for creating dense vector representations of words.
186
+
187
+ #### Key Features:
188
+ - Captures semantic relationships (e.g., "king" - "man" + "woman" = "queen").
189
+ """)
190
+
191
+ # Sidebar navigation
192
+ st.sidebar.title("Explore NLP Topics")
193
+ menu_options = [
194
+ "Home",
195
+ "Tokenization",
196
+ "NLP Terminologies",
197
+ "One-Hot Vectorization",
198
+ "Bag of Words",
199
+ "TF-IDF Vectorizer",
200
+ "Word2Vec",
201
+ ]
202
+ selected_page = st.sidebar.radio("Select a topic", menu_options)
203
+
204
+ # Display the selected page
205
+ if selected_page == "Home":
206
+ show_home_page()
207
+ else:
208
+ show_page(selected_page)