Harika22 commited on
Commit
f8cdeaa
Β·
verified Β·
1 Parent(s): 896d8db

Update pages/3_Terminology.py

Browse files
Files changed (1) hide show
  1. pages/3_Terminology.py +55 -79
pages/3_Terminology.py CHANGED
@@ -76,85 +76,61 @@ st.markdown("""
76
  </style>
77
  """, unsafe_allow_html=True)
78
 
79
- st.markdown("<h1 class='title'>NLP Terminology</h1>", unsafe_allow_html=True)
80
-
81
- st.markdown(
82
- "<p class='caption'>Explore essential terms in Natural Language Processing and their meanings!...</p>",
83
- unsafe_allow_html=True,
84
- )
85
- st.header("Document")
86
- st.markdown('''
87
- - Document is defined as collection of sentence / paragraph / single word / single character
88
- ''')
89
-
90
- st.header("Paragraph")
91
- st.markdown('''
92
- - Paragraph is defined as collection of sentence.
93
- ''')
94
-
95
- st.header("Sentence")
96
- st.markdown('''
97
- - Sentence is defined as collection of words.
98
- ''')
99
-
100
- st.header("Word")
101
- st.markdown('''
102
- - Words are defined as collection of characters
103
- ''')
104
-
105
- st.header("Character")
106
- st.markdown('''
107
- - Character can either be in number , alphabets or special symbol.
108
- ''')
109
-
110
- st.header("Tokenization")
111
- st.markdown('''
112
- - It is a technique by using which we can convert a huge chunk into small entity where those small entities are known as tokens.
113
- ''')
114
-
115
- st.subheader("Types of Tokenization")
116
  st.markdown("""
117
- <ul class="icon-bullet">
118
- <li>Sentence tokenization</li>
119
- <li>Word tokennization</li>
120
- <li>Character tokenization </li>
121
- </ul>
122
- """, unsafe_allow_html=True)
 
 
 
 
123
 
124
- st.subheader("Sentence tokenization")
125
- st.markdown('''
126
- - It is a technique by using which we can convert a huge chunk into small entity where those small entities are known as tokens which are in sentence.
127
- ''')
128
-
129
- st.subheader("Word tokenization")
130
- st.markdown('''
131
- - It is a technique by using which we can convert a huge chunk into small entity where those small entities are known as tokens which are words.
132
- ''')
133
-
134
- st.subheader("Character tokenization")
135
- st.markdown('''
136
- - It is a technique by using which we can convert a huge chunk into small entity where those small entities are known as tokens which are in characters.
137
- ''')
138
-
139
- st.header("Stop Words")
140
- st.markdown('''
141
- - They are set of words which didn't have impact on the meaning of sentence / paragraph
142
- - Stop words are used to make the grammar very clear
143
- ''')
144
-
145
- st.header("Vectorization")
146
- st.markdown('''
147
- - It is a technique which helps us to convert a text into vector format
148
- ''')
149
-
150
- st.subheader("Different types of techniques")
151
  st.markdown("""
152
- <ul class="icon-bullet">
153
- <li>One-Hot Vectorization </li>
154
- <li>Bag of Words</li>
155
- <li>TF-IDF (Term Frequency and Inverse Document Frequency)</li>
156
- <li>Word2Vector</li>
157
- <li>Glove</li>
158
- <li>Fast text</li>
159
- </ul>
160
- """, unsafe_allow_html=True)
 
76
  </style>
77
  """, unsafe_allow_html=True)
78
 
79
+
80
+ st.markdown("<h1 class='title'>πŸ“– NLP Terminology</h1>", unsafe_allow_html=True)
81
+ st.markdown("<p class='caption'>✨ Explore essential terms in Natural Language Processing and their meanings!...</p>", unsafe_allow_html=True)
82
+
83
+ st.header("πŸ“ Corpus")
84
+ st.markdown("- **A corpus** is a collection of documents.")
85
+
86
+ st.header("πŸ“„ Document")
87
+ st.markdown("- **A document** is a collection of sentences, paragraphs, single words, or even single characters.")
88
+
89
+ st.header("πŸ“ Paragraph")
90
+ st.markdown("- **A paragraph** consists of multiple sentences.")
91
+
92
+ st.header("πŸ“’ Sentence")
93
+ st.markdown("- **A sentence** is a collection of words.")
94
+
95
+ st.header("πŸ”€ Word")
96
+ st.markdown("- **Words** are made up of characters.")
97
+
98
+ st.header("πŸ”  Character")
99
+ st.markdown("- **A character** can be a number, alphabet, or special symbol.")
100
+
101
+ st.header("βœ‚οΈ Tokenization")
102
+ st.markdown("- **Tokenization** is a technique by using which we can convert a huge chunk into small entity where those small entities are known as tokens.")
103
+
104
+ st.subheader("πŸ› οΈ Types of Tokenization")
 
 
 
 
 
 
 
 
 
 
 
105
  st.markdown("""
106
+ - πŸ”Ή **Sentence Tokenization** – Splits text into sentences.
107
+ - πŸ”Ή **Word Tokenization** – Splits sentences into words.
108
+ - πŸ”Ή **Character Tokenization** – Splits words into individual characters.
109
+ """)
110
+
111
+ st.subheader("πŸ“ Sentence Tokenization")
112
+ st.markdown("- **Breaks a large text into meaningful sentence units.**")
113
+
114
+ st.subheader("πŸ“– Word Tokenization")
115
+ st.markdown("- **Splits a sentence into individual words.**")
116
 
117
+ st.subheader("πŸ”‘ Character Tokenization")
118
+ st.markdown("- **Breaks words into separate characters.**")
119
+
120
+ st.header("🚫 Stop Words")
121
+ st.markdown("- **Common words** (e.g., 'the', 'is', 'and') that do not add meaning to the text but maintain grammatical structure.")
122
+
123
+ st.header("πŸ“Š Vectorization")
124
+ st.markdown("- **Transforms text into numerical representation** for machine learning models.")
125
+
126
+ st.subheader("πŸ”’ Different Types of Vectorization Techniques")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  st.markdown("""
128
+ - 🎯 **One-Hot Encoding**
129
+ - 🏷️ **Bag of Words (BoW)**
130
+ - πŸ“Š **TF-IDF (Term Frequency-Inverse Document Frequency)**
131
+ - 🧠 **Word2Vec**
132
+ - 🌍 **GloVe**
133
+ - ⚑ **FastText**
134
+ """)
135
+
136
+ st.success("πŸš€ Mastering these **NLP terminologies** will help you build powerful text-processing applications!")