Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Harika22 commited on Feb 1, 2025

Commit

f8cdeaa

verified ·

1 Parent(s): 896d8db

Update pages/3_Terminology.py

Browse files

Files changed (1) hide show

pages/3_Terminology.py +55 -79

pages/3_Terminology.py CHANGED Viewed

@@ -76,85 +76,61 @@ st.markdown("""
     </style>
     """, unsafe_allow_html=True)
-st.markdown("<h1 class='title'>NLP Terminology</h1>", unsafe_allow_html=True)
-st.markdown(
-    "<p class='caption'>Explore essential terms in Natural Language Processing and their meanings!...</p>",
-    unsafe_allow_html=True,
-)
-st.header("Document")
-st.markdown('''
-- Document is defined as collection of sentence / paragraph / single word / single character
-''')
-st.header("Paragraph")
-st.markdown('''
-- Paragraph is defined as collection of sentence.
-''')
-st.header("Sentence")
-st.markdown('''
--  Sentence is defined as collection of words.
-''')
-st.header("Word")
-st.markdown('''
--  Words are defined as collection of characters
-''')
-st.header("Character")
-st.markdown('''
-- Character can either be in number , alphabets or special symbol.
-''')
-st.header("Tokenization")
-st.markdown('''
-- It is a technique by using which we can convert a huge chunk into small entity where those small entities are known as tokens.
-''')
-st.subheader("Types of Tokenization")
 st.markdown("""
-    <ul class="icon-bullet">
-    <li>Sentence tokenization</li>
-    <li>Word tokennization</li>
-    <li>Character tokenization </li>
-    </ul>
-    """, unsafe_allow_html=True)
-st.subheader("Sentence tokenization")
-st.markdown('''
-- It is a technique by using which we can convert a huge chunk into small entity where those small entities are known as tokens which are in sentence.
-''')
-st.subheader("Word tokenization")
-st.markdown('''
-- It is a technique by using which we can convert a huge chunk into small entity where those small entities are known as tokens which are words.
-''')
-st.subheader("Character tokenization")
-st.markdown('''
-- It is a technique by using which we can convert a huge chunk into small entity where those small entities are known as tokens which are in characters.
-''')
-st.header("Stop Words")
-st.markdown('''
-- They are set of words which didn't have impact on the meaning of sentence / paragraph
-- Stop words are used to make the grammar very clear
-''')
-st.header("Vectorization")
-st.markdown('''
-- It is a technique which helps us to convert a text into vector format
-''')
-st.subheader("Different types of techniques")
 st.markdown("""
-    <ul class="icon-bullet">
-    <li>One-Hot Vectorization </li>
-    <li>Bag of Words</li>
-    <li>TF-IDF (Term Frequency and Inverse Document Frequency)</li>
-    <li>Word2Vector</li>
-    <li>Glove</li>
-    <li>Fast text</li>
-    </ul>
-    """, unsafe_allow_html=True)

     </style>
     """, unsafe_allow_html=True)
+st.markdown("<h1 class='title'>📖 NLP Terminology</h1>", unsafe_allow_html=True)
+st.markdown("<p class='caption'>✨ Explore essential terms in Natural Language Processing and their meanings!...</p>", unsafe_allow_html=True)
+st.header("📝 Corpus")
+st.markdown("- **A corpus** is a collection of documents.")
+st.header("📄 Document")
+st.markdown("- **A document** is a collection of sentences, paragraphs, single words, or even single characters.")
+st.header("📝 Paragraph")
+st.markdown("- **A paragraph** consists of multiple sentences.")
+st.header("📢 Sentence")
+st.markdown("- **A sentence** is a collection of words.")
+st.header("🔤 Word")
+st.markdown("- **Words** are made up of characters.")
+st.header("🔠 Character")
+st.markdown("- **A character** can be a number, alphabet, or special symbol.")
+st.header("✂️ Tokenization")
+st.markdown("- **Tokenization** is a technique by using which we can convert a huge chunk into small entity where those small entities are known as tokens.")
+st.subheader("🛠️ Types of Tokenization")
 st.markdown("""
+    - 🔹 **Sentence Tokenization** – Splits text into sentences.
+    - 🔹 **Word Tokenization** – Splits sentences into words.
+    - 🔹 **Character Tokenization** – Splits words into individual characters.
+""")
+st.subheader("📝 Sentence Tokenization")
+st.markdown("- **Breaks a large text into meaningful sentence units.**")
+st.subheader("📖 Word Tokenization")
+st.markdown("- **Splits a sentence into individual words.**")
+st.subheader("🔡 Character Tokenization")
+st.markdown("- **Breaks words into separate characters.**")
+st.header("🚫 Stop Words")
+st.markdown("- **Common words** (e.g., 'the', 'is', 'and') that do not add meaning to the text but maintain grammatical structure.")
+st.header("📊 Vectorization")
+st.markdown("- **Transforms text into numerical representation** for machine learning models.")
+st.subheader("🔢 Different Types of Vectorization Techniques")
 st.markdown("""
+    - 🎯 **One-Hot Encoding**
+    - 🏷️ **Bag of Words (BoW)**
+    - 📊 **TF-IDF (Term Frequency-Inverse Document Frequency)**
+    - 🧠 **Word2Vec**
+    - 🌍 **GloVe**
+    - ⚡ **FastText**
+""")
+st.success("🚀 Mastering these **NLP terminologies** will help you build powerful text-processing applications!")