Spaces:

UmaKumpatla
/

NLP

Sleeping

App Files Files Community

UmaKumpatla commited on Feb 13, 2025

Commit

0d2080f

verified ·

1 Parent(s): d82093c

Update pages/2.Terminology.py

Browse files

Files changed (1) hide show

pages/2.Terminology.py +45 -0

pages/2.Terminology.py CHANGED Viewed

	@@ -0,0 +1,45 @@

+import streamlit as st
+# Title
+st.title(":blue[🔍 NLP Terminologies]")
+# Helper function to display sections
+def display_section(title, description, example=None, extra=None):
+    st.subheader(f":green[{title}]")
+    st.write(description)
+    if example:
+        st.write(":red[Example]")
+        st.write(example)
+    if extra:
+        st.write(extra)
+# NLP Terminologies
+display_section("Corpus", "A collection of documents grouped together.",
+                "A corpus of English literature might include works by Shakespeare, Dickens, and Austen.")
+display_section("Document", "A collection of sentences, paragraphs, single words, or single characters.",
+                "An article, a book, or an email can be considered a document.")
+display_section("Paragraph", "A collection of sentences.",
+                "The quick brown fox jumps over the lazy dog. It was a sunny day. The fox was happy.")
+display_section("Sentence", "A collection of words.",
+                "The quick brown fox jumps over the lazy dog.")
+display_section("Word", "A collection of characters.",
+                "Fox is a word made up of the characters 'F', 'o', and 'x'.")
+display_section("Characters", "Can be numbers, alphabets, or special symbols.",
+                "'A', '1', and '@' are all characters.")
+display_section("Tokenization", "Tokenization is the process of breaking down text into smaller units called tokens.",
+                "Sentence tokenization splits text into sentences, while word tokenization splits text into words.",
+                ":blue[Types of Tokenization]\n- **Sentence Tokenization**: Splits text into individual sentences.\n- **Word Tokenization**: Splits text into individual words.\n- **Character Tokenization**: Splits text into individual characters.")
+display_section("Stop Words", "Stop words are common words that do not contribute much to the meaning of a sentence and are often removed during text processing.",
+                "Words like the, we, in, am, she, and he are considered stop words.")
+display_section("Vectorization", "Vectorization converts text data into a numerical format for machine learning models.",
+                None,
+                ":blue[Types of Vectorization]\n- **One-Hot Encoding**: Represents words as binary vectors.\n- **Bag-of-Words**: Counts word occurrences, disregarding grammar.\n- **TF-IDF**: Balances word frequency in a document vs. the entire corpus.\n- **Word2Vec**: Deep learning-based word embeddings.\n- **GloVe**: Uses word co-occurrence matrices.\n- **FastText**: Considers subwords for rare/misspelled words.")