import streamlit as st st.markdown( """ """, unsafe_allow_html=True ) st.markdown( """

Basic Terminology in NLP

""", unsafe_allow_html=True ) st.markdown( """

Before diving deep into the concepts of NLP we must know about the frequently used terminologies in NLP

1.Key Terminologies in NLP

Corpus: A collection of text documents. Example: {d1, d2, d3, ...}
Document: A single unit of text (e.g., a sentence, paragraph, or article).
Paragraph: A collection of sentences.
Sentence: A collection of words forming a meaningful expression.
Word: A collection of characters.
Character: A basic unit like an alphabet, number, or special symbol.

""", unsafe_allow_html=True ) st.markdown( """

2.Tokenization

Tokenization is the process of breaking down a large piece of text into smaller units called tokens. These tokens can be words, sentences, or subwords, depending on the granularity required for the task.

Types of Tokenization:

Sentence Tokenization: Splitting text into sentences.
Example: "I love ice-cream. I love chocolate." → ["I love ice-cream", "I love chocolate"]
Word Tokenization: Splitting sentences into words.
Example: "I love biryani" → ["I", "love", "biryani"]
Character Tokenization: Splitting words into characters.
Example: "Love" → ["L", "o", "v","e"]

""", unsafe_allow_html=True ) st.markdown( """

3.Stop Words

Stop words are commonly used words in a language that carry little or no meaningful information for text analysis.

Example:

"In Hyderabad, we can eat famous biryani."
Stop words: ["in", "we", "can"]

""", unsafe_allow_html=True ) st.markdown( """

4.Vectorization

Vectorization is the process of converting text data into numerical representations so that machine learning models can process and analyze it.

Types of Vectorization:

One-Hot Encoding: Represents each word as a binary vector.
Bag of Words (BoW): Represents text based on word frequencies.
TF-IDF: Adjusts word frequency by importance.
Word2Vec: Embeds words in a vector space using deep learning.
GloVe: Uses global co-occurrence statistics for embedding.
FastText: Similar to Word2Vec but includes subword information.

""", unsafe_allow_html=True ) st.markdown( """

5. Stemming

Stemming is the process of reducing words to their base or root form, often by removing prefixes or suffixes. It is a rule-based, heuristic approach to standardize words by removing derivational affixes.

Example:

Original Words: "running", "runner", "runs"
Stemmed Form: "run"

""", unsafe_allow_html=True ) st.markdown( """

6. Lemmatization

Lemmatization is the process of reducing a word to its base or root form (called a lemma) using linguistic rules and a vocabulary (dictionary). Unlike stemming, lemmatization ensures that the resulting word is a valid word in the language.

Example:

Original Words: "studying", "better", "carrying"
Lemmatized Form: "study", "good", "carry"

Lemmatization is more accurate than stemming but computationally more intensive as it requires a language dictionary.

""", unsafe_allow_html=True )