import streamlit as st
st.markdown(
"""
""",
unsafe_allow_html=True
)
st.markdown(
"""
Basic Terminology in NLP
""",
unsafe_allow_html=True
)
st.markdown(
"""
Before diving deep into the concepts of NLP we must know about the frequently used terminologies in NLP
1.Key Terminologies in NLP
- Corpus: A collection of text documents. Example: {d1, d2, d3, ...}
- Document: A single unit of text (e.g., a sentence, paragraph, or article).
- Paragraph: A collection of sentences.
- Sentence: A collection of words forming a meaningful expression.
- Word: A collection of characters.
- Character: A basic unit like an alphabet, number, or special symbol.
""",
unsafe_allow_html=True
)
st.markdown(
"""
2.Tokenization
Tokenization is the process of breaking down a large piece of text into smaller units called tokens. These tokens can be words, sentences, or subwords, depending on the granularity required for the task.
Types of Tokenization:
- Sentence Tokenization: Splitting text into sentences.
Example: "I love ice-cream. I love chocolate." → ["I love ice-cream", "I love chocolate"]
- Word Tokenization: Splitting sentences into words.
Example: "I love biryani" → ["I", "love", "biryani"]
- Character Tokenization: Splitting words into characters.
Example: "Love" → ["L", "o", "v","e"]
""",
unsafe_allow_html=True
)
st.markdown(
"""
3.Stop Words
Stop words are commonly used words in a language that carry little or no meaningful information for text analysis.
Example:
"In Hyderabad, we can eat famous biryani."
Stop words: ["in", "we", "can"]
""",
unsafe_allow_html=True
)
st.markdown(
"""
4.Vectorization
Vectorization is the process of converting text data into numerical representations so that machine learning models can process and analyze it.
Types of Vectorization:
- One-Hot Encoding: Represents each word as a binary vector.
- Bag of Words (BoW): Represents text based on word frequencies.
- TF-IDF: Adjusts word frequency by importance.
- Word2Vec: Embeds words in a vector space using deep learning.
- GloVe: Uses global co-occurrence statistics for embedding.
- FastText: Similar to Word2Vec but includes subword information.
""",
unsafe_allow_html=True
)
st.markdown(
"""
5. Stemming
Stemming is the process of reducing words to their base or root form, often by removing prefixes or suffixes. It is a rule-based, heuristic approach to standardize words by removing derivational affixes.
Example:
- Original Words: "running", "runner", "runs"
- Stemmed Form: "run"
""",
unsafe_allow_html=True
)
st.markdown(
"""
6. Lemmatization
Lemmatization is the process of reducing a word to its base or root form (called a lemma) using linguistic rules and a vocabulary (dictionary). Unlike stemming, lemmatization ensures that the resulting word is a valid word in the language.
Example:
- Original Words: "studying", "better", "carrying"
- Lemmatized Form: "study", "good", "carry"
Lemmatization is more accurate than stemming but computationally more intensive as it requires a language dictionary.
""",
unsafe_allow_html=True
)