File size: 5,993 Bytes
08134f2 1c1d531 d21c17e 08134f2 a069442 885b16a a069442 885b16a 1c1d531 381b8fe 10e772a 1c1d531 0b2e3e9 a069442 381b8fe 885b16a 1c1d531 0b2e3e9 a069442 e2d36f6 885b16a ac40de5 381b8fe ac40de5 e2d36f6 ac40de5 885b16a 1c1d531 29c008d 105486b 1c1d531 29c008d 1c1d531 29c008d 1c1d531 29c008d 1c1d531 29c008d 0b2e3e9 e2d36f6 ac40de5 885b16a 1c1d531 29c008d 1c1d531 29c008d 1c1d531 29c008d 1c1d531 29c008d 0b2e3e9 d21c17e 885b16a 1c1d531 29c008d 1c1d531 29c008d 1c1d531 29c008d 1c1d531 29c008d d21c17e 885b16a 1c1d531 29c008d d21c17e 1c1d531 29c008d 1c1d531 29c008d 1c1d531 29c008d 1c1d531 d21c17e 1c1d531 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
import streamlit as st
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np
from gensim.models import Word2Vec
# Title
st.title(":red[Introduction to NLP]")
# Section: What is NLP?
st.header(":blue[What is NLP?]")
st.write("""
Natural Language Processing (NLP) is a subfield of artificial intelligence that enables computers to process, understand, and generate human language.
### Applications of NLP:
- **Chatbots & Virtual Assistants** (e.g., Siri, Alexa)
- **Sentiment Analysis** (e.g., Product reviews, Social Media monitoring)
- **Machine Translation** (e.g., Google Translate)
- **Text Summarization** (e.g., News article summaries)
- **Speech Recognition** (e.g., Voice commands)
""")
# Section: NLP Terminologies
st.header(":blue[NLP Terminologies]")
st.write("""
- **Corpus**: A collection of text documents used for NLP tasks.
- **Tokenization**: Splitting text into individual words or phrases.
- **Stop Words**: Common words (e.g., "the", "is") that are often removed.
- **Stemming**: Reducing words to their base form (e.g., "running" β "run").
- **Lemmatization**: More advanced than stemming; it converts words to their dictionary form.
- **Named Entity Recognition (NER)**: Identifies entities like names, dates, and locations.
- **Sentiment Analysis**: Determines the sentiment (positive, negative, neutral) of a text.
- **n-grams**: Sequences of 'n' consecutive words (e.g., "New York" is a bi-gram).
""")
# Section: Text Representation Methods
st.header(":blue[Text Representation Methods]")
methods = [
"Bag of Words",
"TF-IDF",
"One-Hot Encoding",
"Word Embeddings (Word2Vec)"
]
selected_method = st.radio("Select a text representation method:", methods)
if selected_method == "Bag of Words":
st.subheader(":blue[Bag of Words (BoW)]")
st.write("""
**Definition**: Bag of Words (BoW) is a simple text representation technique that converts text into numerical data by counting the occurrence of each word in a document. It ignores grammar, word order, and context.
**How it works**:
- Each unique word in a dataset becomes a feature.
- The text is converted into a frequency-based numerical representation.
- The more a word appears in a document, the higher its count.
**Uses**:
- Sentiment analysis
- Document classification
- Spam detection
- Information retrieval
**Advantages**:
β
Simple and easy to implement
β
Works well with traditional machine learning models
**Disadvantages**:
β Ignores word order and meaning
β High-dimensionality for large vocabularies
β Cannot differentiate between synonyms (e.g., "happy" and "joyful")
""")
elif selected_method == "TF-IDF":
st.subheader(":blue[Term Frequency-Inverse Document Frequency (TF-IDF)]")
st.write("""
**Definition**: TF-IDF is an advanced version of Bag of Words that assigns importance to words based on how frequently they appear in a document while reducing the importance of common words.
**How it works**:
- **Term Frequency (TF)**: Measures how often a word appears in a document.
- **Inverse Document Frequency (IDF)**: Reduces the weight of words that are very common across all documents.
- The final score is calculated as: **TF Γ IDF**.
**Uses**:
- Information retrieval (e.g., search engines)
- Text classification
- Keyword extraction
- Document similarity detection
**Advantages**:
β
Reduces the impact of common words like "the", "is", etc.
β
Highlights important words in a document
β
Better than BoW for capturing relevance
**Disadvantages**:
β Still ignores word order
β Cannot capture deep semantic meanings
β Computationally expensive for very large datasets
""")
elif selected_method == "One-Hot Encoding":
st.subheader(":blue[One-Hot Encoding]")
st.write("""
**Definition**: One-hot encoding is a simple representation method where each unique word in a vocabulary is represented as a binary vector.
**How it works**:
- Each word is assigned a unique index in a vocabulary.
- A word is represented as a vector where all values are 0 except for the position of that word, which is 1.
- For example, if the vocabulary consists of ["NLP", "is", "great"], then "NLP" is represented as **[1, 0, 0]**.
**Uses**:
- Simple NLP tasks
- Word-level feature engineering
- Early-stage text processing in machine learning models
**Advantages**:
β
Simple and easy to understand
β
Works well for small vocabulary sizes
**Disadvantages**:
β Inefficient for large vocabularies (results in sparse vectors)
β Does not capture word meaning or relationships
""")
elif selected_method == "Word Embeddings (Word2Vec)":
st.subheader(":blue[Word Embeddings (Word2Vec)]")
st.write("""
**Definition**: Word embeddings convert words into dense numerical vectors that capture semantic meaning. Unlike BoW and TF-IDF, word embeddings preserve relationships between words.
**How it works**:
- Words are represented as high-dimensional vectors (e.g., 100 or 300 dimensions).
- Words with similar meanings have closer vectors.
- It is trained using techniques like **CBOW (Continuous Bag of Words)** and **Skip-gram**.
**Uses**:
- Machine translation
- Speech recognition
- Sentiment analysis
- Document clustering
**Advantages**:
β
Captures semantic relationships between words
β
Works well with deep learning models
β
Can detect synonyms and analogies (e.g., "king" - "man" + "woman" = "queen")
**Disadvantages**:
β Requires large datasets to train
β Computationally expensive
β Needs domain-specific tuning for best performance
""")
# Footer
st.write("---")
st.write("Developed with β€οΈ using Streamlit for NLP enthusiasts.")
|