Spaces:

Gowthamvemula
/

nlp_machine

Sleeping

App Files Files Community

nlp_machine / app.py

Gowthamvemula

Update app.py

a18d360 verified about 1 year ago

raw

history blame contribute delete

25.9 kB

	import streamlit as st

	# App title with emoji
	st.title("💡 Behind the Scenes of NLP")

	# Sidebar navigation with icons
	st.sidebar.title("🌍 Find Your Way")

	# Define main sections and their subpoints
	sections = {
	"📚 Introduction to NLP": [],
	"🔄 Lifecycle of NLP": [
	" Problem Statement",
	" Data Collection",
	" Simple EDA",
	" Data Preprocessing",
	" Feature Extraction",
	" Model Selection",
	" Model Training and Evaluation",
	" Deployment",
	" Monitoring and Maintenance",
	],
	"⚙️ NLP Techniques": [
	" Tokenization",
	" Stemming",
	" Lemmatization",
	" Stop Words",
	" One Hot Encoding",
	" Bag Of Words",
	" Binary Bag Of Words",
	" TF-IDF",
	" Word Embeddings",
	" Part-of-Speech (POS) Tagging",
	" Named Entity Recognition (NER)",
	" Sentiment Analysis",
	],

	}

	# Display main sections and subpoints in sidebar
	selected_page = st.sidebar.radio("Steps, Guidance, Clarity", list(sections.keys()))
	selected_subpoint = None

	if sections[selected_page]:
	st.sidebar.write("### Grow & Achieve")
	selected_subpoint = st.sidebar.radio("Grow & Achieve", sections[selected_page])

	# Content rendering
	if selected_page == "📚 Introduction to NLP":
	st.header("What is Natural Language Processing (NLP)? 🧠")
	st.write("""
	What is NLP? 🤖💬:
	Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) that focuses on the interaction between computers and humans through natural language.
	The goal is to enable machines to understand, interpret, and generate human language in a way that is meaningful and useful.
	### Why NLP? 💡
	NLP helps machines bridge the gap between human communication and machine understanding. It allows computers to process large amounts of unstructured text, making them capable of tasks like translation, sentiment analysis, text summarization, and more.
	""")

	st.header("Application of NLP:")
	st.write("""
	1. Chatbots and Virtual Assistants 🤖💬
	Example: Siri or Alexa understanding your voice commands and providing responses.

	2. Sentiment Analysis ❤️💔
	Example: Analyzing Twitter comments to determine if they are positive, negative, or neutral about a product.

	3. Machine Translation 🌍
	Example: Google Translate converting a sentence from Spanish to English: "Hola, ¿cómo estás?" → "Hello, how are you?"

	4. Text Summarization ✂️
	Example: Automatically summarizing a long article into a few key points.
	- NLP is crucial in making machines more intelligent and interactive by understanding and responding to human language effectively.
	""")



	elif selected_page == "🔄 Lifecycle of NLP":
	st.header("NLP Lifecycle 🔄")
	if selected_subpoint:
	if selected_subpoint == " Problem Statement":
	st.write("""
	Data collection is a critical first step in any NLP project. The quality and quantity of the data collected directly influence the success of the subsequent stages, such as data preprocessing, model training, and evaluation. Here's a breakdown of the data collection process and the different types of data formats used in NLP:
	Defining the Problem:
	- 🕵️ Identify the challenge: Understand the goal and define the scope of the NLP task.
	- 🧩 Determine the input: What type of text data will be used? For instance, customer reviews, emails, or social media posts.
	- 🎯 Specify the outcome: What result is expected? Is it classification, summarization, or language translation?

	Example:
	- 📦 Task: Build a chatbot to assist customers in an e-commerce store.
	- 💬 Input: Customer queries like "Where is my order?" or "How do I return this product?"
	- 🎉 Output: A conversational response such as "Your order is on its way!" or "Please follow these steps to return your product."
	""")
	elif selected_subpoint == " Data Collection":
	st.write("""
	Data Collection: Gather text data from various sources and handle diverse file formats.

	Supported File Types:
	- 📄 CSV: Structured data files (e.g., `data.csv`).
	- 📋 XLSX: Excel spreadsheets (e.g., `data.xlsx`).
	- 🌐 HTML: Scraped web data (e.g., `table.html`).
	- 📂 JSON: API responses or hierarchical data (e.g., `data.json`).
	- 📜 XML: Nested data like RSS feeds (e.g., `data.xml`).

	Example:
	```python
	import pandas as pd

	# Load a CSV file
	csv_file = "data.csv"
	csv_data = pd.read_csv(csv_file)

	# Load an Excel file
	excel_file = "data.xlsx"
	excel_data = pd.read_excel(excel_file)

	# Load an HTML file
	html_file = "table.html"
	html_data = pd.read_html(html_file)[0]

	# Load a JSON file
	json_file = "data.json"
	json_data = pd.read_json(json_file)

	# Load an XML file
	xml_file = "data.xml"
	xml_data = pd.read_xml(xml_file)

	# Print sample outputs
	print(csv_data.head())
	print(excel_data.head())
	print(html_data.head())
	print(json_data.head())
	print(xml_data.head())
	```
	""")
	elif selected_subpoint == " Simple EDA":
	st.write("""
	Simple EDA (Exploratory Data Analysis):
	- 📊 Visualize the Data: Use plots and histograms to understand distributions.
	- 📈 Summary Statistics: Calculate mean, median, and other stats to find patterns.
	- 🔍 Detect Anomalies: Identify missing values or outliers.

	Example:
	```python
	# Example EDA Code
	import pandas as pd
	import matplotlib.pyplot as plt

	df = pd.read_csv("data.csv")

	# Summary statistics
	print(df.describe())

	# Visualize distributions
	df['column_name'].hist()
	plt.show()
	```
	""")
	elif selected_subpoint == " Data Preprocessing":
	st.write("""
	Text preprocessing is the foundation of Natural Language Processing (NLP). It transforms raw text into a clean, structured format, making it suitable for analysis and machine learning.
	Below are the key steps involved in preprocessing:
	""")

	st.write("### 🔗 Key Steps in Text Preprocessing")
	st.write("""
	- ✂️ Tokenization: Splitting text into smaller units like words or sentences.
	- Example: `"The cat is happy"` → `["The", "cat", "is", "happy"]`

	- ❌ Stop Word Removal: Removing commonly used words (e.g., "the", "is", "and") that add little meaning to the analysis.
	- Example: `"The cat is happy"` → `["cat", "happy"]`

	- 🌱 Lemmatization: Converting words to their root form using grammar rules.
	- Example: `"running"` → `"run"`

	- ✂️ Stemming: Truncating words to their base form by removing prefixes or suffixes.
	- Example: `"runner"` → `"run"`

	- 🔡 Lowercasing: Converting all text to lowercase to maintain uniformity.
	- Example: `"Hello World!"` → `"hello world!"`

	- 🔧 HTML Tag Removal: Cleaning out HTML tags like `<a>`, `<p>`, `<b>` from text.
	- Example: `"Hello <b>World</b>!"` → `"Hello World!"`

	- 🌐 URL Removal: Stripping URLs from text.
	- Example: `"Visit us at http://example.com"` → `"Visit us at"`

	- 😊 Emoji Removal: Eliminating emojis that may not contribute to textual analysis.
	- Example: `"I love NLP! 🚀"` → `"I love NLP!"`

	- 📌 Hashtag Removal: Removing hashtags that may not be relevant for the analysis.
	- Example: `"#MachineLearning is awesome!"` → `"is awesome!"`

	- ❗ Special Character Removal: Cleaning symbols like `@`, `#`, `%`, etc.
	- Example: `"Hello @user!"` → `"Hello user"`
	""")

	st.write("### Example of Preprocessing")
	st.write("""
	Consider the sentence:
	`"Check out this amazing post! 😊 #awesome #data http://example.com 🚀 Let's talk about AI!"`

	After applying the preprocessing steps:
	- Tokenization: `["Check", "out", "this", "amazing", "post", "😊", "#awesome", "#data", "http://example.com", "🚀", "Let's", "talk", "about", "AI"]`
	- Stop Word Removal: `["Check", "amazing", "post", "😊", "#awesome", "#data", "http://example.com", "🚀", "Let's", "talk", "AI"]`
	- URL Removal: `["Check", "amazing", "post", "😊", "#awesome", "#data", "🚀", "Let's", "talk", "AI"]`
	- Emoji Removal: `["Check", "amazing", "post", "#awesome", "#data", "Let's", "talk", "AI"]`
	- Hashtag Removal: `["Check", "amazing", "post", "Let's", "talk", "AI"]`
	""")

	st.write("### Code Example")
	st.code("""
	import re
	from bs4 import BeautifulSoup

	# Sample text
	text = "Check out this amazing post! 😊 #awesome #data http://example.com 🚀 Let's talk about AI!"

	# Remove HTML tags
	cleaned_text = BeautifulSoup(text, "html.parser").get_text()

	# Remove URLs
	cleaned_text = re.sub(r'http\\S+\|www\\S+', '', cleaned_text)

	# Remove emojis
	cleaned_text = re.sub(r'[^\w\s,]', '', cleaned_text)

	# Remove hashtags
	cleaned_text = re.sub(r'#\\w+', '', cleaned_text)

	# Output cleaned text
	st.write(f"Cleaned Text: {cleaned_text}")
	""", language='python')

	st.write("""
	Final Output:
	`"Check out this amazing post Let's talk about AI"`
	""")

	elif selected_subpoint == " Feature Extraction":
	st.write("""
	Feature Extraction:
	- 🔢 Bag of Words (BoW): Represent text as a frequency matrix.
	- 🧮 TF-IDF: Measure word importance using term frequency and inverse document frequency.
	- 🤖 Word Embeddings: Represent words in continuous vector spaces (e.g., Word2Vec, GloVe).

	Example:
	```python
	from sklearn.feature_extraction.text import TfidfVectorizer

	documents = ["This is a sample document.", "Another document for testing."]
	vectorizer = TfidfVectorizer()
	features = vectorizer.fit_transform(documents)

	print(features.toarray())
	```
	""")
	elif selected_subpoint == " Model Selection":
	st.write("""
	Model Selection:
	- 🔍 Choose models based on data and requirements:
	- 📊 Classification: Logistic Regression, Naive Bayes.
	- 🔤 Sequence Models: RNNs, LSTMs.
	- 🌟 Transformers: BERT, GPT.

	Example:
	```python
	from sklearn.naive_bayes import MultinomialNB
	model = MultinomialNB()
	```
	""")
	elif selected_subpoint == " Model Training and Evaluation":
	st.write("""
	Model Training & Evaluation:
	- 📖 Train models using training datasets.
	- 📊 Evaluate using metrics like accuracy, precision, and recall.

	Example:
	```python
	from sklearn.metrics import accuracy_score
	predictions = model.predict(X_test)
	print("Accuracy:", accuracy_score(y_test, predictions))
	```
	""")
	elif selected_subpoint == " Deployment":
	st.write("""
	Deployment:
	- 🌐 Deploy models as APIs using tools like Flask or FastAPI.
	- 📦 Integrate into web or mobile apps.

	Example:
	```bash
	flask run
	```
	""")
	elif selected_subpoint == " Monitoring and Maintenance":
	st.write("""
	Monitoring & Maintenance:
	- 📡 Monitor performance using real-time data.
	- 🔄 Retrain models periodically with updated data.
	""")





	elif selected_page == "⚙️ NLP Techniques":
	st.header("⚙️ NLP Techniques")
	if selected_subpoint:
	if selected_subpoint == " Tokenization":
	st.write("### Tokenization")
	st.write("""
	Breaking down text into smaller units such as words or sentences to make it manageable for analysis.
	Example:
	- Input: `"Artificial Intelligence is fascinating."`
	- Word Tokens: `["Artificial", "Intelligence", "is", "fascinating", "."]`
	- Sentence Tokens: `["Artificial Intelligence is fascinating."]`
	""")
	elif selected_subpoint == " Stemming":
	st.write("### 🌱 Stemming")
	st.write("""
	Stemming reduces words to their root form by removing prefixes or suffixes, often resulting in a non-grammatical base.

	Example:
	- Input: `["running", "runner", "runs"]`
	- Output: `["run", "runner", "run"]` (Porter Stemmer)

	Key Points:
	- Fast and simple, but can lead to over-stemming or under-stemming.
	- Example of over-stemming: `"generous"` → `"gener"`

	Code Example:
	```python
	from nltk.stem import PorterStemmer

	stemmer = PorterStemmer()
	words = ["running", "runner", "runs"]
	print([stemmer.stem(word) for word in words])
	# Output: ['run', 'runner', 'run']
	```
	""")

	elif selected_subpoint == " Lemmatization":
	st.write("### 🌿 Lemmatization")
	st.write("""
	Lemmatization reduces words to their dictionary base form (lemma), ensuring grammatical correctness.

	Example:
	- Input: `["running", "ran", "better"]`
	- Output: `["run", "run", "good"]`

	Key Points:
	- Context-aware and accurate.
	- More computationally intensive than stemming.

	Code Example:
	```python
	from nltk.stem import WordNetLemmatizer

	lemmatizer = WordNetLemmatizer()
	words = ["running", "ran", "better"]
	print([lemmatizer.lemmatize(word, pos="v") for word in words])
	```
	""")
	elif selected_subpoint == " Stop Words":
	st.write("### 🚫 Stop Words")
	st.write("""
	Stop words are common words (e.g., the, is) that are removed during text processing as they don't add much meaning.

	Example:
	- Input: `"This is a simple sentence."`
	- Output: `"simple sentence"`

	Code Example:
	```python
	from nltk.corpus import stopwords

	stop_words = set(stopwords.words("english"))
	sentence = "This is a simple sentence."
	filtered_sentence = [word for word in sentence.split() if word.lower() not in stop_words]
	print(filtered_sentence) # Output: ['simple', 'sentence']
	```
	""")

	elif selected_subpoint == " One Hot Encoding":
	st.write("### 🔥 One-Hot Encoding")
	st.write("""
	Representing categorical data as binary vectors to make it suitable for machine learning models.

	How it works:
	- Each unique category is assigned a unique binary vector.
	- A binary vector has all values as `0` except for the position representing the category, which is `1`.

	Example:
	- Categories: `["Apple", "Banana", "Cherry"]`
	- Encoding:
	- `Apple`: `[1, 0, 0]`
	- `Banana`: `[0, 1, 0]`
	- `Cherry`: `[0, 0, 1]`
	""")

	st.write("### 🍎 Example with Fruits")
	st.write("""
	Input Categories: `["Apple", "Banana", "Cherry", "Banana", "Apple"]`
	Output (One-Hot Encoding):
	- `Apple`: `[1, 0, 0]`
	- `Banana`: `[0, 1, 0]`
	- `Cherry`: `[0, 0, 1]`
	- `Banana`: `[0, 1, 0]`
	- `Apple`: `[1, 0, 0]`
	""")

	elif selected_subpoint == " Bag Of Words":
	st.write("### 👜 Bag of Words (BoW)")
	st.write("""
	Bag of Words converts text into a matrix of word frequencies, where each word is represented by a unique index in the vocabulary.

	Example:
	- Input: `["I love NLP", "NLP is fun"]`
	- Vocabulary: `["I", "love", "NLP", "is", "fun"]`
	- BoW Matrix:
	- `"I love NLP"`: `[1, 1, 1, 0, 0]`
	- `"NLP is fun"`: `[0, 0, 1, 1, 1]`

	Code Example:
	```python
	from sklearn.feature_extraction.text import CountVectorizer

	documents = ["I love NLP", "NLP is fun"]
	vectorizer = CountVectorizer()
	bow_matrix = vectorizer.fit_transform(documents)

	print(bow_matrix.toarray()) # Output: [[1, 1, 1, 0, 0], [0, 0, 1, 1, 1]]
	```
	""")

	elif selected_subpoint == " Binary Bag Of Words":
	st.write("### 🔲 Binary Bag of Words")
	st.write("""
	Binary Bag of Words is a variation of the BoW model where each word is represented by `1` if present in the document and `0` if absent, ignoring word frequencies.

	Example:
	- Input: `["I love NLP", "NLP is fun"]`
	- Vocabulary: `["I", "love", "NLP", "is", "fun"]`
	- Binary BoW Matrix:
	- `"I love NLP"`: `[1, 1, 1, 0, 0]`
	- `"NLP is fun"`: `[0, 0, 1, 1, 1]`

	Code Example:
	```python
	from sklearn.feature_extraction.text import CountVectorizer

	documents = ["I love NLP", "NLP is fun"]
	vectorizer = CountVectorizer(binary=True)
	binary_bow_matrix = vectorizer.fit_transform(documents)

	print(binary_bow_matrix.toarray()) # Output: [[1, 1, 1, 0, 0], [0, 0, 1, 1, 1]]
	```
	""")

	elif selected_subpoint == " TF-IDF":
	st.write("### 🧮 TF-IDF (Term Frequency - Inverse Document Frequency)")
	st.write("""
	TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It considers two factors:
	- Term Frequency (TF): The frequency of a word in a document.
	- Inverse Document Frequency (IDF): The importance of the word across all documents in the corpus. Words that appear in many documents are less important.

	The formula for TF-IDF is:
	- *TF-IDF = TF IDF**

	Example:
	Consider three documents:
	1. `"I love programming"`
	2. `"Programming is fun"`
	3. `"I love Python programming"`

	- TF (for "programming"):
	- Document 1: `1/3`
	- Document 2: `1/3`
	- Document 3: `1/3`

	- IDF (for "programming"):
	- IDF = log(3/3) = 0 (common word, less informative)

	Code Example:
	```python
	from sklearn.feature_extraction.text import TfidfVectorizer

	documents = ["I love programming", "Programming is fun", "I love Python programming"]
	vectorizer = TfidfVectorizer()
	tfidf_matrix = vectorizer.fit_transform(documents)

	print(tfidf_matrix.toarray()) # Output will show TF-IDF scores for each word in each document
	```
	""")

	elif selected_subpoint == " Word Embeddings":
	st.write("### 🤖 Word Embeddings")
	st.write("""
	Word embeddings are dense vector representations of words in a continuous vector space, capturing semantic meanings and relationships between words.

	Types of Word Embeddings:

	1. Word2Vec: Learns word associations from context using two approaches: Word2Vec is a model that transforms words into dense vector representations in a continuous vector space, capturing semantic relationships. It learns these representations by predicting words based on their context.

	- Skip-gram model: This model predicts context words from a target word, similar to Word2Vec's Skip-gram model. It's useful for capturing word relationships, and it works well with smaller datasets.

	- CBOW (Continuous Bag of Words) model: This model predicts a target word from a context window of words, similar to Word2Vec's CBOW model. It's effective for larger datasets and works well when words occur frequently.

	2. GloVe (Global Vectors for Word Representation): Uses a co-occurrence matrix to capture the relationships between words. It factors the matrix to produce low-dimensional vectors.

	3. FastText: Extends Word2Vec by breaking words into subword units, which helps capture morphology and represent rare or unseen words.


	Example:
	- Words like `"king"` and `"queen"` will have similar vector representations in embedding space, reflecting their semantic relationship.

	Code Example (using Word2Vec):
	```python
	from gensim.models import Word2Vec

	# Sample sentences
	sentences = [["I", "love", "programming"], ["Word", "embeddings", "are", "cool"]]

	# Train Word2Vec model
	model = Word2Vec(sentences, min_count=1)

	# Get the vector for the word 'programming'
	vector = model.wv['programming']
	print(vector)
	```
	""")

	elif selected_subpoint == " Part-of-Speech (POS) Tagging":
	st.write("### 🖇️ Part-of-Speech (POS) Tagging")
	st.write("""
	Assigning grammatical labels to each word in a sentence, indicating its role in context.
	Example:
	- Input: `"Birds fly high"`
	- Output: `["Birds (NOUN)", "fly (VERB)", "high (ADJ)"]`
	""")

	elif selected_subpoint == " Named Entity Recognition (NER)":
	st.write("### 🌍 Named Entity Recognition (NER)")
	st.write("""
	Detecting and categorizing entities like names, dates, and locations from text.
	Example:
	- Input: `"Tesla, founded by Elon Musk, is based in California."`
	- Output: `["Tesla (ORGANIZATION)", "Elon Musk (PERSON)", "California (LOCATION)"]`
	""")

	elif selected_subpoint == " Sentiment Analysis":
	st.write("### 🎭 Sentiment Analysis")
	st.write("""
	Classifying the emotional tone of a text into categories such as positive, negative, or neutral.
	Example:
	- Input: `"The service was exceptional!"`
	- Output: `Positive`
	""")




	# Footer
	st.sidebar.write("---")
	st.sidebar.write("🔍 Made with Streamlit.")