Spaces:

Rajesh6
/

NLP

Sleeping

App Files Files Community

Rajesh6 commited on Nov 23, 2024

Commit

2370497

verified ·

1 Parent(s): 9ccbd79

Update pages/Introduction.py

Browse files

Files changed (1) hide show

pages/Introduction.py +27 -13

pages/Introduction.py CHANGED Viewed

@@ -15,7 +15,7 @@ st.markdown("<p>NLP encompasses a wide array of techniques that aimed at enablin
 st.markdown('<p style="color:blue;"><b>1. Text Processing and Preprocessing In NLP</b></p>', unsafe_allow_html=True)
 st.write("Before performing any analysis or modeling, raw text data must be cleaned and prepared.")
-st.markdown('<p style="color:blue;"><b>a. Tokenization</b></p>', unsafe_allow_html=True)
 st.write("Splits text into smaller units like words or sentences.")
 st.write("**Types:**")
@@ -25,34 +25,34 @@ st.write("Example: _'I love NLP'_ → [‘I’, ‘love’, ‘NLP’]")
 st.write("**(ii) Sentence Tokenization:** Breaking text into sentences.")
 st.write("Example: _'I love NLP. It’s fascinating!'_ → [‘I love NLP.’, ‘It’s fascinating!’]")
-st.markdown('<p style="color:blue;"><b>b. Stopword Removal</b></p>', unsafe_allow_html=True)
 st.write("Removes common words like “the,” “and,” “is” that do not contribute much to analysis.")
-st.markdown('<p style="color:blue;"><b>c. Stemming and Lemmatization</b></p>', unsafe_allow_html=True)
 st.write("Stemming: Reduces words to their base or root form by chopping off suffixes (may not produce valid words).")
 st.write("Example: _“running” _ → “run”")
 st.write("Lemmatization: Converts words to their base form using vocabulary and grammar")
 st.write("Example: _“good” _ → “better”")
-st.markdown('<p style="color:blue;"><b>d. Part-of-Speech (POS) Tagging</b></p>', unsafe_allow_html=True)
 st.write("Labels words with their grammatical roles (noun, verb, adjective, etc.)")
 st.write("Example: _The cat sleeps”_ → [“The/DET”, “cat/NOUN”, “sleeps/VERB”]")
-st.markdown('<p style="color:blue;"><b>e. Named Entity Recognition (NER)</b></p>', unsafe_allow_html=True)
 st.write("Identifies and classifies entities in text (e.g., names, dates, locations)")
 st.write("Example: _ “Barack Obama was born in Hawaii. _ ” → [Barack Obama: PERSON, Hawaii: LOCATION]")
-st.markdown('<p style="color:blue;"><b>f. Text Normalization</b></p>', unsafe_allow_html=True)
 st.write("Converts text to a standard format (lowercasing, removing punctuation, etc.).")
 st.markdown('<p style="color:blue;"><b>2. Feature Extraction Techniques</b></p>', unsafe_allow_html=True)
 st.write("Text needs to be transformed into numerical representations for machine learning models.")
-st.markdown('<p style="color:blue;"><b>a. Bag of Words (BoW)</b></p>', unsafe_allow_html=True)
 st.write("Represents text as a vector of word frequencies or occurrences, ignoring grammar and order")
 st.write("Examples:")
 st.write("Text: “I love NLP” and “NLP is great”")
@@ -60,12 +60,26 @@ st.write("Vocabulary: [“I”, “love”, “NLP”, “is”, “great”]")
 st.write("Vector for “I love NLP”: [1, 1, 1, 0, 0]")
-st.markdown('<p style="color:blue;"><b>b. Term Frequency-Inverse Document Frequency (TF-IDF)</b></p>', unsafe_allow_html=True)
-st.write("Assigns weights to words based on their frequency in a document and their rarity across all documents.")
-st.write("Examples:")
-st.write("Text: “I love NLP” and “NLP is great”")
-st.write("Vocabulary: [“I”, “love”, “NLP”, “is”, “great”]")
-st.write("Vector for “I love NLP”: [1, 1, 1, 0, 0]")

 st.markdown('<p style="color:blue;"><b>1. Text Processing and Preprocessing In NLP</b></p>', unsafe_allow_html=True)
 st.write("Before performing any analysis or modeling, raw text data must be cleaned and prepared.")
+st.markdown('<p style="color:lightblue;"><b>a. Tokenization</b></p>', unsafe_allow_html=True)
 st.write("Splits text into smaller units like words or sentences.")
 st.write("**Types:**")
 st.write("**(ii) Sentence Tokenization:** Breaking text into sentences.")
 st.write("Example: _'I love NLP. It’s fascinating!'_ → [‘I love NLP.’, ‘It’s fascinating!’]")
+st.markdown('<p style="color:lightblue;"><b>b. Stopword Removal</b></p>', unsafe_allow_html=True)
 st.write("Removes common words like “the,” “and,” “is” that do not contribute much to analysis.")
+st.markdown('<p style="color:lightblue;"><b>c. Stemming and Lemmatization</b></p>', unsafe_allow_html=True)
 st.write("Stemming: Reduces words to their base or root form by chopping off suffixes (may not produce valid words).")
 st.write("Example: _“running” _ → “run”")
 st.write("Lemmatization: Converts words to their base form using vocabulary and grammar")
 st.write("Example: _“good” _ → “better”")
+st.markdown('<p style="color:lightblue;"><b>d. Part-of-Speech (POS) Tagging</b></p>', unsafe_allow_html=True)
 st.write("Labels words with their grammatical roles (noun, verb, adjective, etc.)")
 st.write("Example: _The cat sleeps”_ → [“The/DET”, “cat/NOUN”, “sleeps/VERB”]")
+st.markdown('<p style="color:lightblue;"><b>e. Named Entity Recognition (NER)</b></p>', unsafe_allow_html=True)
 st.write("Identifies and classifies entities in text (e.g., names, dates, locations)")
 st.write("Example: _ “Barack Obama was born in Hawaii. _ ” → [Barack Obama: PERSON, Hawaii: LOCATION]")
+st.markdown('<p style="color:lightblue;"><b>f. Text Normalization</b></p>', unsafe_allow_html=True)
 st.write("Converts text to a standard format (lowercasing, removing punctuation, etc.).")
 st.markdown('<p style="color:blue;"><b>2. Feature Extraction Techniques</b></p>', unsafe_allow_html=True)
 st.write("Text needs to be transformed into numerical representations for machine learning models.")
+st.markdown('<p style="color:lightblue;"><b>a. Bag of Words (BoW)</b></p>', unsafe_allow_html=True)
 st.write("Represents text as a vector of word frequencies or occurrences, ignoring grammar and order")
 st.write("Examples:")
 st.write("Text: “I love NLP” and “NLP is great”")
 st.write("Vector for “I love NLP”: [1, 1, 1, 0, 0]")
+st.markdown('<p style="color:lightblue;"><b>b. Term Frequency-Inverse Document Frequency (TF-IDF)</b></p>', unsafe_allow_html=True)
+st.write("**TF-IDF** stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).")
+st.write("**Term Frequency:** In document d, the frequency represents the number of instances of a given word t. Therefore, we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term in the paper, there is an entry with the value being the term frequency.")
+st.write("The weight of a term that occurs in a document is simply proportional to the term frequency.")
+st.write("tf(t,d) = count of t in d / number of words in d")
+st.write("**Document Frequency:** This tests the meaning of the text, which is very similar to TF, in the whole corpus collection. The only difference is that in document d, TF is the frequency counter for a term t, while df is the number of occurrences in the document set N of the term t. In other words, the number of papers in which the word is present is DF.")
+st.write("df(t) = occurrence of t in documents")
+st.write("Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of the search is to locate the appropriate records that fit the demand. Since tf considers all terms equally significant, it is therefore not only possible to use the term frequencies to measure the weight of the term in the paper. First, find the document frequency of a term t by counting the number of documents containing the term:")
+st.write("df(t) = N(t)")
+st.write("where")
+st.write("df(t) = Document frequency of a term t")
+st.write("N(t) = Number of documents containing the term t")
+st.write("Term frequency is the number of instances of a term in a single document only; although the frequency of the document is the number of separate documents in which the term appears, it depends on the entire corpus. Now let’s look at the definition of the frequency of the inverse paper. The IDF of the word is the number of documents in the corpus separated by the frequency of the text.")
+st.write("idf(t) = N/ df(t) = N/N(t)")
+st.write("The more common word is supposed to be considered less significant, but the element (most definite integers) seems too harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the if of the term t becomes: ")
+st.write("idf(t) = log(N/ df(t))")