Spaces:

NeonSamurai
/

NLPHub

Sleeping

App Files Files Community

NLPHub / stages /feature_engineering.py

NeonSamurai

Update stages/feature_engineering.py

4298273 verified about 1 year ago

raw

history blame contribute delete

1.83 kB

	import streamlit as st

	def main():
	st.title("Step 6: Feature Engineering")

	st.markdown("""
	### :mag: What is Text Vectorization? :bar_chart:

	Feature Engineering for text data mainly involves Text Vectorization, which is the key process of transforming unstructured text data into numerical form so that machine learning models can understand and learn from it. Why do we need this?

	:brain: Why is Vectorization Necessary?
	- Machine Learning Models Understand Only Numbers: Algorithms can only work with numerical data, so raw text needs to be converted into numbers for the model to process it.
	- Makes Text Usable for Models: Vectorization translates the meaning of words, sentences, and documents into a format that models can interpret. This allows them to identify patterns, relationships, and insights.

	:bulb: Think about it this way: Text is like a language, but models only speak in numbers. Vectorization is the translator that helps them communicate effectively!

	Common Vectorization Techniques:
	- Bag of Words (BoW): Counts word occurrences in a document, turning them into a numerical vector.
	- TF-IDF (Term Frequency-Inverse Document Frequency): Weighs words based on how often they appear in the document and how unique they are across all documents.
	- Word Embeddings: Uses models like Word2Vec, GloVe, and FastText to represent words as dense vectors, capturing semantic meaning and context.

	:key: In Short: Vectorization is crucial because it transforms raw text into a format that machine learning models can process and understand. Without it, models would have no way to interpret the meaning hidden in the text data.
	""")

	st.divider()
	main()

	import streamlit as st

	def main():
	st.title("Step 6: Feature Engineering")

	st.markdown("""
	### :mag: What is Text Vectorization? :bar_chart:

	Feature Engineering for text data mainly involves Text Vectorization, which is the key process of transforming unstructured text data into numerical form so that machine learning models can understand and learn from it. Why do we need this?

	:brain: Why is Vectorization Necessary?
	- Machine Learning Models Understand Only Numbers: Algorithms can only work with numerical data, so raw text needs to be converted into numbers for the model to process it.
	- Makes Text Usable for Models: Vectorization translates the meaning of words, sentences, and documents into a format that models can interpret. This allows them to identify patterns, relationships, and insights.

	:bulb: Think about it this way: Text is like a language, but models only speak in numbers. Vectorization is the translator that helps them communicate effectively!

	Common Vectorization Techniques:
	- Bag of Words (BoW): Counts word occurrences in a document, turning them into a numerical vector.
	- TF-IDF (Term Frequency-Inverse Document Frequency): Weighs words based on how often they appear in the document and how unique they are across all documents.
	- Word Embeddings: Uses models like Word2Vec, GloVe, and FastText to represent words as dense vectors, capturing semantic meaning and context.

	:key: In Short: Vectorization is crucial because it transforms raw text into a format that machine learning models can process and understand. Without it, models would have no way to interpret the meaning hidden in the text data.
	""")

	st.divider()
	main()