CIS5190abcd
/

svm

Model card Files Files and versions

xet

Community

yitingliii commited on Dec 13, 2024

Commit

ada45ca

verified ·

1 Parent(s): ca7966e

Update README.md

Browse files

Files changed (1) hide show

README.md +38 -55

README.md CHANGED Viewed

@@ -1,5 +1,27 @@
 # SVM Model with TF-IDF
-This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction.:
 ## Installation
 <br>Before running the code, ensure you have all the required libraries installed:
@@ -7,72 +29,31 @@ This repository provides a pre-trained Support Vector Machine (SVM) model for te
 pip install nltk beautifulsoup4 scikit-learn pandas datasets
 ```
 <br> Download necessary NTLK resources for preprocessing.
-```python
-import nltk
-nltk.download('stopwords')
-nltk.download('wordnet')
 ```
-# How to Use:
-1. Data Cleaning
-<br> The data_cleaning.py file contains a clean() function to preprocess the input dataset:
-- Removes HTML tags.
-- Removes non-alphanumeric characters and extra spaces.
-- Converts text to lowercase.
-- Removes stopwords.
-- Lemmatizes words.
-```python
-from data_cleaning import clean
-import pandas as pd
 import nltk
 nltk.download('stopwords')
-# Load your data
-df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")
-# Clean the data
-cleaned_df = clean(df)
 ```
-2. TF-IDF Feature Extraction
-<br> The tfidf.py file contains the TF-IDF vectorization logic. It converts cleaned text data into numerical features suitable for training and testing the SVM model.
-```python
-from tfidf import tfidf
-# Apply TF-IDF vectorization
-X_train_tfidf = tfidf.fit_transform(X_train['title'])
-X_test_tfidf = tfidf.transform(X_test['title'])
 ```
-3. Training and Testing the SVM Model
-<br> The svm.py file contains the logic for training and testing the SVM model. It uses the TF-IDF-transformed features to classify text data.
-```python
-from sklearn.svm import SVC
-from sklearn.metrics import accuracy_score, classification_report
-# Train the SVM model
-svm_model = SVC(kernel='linear', random_state=42)
-svm_model.fit(X_train_tfidf, y_train)
-# Predict and evaluate
-y_pred = svm_model.predict(X_test_tfidf)
-accuracy = accuracy_score(y_test, y_pred)
-print(f"SVM Accuracy: {accuracy:.4f}")
-print(classification_report(y_test, y_pred))
 ```
-4. Training a new dataset with pre-trained model
-<br>To test a new dataset, follow the steps below：
 - Clean the Dataset
 ```python
 from data_cleaning import clean
 import pandas as pd
-# Load your dataset
-df = pd.read_csv('test_data_random_subset.csv')
 # Clean the data
 cleaned_df = clean(df)
@@ -97,3 +78,5 @@ predictions = svm_model.predict(X_new_tfidf)
 ```

 # SVM Model with TF-IDF
+This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:
+## Start:
+<br>Open your terminal.
+<br> Clone the repo by using the following command:
+```
+git clone https://huggingface.co/CIS5190abcd/svm
+```
+<br> Go to the svm directory using following command:
+```
+cd svm
+```
+<br> Run ```ls``` to check the files inside svm folder. Make sure ```tfidf.py```, ```svm.py``` and ```data_cleaning.py``` are existing in this directory. If not, run the folloing commands:
+```
+git checkout origin/main -- tfidf.py
+git checkout origin/main -- svm.py
+git checkout origin/main -- data_cleaning.py
+```
+<br> Rerun ```ls```, double check all the required files are existing. Should look like this:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6755cffd784ff7ea9db10bd4/O9K5zYm7TKiIg9cYZpV1x.png)
+<br> keep inside the svm directory until ends.
 ## Installation
 <br>Before running the code, ensure you have all the required libraries installed:
 pip install nltk beautifulsoup4 scikit-learn pandas datasets
 ```
 <br> Download necessary NTLK resources for preprocessing.
 ```
+python
 import nltk
 nltk.download('stopwords')
+nltk.download('wordnet')
 ```
+<br> After downloading all the required packages,
 ```
+exit()
 ```
+## How to use:
+Training a new dataset with existing SVM model, follow the steps below：
 - Clean the Dataset
 ```python
 from data_cleaning import clean
 import pandas as pd
+import nltk
+nltk.download('stopwords')
+```
+<br> You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```.
+```
+# Load your data
+df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")
 # Clean the data
 cleaned_df = clean(df)
 ```