CIS5190abcd
/

svm

yitingliii commited on Dec 13, 2024

Commit

9c9929c

verified ·

1 Parent(s): e1bbe05

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -2,16 +2,58 @@
 This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:
 There are two ways to test our model:
-# 1.Colab
 ## Start
 <br> Download all the files.
 <br> Copy all the codes below into Colab
 ```python
 pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub
 ```
 # 2. Termial

 This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction:
 There are two ways to test our model:
+# 1.Colab (can see the file for how the Colab looks like)
 ## Start
 <br> Download all the files.
 <br> Copy all the codes below into Colab
+<br>Before running the code, ensure you have all the required libraries installed:
 ```python
 pip install nltk beautifulsoup4 scikit-learn pandas datasets fsspec huggingface_hub
 ```
+<br> Download necessary NTLK resources for preprocessing.
+```python
+import nltk
+nltk.download('stopwords')
+nltk.download('wordnet')
+nltk.download('omw-1.4')
+```
+<br>Clean the Dataset
+```python
+from data_cleaning import clean
+import pandas as pd
+import nltk
+nltk.download('stopwords')
+```
+<br> You can replace with any datasets you want by changing the file name inside ```pd.read_csv()```.
+```python
+df = pd.read_csv("hf://datasets/CIS5190abcd/headlines_test/test_cleaned_headlines.csv")
+cleaned_df = clean(df)
+```
+- Extract TF-IDF Features
+```python
+from tfidf import tfidf
+X_new_tfidf = tfidf.transform(cleaned_df['title'])
+```
+- Make Predictions
+```python
+from svm import svm_model
+```
 # 2. Termial