CIS5190abcd
/

svm

Model card Files Files and versions

xet

Community

yitingliii commited on Dec 13, 2024

Commit

9d8f216

verified ·

1 Parent(s): fad1a71

Update README.md

Browse files

Files changed (1) hide show

README.md +37 -14

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 # SVM Model with TF-IDF
-Step by step instruction:
 ## Installation
 <br>Before running the code, ensure you have all the required libraries installed:
@@ -14,18 +14,13 @@ nltk.download('wordnet')
 ```
 # How to Use:
-1. Pre-Trained Model and Vectorizer
-<br> The repository includes:
-- model.pkl : The pre-trained SVM model
-- tfidf.pkl: The saved TF-IDF vectorizer used to transform the text data.
-2. Testing a new dataset
-<br> To test the model with the new dataset, follow these steps:
-- Step 1: Prepare the dataset:
-<br> Ensure the dataset is in CVS format and has three columns: title, outlet and labels. title column containing the text data to be classified.
-- Step 2: Preprocess the Data
-<br>Use the clean() function from data_cleaning.py to preprocess the text data:
 ```python
 from data_cleaning import clean
@@ -39,5 +34,33 @@ cleaned_df = clean(df)
 ```
-- Step 3: Load the pre-trained model and TF-IDF Vectorizer

 # SVM Model with TF-IDF
+This repository provides a pre-trained Support Vector Machine (SVM) model for text classification using Term Frequency-Inverse Document Frequency (TF-IDF). The repository also includes utilities for data preprocessing and feature extraction.:
 ## Installation
 <br>Before running the code, ensure you have all the required libraries installed:
 ```
 # How to Use:
+1. Data Cleaning
+<br> The data_cleaning.py file contains a clean() function to preprocess the input dataset:
+- Removes HTML tags.
+- Removes non-alphanumeric characters and extra spaces.
+- Converts text to lowercase.
+- Removes stopwords.
+- Lemmatizes words.
 ```python
 from data_cleaning import clean
 ```
+2. TF-IDF Feature Extraction
+<br> The tfidf.py file contains the TF-IDF vectorization logic. It converts cleaned text data into numerical features suitable for training and testing the SVM model.
+```python
+from tfidf import tfidf
+# Apply TF-IDF vectorization
+X_train_tfidf = tfidf.fit_transform(X_train['title'])
+X_test_tfidf = tfidf.transform(X_test['title'])
+```
+3. Training and Testing the SVM Model
+<br> The svm.py file contains the logic for training and testing the SVM model. It uses the TF-IDF-transformed features to classify text data.
+```python
+from sklearn.svm import SVC
+from sklearn.metrics import accuracy_score, classification_report
+# Train the SVM model
+svm_model = SVC(kernel='linear', random_state=42)
+svm_model.fit(X_train_tfidf, y_train)
+# Predict and evaluate
+y_pred = svm_model.predict(X_test_tfidf)
+accuracy = accuracy_score(y_test, y_pred)
+print(f"SVM Accuracy: {accuracy:.4f}")
+print(classification_report(y_test, y_pred))
+```
+4. Training a new dataset with pre-trained model
+<br>To test a new dataset, combine the steps above：
+-