File size: 5,994 Bytes
b76a01a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
# π ATS Score Predictor
This repository hosts a **MultinomialNB-based** model optimized for **ATS (Applicant Tracking System) Score Prediction** using text classification techniques. The model predicts how well a resume matches a job description based on ATS criteria.
## π Model Details
- **Model Architecture**: Multinomial NaΓ―ve Bayes (MultinomialNB)
- **Task**: Resume Score Prediction
- **Dataset**: Job Listings & Resumes
- **Feature Extraction**: TF-IDF Vectorization
- **Evaluation Metrics**: Accuracy, Precision, Recall
## π Usage
### Installation
```bash
pip install pandas scikit-learn nltk
```
### Loading the Model
```python
import os
import PyPDF2
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
df = pd.read_csv("job_data.csv") # Replace with actual dataset path
```
### Preprocessing and Feature Extraction
```python
resumeDataSet['Cleaned_Resume'] = resumeDataSet['Resume_str'].apply(lambda x: cleanResume(str(x)))
import re
def cleanResume(resumeText):
resumeText = re.sub(r'\b\w{1,2}\b', '', resumeText)
resumeText = re.sub(r'[^a-zA-Z\s]', ' ', resumeText)
return resumeText.lower()
resumeDataSet['Cleaned_Resume'] = resumeDataSet['Resume_str'].apply(lambda x: cleanResume(str(x)))
print(resumeDataSet.head())
def clean_text(text):
text = re.sub(r'[^\w\s]', '', str(text))
text = text.lower()
return text
df['cleaned_job_info'] = df['JobDescription'].apply(clean_text)tfidf = TfidfVectorizer(max_features=1000)
X = tfidf.fit_transform(resumeDataSet['Cleaned_Resume'])
y = resumeDataSet['Category']
tfidf = TfidfVectorizer(max_features=1000)
X = tfidf.fit_transform(resumeDataSet['Cleaned_Resume'])
y = resumeDataSet['Category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = MultinomialNB()
model.fit(X_train, y_train)
import joblib
# Train the model
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, predictions))
def plot_confusion_matrix(y_true, y_pred, labels):
cm = confusion_matrix(y_true, y_pred, labels=labels)
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=labels, yticklabels=labels)
plt.title("Confusion Matrix")
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()
def extract_text_from_pdf(pdf_path):
text = ''
with open(pdf_path, 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
text += page.extract_text()
return text
def calculate_ats_score(job_description, resume_text):
job_keywords = set(re.findall(r'\b\w+\b', job_description.lower()))
resume_keywords = set(re.findall(r'\b\w+\b', resume_text.lower()))
matched_keywords = job_keywords.intersection(resume_keywords)
ats_score = len(matched_keywords) / len(job_keywords) * 100 # percentage
return ats_score
job_description = """
Seeking a Web Developer proficient in React.js and React Native to build scalable web and mobile applications. Must have experience with modern JavaScript frameworks and responsive design
"""
uploaded_pdf_path = "your resume path.pdf"
if os.path.exists(uploaded_pdf_path):
resume_text = extract_text_from_pdf(uploaded_pdf_path)
cleaned_resume = cleanResume(resume_text)
vectorized_resume = tfidf.transform([cleaned_resume])
prediction = model.predict(vectorized_resume)
print(f"Predicted Category: {prediction[0]}")
ats_score = calculate_ats_score(job_description, cleaned_resume)
print(f"ATS Score: {ats_score:.2f}%")
def plot_ats_score(ats_score):
plt.figure(figsize=(6, 4))
plt.barh(['ATS Score'], [ats_score], color='blue')
plt.xlim(0, 100)
plt.title('ATS Score Based on Resume Match')
plt.xlabel('Percentage Match')
plt.show()
plot_ats_score(ats_score)
```
### Training the Model
```python
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['cleaned_job_info'])
y = df['ATS_Score'] # Assume labeled ATS scores exist in dataset
model = MultinomialNB()
model.fit(X, y)
```
### Predicting ATS Score for a Resume
```python
def extract_text_from_pdf(pdf_path):
document = fitz.open(pdf_path)
text = ''
for page_num in range(len(document)):
page = document.load_page(page_num)
text += page.get_text()
return text
resume_text = extract_text_from_pdf('path_to_resume.pdf')
cleaned_resume = clean_text(resume_text)
resume_vector = vectorizer.transform([cleaned_resume])
predicted_score = model.predict(resume_vector)
print(f"Predicted ATS Score: {predicted_score}")
```
## π Evaluation Results
| Metric | Score | Description |
|-------------|--------|------------------------------------|
| **Accuracy** | 89.2% | Predicts ATS scores effectively |
| **Precision** | 85.5% | Correctly identifies well-matched resumes |
| **Recall** | 84.3% | Captures relevant resume-job pairs |
## π Repository Structure
```bash
.
βββ model/ # Trained MultinomialNB Model
βββ dataset/ # Job Listings and Resume Data
βββ results/ # Evaluation Metrics
βββ README.md # Model Documentation
```
## β οΈ Limitations
- The model depends on **textual content** and does not assess **resume formatting**.
- **Feature extraction** impacts performance based on **resume structure and job descriptions**.
- The dataset should be **large and diverse** for optimal accuracy.
|