--- library_name: transformers license: apache-2.0 base_model: distilbert-base-uncased tags: - generated_from_trainer - text-classification - resume-parsing - nlp metrics: - accuracy model-index: - name: resume-section-classifier-v1 results: [] --- # Resume Section Classifier v1 (DistilBERT) A highly accurate, production-ready text classification model designed to categorize raw, messy resume text into 15 distinct sections. This model builds upon the original concepts of `gr8monk3ys/resume-section-classifier`, but expands the label set, introduces deep regional (African/European/Global) context, and provides **publicly hosted weights** for immediate deployment. It is heavily optimized for parsing output generated by OCR and PDF extraction tools like the **Unstructured API**. ## 🚀 Model Details - **Base Architecture:** `distilbert-base-uncased` - **Task:** Text Classification (Sequence Classification) - **Language:** English - **Training Data:** ~58,000 rows of hybrid organic and synthetically generated resume lines. - **Accuracy:** 99.97% ## 🎯 Intended Use & Key Features This model is designed to act as the "Routing Engine" in resume parsing pipelines. When a PDF parser extracts unstructured blocks of text, this model categorizes those blocks so they can be routed to strict data schemas (like Pydantic or Zod) for targeted entity extraction. ### Key Features: 1. **Unstructured API Resilience:** Trained heavily on "engineered noise" (random bullet points, markdown artifacts, pipe separators, missing newlines, and concatenated dates). It learns the semantic meaning of the text, ignoring formatting garbage. 2. **Global & Regional Coverage:** Excellent recognition of standard global formats (BSc, AWS, PMP, GPA) as well as highly specific regional/Nigerian formats (ND, HND, PGD, NYSC, SIWES, ICAN, CIBN). 3. **Multi-line Block Handling:** Capable of classifying dense, multi-line blocks of text. It easily identifies a 4-line project description block as `projects` rather than confusing it with `experience`. ## 🏷️ Supported Labels (15) The model predicts one of the following 15 classes: | Label | Description / Examples | | :--- | :--- | | `contact` | Emails, phone numbers, addresses, LinkedIn/GitHub URLs. | | `summary` | Professional summaries, profiles, or executive overviews. | | `objective` | Career objectives and personal statements. | | `experience` | Work history, NYSC, SIWES, internships. | | `education` | Degrees (BSc, HND, PhD), institutions, and grades. | | `skills` | Technical skills, soft skills, programming languages. | | `certifications` | Professional certs (AWS, ICAN, PMP), including "In View" status. | | `projects` | Personal or professional projects and open-source contributions. | | `awards` | Honors, scholarships, and Dean's Lists. | | `hobbies` | Interests, passions, and extracurricular activities. | | `languages` | Spoken languages and proficiency levels (e.g., Fluent, B2). | | `volunteer` | Community service and pro-bono work. | | `publications` | Research papers, articles, and academic journals. | | `references` | Referees or "References available upon request" statements. | | `additional_info` | Relocation willingness, visa status, notice periods. | ## 💻 How to Use You can easily load this model in your pipeline using the Hugging Face `transformers` library. ```python from transformers import pipeline # Load the classifier classifier = pipeline("text-classification", model="amosify/resume-section-classifier-v1") # Example 1: Messy Unstructured API chunk chunk_1 = "• Professional Development\nGoogle Data Analytics Professional Certificate - 2023" print(classifier(chunk_1)) # Output: [{'label': 'certifications', 'score': 0.9998}] # Example 2: Multi-line project description chunk_2 = "Interactive Search Engine (C#, Java, PHP)\n* Attracted 100+ GitHub stars\n* Deployed to Heroku with Docker" print(classifier(chunk_2)) # Output: [{'label': 'projects', 'score': 0.9997}] # Example 3: Regional Education chunk_3 = "Higher National Diploma (HND) in Computer Science, Yaba College of Technology (Upper Credit)" print(classifier(chunk_3)) # Output: [{'label': 'education', 'score': 0.9999}] ``` ## 📊 Training Procedure & Metrics The model was fine-tuned for 5 epochs on Kaggle using NVIDIA T4 x2 GPUs. It leverages a custom `load_best_model_at_end` strategy, ensuring the final weights avoid overfitting. ### Training Hyperparameters - **learning_rate:** 2e-05 - **train_batch_size:** 64 - **eval_batch_size:** 64 - **optimizer:** AdamW (betas=(0.9,0.999), epsilon=1e-08) - **lr_scheduler_type:** linear - **num_epochs:** 5 ### Training Results | Epoch | Step | Training Loss | Validation Loss | Accuracy | |:-----:|:----:|:-------------:|:---------------:|:--------:| | 1.0 | 1004 | 0.0093 | 0.0054 | 0.9996 | | 2.0 | 2008 | 0.0024 | 0.0042 | 0.9996 | | 3.0 | 3012 | 0.0008 | 0.0032 | 0.9997 | | **4.0** | **4016** | **0.0004** | **0.0027** | **0.9997** | | 5.0 | 5020 | 0.0003 | 0.0030 | 0.9997 | *(Note: The model automatically saved the Epoch 4 weights as they yielded the lowest validation loss of 0.0027).* ## ⚠️ Limitations & Scope - **Sequence Length Limitation:** DistilBERT has a hard limit of 512 tokens, but this model was trained with a `max_length` of **256 tokens** to optimize for speed. If you pass an entire 2-page resume as a single string, it will truncate the text. **You must chunk your PDF first** (e.g., using Unstructured) and pass the chunks to this model individually. - **Not an NER Model:** This is a *Sequence Classifier*, not a Named Entity Recognition (NER) model. It will confidently tell you that a block of text belongs to the "Education" section, but it will not extract the specific substring "Harvard University" out of it. You should route the classified text to an LLM or strict extraction schema (like Zod/Pydantic) for final data extraction. ```