| **READ ME was generated by ChatGPT** | |
| # gmay29/ner\_model\_final | |
| ### Model Description | |
| This is a **Named Entity Recognition (NER)** model based on [**microsoft/deberta-base**](https://huggingface.co/microsoft/deberta-base). | |
| It was **fine-tuned on synthetic internship and job description data** generated using **mostly.ai**. | |
| The model extracts structured entities from **internship postings and job descriptions**, such as: | |
| * **SKILL** β technical skills, tools, programming languages, frameworks, and soft skills | |
| * **DISCIPLINE** β academic or technical fields (AI, ML, NLP, Computer Vision, Engineering, etc.) | |
| * **COURSE** β courses and degrees mentioned in the job description | |
| * **ROLE** β job roles and collaborators (intern, data scientist, software engineer, etc.) | |
| --- | |
| ### Intended Use | |
| * Parsing **internship descriptions** | |
| * Parsing **job postings** | |
| * Building **HR/recruitment tools** | |
| * Structuring unstructured job text into a machine-readable format | |
| Not designed for parsing **candidate resumes**. | |
| --- | |
| ### Training Data | |
| * **Synthetic internship and job description dataset** generated with **mostly.ai** | |
| * \~20,000 labeled samples (replace with actual dataset size if you know it) | |
| * Labels: `SKILL`, `DISCIPLINE`, `COURSE`, `ROLE` | |
| --- | |
| ### How to Use | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline | |
| # Load from Hugging Face Hub | |
| model_name = "gmay29/ner_model_final" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForTokenClassification.from_pretrained(model_name) | |
| device = 0 if torch.cuda.is_available() else -1 # pipeline expects 0 for GPU, -1 for CPU | |
| # Create NER pipeline (handles context automatically) | |
| ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, device=device, aggregation_strategy="simple") | |
| # Example job description text | |
| text = """ | |
| Responsibilities of the Intern: | |
| Design, develop, and implement AI agents for various applications. | |
| Train and fine-tune machine learning models using structured and unstructured datasets. | |
| Work with deep learning frameworks such as TensorFlow, PyTorch, and scikit-learn. | |
| Implement reinforcement learning techniques to enhance AI agent performance. | |
| Perform data preprocessing, augmentation, and feature engineering. | |
| Optimize model performance through hyperparameter tuning and algorithm optimization. | |
| Integrate AI solutions into real-world applications and assist in deployment. | |
| Collaborate with data scientists, software engineers, and business teams to understand requirements and deliver AI-driven solutions. | |
| Conduct research and stay updated with the latest trends and advancements in AI and ML. | |
| Document research findings, methodologies, and results effectively. | |
| Requirements: | |
| Strong programming skills in Python (knowledge of C++/Java is a plus). | |
| Experience with machine learning frameworks like TensorFlow, PyTorch, or Keras. | |
| Understanding of deep learning, natural language processing (NLP), and computer vision techniques. | |
| Familiarity with reinforcement learning concepts and their applications. | |
| Knowledge of data preprocessing, feature engineering, and model evaluation techniques. | |
| Experience working with large datasets and cloud computing platforms (AWS, Google Cloud, or Azure) is a plus. | |
| Strong problem-solving skills and the ability to work in a collaborative team environment. | |
| Excellent communication and documentation skills. | |
| """ | |
| # Run inference | |
| entities = ner_pipeline(text) | |
| # Pretty print results | |
| for ent in entities: | |
| print(f"{ent['entity_group']:<10} | {ent['word']:<25} | score={ent['score']:.3f}") | |
| ``` | |
| **Example Output:** | |
| ``` | |
| Device set to use cpu | |
| Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. | |
| SKILL | machine learning | score=1.000 | |
| SKILL | learning | score=1.000 | |
| SKILL | learn | score=0.758 | |
| SKILL | learning techniques | score=0.769 | |
| DISCIPLINE | engineering | score=1.000 | |
| COURSE | data scientists | score=0.954 | |
| SKILL | research | score=1.000 | |
| SKILL | ML | score=0.997 | |
| SKILL | research findings | score=1.000 | |
| SKILL | results | score=0.960 | |
| SKILL | programming | score=1.000 | |
| SKILL | Python | score=0.599 | |
| SKILL | machine learning | score=1.000 | |
| SKILL | learning | score=1.000 | |
| SKILL | learning | score=1.000 | |
| DISCIPLINE | engineering | score=1.000 | |
| SKILL | collaborative | score=0.900 | |
| SKILL | communication | score=1.000 | |
| SKILL | documentation skills | score=0.983 | |
| ``` | |
| --- | |
| ### Limitations | |
| * Trained on **synthetic job descriptions** β may not perfectly generalize to **real-world postings**. | |
| * Some ambiguity across entity classes (e.g., βAIβ could be both `DISCIPLINE` and `SKILL`). | |
| * Supports **English** only. | |
| --- | |
| ### Future Work | |
| * Extend training with **real-world internship/job postings**. | |
| * Add entity types such as `CERTIFICATION`, `TOOLS`, `COMPANY`. | |
| * Benchmark against public job-posting datasets for NER. | |
| --- | |
| [More Information Needed] |