Spaces:

mhr-212
/

resume-llm-api

Running

App Files Files Community

Your Name commited on 2 days ago

Commit

8e21fda

0 Parent(s):

Initialize independent HF Space repository

Browse files

Files changed (22) hide show

.gitattributes +1 -0
Dockerfile +27 -0
README.md +11 -0
models/checkpoints/final/README.md +207 -0
models/checkpoints/final/adapter_config.json +41 -0
models/checkpoints/final/adapter_model.safetensors +3 -0
models/checkpoints/final/added_tokens.json +40 -0
models/checkpoints/final/merges.txt +0 -0
models/checkpoints/final/special_tokens_map.json +24 -0
models/checkpoints/final/tokenizer.json +0 -0
models/checkpoints/final/tokenizer_config.json +326 -0
models/checkpoints/final/vocab.json +0 -0
requirements.txt +7 -0
resume-llm-api +1 -0
src/__init__.py +0 -0
src/__pycache__/__init__.cpython-312.pyc +0 -0
src/__pycache__/inference.cpython-312.pyc +0 -0
src/data_preparation.py +211 -0
src/evaluate.py +274 -0
src/inference.py +316 -0
src/train.py +175 -0
src/utils.py +59 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ *.safetensors filter=lfs diff=lfs merge=lfs -text

Dockerfile ADDED Viewed

	@@ -0,0 +1,27 @@

+# Use Python 3.10
+FROM python:3.10-slim
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
+# Copy requirements
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy source code and model
+COPY src/ ./src/
+COPY models/ ./models/
+# Create a user to run the app (security best practice for HF)
+RUN useradd -m -u 1000 user
+USER user
+ENV HOME=/home/user \
+    PATH=/home/user/.local/bin:$PATH
+# Expose the standard HF Spaces port
+EXPOSE 7860
+# Start the API
+CMD ["python", "src/inference.py", "--mode", "api", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,11 @@

+---
+title: "Resume-LLM-API"
+emoji: "📄"
+colorFrom: "blue"
+colorTo: "indigo"
+sdk: "docker"
+pinned: false
+app_port: 7860
+---
+# Resume LLM API

models/checkpoints/final/README.md ADDED Viewed

	@@ -0,0 +1,207 @@

+---
+base_model: microsoft/phi-2
+library_name: peft
+pipeline_tag: text-generation
+tags:
+- base_model:adapter:microsoft/phi-2
+- lora
+- transformers
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.18.1

models/checkpoints/final/adapter_config.json ADDED Viewed

	@@ -0,0 +1,41 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "microsoft/phi-2",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.18.1",
+  "qalora_group_size": 16,
+  "r": 16,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "v_proj",
+    "q_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

models/checkpoints/final/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5dd863a28403b34cfd76507ced3b90a837c1010e10b9e23ccf06e777693c74d
+size 20988664

models/checkpoints/final/added_tokens.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+  "\t\t": 50294,
+  "\t\t\t": 50293,
+  "\t\t\t\t": 50292,
+  "\t\t\t\t\t": 50291,
+  "\t\t\t\t\t\t": 50290,
+  "\t\t\t\t\t\t\t": 50289,
+  "\t\t\t\t\t\t\t\t": 50288,
+  "\t\t\t\t\t\t\t\t\t": 50287,
+  "  ": 50286,
+  "   ": 50285,
+  "    ": 50284,
+  "     ": 50283,
+  "      ": 50282,
+  "       ": 50281,
+  "        ": 50280,
+  "         ": 50279,
+  "          ": 50278,
+  "           ": 50277,
+  "            ": 50276,
+  "             ": 50275,
+  "              ": 50274,
+  "               ": 50273,
+  "                ": 50272,
+  "                 ": 50271,
+  "                  ": 50270,
+  "                   ": 50269,
+  "                    ": 50268,
+  "                     ": 50267,
+  "                      ": 50266,
+  "                       ": 50265,
+  "                        ": 50264,
+  "                         ": 50263,
+  "                          ": 50262,
+  "                           ": 50261,
+  "                            ": 50260,
+  "                             ": 50259,
+  "                              ": 50258,
+  "                               ": 50257
+}

models/checkpoints/final/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

models/checkpoints/final/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

models/checkpoints/final/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

models/checkpoints/final/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,326 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "50256": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50257": {
+      "content": "                               ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50258": {
+      "content": "                              ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50259": {
+      "content": "                             ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50260": {
+      "content": "                            ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50261": {
+      "content": "                           ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50262": {
+      "content": "                          ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50263": {
+      "content": "                         ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50264": {
+      "content": "                        ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50265": {
+      "content": "                       ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50266": {
+      "content": "                      ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50267": {
+      "content": "                     ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50268": {
+      "content": "                    ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50269": {
+      "content": "                   ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50270": {
+      "content": "                  ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50271": {
+      "content": "                 ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50272": {
+      "content": "                ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50273": {
+      "content": "               ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50274": {
+      "content": "              ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50275": {
+      "content": "             ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50276": {
+      "content": "            ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50277": {
+      "content": "           ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50278": {
+      "content": "          ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50279": {
+      "content": "         ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50280": {
+      "content": "        ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50281": {
+      "content": "       ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50282": {
+      "content": "      ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50283": {
+      "content": "     ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50284": {
+      "content": "    ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50285": {
+      "content": "   ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50286": {
+      "content": "  ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50287": {
+      "content": "\t\t\t\t\t\t\t\t\t",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50288": {
+      "content": "\t\t\t\t\t\t\t\t",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50289": {
+      "content": "\t\t\t\t\t\t\t",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50290": {
+      "content": "\t\t\t\t\t\t",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50291": {
+      "content": "\t\t\t\t\t",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50292": {
+      "content": "\t\t\t\t",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50293": {
+      "content": "\t\t\t",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50294": {
+      "content": "\t\t",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "model_max_length": 2048,
+  "pad_token": "<|endoftext|>",
+  "return_token_type_ids": false,
+  "tokenizer_class": "CodeGenTokenizer",
+  "unk_token": "<|endoftext|>"
+}

models/checkpoints/final/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+# Dependencies for the AI Model (Hugging Face / GPU Server)
+torch
+transformers
+tokenizers
+accelerate
+peft
+flask

resume-llm-api ADDED Viewed

	@@ -0,0 +1 @@


1	+ Subproject commit db3e24a2516e30c66dc06acd4084f4203028de66

src/__init__.py ADDED Viewed

File without changes

src/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (160 Bytes). View file

src/__pycache__/inference.cpython-312.pyc ADDED Viewed

Binary file (15.5 kB). View file

src/data_preparation.py ADDED Viewed

	@@ -0,0 +1,211 @@

+import json
+import pandas as pd
+import numpy as np
+from typing import List, Dict, Tuple
+import os
+class DataGenerator:
+    """Generate synthetic training data for both tasks"""
+    @staticmethod
+    def generate_extraction_samples(num_samples: int = 1000) -> List[Dict]:
+        """Generate resume extraction training samples"""
+        companies = ["TechCorp", "DataFlow", "CloudSys", "AI Labs", "WebDev Inc",
+                    "FinTech Solutions", "Health Systems", "E-commerce Plus"]
+        roles = ["Developer", "Senior Developer", "Data Scientist", "ML Engineer",
+                "Product Manager", "DevOps Engineer", "Frontend Engineer", "Backend Engineer"]
+        skills_pool = ["Python", "Django", "Flask", "FastAPI", "PostgreSQL", "MongoDB",
+                      "React", "Vue.js", "AWS", "GCP", "Docker", "Kubernetes",
+                      "Machine Learning", "NLP", "TensorFlow", "PyTorch", "Git",
+                      "SQL", "REST API", "GraphQL", "Redis", "Elasticsearch"]
+        universities = ["MIT", "Stanford", "Carnegie Mellon", "Berkeley", "Harvard",
+                       "University of Washington", "State University", "Tech Institute"]
+        degrees = ["BS Computer Science", "BS Data Science", "MS Computer Science",
+                  "MS Artificial Intelligence", "BS Engineering"]
+        samples = []
+        for i in range(num_samples):
+            name = f"Candidate_{i+1}"
+            email = f"candidate{i+1}@email.com"
+            phone = f"555-{np.random.randint(1000, 9999)}"
+            # Experience
+            num_exp = np.random.randint(1, 4)
+            experience = []
+            for _ in range(num_exp):
+                experience.append({
+                    "company": np.random.choice(companies),
+                    "role": np.random.choice(roles),
+                    "duration": f"{np.random.randint(1, 7)} years",
+                    "description": "Led projects and mentored team members"
+                })
+            # Skills
+            num_skills = np.random.randint(3, 10)
+            skills = list(np.random.choice(skills_pool, num_skills, replace=False))
+            # Education
+            education = [{
+                "degree": np.random.choice(degrees),
+                "university": np.random.choice(universities),
+                "graduation_year": str(np.random.randint(2015, 2023))
+            }]
+            # Certifications
+            certifications = [f"Cert_{j}" for j in range(np.random.randint(0, 3))]
+            resume_text = f"""
+Resume of {name}
+Email: {email} | Phone: {phone}
+EXPERIENCE:
+{chr(10).join([f"- {exp['company']}: {exp['role']} ({exp['duration']})" for exp in experience])}
+SKILLS:
+{', '.join(skills)}
+EDUCATION:
+{chr(10).join([f"- {edu['degree']} from {edu['university']} ({edu['graduation_year']})" for edu in education])}
+CERTIFICATIONS:
+{chr(10).join(certifications) if certifications else "None"}
+"""
+            extracted_data = {
+                "name": name,
+                "email": email,
+                "phone": phone,
+                "skills": skills,
+                "experience": experience,
+                "education": education,
+                "certifications": certifications
+            }
+            samples.append({
+                "input": resume_text.strip(),
+                "output": json.dumps(extracted_data, indent=2),
+                "task": "extraction"
+            })
+        return samples
+    @staticmethod
+    def generate_matching_samples(num_samples: int = 500) -> List[Dict]:
+        """Generate resume-job matching training samples"""
+        job_titles = ["Senior Python Developer", "Data Scientist", "ML Engineer",
+                     "Full-Stack Developer", "DevOps Engineer", "Product Manager"]
+        skills_pool = ["Python", "Django", "PostgreSQL", "AWS", "Docker", "Kubernetes",
+                      "Machine Learning", "React", "Node.js", "SQL"]
+        samples = []
+        for i in range(num_samples):
+            # Create job description
+            job_title = np.random.choice(job_titles)
+            required_skills = list(np.random.choice(skills_pool, np.random.randint(3, 7), replace=False))
+            job_desc = f"""
+Job Title: {job_title}
+Required Skills:
+{', '.join(required_skills)}
+Experience: 3+ years in relevant role
+Education: BS in Computer Science or related field
+"""
+            # Create matching resume
+            resume_skills = list(np.random.choice(skills_pool, np.random.randint(3, 8), replace=False))
+            resume = f"Skills: {', '.join(resume_skills)}\nExperience: {np.random.randint(1, 8)} years"
+            # Calculate match score based on skill overlap
+            matching_skills = list(set(resume_skills) & set(required_skills))
+            match_score = min(100, int((len(matching_skills) / len(required_skills)) * 100))
+            matching_data = {
+                "match_score": match_score,
+                "matching_skills": matching_skills,
+                "missing_skills": [s for s in required_skills if s not in resume_skills],
+                "recommendation": "Recommend interview" if match_score >= 70 else "Consider further review"
+            }
+            samples.append({
+                "input": f"Resume:\n{resume}\n\nJob Description:\n{job_desc}",
+                "output": json.dumps(matching_data, indent=2),
+                "task": "matching"
+            })
+        return samples
+    @staticmethod
+    def create_instruction_dataset(extraction_samples: List[Dict],
+                                   matching_samples: List[Dict]) -> List[Dict]:
+        """Convert samples to instruction-following format"""
+        dataset = []
+        # Extraction task instructions
+        for sample in extraction_samples:
+            dataset.append({
+                "instruction": "Extract structured information from the resume. Return valid JSON.",
+                "input": sample["input"],
+                "output": sample["output"],
+                "task": "extraction"
+            })
+        # Matching task instructions
+        for sample in matching_samples:
+            dataset.append({
+                "instruction": "Compare the resume against the job description and provide a match score (0-100) with reasoning. Return valid JSON.",
+                "input": sample["input"],
+                "output": sample["output"],
+                "task": "matching"
+            })
+        return dataset
+def prepare_data(output_dir: str = "data/processed"):
+    """Main function to prepare all data"""
+    os.makedirs(output_dir, exist_ok=True)
+    print("Generating extraction samples...")
+    extraction_samples = DataGenerator.generate_extraction_samples(1000)
+    print("Generating matching samples...")
+    matching_samples = DataGenerator.generate_matching_samples(500)
+    print("Creating instruction dataset...")
+    full_dataset = DataGenerator.create_instruction_dataset(extraction_samples, matching_samples)
+    # Split into train/val/test
+    np.random.shuffle(full_dataset)
+    total = len(full_dataset)
+    train_idx = int(0.8 * total)
+    val_idx = int(0.9 * total)
+    train_data = full_dataset[:train_idx]
+    val_data = full_dataset[train_idx:val_idx]
+    test_data = full_dataset[val_idx:]
+    # Save datasets
+    with open(f"{output_dir}/train.json", "w") as f:
+        json.dump(train_data, f, indent=2)
+    with open(f"{output_dir}/validation.json", "w") as f:
+        json.dump(val_data, f, indent=2)
+    with open(f"{output_dir}/test.json", "w") as f:
+        json.dump(test_data, f, indent=2)
+    print(f"✅ Data prepared successfully!")
+    print(f"  - Train samples: {len(train_data)}")
+    print(f"  - Validation samples: {len(val_data)}")
+    print(f"  - Test samples: {len(test_data)}")
+    print(f"  - Total: {total}")
+    return train_data, val_data, test_data
+if __name__ == "__main__":
+    prepare_data()

src/evaluate.py ADDED Viewed

	@@ -0,0 +1,274 @@

+import json
+import numpy as np
+from sklearn.metrics import precision_recall_fscore_support, accuracy_score
+from typing import List, Dict
+import re
+import os
+class EvaluationMetrics:
+    """Evaluate model performance on both tasks"""
+    @staticmethod
+    def evaluate_extraction(predictions: List[Dict], ground_truth: List[Dict]) -> Dict:
+        """Evaluate extraction task performance"""
+        metrics = {
+            "overall_accuracy": 0,
+            "field_accuracies": {},
+            "total_samples": len(predictions)
+        }
+        all_correct = 0
+        field_correct = {}
+        field_counts = {}
+        # Extract field names
+        fields = ["name", "email", "phone", "skills", "experience", "education", "certifications"]
+        for field in fields:
+            field_correct[field] = 0
+            field_counts[field] = 0
+        for pred, truth in zip(predictions, ground_truth):
+            for field in fields:
+                if field in pred and field in truth:
+                    field_counts[field] += 1
+                    # Compare field values
+                    if isinstance(pred[field], (list, dict)):
+                        if json.dumps(pred[field], sort_keys=True) == json.dumps(truth[field], sort_keys=True):
+                            field_correct[field] += 1
+                    else:
+                        if str(pred[field]).lower() == str(truth[field]).lower():
+                            field_correct[field] += 1
+        # Calculate field accuracies
+        for field in fields:
+            if field_counts[field] > 0:
+                accuracy = field_correct[field] / field_counts[field]
+                metrics["field_accuracies"][field] = accuracy
+        # Overall accuracy
+        total_fields = sum(field_counts.values())
+        if total_fields > 0:
+            metrics["overall_accuracy"] = sum(field_correct.values()) / total_fields
+        return metrics
+    @staticmethod
+    def evaluate_matching(predictions: List[Dict], ground_truth: List[Dict]) -> Dict:
+        """Evaluate matching task performance"""
+        metrics = {
+            "score_rmse": 0,
+            "score_mae": 0,
+            "skill_matching_precision": 0,
+            "skill_matching_recall": 0,
+            "recommendation_accuracy": 0,
+            "total_samples": len(predictions)
+        }
+        score_errors = []
+        correct_recommendations = 0
+        all_matching_skills = []
+        all_pred_matching_skills = []
+        for pred, truth in zip(predictions, ground_truth):
+            # Score error
+            if "match_score" in pred and "match_score" in truth:
+                score_errors.append(abs(pred["match_score"] - truth["match_score"]))
+            # Recommendation accuracy
+            if "recommendation" in pred and "recommendation" in truth:
+                if pred["recommendation"].lower() == truth["recommendation"].lower():
+                    correct_recommendations += 1
+            # Skill matching
+            if "matching_skills" in pred and "matching_skills" in truth:
+                all_pred_matching_skills.extend(pred.get("matching_skills", []))
+                all_matching_skills.extend(truth.get("matching_skills", []))
+        if score_errors:
+            metrics["score_rmse"] = np.sqrt(np.mean(np.array(score_errors)**2))
+            metrics["score_mae"] = np.mean(score_errors)
+        if len(predictions) > 0:
+            metrics["recommendation_accuracy"] = correct_recommendations / len(predictions)
+        # Skill matching metrics
+        if all_matching_skills or all_pred_matching_skills:
+            # Simple precision/recall for skills
+            correct_skills = len(set(all_pred_matching_skills) & set(all_matching_skills))
+            if all_pred_matching_skills:
+                metrics["skill_matching_precision"] = correct_skills / len(set(all_pred_matching_skills))
+            if all_matching_skills:
+                metrics["skill_matching_recall"] = correct_skills / len(set(all_matching_skills))
+        return metrics
+    @staticmethod
+    def print_metrics(metrics: Dict, task: str):
+        """Pretty print metrics"""
+        print(f"\n{'='*50}")
+        print(f"EVALUATION RESULTS - {task.upper()}")
+        print(f"{'='*50}")
+        for key, value in metrics.items():
+            if isinstance(value, float):
+                print(f"{key}: {value:.4f}")
+            elif isinstance(value, dict):
+                print(f"\n{key}:")
+                for sub_key, sub_value in value.items():
+                    if isinstance(sub_value, float):
+                        print(f"  {sub_key}: {sub_value:.4f}")
+                    else:
+                        print(f"  {sub_key}: {sub_value}")
+            else:
+                print(f"{key}: {value}")
+def evaluate_on_test_set(test_path: str = "data/processed/test.json",
+                        model_path: str = "models/checkpoints/final"):
+    """Evaluate model on test set"""
+    # Prefer package-relative import; fall back to absolute when executed as a script.
+    try:
+        from .inference import ResumeInferenceEngine
+    except ImportError as e:
+        if "attempted relative import" in str(e).lower():
+            from src.inference import ResumeInferenceEngine
+        else:
+            raise
+    def _load_json_or_jsonl(path: str):
+        with open(path, "r", encoding="utf-8") as f:
+            content = f.read().strip()
+        if not content:
+            return []
+        # JSON array
+        if content[0] == "[":
+            return json.loads(content)
+        # JSONL
+        rows = []
+        for line in content.splitlines():
+            line = line.strip()
+            if not line:
+                continue
+            rows.append(json.loads(line))
+        return rows
+    def _safe_json_loads(text: str):
+        try:
+            return json.loads(text)
+        except Exception:
+            return None
+    def _parse_match_score(text: str):
+        # Accept formats like "Match Score: 0.82" or JSON {"match_score": 82}
+        if not isinstance(text, str):
+            return None
+        match = re.search(r"match\s*score\s*[:=]\s*([0-9]*\.?[0-9]+)", text, flags=re.IGNORECASE)
+        if not match:
+            return None
+        value = float(match.group(1))
+        # Normalize to 0-100 if it looks like 0-1
+        if value <= 1.0:
+            value *= 100.0
+        return value
+    # Load test data (supports JSON array or JSONL)
+    test_data = _load_json_or_jsonl(test_path)
+    # Initialize engine
+    engine = ResumeInferenceEngine(model_path)
+    # Separate by task (fallback: treat everything as matching)
+    extraction_samples = [s for s in test_data if s.get("task") == "extraction"]
+    matching_samples = [s for s in test_data if s.get("task") == "matching"]
+    if not extraction_samples and not matching_samples:
+        matching_samples = list(test_data)
+    print(f"Evaluating on {len(extraction_samples)} extraction samples...")
+    print(f"Evaluating on {len(matching_samples)} matching samples...")
+    # Evaluate extraction
+    extraction_preds = []
+    extraction_truth = []
+    for sample in extraction_samples:
+        try:
+            pred = engine.extract_resume(sample["input"])
+            extraction_preds.append(pred)
+            truth = _safe_json_loads(sample.get("output", ""))
+            extraction_truth.append(truth if isinstance(truth, dict) else {})
+        except Exception as e:
+            print(f"Error on extraction sample: {e}")
+            extraction_preds.append({})
+    extraction_metrics = EvaluationMetrics.evaluate_extraction(extraction_preds, extraction_truth)
+    EvaluationMetrics.print_metrics(extraction_metrics, "extraction")
+    # Evaluate matching
+    matching_preds = []
+    matching_truth = []
+    for sample in matching_samples:
+        try:
+            input_text = sample.get("input", "")
+            # Try to parse the expected delimiter; otherwise treat entire input as resume text.
+            parts = input_text.split("\n\nJob Description:\n")
+            if len(parts) == 2:
+                resume = parts[0].replace("Resume:\n", "").strip()
+                job = parts[1].strip()
+            else:
+                resume = input_text.strip()
+                job = ""
+            pred = engine.match_resume_to_job(resume, job) if job else engine.extract_resume(resume)
+            matching_preds.append(pred)
+            truth_obj = _safe_json_loads(sample.get("output", ""))
+            if isinstance(truth_obj, dict):
+                if "match_score" in truth_obj and isinstance(truth_obj["match_score"], (int, float)):
+                    # normalize to 0-100 if needed
+                    if truth_obj["match_score"] <= 1.0:
+                        truth_obj["match_score"] *= 100.0
+                matching_truth.append(truth_obj)
+            else:
+                # Fallback: parse numeric score from plain text outputs like "Match Score: 0.82"
+                score = _parse_match_score(sample.get("output", ""))
+                matching_truth.append({"match_score": score} if score is not None else {})
+        except Exception as e:
+            print(f"Error on matching sample: {e}")
+            matching_preds.append({})
+    matching_metrics = EvaluationMetrics.evaluate_matching(matching_preds, matching_truth)
+    EvaluationMetrics.print_metrics(matching_metrics, "matching")
+    # Save results
+    results = {
+        "extraction": extraction_metrics,
+        "matching": matching_metrics
+    }
+    os.makedirs("results", exist_ok=True)
+    with open("results/evaluation_results.json", "w", encoding="utf-8") as f:
+        json.dump(results, f, indent=2)
+    print("\n✅ Results saved to results/evaluation_results.json")
+    return extraction_metrics, matching_metrics
+if __name__ == "__main__":
+    import argparse
+    import os
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--test-path", default="data/processed/test.json")
+    parser.add_argument("--model-path", default="models/checkpoints/final")
+    args = parser.parse_args()
+    os.makedirs("results", exist_ok=True)
+    evaluate_on_test_set(args.test_path, args.model_path)

src/inference.py ADDED Viewed

	@@ -0,0 +1,316 @@

+import torch
+import json
+import numpy as np
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from typing import Dict, List, Union
+import re
+import os
+class ResumeInferenceEngine:
+    """Inference engine for resume extraction and matching"""
+    def __init__(self, model_path: str = "models/checkpoints/final"):
+        """Load fine-tuned model and tokenizer"""
+        print(f"Loading model from {model_path}...")
+        # CPU-only environments (common on Windows laptops) can hit PEFT/accelerate
+        # offload edge-cases when using device_map="auto". Prefer a simple CPU load.
+        use_cuda = torch.cuda.is_available()
+        dtype = torch.float16 if use_cuda else torch.float32
+        device_map = "auto" if use_cuda else None
+        low_cpu_mem_usage = True if use_cuda else False
+        adapter_config_path = os.path.join(model_path, "adapter_config.json")
+        is_adapter = os.path.exists(adapter_config_path)
+        # Prefer tokenizer saved alongside adapter/model (the notebook saves tokenizer to final/)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+        if self.tokenizer.pad_token is None and self.tokenizer.eos_token is not None:
+            self.tokenizer.pad_token = self.tokenizer.eos_token
+        if is_adapter:
+            from peft import PeftModel
+            with open(adapter_config_path, "r", encoding="utf-8") as f:
+                adapter_cfg = json.load(f)
+            base_model_name = adapter_cfg.get("base_model_name_or_path") or adapter_cfg.get("base_model") or "microsoft/phi-2"
+            base_model = AutoModelForCausalLM.from_pretrained(
+                base_model_name,
+                torch_dtype=dtype,
+                device_map=device_map,
+                low_cpu_mem_usage=low_cpu_mem_usage,
+                trust_remote_code=True,
+            )
+            self.model = PeftModel.from_pretrained(base_model, model_path)
+        else:
+            self.model = AutoModelForCausalLM.from_pretrained(
+                model_path,
+                torch_dtype=dtype,
+                device_map=device_map,
+                low_cpu_mem_usage=low_cpu_mem_usage,
+                trust_remote_code=True,
+            )
+        self.model.eval()
+    def extract_resume(self, resume_text: str) -> Dict:
+        """Extract structured information from resume"""
+        prompt = f"""Instruction: Extract structured information from the resume. Return valid JSON with fields: name, email, phone, skills, experience, education, certifications.
+Input:
+{resume_text}
+Output:"""
+        output = self._generate(prompt)
+        return self._parse_json_output(output)
+    def match_resume_to_job(self, resume_text: str, job_description: str) -> Dict:
+        """Match resume to job description"""
+        prompt = f"""Instruction: Compare the resume against the job description and provide a match score (0-100) with reasoning. Return valid JSON with fields: match_score, matching_skills, missing_skills, recommendation.
+Input:
+Resume:
+{resume_text}
+Job Description:
+{job_description}
+Output:"""
+        # Use a lower temperature to improve format adherence.
+        output = self._generate(prompt, max_length=256, temperature=0.3)
+        return self._parse_json_output(output)
+    def _generate(self, prompt: str, max_length: int = 512, temperature: float = 0.7) -> str:
+        """Generate text from prompt"""
+        # When using device_map="auto", pick the device of the first parameter.
+        input_device = next(iter(self.model.parameters())).device
+        tokenized = self.tokenizer(prompt, return_tensors="pt")
+        tokenized = {k: v.to(input_device) for k, v in tokenized.items()}
+        input_len = tokenized["input_ids"].shape[1]
+        # Interpret max_length as a generation budget (max_new_tokens) for backward compat.
+        max_new_tokens = max(64, min(512, int(max_length)))
+        with torch.inference_mode():
+            sequences = self.model.generate(
+                **tokenized,
+                max_new_tokens=max_new_tokens,
+                min_new_tokens=8,
+                temperature=temperature,
+                top_p=0.95,
+                num_beams=1,
+                do_sample=True,
+                pad_token_id=self.tokenizer.pad_token_id,
+                eos_token_id=self.tokenizer.eos_token_id,
+            )
+        # Decode ONLY the generated continuation; avoids returning an empty string when the
+        # prompt already contains the delimiter text (e.g., "Output:").
+        gen_tokens = sequences[0][input_len:]
+        gen_text = self.tokenizer.decode(gen_tokens, skip_special_tokens=True).strip()
+        if gen_text:
+            return gen_text
+        # Fallback: full decode so callers can see what happened.
+        full_text = self.tokenizer.decode(sequences[0], skip_special_tokens=True)
+        return full_text.strip()
+    def _parse_json_output(self, output: str) -> Dict:
+        """Extract JSON from model output"""
+        def _split_skills(v: Union[str, List[str], None]) -> List[str]:
+            if v is None:
+                return []
+            if isinstance(v, list):
+                return [str(s).strip() for s in v if str(s).strip()]
+            v = str(v).strip()
+            if not v or v.lower() in {"none", "n/a", "na"}:
+                return []
+            return [s.strip() for s in v.split(",") if s.strip()]
+        def _normalize(d: Dict) -> Dict:
+            if not isinstance(d, dict):
+                return {"raw_output": output}
+            # Normalize match_score to 0-100
+            if "match_score" in d:
+                try:
+                    score = d["match_score"]
+                    if isinstance(score, str):
+                        score = float(re.findall(r"[0-9]*\.?[0-9]+", score)[0])
+                    else:
+                        score = float(score)
+                    if score <= 1.0:
+                        score *= 100.0
+                    d["match_score"] = score
+                except Exception:
+                    pass
+            # Normalize skills fields to lists
+            if "matching_skills" in d:
+                d["matching_skills"] = _split_skills(d.get("matching_skills"))
+            if "missing_skills" in d:
+                d["missing_skills"] = _split_skills(d.get("missing_skills"))
+            # Preserve raw output for debugging
+            d.setdefault("raw_output", output)
+            return d
+        try:
+            # Try to find JSON in the output
+            json_match = re.search(r'\{.*\}', output, re.DOTALL)
+            if json_match:
+                json_str = json_match.group(0)
+                return _normalize(json.loads(json_str))
+        except json.JSONDecodeError:
+            pass
+        # Fallback: parse simple key:value lines (common when the model doesn't emit JSON).
+        # Example:
+        # match_score: 0.85
+        # matching_skills: Python, TensorFlow
+        if isinstance(output, str):
+            kv = {}
+            for raw_line in output.splitlines():
+                line = raw_line.strip()
+                if not line or ":" not in line:
+                    continue
+                key, value = line.split(":", 1)
+                key = key.strip().strip('"').strip("'").lower()
+                value = value.strip().strip('"').strip("'")
+                if not key:
+                    continue
+                kv[key] = value
+            if kv:
+                # Normalize known fields
+                if "match_score" in kv:
+                    try:
+                        score = float(re.findall(r"[0-9]*\.?[0-9]+", kv["match_score"])[0])
+                        if score <= 1.0:
+                            score *= 100.0
+                        kv["match_score"] = score
+                    except Exception:
+                        pass
+                if "matching_skills" in kv:
+                    kv["matching_skills"] = _split_skills(kv["matching_skills"])
+                if "missing_skills" in kv:
+                    kv["missing_skills"] = _split_skills(kv["missing_skills"])
+                # Keep a copy of the original raw output for debugging
+                kv["raw_output"] = output
+                return kv
+        # Fallback: try to parse a match score from plain text.
+        m = re.search(r"match\s*score\s*[:=]\s*([0-9]*\.?[0-9]+)", output or "", flags=re.IGNORECASE)
+        if m:
+            score = float(m.group(1))
+            if score <= 1.0:
+                score *= 100.0
+            return {"match_score": score, "raw_output": output}
+        # Return structured response if parsing fails
+        return {"error": "Failed to parse output", "raw_output": output}
+    def batch_extract(self, resumes: List[str]) -> List[Dict]:
+        """Extract from multiple resumes"""
+        results = []
+        for i, resume in enumerate(resumes):
+            print(f"Processing resume {i+1}/{len(resumes)}...")
+            results.append(self.extract_resume(resume))
+        return results
+    def batch_match(self, resume_pairs: List[tuple]) -> List[Dict]:
+        """Match multiple resume-job pairs"""
+        results = []
+        for i, (resume, job) in enumerate(resume_pairs):
+            print(f"Processing pair {i+1}/{len(resume_pairs)}...")
+            results.append(self.match_resume_to_job(resume, job))
+        return results
+# Flask API for serving predictions
+def create_api(model_path: str = "models/checkpoints/final"):
+    """Create Flask API for inference"""
+    from flask import Flask, request, jsonify
+    app = Flask(__name__)
+    engine = ResumeInferenceEngine(model_path)
+    @app.route("/extract", methods=["POST"])
+    def extract():
+        """Extract information from resume"""
+        data = request.json
+        resume = data.get("resume", "")
+        if not resume:
+            return jsonify({"error": "Resume text required"}), 400
+        result = engine.extract_resume(resume)
+        return jsonify(result)
+    @app.route("/match", methods=["POST"])
+    def match():
+        """Match resume to job description"""
+        data = request.json
+        resume = data.get("resume", "")
+        job = data.get("job_description", "")
+        if not resume or not job:
+            return jsonify({"error": "Resume and job description required"}), 400
+        result = engine.match_resume_to_job(resume, job)
+        return jsonify(result)
+    @app.route("/health", methods=["GET"])
+    def health():
+        return jsonify({"status": "healthy"})
+    return app
+def main():
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--mode", default="cli", help="Mode: cli, api, or batch")
+    parser.add_argument("--model-path", default="models/checkpoints/final", help="Path to model")
+    parser.add_argument("--task", choices=["extract", "match"], default="extract")
+    parser.add_argument("--resume-file", help="Path to resume file")
+    parser.add_argument("--job-file", help="Path to job description file")
+    parser.add_argument("--port", type=int, default=5000, help="API port")
+    args = parser.parse_args()
+    engine = ResumeInferenceEngine(args.model_path)
+    if args.mode == "cli":
+        if args.task == "extract":
+            with open(args.resume_file) as f:
+                resume = f.read()
+            result = engine.extract_resume(resume)
+            print(json.dumps(result, indent=2))
+        elif args.task == "match":
+            with open(args.resume_file) as f:
+                resume = f.read()
+            with open(args.job_file) as f:
+                job = f.read()
+            result = engine.match_resume_to_job(resume, job)
+            print(json.dumps(result, indent=2))
+    elif args.mode == "api":
+        app = create_api(args.model_path)
+        print(f"Starting API on port {args.port}...")
+        app.run(host="0.0.0.0", port=args.port, debug=False)
+if __name__ == "__main__":
+    main()

src/train.py ADDED Viewed

	@@ -0,0 +1,175 @@

+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
+from datasets import load_dataset
+import json
+from typing import Dict
+import argparse
+import os
+class ResumeModelTrainer:
+    """Fine-tune LLM for resume extraction and matching"""
+    def __init__(self, model_name: str = "mistralai/Mistral-7B-Instruct-v0.1"):
+        self.model_name = model_name
+        self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        print(f"Using device: {self.device}")
+    def setup_model(self):
+        """Load and configure model with quantization"""
+        # 4-bit quantization for memory efficiency
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16
+        )
+        print(f"Loading model: {self.model_name}")
+        model = AutoModelForCausalLM.from_pretrained(
+            self.model_name,
+            quantization_config=bnb_config,
+            device_map="auto"
+        )
+        tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+        tokenizer.pad_token = tokenizer.eos_token
+        # Prepare for LoRA
+        model = prepare_model_for_kbit_training(model)
+        # LoRA config
+        peft_config = LoraConfig(
+            r=16,
+            lora_alpha=32,
+            lora_dropout=0.05,
+            bias="none",
+            task_type="CAUSAL_LM",
+            target_modules=["q_proj", "v_proj"]
+        )
+        model = get_peft_model(model, peft_config)
+        return model, tokenizer
+    def prepare_data(self, data_path: str):
+        """Load and format training data"""
+        with open(data_path) as f:
+            data = json.load(f)
+        def format_sample(sample):
+            return {
+                "text": f"""Instruction: {sample['instruction']}
+Input:
+{sample['input']}
+Output:
+{sample['output']}"""
+            }
+        formatted_data = [format_sample(s) for s in data]
+        # Create dataset
+        dataset = load_dataset(
+            "json",
+            data_files={"train": data_path},
+            field="text"
+        )
+        return dataset, formatted_data
+    def train(self,
+              train_path: str = "data/processed/train.json",
+              val_path: str = "data/processed/validation.json",
+              output_dir: str = "models/checkpoints",
+              num_epochs: int = 3):
+        """Train the model"""
+        from transformers import Trainer, TrainingArguments
+        os.makedirs(output_dir, exist_ok=True)
+        # Load model and tokenizer
+        model, tokenizer = self.setup_model()
+        # Prepare datasets
+        dataset = load_dataset("json", data_files={"train": train_path, "validation": val_path})
+        def tokenize_function(examples):
+            # Simple tokenization
+            tokenized = tokenizer(
+                examples["instruction"] + " " + examples["input"],
+                truncation=True,
+                max_length=512,
+                padding="max_length"
+            )
+            tokenized["labels"] = tokenized["input_ids"].copy()
+            return tokenized
+        tokenized_datasets = dataset.map(tokenize_function, batched=True)
+        # Training arguments
+        training_args = TrainingArguments(
+            output_dir=output_dir,
+            num_train_epochs=num_epochs,
+            per_device_train_batch_size=4,
+            per_device_eval_batch_size=4,
+            warmup_steps=100,
+            weight_decay=0.01,
+            logging_dir="./logs",
+            logging_steps=50,
+            evaluation_strategy="epoch",
+            save_strategy="epoch",
+            learning_rate=5e-4,
+            bf16=True,  # Use bfloat16 for faster training
+            lr_scheduler_type="cosine",
+            gradient_accumulation_steps=2,
+        )
+        trainer = Trainer(
+            model=model,
+            args=training_args,
+            train_dataset=tokenized_datasets["train"],
+            eval_dataset=tokenized_datasets["validation"],
+            tokenizer=tokenizer,
+        )
+        print("Starting training...")
+        trainer.train()
+        # Save final model
+        model.save_pretrained(f"{output_dir}/final")
+        tokenizer.save_pretrained(f"{output_dir}/final")
+        print(f"✅ Model saved to {output_dir}/final")
+        return model, tokenizer
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--task", default="both", help="Task: extraction, matching, or both")
+    parser.add_argument("--model", default="mistral", help="Model: mistral or llama")
+    parser.add_argument("--epochs", type=int, default=3, help="Number of training epochs")
+    parser.add_argument("--output-dir", default="models/checkpoints", help="Output directory")
+    args = parser.parse_args()
+    # Select model
+    model_map = {
+        "mistral": "mistralai/Mistral-7B-Instruct-v0.1",
+        "llama": "meta-llama/Llama-2-7b-hf"
+    }
+    model_name = model_map.get(args.model, "mistralai/Mistral-7B-Instruct-v0.1")
+    trainer = ResumeModelTrainer(model_name)
+    model, tokenizer = trainer.train(
+        num_epochs=args.epochs,
+        output_dir=args.output_dir
+    )
+    print("✅ Training complete!")
+if __name__ == "__main__":
+    main()

src/utils.py ADDED Viewed

	@@ -0,0 +1,59 @@

+# Helper utilities for the project
+def parse_skill_match_score(score_str: str) -> int:
+    """Extract numeric score from string"""
+    import re
+    match = re.search(r'\d+', score_str)
+    return int(match.group(0)) if match else 50
+def format_experience_duration(years_str: str) -> str:
+    """Standardize experience duration format"""
+    import re
+    match = re.search(r'\d+', years_str)
+    if match:
+        years = int(match.group(0))
+        return f"{years} years"
+    return years_str
+def clean_text(text: str) -> str:
+    """Clean and normalize text"""
+    import re
+    # Remove extra whitespace
+    text = re.sub(r'\s+', ' ', text)
+    # Remove special characters
+    text = re.sub(r'[^\w\s\-@.]', '', text)
+    return text.strip()
+def skill_similarity(skill1: str, skill2: str) -> float:
+    """Calculate similarity between two skills"""
+    from difflib import SequenceMatcher
+    return SequenceMatcher(None, skill1.lower(), skill2.lower()).ratio()
+def batch_process(items: list, batch_size: int = 32):
+    """Process items in batches"""
+    for i in range(0, len(items), batch_size):
+        yield items[i:i+batch_size]
+# Model conversion utilities
+def convert_to_onnx(model_path: str, output_path: str):
+    """Convert fine-tuned model to ONNX format for faster inference"""
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+    model = AutoModelForCausalLM.from_pretrained(model_path)
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+    # Export to ONNX
+    import torch
+    dummy_input = torch.tensor([[tokenizer.eos_token_id]])
+    torch.onnx.export(
+        model,
+        dummy_input,
+        output_path,
+        input_names=['input_ids'],
+        output_names=['output'],
+        dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'}},
+        opset_version=12
+    )
+    print(f"✅ Model exported to {output_path}")