Your Name commited on
Commit
8e21fda
·
0 Parent(s):

Initialize independent HF Space repository

Browse files
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
Dockerfile ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use Python 3.10
2
+ FROM python:3.10-slim
3
+
4
+ WORKDIR /app
5
+
6
+ # Install system dependencies
7
+ RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
8
+
9
+ # Copy requirements
10
+ COPY requirements.txt .
11
+ RUN pip install --no-cache-dir -r requirements.txt
12
+
13
+ # Copy source code and model
14
+ COPY src/ ./src/
15
+ COPY models/ ./models/
16
+
17
+ # Create a user to run the app (security best practice for HF)
18
+ RUN useradd -m -u 1000 user
19
+ USER user
20
+ ENV HOME=/home/user \
21
+ PATH=/home/user/.local/bin:$PATH
22
+
23
+ # Expose the standard HF Spaces port
24
+ EXPOSE 7860
25
+
26
+ # Start the API
27
+ CMD ["python", "src/inference.py", "--mode", "api", "--port", "7860"]
README.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Resume-LLM-API"
3
+ emoji: "📄"
4
+ colorFrom: "blue"
5
+ colorTo: "indigo"
6
+ sdk: "docker"
7
+ pinned: false
8
+ app_port: 7860
9
+ ---
10
+
11
+ # Resume LLM API
models/checkpoints/final/README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: microsoft/phi-2
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - base_model:adapter:microsoft/phi-2
7
+ - lora
8
+ - transformers
9
+ ---
10
+
11
+ # Model Card for Model ID
12
+
13
+ <!-- Provide a quick summary of what the model is/does. -->
14
+
15
+
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+
21
+ <!-- Provide a longer summary of what this model is. -->
22
+
23
+
24
+
25
+ - **Developed by:** [More Information Needed]
26
+ - **Funded by [optional]:** [More Information Needed]
27
+ - **Shared by [optional]:** [More Information Needed]
28
+ - **Model type:** [More Information Needed]
29
+ - **Language(s) (NLP):** [More Information Needed]
30
+ - **License:** [More Information Needed]
31
+ - **Finetuned from model [optional]:** [More Information Needed]
32
+
33
+ ### Model Sources [optional]
34
+
35
+ <!-- Provide the basic links for the model. -->
36
+
37
+ - **Repository:** [More Information Needed]
38
+ - **Paper [optional]:** [More Information Needed]
39
+ - **Demo [optional]:** [More Information Needed]
40
+
41
+ ## Uses
42
+
43
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
+
45
+ ### Direct Use
46
+
47
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
48
+
49
+ [More Information Needed]
50
+
51
+ ### Downstream Use [optional]
52
+
53
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
+
55
+ [More Information Needed]
56
+
57
+ ### Out-of-Scope Use
58
+
59
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
+
61
+ [More Information Needed]
62
+
63
+ ## Bias, Risks, and Limitations
64
+
65
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
+
67
+ [More Information Needed]
68
+
69
+ ### Recommendations
70
+
71
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
+
73
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
74
+
75
+ ## How to Get Started with the Model
76
+
77
+ Use the code below to get started with the model.
78
+
79
+ [More Information Needed]
80
+
81
+ ## Training Details
82
+
83
+ ### Training Data
84
+
85
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
+
87
+ [More Information Needed]
88
+
89
+ ### Training Procedure
90
+
91
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
+
93
+ #### Preprocessing [optional]
94
+
95
+ [More Information Needed]
96
+
97
+
98
+ #### Training Hyperparameters
99
+
100
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
+
102
+ #### Speeds, Sizes, Times [optional]
103
+
104
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
+
106
+ [More Information Needed]
107
+
108
+ ## Evaluation
109
+
110
+ <!-- This section describes the evaluation protocols and provides the results. -->
111
+
112
+ ### Testing Data, Factors & Metrics
113
+
114
+ #### Testing Data
115
+
116
+ <!-- This should link to a Dataset Card if possible. -->
117
+
118
+ [More Information Needed]
119
+
120
+ #### Factors
121
+
122
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
123
+
124
+ [More Information Needed]
125
+
126
+ #### Metrics
127
+
128
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
+
130
+ [More Information Needed]
131
+
132
+ ### Results
133
+
134
+ [More Information Needed]
135
+
136
+ #### Summary
137
+
138
+
139
+
140
+ ## Model Examination [optional]
141
+
142
+ <!-- Relevant interpretability work for the model goes here -->
143
+
144
+ [More Information Needed]
145
+
146
+ ## Environmental Impact
147
+
148
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
+
150
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
+
152
+ - **Hardware Type:** [More Information Needed]
153
+ - **Hours used:** [More Information Needed]
154
+ - **Cloud Provider:** [More Information Needed]
155
+ - **Compute Region:** [More Information Needed]
156
+ - **Carbon Emitted:** [More Information Needed]
157
+
158
+ ## Technical Specifications [optional]
159
+
160
+ ### Model Architecture and Objective
161
+
162
+ [More Information Needed]
163
+
164
+ ### Compute Infrastructure
165
+
166
+ [More Information Needed]
167
+
168
+ #### Hardware
169
+
170
+ [More Information Needed]
171
+
172
+ #### Software
173
+
174
+ [More Information Needed]
175
+
176
+ ## Citation [optional]
177
+
178
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
179
+
180
+ **BibTeX:**
181
+
182
+ [More Information Needed]
183
+
184
+ **APA:**
185
+
186
+ [More Information Needed]
187
+
188
+ ## Glossary [optional]
189
+
190
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
+
192
+ [More Information Needed]
193
+
194
+ ## More Information [optional]
195
+
196
+ [More Information Needed]
197
+
198
+ ## Model Card Authors [optional]
199
+
200
+ [More Information Needed]
201
+
202
+ ## Model Card Contact
203
+
204
+ [More Information Needed]
205
+ ### Framework versions
206
+
207
+ - PEFT 0.18.1
models/checkpoints/final/adapter_config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "microsoft/phi-2",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 32,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "peft_version": "0.18.1",
27
+ "qalora_group_size": 16,
28
+ "r": 16,
29
+ "rank_pattern": {},
30
+ "revision": null,
31
+ "target_modules": [
32
+ "v_proj",
33
+ "q_proj"
34
+ ],
35
+ "target_parameters": null,
36
+ "task_type": "CAUSAL_LM",
37
+ "trainable_token_indices": null,
38
+ "use_dora": false,
39
+ "use_qalora": false,
40
+ "use_rslora": false
41
+ }
models/checkpoints/final/adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5dd863a28403b34cfd76507ced3b90a837c1010e10b9e23ccf06e777693c74d
3
+ size 20988664
models/checkpoints/final/added_tokens.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "\t\t": 50294,
3
+ "\t\t\t": 50293,
4
+ "\t\t\t\t": 50292,
5
+ "\t\t\t\t\t": 50291,
6
+ "\t\t\t\t\t\t": 50290,
7
+ "\t\t\t\t\t\t\t": 50289,
8
+ "\t\t\t\t\t\t\t\t": 50288,
9
+ "\t\t\t\t\t\t\t\t\t": 50287,
10
+ " ": 50286,
11
+ " ": 50285,
12
+ " ": 50284,
13
+ " ": 50283,
14
+ " ": 50282,
15
+ " ": 50281,
16
+ " ": 50280,
17
+ " ": 50279,
18
+ " ": 50278,
19
+ " ": 50277,
20
+ " ": 50276,
21
+ " ": 50275,
22
+ " ": 50274,
23
+ " ": 50273,
24
+ " ": 50272,
25
+ " ": 50271,
26
+ " ": 50270,
27
+ " ": 50269,
28
+ " ": 50268,
29
+ " ": 50267,
30
+ " ": 50266,
31
+ " ": 50265,
32
+ " ": 50264,
33
+ " ": 50263,
34
+ " ": 50262,
35
+ " ": 50261,
36
+ " ": 50260,
37
+ " ": 50259,
38
+ " ": 50258,
39
+ " ": 50257
40
+ }
models/checkpoints/final/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
models/checkpoints/final/special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<|endoftext|>",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
models/checkpoints/final/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
models/checkpoints/final/tokenizer_config.json ADDED
@@ -0,0 +1,326 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "50257": {
13
+ "content": " ",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": false
19
+ },
20
+ "50258": {
21
+ "content": " ",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": false
27
+ },
28
+ "50259": {
29
+ "content": " ",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": false
35
+ },
36
+ "50260": {
37
+ "content": " ",
38
+ "lstrip": false,
39
+ "normalized": true,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": false
43
+ },
44
+ "50261": {
45
+ "content": " ",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": false
51
+ },
52
+ "50262": {
53
+ "content": " ",
54
+ "lstrip": false,
55
+ "normalized": true,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": false
59
+ },
60
+ "50263": {
61
+ "content": " ",
62
+ "lstrip": false,
63
+ "normalized": true,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": false
67
+ },
68
+ "50264": {
69
+ "content": " ",
70
+ "lstrip": false,
71
+ "normalized": true,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": false
75
+ },
76
+ "50265": {
77
+ "content": " ",
78
+ "lstrip": false,
79
+ "normalized": true,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": false
83
+ },
84
+ "50266": {
85
+ "content": " ",
86
+ "lstrip": false,
87
+ "normalized": true,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": false
91
+ },
92
+ "50267": {
93
+ "content": " ",
94
+ "lstrip": false,
95
+ "normalized": true,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": false
99
+ },
100
+ "50268": {
101
+ "content": " ",
102
+ "lstrip": false,
103
+ "normalized": true,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": false
107
+ },
108
+ "50269": {
109
+ "content": " ",
110
+ "lstrip": false,
111
+ "normalized": true,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": false
115
+ },
116
+ "50270": {
117
+ "content": " ",
118
+ "lstrip": false,
119
+ "normalized": true,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": false
123
+ },
124
+ "50271": {
125
+ "content": " ",
126
+ "lstrip": false,
127
+ "normalized": true,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": false
131
+ },
132
+ "50272": {
133
+ "content": " ",
134
+ "lstrip": false,
135
+ "normalized": true,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": false
139
+ },
140
+ "50273": {
141
+ "content": " ",
142
+ "lstrip": false,
143
+ "normalized": true,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": false
147
+ },
148
+ "50274": {
149
+ "content": " ",
150
+ "lstrip": false,
151
+ "normalized": true,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": false
155
+ },
156
+ "50275": {
157
+ "content": " ",
158
+ "lstrip": false,
159
+ "normalized": true,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": false
163
+ },
164
+ "50276": {
165
+ "content": " ",
166
+ "lstrip": false,
167
+ "normalized": true,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": false
171
+ },
172
+ "50277": {
173
+ "content": " ",
174
+ "lstrip": false,
175
+ "normalized": true,
176
+ "rstrip": false,
177
+ "single_word": false,
178
+ "special": false
179
+ },
180
+ "50278": {
181
+ "content": " ",
182
+ "lstrip": false,
183
+ "normalized": true,
184
+ "rstrip": false,
185
+ "single_word": false,
186
+ "special": false
187
+ },
188
+ "50279": {
189
+ "content": " ",
190
+ "lstrip": false,
191
+ "normalized": true,
192
+ "rstrip": false,
193
+ "single_word": false,
194
+ "special": false
195
+ },
196
+ "50280": {
197
+ "content": " ",
198
+ "lstrip": false,
199
+ "normalized": true,
200
+ "rstrip": false,
201
+ "single_word": false,
202
+ "special": false
203
+ },
204
+ "50281": {
205
+ "content": " ",
206
+ "lstrip": false,
207
+ "normalized": true,
208
+ "rstrip": false,
209
+ "single_word": false,
210
+ "special": false
211
+ },
212
+ "50282": {
213
+ "content": " ",
214
+ "lstrip": false,
215
+ "normalized": true,
216
+ "rstrip": false,
217
+ "single_word": false,
218
+ "special": false
219
+ },
220
+ "50283": {
221
+ "content": " ",
222
+ "lstrip": false,
223
+ "normalized": true,
224
+ "rstrip": false,
225
+ "single_word": false,
226
+ "special": false
227
+ },
228
+ "50284": {
229
+ "content": " ",
230
+ "lstrip": false,
231
+ "normalized": true,
232
+ "rstrip": false,
233
+ "single_word": false,
234
+ "special": false
235
+ },
236
+ "50285": {
237
+ "content": " ",
238
+ "lstrip": false,
239
+ "normalized": true,
240
+ "rstrip": false,
241
+ "single_word": false,
242
+ "special": false
243
+ },
244
+ "50286": {
245
+ "content": " ",
246
+ "lstrip": false,
247
+ "normalized": true,
248
+ "rstrip": false,
249
+ "single_word": false,
250
+ "special": false
251
+ },
252
+ "50287": {
253
+ "content": "\t\t\t\t\t\t\t\t\t",
254
+ "lstrip": false,
255
+ "normalized": true,
256
+ "rstrip": false,
257
+ "single_word": false,
258
+ "special": false
259
+ },
260
+ "50288": {
261
+ "content": "\t\t\t\t\t\t\t\t",
262
+ "lstrip": false,
263
+ "normalized": true,
264
+ "rstrip": false,
265
+ "single_word": false,
266
+ "special": false
267
+ },
268
+ "50289": {
269
+ "content": "\t\t\t\t\t\t\t",
270
+ "lstrip": false,
271
+ "normalized": true,
272
+ "rstrip": false,
273
+ "single_word": false,
274
+ "special": false
275
+ },
276
+ "50290": {
277
+ "content": "\t\t\t\t\t\t",
278
+ "lstrip": false,
279
+ "normalized": true,
280
+ "rstrip": false,
281
+ "single_word": false,
282
+ "special": false
283
+ },
284
+ "50291": {
285
+ "content": "\t\t\t\t\t",
286
+ "lstrip": false,
287
+ "normalized": true,
288
+ "rstrip": false,
289
+ "single_word": false,
290
+ "special": false
291
+ },
292
+ "50292": {
293
+ "content": "\t\t\t\t",
294
+ "lstrip": false,
295
+ "normalized": true,
296
+ "rstrip": false,
297
+ "single_word": false,
298
+ "special": false
299
+ },
300
+ "50293": {
301
+ "content": "\t\t\t",
302
+ "lstrip": false,
303
+ "normalized": true,
304
+ "rstrip": false,
305
+ "single_word": false,
306
+ "special": false
307
+ },
308
+ "50294": {
309
+ "content": "\t\t",
310
+ "lstrip": false,
311
+ "normalized": true,
312
+ "rstrip": false,
313
+ "single_word": false,
314
+ "special": false
315
+ }
316
+ },
317
+ "bos_token": "<|endoftext|>",
318
+ "clean_up_tokenization_spaces": true,
319
+ "eos_token": "<|endoftext|>",
320
+ "extra_special_tokens": {},
321
+ "model_max_length": 2048,
322
+ "pad_token": "<|endoftext|>",
323
+ "return_token_type_ids": false,
324
+ "tokenizer_class": "CodeGenTokenizer",
325
+ "unk_token": "<|endoftext|>"
326
+ }
models/checkpoints/final/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ # Dependencies for the AI Model (Hugging Face / GPU Server)
2
+ torch
3
+ transformers
4
+ tokenizers
5
+ accelerate
6
+ peft
7
+ flask
resume-llm-api ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit db3e24a2516e30c66dc06acd4084f4203028de66
src/__init__.py ADDED
File without changes
src/__pycache__/__init__.cpython-312.pyc ADDED
Binary file (160 Bytes). View file
 
src/__pycache__/inference.cpython-312.pyc ADDED
Binary file (15.5 kB). View file
 
src/data_preparation.py ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import pandas as pd
3
+ import numpy as np
4
+ from typing import List, Dict, Tuple
5
+ import os
6
+
7
+ class DataGenerator:
8
+ """Generate synthetic training data for both tasks"""
9
+
10
+ @staticmethod
11
+ def generate_extraction_samples(num_samples: int = 1000) -> List[Dict]:
12
+ """Generate resume extraction training samples"""
13
+
14
+ companies = ["TechCorp", "DataFlow", "CloudSys", "AI Labs", "WebDev Inc",
15
+ "FinTech Solutions", "Health Systems", "E-commerce Plus"]
16
+ roles = ["Developer", "Senior Developer", "Data Scientist", "ML Engineer",
17
+ "Product Manager", "DevOps Engineer", "Frontend Engineer", "Backend Engineer"]
18
+ skills_pool = ["Python", "Django", "Flask", "FastAPI", "PostgreSQL", "MongoDB",
19
+ "React", "Vue.js", "AWS", "GCP", "Docker", "Kubernetes",
20
+ "Machine Learning", "NLP", "TensorFlow", "PyTorch", "Git",
21
+ "SQL", "REST API", "GraphQL", "Redis", "Elasticsearch"]
22
+ universities = ["MIT", "Stanford", "Carnegie Mellon", "Berkeley", "Harvard",
23
+ "University of Washington", "State University", "Tech Institute"]
24
+ degrees = ["BS Computer Science", "BS Data Science", "MS Computer Science",
25
+ "MS Artificial Intelligence", "BS Engineering"]
26
+
27
+ samples = []
28
+ for i in range(num_samples):
29
+ name = f"Candidate_{i+1}"
30
+ email = f"candidate{i+1}@email.com"
31
+ phone = f"555-{np.random.randint(1000, 9999)}"
32
+
33
+ # Experience
34
+ num_exp = np.random.randint(1, 4)
35
+ experience = []
36
+ for _ in range(num_exp):
37
+ experience.append({
38
+ "company": np.random.choice(companies),
39
+ "role": np.random.choice(roles),
40
+ "duration": f"{np.random.randint(1, 7)} years",
41
+ "description": "Led projects and mentored team members"
42
+ })
43
+
44
+ # Skills
45
+ num_skills = np.random.randint(3, 10)
46
+ skills = list(np.random.choice(skills_pool, num_skills, replace=False))
47
+
48
+ # Education
49
+ education = [{
50
+ "degree": np.random.choice(degrees),
51
+ "university": np.random.choice(universities),
52
+ "graduation_year": str(np.random.randint(2015, 2023))
53
+ }]
54
+
55
+ # Certifications
56
+ certifications = [f"Cert_{j}" for j in range(np.random.randint(0, 3))]
57
+
58
+ resume_text = f"""
59
+ Resume of {name}
60
+ Email: {email} | Phone: {phone}
61
+
62
+ EXPERIENCE:
63
+ {chr(10).join([f"- {exp['company']}: {exp['role']} ({exp['duration']})" for exp in experience])}
64
+
65
+ SKILLS:
66
+ {', '.join(skills)}
67
+
68
+ EDUCATION:
69
+ {chr(10).join([f"- {edu['degree']} from {edu['university']} ({edu['graduation_year']})" for edu in education])}
70
+
71
+ CERTIFICATIONS:
72
+ {chr(10).join(certifications) if certifications else "None"}
73
+ """
74
+
75
+ extracted_data = {
76
+ "name": name,
77
+ "email": email,
78
+ "phone": phone,
79
+ "skills": skills,
80
+ "experience": experience,
81
+ "education": education,
82
+ "certifications": certifications
83
+ }
84
+
85
+ samples.append({
86
+ "input": resume_text.strip(),
87
+ "output": json.dumps(extracted_data, indent=2),
88
+ "task": "extraction"
89
+ })
90
+
91
+ return samples
92
+
93
+ @staticmethod
94
+ def generate_matching_samples(num_samples: int = 500) -> List[Dict]:
95
+ """Generate resume-job matching training samples"""
96
+
97
+ job_titles = ["Senior Python Developer", "Data Scientist", "ML Engineer",
98
+ "Full-Stack Developer", "DevOps Engineer", "Product Manager"]
99
+ skills_pool = ["Python", "Django", "PostgreSQL", "AWS", "Docker", "Kubernetes",
100
+ "Machine Learning", "React", "Node.js", "SQL"]
101
+
102
+ samples = []
103
+ for i in range(num_samples):
104
+ # Create job description
105
+ job_title = np.random.choice(job_titles)
106
+ required_skills = list(np.random.choice(skills_pool, np.random.randint(3, 7), replace=False))
107
+
108
+ job_desc = f"""
109
+ Job Title: {job_title}
110
+
111
+ Required Skills:
112
+ {', '.join(required_skills)}
113
+
114
+ Experience: 3+ years in relevant role
115
+ Education: BS in Computer Science or related field
116
+ """
117
+
118
+ # Create matching resume
119
+ resume_skills = list(np.random.choice(skills_pool, np.random.randint(3, 8), replace=False))
120
+ resume = f"Skills: {', '.join(resume_skills)}\nExperience: {np.random.randint(1, 8)} years"
121
+
122
+ # Calculate match score based on skill overlap
123
+ matching_skills = list(set(resume_skills) & set(required_skills))
124
+ match_score = min(100, int((len(matching_skills) / len(required_skills)) * 100))
125
+
126
+ matching_data = {
127
+ "match_score": match_score,
128
+ "matching_skills": matching_skills,
129
+ "missing_skills": [s for s in required_skills if s not in resume_skills],
130
+ "recommendation": "Recommend interview" if match_score >= 70 else "Consider further review"
131
+ }
132
+
133
+ samples.append({
134
+ "input": f"Resume:\n{resume}\n\nJob Description:\n{job_desc}",
135
+ "output": json.dumps(matching_data, indent=2),
136
+ "task": "matching"
137
+ })
138
+
139
+ return samples
140
+
141
+ @staticmethod
142
+ def create_instruction_dataset(extraction_samples: List[Dict],
143
+ matching_samples: List[Dict]) -> List[Dict]:
144
+ """Convert samples to instruction-following format"""
145
+
146
+ dataset = []
147
+
148
+ # Extraction task instructions
149
+ for sample in extraction_samples:
150
+ dataset.append({
151
+ "instruction": "Extract structured information from the resume. Return valid JSON.",
152
+ "input": sample["input"],
153
+ "output": sample["output"],
154
+ "task": "extraction"
155
+ })
156
+
157
+ # Matching task instructions
158
+ for sample in matching_samples:
159
+ dataset.append({
160
+ "instruction": "Compare the resume against the job description and provide a match score (0-100) with reasoning. Return valid JSON.",
161
+ "input": sample["input"],
162
+ "output": sample["output"],
163
+ "task": "matching"
164
+ })
165
+
166
+ return dataset
167
+
168
+ def prepare_data(output_dir: str = "data/processed"):
169
+ """Main function to prepare all data"""
170
+
171
+ os.makedirs(output_dir, exist_ok=True)
172
+
173
+ print("Generating extraction samples...")
174
+ extraction_samples = DataGenerator.generate_extraction_samples(1000)
175
+
176
+ print("Generating matching samples...")
177
+ matching_samples = DataGenerator.generate_matching_samples(500)
178
+
179
+ print("Creating instruction dataset...")
180
+ full_dataset = DataGenerator.create_instruction_dataset(extraction_samples, matching_samples)
181
+
182
+ # Split into train/val/test
183
+ np.random.shuffle(full_dataset)
184
+ total = len(full_dataset)
185
+ train_idx = int(0.8 * total)
186
+ val_idx = int(0.9 * total)
187
+
188
+ train_data = full_dataset[:train_idx]
189
+ val_data = full_dataset[train_idx:val_idx]
190
+ test_data = full_dataset[val_idx:]
191
+
192
+ # Save datasets
193
+ with open(f"{output_dir}/train.json", "w") as f:
194
+ json.dump(train_data, f, indent=2)
195
+
196
+ with open(f"{output_dir}/validation.json", "w") as f:
197
+ json.dump(val_data, f, indent=2)
198
+
199
+ with open(f"{output_dir}/test.json", "w") as f:
200
+ json.dump(test_data, f, indent=2)
201
+
202
+ print(f"✅ Data prepared successfully!")
203
+ print(f" - Train samples: {len(train_data)}")
204
+ print(f" - Validation samples: {len(val_data)}")
205
+ print(f" - Test samples: {len(test_data)}")
206
+ print(f" - Total: {total}")
207
+
208
+ return train_data, val_data, test_data
209
+
210
+ if __name__ == "__main__":
211
+ prepare_data()
src/evaluate.py ADDED
@@ -0,0 +1,274 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import numpy as np
3
+ from sklearn.metrics import precision_recall_fscore_support, accuracy_score
4
+ from typing import List, Dict
5
+ import re
6
+ import os
7
+
8
+ class EvaluationMetrics:
9
+ """Evaluate model performance on both tasks"""
10
+
11
+ @staticmethod
12
+ def evaluate_extraction(predictions: List[Dict], ground_truth: List[Dict]) -> Dict:
13
+ """Evaluate extraction task performance"""
14
+
15
+ metrics = {
16
+ "overall_accuracy": 0,
17
+ "field_accuracies": {},
18
+ "total_samples": len(predictions)
19
+ }
20
+
21
+ all_correct = 0
22
+ field_correct = {}
23
+ field_counts = {}
24
+
25
+ # Extract field names
26
+ fields = ["name", "email", "phone", "skills", "experience", "education", "certifications"]
27
+
28
+ for field in fields:
29
+ field_correct[field] = 0
30
+ field_counts[field] = 0
31
+
32
+ for pred, truth in zip(predictions, ground_truth):
33
+ for field in fields:
34
+ if field in pred and field in truth:
35
+ field_counts[field] += 1
36
+
37
+ # Compare field values
38
+ if isinstance(pred[field], (list, dict)):
39
+ if json.dumps(pred[field], sort_keys=True) == json.dumps(truth[field], sort_keys=True):
40
+ field_correct[field] += 1
41
+ else:
42
+ if str(pred[field]).lower() == str(truth[field]).lower():
43
+ field_correct[field] += 1
44
+
45
+ # Calculate field accuracies
46
+ for field in fields:
47
+ if field_counts[field] > 0:
48
+ accuracy = field_correct[field] / field_counts[field]
49
+ metrics["field_accuracies"][field] = accuracy
50
+
51
+ # Overall accuracy
52
+ total_fields = sum(field_counts.values())
53
+ if total_fields > 0:
54
+ metrics["overall_accuracy"] = sum(field_correct.values()) / total_fields
55
+
56
+ return metrics
57
+
58
+ @staticmethod
59
+ def evaluate_matching(predictions: List[Dict], ground_truth: List[Dict]) -> Dict:
60
+ """Evaluate matching task performance"""
61
+
62
+ metrics = {
63
+ "score_rmse": 0,
64
+ "score_mae": 0,
65
+ "skill_matching_precision": 0,
66
+ "skill_matching_recall": 0,
67
+ "recommendation_accuracy": 0,
68
+ "total_samples": len(predictions)
69
+ }
70
+
71
+ score_errors = []
72
+ correct_recommendations = 0
73
+ all_matching_skills = []
74
+ all_pred_matching_skills = []
75
+
76
+ for pred, truth in zip(predictions, ground_truth):
77
+ # Score error
78
+ if "match_score" in pred and "match_score" in truth:
79
+ score_errors.append(abs(pred["match_score"] - truth["match_score"]))
80
+
81
+ # Recommendation accuracy
82
+ if "recommendation" in pred and "recommendation" in truth:
83
+ if pred["recommendation"].lower() == truth["recommendation"].lower():
84
+ correct_recommendations += 1
85
+
86
+ # Skill matching
87
+ if "matching_skills" in pred and "matching_skills" in truth:
88
+ all_pred_matching_skills.extend(pred.get("matching_skills", []))
89
+ all_matching_skills.extend(truth.get("matching_skills", []))
90
+
91
+ if score_errors:
92
+ metrics["score_rmse"] = np.sqrt(np.mean(np.array(score_errors)**2))
93
+ metrics["score_mae"] = np.mean(score_errors)
94
+
95
+ if len(predictions) > 0:
96
+ metrics["recommendation_accuracy"] = correct_recommendations / len(predictions)
97
+
98
+ # Skill matching metrics
99
+ if all_matching_skills or all_pred_matching_skills:
100
+ # Simple precision/recall for skills
101
+ correct_skills = len(set(all_pred_matching_skills) & set(all_matching_skills))
102
+ if all_pred_matching_skills:
103
+ metrics["skill_matching_precision"] = correct_skills / len(set(all_pred_matching_skills))
104
+ if all_matching_skills:
105
+ metrics["skill_matching_recall"] = correct_skills / len(set(all_matching_skills))
106
+
107
+ return metrics
108
+
109
+ @staticmethod
110
+ def print_metrics(metrics: Dict, task: str):
111
+ """Pretty print metrics"""
112
+
113
+ print(f"\n{'='*50}")
114
+ print(f"EVALUATION RESULTS - {task.upper()}")
115
+ print(f"{'='*50}")
116
+
117
+ for key, value in metrics.items():
118
+ if isinstance(value, float):
119
+ print(f"{key}: {value:.4f}")
120
+ elif isinstance(value, dict):
121
+ print(f"\n{key}:")
122
+ for sub_key, sub_value in value.items():
123
+ if isinstance(sub_value, float):
124
+ print(f" {sub_key}: {sub_value:.4f}")
125
+ else:
126
+ print(f" {sub_key}: {sub_value}")
127
+ else:
128
+ print(f"{key}: {value}")
129
+
130
+ def evaluate_on_test_set(test_path: str = "data/processed/test.json",
131
+ model_path: str = "models/checkpoints/final"):
132
+ """Evaluate model on test set"""
133
+
134
+ # Prefer package-relative import; fall back to absolute when executed as a script.
135
+ try:
136
+ from .inference import ResumeInferenceEngine
137
+ except ImportError as e:
138
+ if "attempted relative import" in str(e).lower():
139
+ from src.inference import ResumeInferenceEngine
140
+ else:
141
+ raise
142
+
143
+ def _load_json_or_jsonl(path: str):
144
+ with open(path, "r", encoding="utf-8") as f:
145
+ content = f.read().strip()
146
+ if not content:
147
+ return []
148
+ # JSON array
149
+ if content[0] == "[":
150
+ return json.loads(content)
151
+ # JSONL
152
+ rows = []
153
+ for line in content.splitlines():
154
+ line = line.strip()
155
+ if not line:
156
+ continue
157
+ rows.append(json.loads(line))
158
+ return rows
159
+
160
+ def _safe_json_loads(text: str):
161
+ try:
162
+ return json.loads(text)
163
+ except Exception:
164
+ return None
165
+
166
+ def _parse_match_score(text: str):
167
+ # Accept formats like "Match Score: 0.82" or JSON {"match_score": 82}
168
+ if not isinstance(text, str):
169
+ return None
170
+ match = re.search(r"match\s*score\s*[:=]\s*([0-9]*\.?[0-9]+)", text, flags=re.IGNORECASE)
171
+ if not match:
172
+ return None
173
+ value = float(match.group(1))
174
+ # Normalize to 0-100 if it looks like 0-1
175
+ if value <= 1.0:
176
+ value *= 100.0
177
+ return value
178
+
179
+ # Load test data (supports JSON array or JSONL)
180
+ test_data = _load_json_or_jsonl(test_path)
181
+
182
+ # Initialize engine
183
+ engine = ResumeInferenceEngine(model_path)
184
+
185
+ # Separate by task (fallback: treat everything as matching)
186
+ extraction_samples = [s for s in test_data if s.get("task") == "extraction"]
187
+ matching_samples = [s for s in test_data if s.get("task") == "matching"]
188
+ if not extraction_samples and not matching_samples:
189
+ matching_samples = list(test_data)
190
+
191
+ print(f"Evaluating on {len(extraction_samples)} extraction samples...")
192
+ print(f"Evaluating on {len(matching_samples)} matching samples...")
193
+
194
+ # Evaluate extraction
195
+ extraction_preds = []
196
+ extraction_truth = []
197
+
198
+ for sample in extraction_samples:
199
+ try:
200
+ pred = engine.extract_resume(sample["input"])
201
+ extraction_preds.append(pred)
202
+ truth = _safe_json_loads(sample.get("output", ""))
203
+ extraction_truth.append(truth if isinstance(truth, dict) else {})
204
+ except Exception as e:
205
+ print(f"Error on extraction sample: {e}")
206
+ extraction_preds.append({})
207
+
208
+ extraction_metrics = EvaluationMetrics.evaluate_extraction(extraction_preds, extraction_truth)
209
+ EvaluationMetrics.print_metrics(extraction_metrics, "extraction")
210
+
211
+ # Evaluate matching
212
+ matching_preds = []
213
+ matching_truth = []
214
+
215
+ for sample in matching_samples:
216
+ try:
217
+ input_text = sample.get("input", "")
218
+
219
+ # Try to parse the expected delimiter; otherwise treat entire input as resume text.
220
+ parts = input_text.split("\n\nJob Description:\n")
221
+ if len(parts) == 2:
222
+ resume = parts[0].replace("Resume:\n", "").strip()
223
+ job = parts[1].strip()
224
+ else:
225
+ resume = input_text.strip()
226
+ job = ""
227
+
228
+ pred = engine.match_resume_to_job(resume, job) if job else engine.extract_resume(resume)
229
+ matching_preds.append(pred)
230
+
231
+ truth_obj = _safe_json_loads(sample.get("output", ""))
232
+ if isinstance(truth_obj, dict):
233
+ if "match_score" in truth_obj and isinstance(truth_obj["match_score"], (int, float)):
234
+ # normalize to 0-100 if needed
235
+ if truth_obj["match_score"] <= 1.0:
236
+ truth_obj["match_score"] *= 100.0
237
+ matching_truth.append(truth_obj)
238
+ else:
239
+ # Fallback: parse numeric score from plain text outputs like "Match Score: 0.82"
240
+ score = _parse_match_score(sample.get("output", ""))
241
+ matching_truth.append({"match_score": score} if score is not None else {})
242
+ except Exception as e:
243
+ print(f"Error on matching sample: {e}")
244
+ matching_preds.append({})
245
+
246
+ matching_metrics = EvaluationMetrics.evaluate_matching(matching_preds, matching_truth)
247
+ EvaluationMetrics.print_metrics(matching_metrics, "matching")
248
+
249
+ # Save results
250
+ results = {
251
+ "extraction": extraction_metrics,
252
+ "matching": matching_metrics
253
+ }
254
+
255
+ os.makedirs("results", exist_ok=True)
256
+ with open("results/evaluation_results.json", "w", encoding="utf-8") as f:
257
+ json.dump(results, f, indent=2)
258
+
259
+ print("\n✅ Results saved to results/evaluation_results.json")
260
+
261
+ return extraction_metrics, matching_metrics
262
+
263
+ if __name__ == "__main__":
264
+ import argparse
265
+ import os
266
+
267
+ parser = argparse.ArgumentParser()
268
+ parser.add_argument("--test-path", default="data/processed/test.json")
269
+ parser.add_argument("--model-path", default="models/checkpoints/final")
270
+
271
+ args = parser.parse_args()
272
+
273
+ os.makedirs("results", exist_ok=True)
274
+ evaluate_on_test_set(args.test_path, args.model_path)
src/inference.py ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import json
3
+ import numpy as np
4
+ from transformers import AutoModelForCausalLM, AutoTokenizer
5
+ from typing import Dict, List, Union
6
+ import re
7
+ import os
8
+
9
+ class ResumeInferenceEngine:
10
+ """Inference engine for resume extraction and matching"""
11
+
12
+ def __init__(self, model_path: str = "models/checkpoints/final"):
13
+ """Load fine-tuned model and tokenizer"""
14
+
15
+ print(f"Loading model from {model_path}...")
16
+
17
+ # CPU-only environments (common on Windows laptops) can hit PEFT/accelerate
18
+ # offload edge-cases when using device_map="auto". Prefer a simple CPU load.
19
+ use_cuda = torch.cuda.is_available()
20
+ dtype = torch.float16 if use_cuda else torch.float32
21
+ device_map = "auto" if use_cuda else None
22
+ low_cpu_mem_usage = True if use_cuda else False
23
+
24
+ adapter_config_path = os.path.join(model_path, "adapter_config.json")
25
+ is_adapter = os.path.exists(adapter_config_path)
26
+
27
+ # Prefer tokenizer saved alongside adapter/model (the notebook saves tokenizer to final/)
28
+ self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
29
+ if self.tokenizer.pad_token is None and self.tokenizer.eos_token is not None:
30
+ self.tokenizer.pad_token = self.tokenizer.eos_token
31
+
32
+ if is_adapter:
33
+ from peft import PeftModel
34
+ with open(adapter_config_path, "r", encoding="utf-8") as f:
35
+ adapter_cfg = json.load(f)
36
+ base_model_name = adapter_cfg.get("base_model_name_or_path") or adapter_cfg.get("base_model") or "microsoft/phi-2"
37
+
38
+ base_model = AutoModelForCausalLM.from_pretrained(
39
+ base_model_name,
40
+ torch_dtype=dtype,
41
+ device_map=device_map,
42
+ low_cpu_mem_usage=low_cpu_mem_usage,
43
+ trust_remote_code=True,
44
+ )
45
+ self.model = PeftModel.from_pretrained(base_model, model_path)
46
+ else:
47
+ self.model = AutoModelForCausalLM.from_pretrained(
48
+ model_path,
49
+ torch_dtype=dtype,
50
+ device_map=device_map,
51
+ low_cpu_mem_usage=low_cpu_mem_usage,
52
+ trust_remote_code=True,
53
+ )
54
+
55
+ self.model.eval()
56
+
57
+ def extract_resume(self, resume_text: str) -> Dict:
58
+ """Extract structured information from resume"""
59
+
60
+ prompt = f"""Instruction: Extract structured information from the resume. Return valid JSON with fields: name, email, phone, skills, experience, education, certifications.
61
+
62
+ Input:
63
+ {resume_text}
64
+
65
+ Output:"""
66
+
67
+ output = self._generate(prompt)
68
+ return self._parse_json_output(output)
69
+
70
+ def match_resume_to_job(self, resume_text: str, job_description: str) -> Dict:
71
+ """Match resume to job description"""
72
+
73
+ prompt = f"""Instruction: Compare the resume against the job description and provide a match score (0-100) with reasoning. Return valid JSON with fields: match_score, matching_skills, missing_skills, recommendation.
74
+
75
+ Input:
76
+ Resume:
77
+ {resume_text}
78
+
79
+ Job Description:
80
+ {job_description}
81
+
82
+ Output:"""
83
+
84
+ # Use a lower temperature to improve format adherence.
85
+ output = self._generate(prompt, max_length=256, temperature=0.3)
86
+ return self._parse_json_output(output)
87
+
88
+ def _generate(self, prompt: str, max_length: int = 512, temperature: float = 0.7) -> str:
89
+ """Generate text from prompt"""
90
+
91
+ # When using device_map="auto", pick the device of the first parameter.
92
+ input_device = next(iter(self.model.parameters())).device
93
+ tokenized = self.tokenizer(prompt, return_tensors="pt")
94
+ tokenized = {k: v.to(input_device) for k, v in tokenized.items()}
95
+ input_len = tokenized["input_ids"].shape[1]
96
+
97
+ # Interpret max_length as a generation budget (max_new_tokens) for backward compat.
98
+ max_new_tokens = max(64, min(512, int(max_length)))
99
+ with torch.inference_mode():
100
+ sequences = self.model.generate(
101
+ **tokenized,
102
+ max_new_tokens=max_new_tokens,
103
+ min_new_tokens=8,
104
+ temperature=temperature,
105
+ top_p=0.95,
106
+ num_beams=1,
107
+ do_sample=True,
108
+ pad_token_id=self.tokenizer.pad_token_id,
109
+ eos_token_id=self.tokenizer.eos_token_id,
110
+ )
111
+
112
+ # Decode ONLY the generated continuation; avoids returning an empty string when the
113
+ # prompt already contains the delimiter text (e.g., "Output:").
114
+ gen_tokens = sequences[0][input_len:]
115
+ gen_text = self.tokenizer.decode(gen_tokens, skip_special_tokens=True).strip()
116
+ if gen_text:
117
+ return gen_text
118
+
119
+ # Fallback: full decode so callers can see what happened.
120
+ full_text = self.tokenizer.decode(sequences[0], skip_special_tokens=True)
121
+ return full_text.strip()
122
+
123
+ def _parse_json_output(self, output: str) -> Dict:
124
+ """Extract JSON from model output"""
125
+
126
+ def _split_skills(v: Union[str, List[str], None]) -> List[str]:
127
+ if v is None:
128
+ return []
129
+ if isinstance(v, list):
130
+ return [str(s).strip() for s in v if str(s).strip()]
131
+ v = str(v).strip()
132
+ if not v or v.lower() in {"none", "n/a", "na"}:
133
+ return []
134
+ return [s.strip() for s in v.split(",") if s.strip()]
135
+
136
+ def _normalize(d: Dict) -> Dict:
137
+ if not isinstance(d, dict):
138
+ return {"raw_output": output}
139
+
140
+ # Normalize match_score to 0-100
141
+ if "match_score" in d:
142
+ try:
143
+ score = d["match_score"]
144
+ if isinstance(score, str):
145
+ score = float(re.findall(r"[0-9]*\.?[0-9]+", score)[0])
146
+ else:
147
+ score = float(score)
148
+ if score <= 1.0:
149
+ score *= 100.0
150
+ d["match_score"] = score
151
+ except Exception:
152
+ pass
153
+
154
+ # Normalize skills fields to lists
155
+ if "matching_skills" in d:
156
+ d["matching_skills"] = _split_skills(d.get("matching_skills"))
157
+ if "missing_skills" in d:
158
+ d["missing_skills"] = _split_skills(d.get("missing_skills"))
159
+
160
+ # Preserve raw output for debugging
161
+ d.setdefault("raw_output", output)
162
+ return d
163
+
164
+ try:
165
+ # Try to find JSON in the output
166
+ json_match = re.search(r'\{.*\}', output, re.DOTALL)
167
+ if json_match:
168
+ json_str = json_match.group(0)
169
+ return _normalize(json.loads(json_str))
170
+ except json.JSONDecodeError:
171
+ pass
172
+
173
+ # Fallback: parse simple key:value lines (common when the model doesn't emit JSON).
174
+ # Example:
175
+ # match_score: 0.85
176
+ # matching_skills: Python, TensorFlow
177
+ if isinstance(output, str):
178
+ kv = {}
179
+ for raw_line in output.splitlines():
180
+ line = raw_line.strip()
181
+ if not line or ":" not in line:
182
+ continue
183
+ key, value = line.split(":", 1)
184
+ key = key.strip().strip('"').strip("'").lower()
185
+ value = value.strip().strip('"').strip("'")
186
+ if not key:
187
+ continue
188
+ kv[key] = value
189
+
190
+ if kv:
191
+ # Normalize known fields
192
+ if "match_score" in kv:
193
+ try:
194
+ score = float(re.findall(r"[0-9]*\.?[0-9]+", kv["match_score"])[0])
195
+ if score <= 1.0:
196
+ score *= 100.0
197
+ kv["match_score"] = score
198
+ except Exception:
199
+ pass
200
+
201
+ if "matching_skills" in kv:
202
+ kv["matching_skills"] = _split_skills(kv["matching_skills"])
203
+ if "missing_skills" in kv:
204
+ kv["missing_skills"] = _split_skills(kv["missing_skills"])
205
+
206
+ # Keep a copy of the original raw output for debugging
207
+ kv["raw_output"] = output
208
+ return kv
209
+
210
+ # Fallback: try to parse a match score from plain text.
211
+ m = re.search(r"match\s*score\s*[:=]\s*([0-9]*\.?[0-9]+)", output or "", flags=re.IGNORECASE)
212
+ if m:
213
+ score = float(m.group(1))
214
+ if score <= 1.0:
215
+ score *= 100.0
216
+ return {"match_score": score, "raw_output": output}
217
+
218
+ # Return structured response if parsing fails
219
+ return {"error": "Failed to parse output", "raw_output": output}
220
+
221
+ def batch_extract(self, resumes: List[str]) -> List[Dict]:
222
+ """Extract from multiple resumes"""
223
+ results = []
224
+ for i, resume in enumerate(resumes):
225
+ print(f"Processing resume {i+1}/{len(resumes)}...")
226
+ results.append(self.extract_resume(resume))
227
+ return results
228
+
229
+ def batch_match(self, resume_pairs: List[tuple]) -> List[Dict]:
230
+ """Match multiple resume-job pairs"""
231
+ results = []
232
+ for i, (resume, job) in enumerate(resume_pairs):
233
+ print(f"Processing pair {i+1}/{len(resume_pairs)}...")
234
+ results.append(self.match_resume_to_job(resume, job))
235
+ return results
236
+
237
+
238
+ # Flask API for serving predictions
239
+ def create_api(model_path: str = "models/checkpoints/final"):
240
+ """Create Flask API for inference"""
241
+
242
+ from flask import Flask, request, jsonify
243
+
244
+ app = Flask(__name__)
245
+ engine = ResumeInferenceEngine(model_path)
246
+
247
+ @app.route("/extract", methods=["POST"])
248
+ def extract():
249
+ """Extract information from resume"""
250
+ data = request.json
251
+ resume = data.get("resume", "")
252
+
253
+ if not resume:
254
+ return jsonify({"error": "Resume text required"}), 400
255
+
256
+ result = engine.extract_resume(resume)
257
+ return jsonify(result)
258
+
259
+ @app.route("/match", methods=["POST"])
260
+ def match():
261
+ """Match resume to job description"""
262
+ data = request.json
263
+ resume = data.get("resume", "")
264
+ job = data.get("job_description", "")
265
+
266
+ if not resume or not job:
267
+ return jsonify({"error": "Resume and job description required"}), 400
268
+
269
+ result = engine.match_resume_to_job(resume, job)
270
+ return jsonify(result)
271
+
272
+ @app.route("/health", methods=["GET"])
273
+ def health():
274
+ return jsonify({"status": "healthy"})
275
+
276
+ return app
277
+
278
+
279
+ def main():
280
+ import argparse
281
+
282
+ parser = argparse.ArgumentParser()
283
+ parser.add_argument("--mode", default="cli", help="Mode: cli, api, or batch")
284
+ parser.add_argument("--model-path", default="models/checkpoints/final", help="Path to model")
285
+ parser.add_argument("--task", choices=["extract", "match"], default="extract")
286
+ parser.add_argument("--resume-file", help="Path to resume file")
287
+ parser.add_argument("--job-file", help="Path to job description file")
288
+ parser.add_argument("--port", type=int, default=5000, help="API port")
289
+
290
+ args = parser.parse_args()
291
+
292
+ engine = ResumeInferenceEngine(args.model_path)
293
+
294
+ if args.mode == "cli":
295
+ if args.task == "extract":
296
+ with open(args.resume_file) as f:
297
+ resume = f.read()
298
+ result = engine.extract_resume(resume)
299
+ print(json.dumps(result, indent=2))
300
+
301
+ elif args.task == "match":
302
+ with open(args.resume_file) as f:
303
+ resume = f.read()
304
+ with open(args.job_file) as f:
305
+ job = f.read()
306
+ result = engine.match_resume_to_job(resume, job)
307
+ print(json.dumps(result, indent=2))
308
+
309
+ elif args.mode == "api":
310
+ app = create_api(args.model_path)
311
+ print(f"Starting API on port {args.port}...")
312
+ app.run(host="0.0.0.0", port=args.port, debug=False)
313
+
314
+
315
+ if __name__ == "__main__":
316
+ main()
src/train.py ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
3
+ from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
4
+ from datasets import load_dataset
5
+ import json
6
+ from typing import Dict
7
+ import argparse
8
+ import os
9
+
10
+ class ResumeModelTrainer:
11
+ """Fine-tune LLM for resume extraction and matching"""
12
+
13
+ def __init__(self, model_name: str = "mistralai/Mistral-7B-Instruct-v0.1"):
14
+ self.model_name = model_name
15
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
16
+ print(f"Using device: {self.device}")
17
+
18
+ def setup_model(self):
19
+ """Load and configure model with quantization"""
20
+
21
+ # 4-bit quantization for memory efficiency
22
+ bnb_config = BitsAndBytesConfig(
23
+ load_in_4bit=True,
24
+ bnb_4bit_use_double_quant=True,
25
+ bnb_4bit_quant_type="nf4",
26
+ bnb_4bit_compute_dtype=torch.bfloat16
27
+ )
28
+
29
+ print(f"Loading model: {self.model_name}")
30
+ model = AutoModelForCausalLM.from_pretrained(
31
+ self.model_name,
32
+ quantization_config=bnb_config,
33
+ device_map="auto"
34
+ )
35
+
36
+ tokenizer = AutoTokenizer.from_pretrained(self.model_name)
37
+ tokenizer.pad_token = tokenizer.eos_token
38
+
39
+ # Prepare for LoRA
40
+ model = prepare_model_for_kbit_training(model)
41
+
42
+ # LoRA config
43
+ peft_config = LoraConfig(
44
+ r=16,
45
+ lora_alpha=32,
46
+ lora_dropout=0.05,
47
+ bias="none",
48
+ task_type="CAUSAL_LM",
49
+ target_modules=["q_proj", "v_proj"]
50
+ )
51
+
52
+ model = get_peft_model(model, peft_config)
53
+
54
+ return model, tokenizer
55
+
56
+ def prepare_data(self, data_path: str):
57
+ """Load and format training data"""
58
+
59
+ with open(data_path) as f:
60
+ data = json.load(f)
61
+
62
+ def format_sample(sample):
63
+ return {
64
+ "text": f"""Instruction: {sample['instruction']}
65
+
66
+ Input:
67
+ {sample['input']}
68
+
69
+ Output:
70
+ {sample['output']}"""
71
+ }
72
+
73
+ formatted_data = [format_sample(s) for s in data]
74
+
75
+ # Create dataset
76
+ dataset = load_dataset(
77
+ "json",
78
+ data_files={"train": data_path},
79
+ field="text"
80
+ )
81
+
82
+ return dataset, formatted_data
83
+
84
+ def train(self,
85
+ train_path: str = "data/processed/train.json",
86
+ val_path: str = "data/processed/validation.json",
87
+ output_dir: str = "models/checkpoints",
88
+ num_epochs: int = 3):
89
+ """Train the model"""
90
+
91
+ from transformers import Trainer, TrainingArguments
92
+
93
+ os.makedirs(output_dir, exist_ok=True)
94
+
95
+ # Load model and tokenizer
96
+ model, tokenizer = self.setup_model()
97
+
98
+ # Prepare datasets
99
+ dataset = load_dataset("json", data_files={"train": train_path, "validation": val_path})
100
+
101
+ def tokenize_function(examples):
102
+ # Simple tokenization
103
+ tokenized = tokenizer(
104
+ examples["instruction"] + " " + examples["input"],
105
+ truncation=True,
106
+ max_length=512,
107
+ padding="max_length"
108
+ )
109
+ tokenized["labels"] = tokenized["input_ids"].copy()
110
+ return tokenized
111
+
112
+ tokenized_datasets = dataset.map(tokenize_function, batched=True)
113
+
114
+ # Training arguments
115
+ training_args = TrainingArguments(
116
+ output_dir=output_dir,
117
+ num_train_epochs=num_epochs,
118
+ per_device_train_batch_size=4,
119
+ per_device_eval_batch_size=4,
120
+ warmup_steps=100,
121
+ weight_decay=0.01,
122
+ logging_dir="./logs",
123
+ logging_steps=50,
124
+ evaluation_strategy="epoch",
125
+ save_strategy="epoch",
126
+ learning_rate=5e-4,
127
+ bf16=True, # Use bfloat16 for faster training
128
+ lr_scheduler_type="cosine",
129
+ gradient_accumulation_steps=2,
130
+ )
131
+
132
+ trainer = Trainer(
133
+ model=model,
134
+ args=training_args,
135
+ train_dataset=tokenized_datasets["train"],
136
+ eval_dataset=tokenized_datasets["validation"],
137
+ tokenizer=tokenizer,
138
+ )
139
+
140
+ print("Starting training...")
141
+ trainer.train()
142
+
143
+ # Save final model
144
+ model.save_pretrained(f"{output_dir}/final")
145
+ tokenizer.save_pretrained(f"{output_dir}/final")
146
+ print(f"✅ Model saved to {output_dir}/final")
147
+
148
+ return model, tokenizer
149
+
150
+ def main():
151
+ parser = argparse.ArgumentParser()
152
+ parser.add_argument("--task", default="both", help="Task: extraction, matching, or both")
153
+ parser.add_argument("--model", default="mistral", help="Model: mistral or llama")
154
+ parser.add_argument("--epochs", type=int, default=3, help="Number of training epochs")
155
+ parser.add_argument("--output-dir", default="models/checkpoints", help="Output directory")
156
+
157
+ args = parser.parse_args()
158
+
159
+ # Select model
160
+ model_map = {
161
+ "mistral": "mistralai/Mistral-7B-Instruct-v0.1",
162
+ "llama": "meta-llama/Llama-2-7b-hf"
163
+ }
164
+ model_name = model_map.get(args.model, "mistralai/Mistral-7B-Instruct-v0.1")
165
+
166
+ trainer = ResumeModelTrainer(model_name)
167
+ model, tokenizer = trainer.train(
168
+ num_epochs=args.epochs,
169
+ output_dir=args.output_dir
170
+ )
171
+
172
+ print("✅ Training complete!")
173
+
174
+ if __name__ == "__main__":
175
+ main()
src/utils.py ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Helper utilities for the project
2
+
3
+ def parse_skill_match_score(score_str: str) -> int:
4
+ """Extract numeric score from string"""
5
+ import re
6
+ match = re.search(r'\d+', score_str)
7
+ return int(match.group(0)) if match else 50
8
+
9
+ def format_experience_duration(years_str: str) -> str:
10
+ """Standardize experience duration format"""
11
+ import re
12
+ match = re.search(r'\d+', years_str)
13
+ if match:
14
+ years = int(match.group(0))
15
+ return f"{years} years"
16
+ return years_str
17
+
18
+ def clean_text(text: str) -> str:
19
+ """Clean and normalize text"""
20
+ import re
21
+ # Remove extra whitespace
22
+ text = re.sub(r'\s+', ' ', text)
23
+ # Remove special characters
24
+ text = re.sub(r'[^\w\s\-@.]', '', text)
25
+ return text.strip()
26
+
27
+ def skill_similarity(skill1: str, skill2: str) -> float:
28
+ """Calculate similarity between two skills"""
29
+ from difflib import SequenceMatcher
30
+ return SequenceMatcher(None, skill1.lower(), skill2.lower()).ratio()
31
+
32
+ def batch_process(items: list, batch_size: int = 32):
33
+ """Process items in batches"""
34
+ for i in range(0, len(items), batch_size):
35
+ yield items[i:i+batch_size]
36
+
37
+ # Model conversion utilities
38
+ def convert_to_onnx(model_path: str, output_path: str):
39
+ """Convert fine-tuned model to ONNX format for faster inference"""
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+
42
+ model = AutoModelForCausalLM.from_pretrained(model_path)
43
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
44
+
45
+ # Export to ONNX
46
+ import torch
47
+ dummy_input = torch.tensor([[tokenizer.eos_token_id]])
48
+
49
+ torch.onnx.export(
50
+ model,
51
+ dummy_input,
52
+ output_path,
53
+ input_names=['input_ids'],
54
+ output_names=['output'],
55
+ dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'}},
56
+ opset_version=12
57
+ )
58
+
59
+ print(f"✅ Model exported to {output_path}")