File size: 14,306 Bytes
b85afa8
1ee78e5
f8fbe87
a01e4c2
b85afa8
 
 
f452678
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b85afa8
f452678
 
b85afa8
f452678
 
b85afa8
f452678
 
 
 
 
 
 
 
b85afa8
f452678
b85afa8
f452678
 
 
b85afa8
f452678
 
 
 
 
b85afa8
f452678
b85afa8
f452678
 
b85afa8
f452678
 
b85afa8
f452678
 
b85afa8
f452678
b85afa8
f452678
 
b85afa8
f452678
 
b85afa8
f452678
 
 
 
 
 
 
 
 
 
b85afa8
f452678
b85afa8
f452678
 
b85afa8
f452678
 
b85afa8
f452678
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b85afa8
 
 
 
 
 
f3ac294
b7af1d6
 
4f49d90
 
 
 
 
 
 
 
 
 
 
 
b85afa8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f452678
2942d3b
b85afa8
 
9773661
4f49d90
 
 
 
b85afa8
 
cc30e52
 
b85afa8
 
 
 
 
 
 
 
 
 
 
 
 
f452678
4f49d90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b85afa8
 
 
 
 
a0e8e60
 
 
4f49d90
 
 
 
 
 
 
 
 
 
 
a01e4c2
020e780
a0e8e60
 
 
4f49d90
 
 
 
 
 
020e780
f452678
4f49d90
 
b85afa8
 
 
4f49d90
 
b85afa8
 
 
4f49d90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b85afa8
 
 
9be4658
 
 
b85afa8
 
 
 
 
 
 
152bc5c
b85afa8
 
 
a0e8e60
b85afa8
 
 
f452678
b85afa8
 
 
 
 
 
 
 
f452678
b85afa8
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
from openai import OpenAI
from models import Evaluations,EvalResult
from typing import List, Dict
import json
tags = {'AI': "This one is the competence description"} #list of competence to save, better to hit db.
client = OpenAI()

def generate_model_parameters(skill: str, transcript: str, lang: str):
    eng = f"""
    You are tasked with evaluating a transcript of an IT job interview. The interview that is conducted in the transcript is technical. 
    You need sufficient IT knowledge since you will evaluate the answer of the interviewee to determine whether the interviewee answer correctly or not.
    You will output "SUCCESS" if the interviewee's answer is deemed correct and "FAIL" if it's deemed false.
    Below are 5 examples of correct answers.
        
    Here are 5 examples:
    EXAMPLE 1:
    SKILL TO BE EVALUATED: Python

    INTERVIEWER:
    What is the use of zip () in python?

    INTERVIEWEE:
    The zip returns an iterator and takes iterable as argument. These iterables can be list, tuple, dictionary etc. It maps similar index of every iterable to make a single entity.
        
    OUTPUT: SUCCESS

    EXAMPLE 2:
    SKILL TO BE EVALUATED: Python

    INTERVIEWER:
    What will be the output of the following?
    name=["swati","shweta"]
    age=[10,20]
    new_entity-zip(name,age)
    new_entity-set(new_entity)
    print(new_entity)

    INTERVIEWEE:
    The output is {{('shweta', 20), ('swati', 10)}}

    OUTPUT: SUCCESS

    EXAMPLE 3:
    SKILL TO BE EVALUATED: Python

    INTERVIEWER:
    What will be the output of the following?
    a=["1","2","3"]
    b=["a","b","c"]
    c=[x+y for x, y in zip(a,b)] print(c)

    INTERVIEWEE:
    The output is: ['1a', '2b', '3c']

    OUTPUT: SUCCESS

    EXAMPLE 4:
    SKILL TO BE EVALUATED: Python

    INTERVIEWER:
    What will be the output of the following?
    str="apple#banana#kiwi#orange"
    print(str.split("#",2))

    INTERVIEWEE:
    ['apple', 'banana', 'kiwi#orange']

    OUTPUT: SUCCESS

    EXAMPLE 5:
    SKILL TO BE EVALUATED: Python

    INTERVIEWER:
    What are python modules? Name some commonly used built-in modules in Python?

    INTERVIEWEE:
    Python modules are files containing Python code. This code can either be function classes or variables. A Python module is a .py file containing executable code. Some of the commonly used built-in modules are:
    - os
    - sys
    - math
    - random
    - data time
    - json

    OUTPUT: SUCCESS

    Note that the examples that I give above have the correct answer. Your job is to generate the output only (SUCCESS OR FAIL). You don't need to explain your justification.
    SKILL TO BE EVALUATED: {skill}
    {transcript}

    """
    idn = f"""
    Anda ditugaskan untuk mengevaluasi transkrip dari sebuah wawancara kerja di bidang IT. Wawancara dalam transkrip tersebut bersifat teknis.
    Anda perlu memiliki pengetahuan yang cukup tentang IT karena Anda akan mengevaluasi jawaban dari peserta wawancara untuk menentukan apakah jawaban peserta tersebut benar atau tidak.
    Anda akan mengeluarkan output "SUCCESS" jika jawaban peserta dianggap benar dan "FAIL" jika dianggap salah.

    Berikut adalah 5 contoh jawaban yang benar.

    CONTOH 1:
    KEMAMPUAN YANG DIEVALUASI: Python

    PEWAWANCARA:
    Apa kegunaan dari fungsi zip() di Python?

    PESERTA:
    Fungsi zip mengembalikan sebuah iterator dan menerima iterable sebagai argumen. Iterable ini bisa berupa list, tuple, dictionary, dll. Fungsi ini mencocokkan indeks yang sama dari setiap iterable untuk membentuk satu entitas.

    OUTPUT: SUCCESS

    CONTOH 2:
    KEMAMPUAN YANG DIEVALUASI: Python

    PEWAWANCARA:
    Apa output dari kode berikut?

    python
    Copy
    Edit
    name = ["swati", "shweta"]
    age = [10, 20]
    new_entity = zip(name, age)
    new_entity = set(new_entity)
    print(new_entity)
    PESERTA:
    Output-nya adalah: {('shweta', 20), ('swati', 10)}

    OUTPUT: SUCCESS

    CONTOH 3:
    KEMAMPUAN YANG DIEVALUASI: Python

    PEWAWANCARA:
    Apa output dari kode berikut?

    python
    Copy
    Edit
    a = ["1", "2", "3"]
    b = ["a", "b", "c"]
    c = [x + y for x, y in zip(a, b)]
    print(c)
    PESERTA:
    Output-nya adalah: ['1a', '2b', '3c']

    OUTPUT: SUCCESS

    CONTOH 4:
    KEMAMPUAN YANG DIEVALUASI: Python

    PEWAWANCARA:
    Apa output dari kode berikut?

    python
    Copy
    Edit
    str = "apple#banana#kiwi#orange"
    print(str.split("#", 2))
    PESERTA:
    ['apple', 'banana', 'kiwi#orange']

    OUTPUT: SUCCESS

    CONTOH 5:
    KEMAMPUAN YANG DIEVALUASI: Python

    PEWAWANCARA:
    Apa itu modul Python? Sebutkan beberapa modul built-in yang umum digunakan di Python?

    PESERTA:
    Modul Python adalah file yang berisi kode Python. Kode ini bisa berupa fungsi, kelas, atau variabel. Sebuah modul Python adalah file .py yang berisi kode yang bisa dijalankan. Beberapa modul built-in yang sering digunakan adalah:
    os
    sys
    math
    random
    datetime
    json

    OUTPUT: SUCCESS

    Catatan: Contoh-contoh di atas memberikan jawaban yang benar. Tugas Anda adalah menghasilkan output saja (SUCCESS atau FAIL). Anda tidak perlu menjelaskan alasan Anda.

    KEMAMPUAN YANG DIEVALUASI:{skill}
    {transcript}
    """
    model_parameters = {
  "model":"gpt-4-0125-preview",
  "messages":[
    {"role": "system", "content": eng if lang == 'en' else idn},
  ]
}
    
    return model_parameters

def gpt_evaluator(payload, fewshot, response_format):
    print("-----tes")
    print(fewshot)
    print(payload)
    res = []
    for i in payload:
        response = client.beta.chat.completions.parse(
            model="gpt-4o-2024-08-06",
            messages=[
                {"role": "system", "content": fewshot},
                {"role": "user", "content": (i)},
            ],
            response_format=response_format)
        json_str = response.choices[0].message.parsed
        res.append(json_str.value)
    return res

def extract_competences_and_responses(competences: list[str], transcripts: list[dict]):
    responses = []

    for i in range(len(competences)):
        transcript = transcripts[i]

        response = ""
        for idx, chat in enumerate(transcript):
            # logger.info(chat)
            response += chat["answer"]

            if idx < len(transcript) - 1:
                response += "\n"
        
        responses.append(response)
    
    return responses

def evaluate_interview(competences: list[str], transcript: list, lang: str = 'en'):
    # global tags
    model_inputs = []

    responses = extract_competences_and_responses(transcript["comp_beha"], transcript["behavioral"])

    print(len(competences))
    print(len(responses))

    # pprint(transcript)

    for i in range(len(transcript["comp_beha"])):
        competence = transcript["comp_beha"][i]
        response = responses[i]

        text = "KNOWLEDGE:\n"

        knowledge_exist = False

        text += f"\nCOMPETENCE: {competence}\n\n"

        text += f"RESPONSE:\n{response}"

        model_inputs.append(text)
        print("------")
    ## TODO: change to gpt

    idn = """
        CONTOH 1:
        KETERAMPILAN YANG DINILAI: Kejujuran
        PEWAWANCARA:
        Apa mimpi burukmu?
        PESERTA WAWANCARA:
        Saya tidak punya mimpi buruk.
        Penilaian: Tidak mungkin seseorang tidak pernah mengalami mimpi buruk. Rasa takut terhadap sesuatu adalah hal yang umum dirasakan manusia.
        Skor: 0.1

        CONTOH 2:
        PEWAWANCARA:
        Bisakah Anda menceritakan saat Anda harus men-debug masalah yang sangat sulit di lingkungan produksi?
        PESERTA WAWANCARA:
        Di pekerjaan saya sebelumnya, kami menggunakan arsitektur berbasis mikroservis yang dideploy di Kubernetes. Suatu pagi, kami mulai menerima peringatan bahwa layanan autentikasi pengguna kami gagal secara intermiten, dan pengguna tidak bisa masuk.
        Sebagai engineer yang sedang bertugas, tanggung jawab saya adalah segera mengidentifikasi akar permasalahan dan mengembalikan layanan ke fungsionalitas penuh tanpa memengaruhi layanan lain yang bergantung padanya.
        Saya mulai dengan memeriksa log di Kibana dan melihat bahwa beberapa pod untuk layanan autentikasi terus-menerus restart. Saya lalu memeriksa metrik penggunaan resource di Prometheus dan melihat lonjakan memori sebelum setiap crash. Saya curiga terjadi memory leak akibat perubahan terbaru, jadi saya rollback ke image container sebelumnya untuk menstabilkan layanan.
        Setelah stabil, saya menelusuri commit terbaru dan menemukan penggunaan session store in-memory baru yang tidak melepaskan sesi lama dengan benar. Saya menulis skrip analisis heap dump cepat, mengonfirmasi kebocoran memori tersebut, dan memperbaiki session store dengan cache LRU yang terbatas.
        Perbaikannya dideploy di hari yang sama, dan masalah tidak pernah terjadi lagi. Laporan postmortem yang saya tulis juga mendorong tim untuk mengadopsi profiling memori untuk semua komponen layanan baru. Waktu penyelesaian insiden kami meningkat sekitar 30% di kuartal berikutnya berkat perbaikan proses tersebut.
    """

    en = """
                Here are 2 examples:
                EXAMPLE 1:
                SKILL TO BE EVALUATED: Honest
                INTERVIEWER:
                What are your nightmare?
                INTERVIEWEE:
                I do not have night mare
                Judgement: It is impossible to some not having any nightmare. Scary of something is common human feels.
                Score: 0.1
                
                EXAMPLE 2:
                INTERVIEWER:
                Can you tell me about a time you had to debug a particularly difficult issue in a production environment?
                INTERVIEWEE:
                At my previous job, we had a microservices-based architecture deployed on Kubernetes. One morning, we started getting alerts that our user authentication service was intermittently failing, and users couldn’t log in.
                As the engineer on call, my responsibility was to quickly identify the root cause and restore the service to full functionality without affecting other dependent services.
                I began by checking the logs in Kibana and noticed that some of the pods for the authentication service were repeatedly restarting. I then checked the resource usage metrics in Prometheus and saw a memory spike before each crash. I suspected a memory leak introduced by a recent change, so I rolled back to the previous container image to stabilize the service.
                After stabilizing, I dug deeper into the recent commits and found a new in-memory session store that was not properly releasing old sessions. I wrote a quick heap dump analysis script, confirmed the leak, and patched the session store to use a bounded LRU cache instead.
                The fix was deployed the same day, and the issue never recurred. The postmortem I wrote also led to the team adopting memory profiling for all new service components. Our incident resolution time improved by about 30% over the next quarter due to those process improvements.
                
                RETURN IN FORMAT BELOW:
                {
                value: [{
                    "Judgement": "It is impossible to some not having any nightmare. Scary of something is common human feels. Means he was lying",
                    "score": 0.1
                    },
                    {
                    "Judgement: "The candidate delivered a clear, concise STAR response that effectively demonstrated strong technical skills, composure under pressure, and a methodical approach to problem-solving in a production environment. The use of appropriate tools (Kibana, Prometheus), the decision to roll back, and the successful root cause analysis showed depth of experience. The result was measurable and impactful, indicating not just resolution but long-term improvement. Slightly more context on user or business impact would make it perfect, but overall, this is an excellent response that would strongly support a hiring decision."
                    "score": 0.95
                    }
                    ]
                }
                """
    result = gpt_evaluator(model_inputs, en if lang == 'en' else idn, 
            Evaluations
    )
    ## output: 
    final_score = 0
    behavioral_scores = generate_behavioral_score(result)
    technical_scores = generate_technical_score(transcript["comp_tech"], transcript["technical"])

    final_score = aggregate_scores(behavioral_scores, technical_scores)

    return EvalResult(final_score=final_score, details=result)

def aggregate_scores(b: list[int], t: list[int]):
    total_score = 0
    alls = b + t
    for i in range(len(alls)):
        score = alls[i]
        total_score += score

    
    return (total_score / len(alls)) * 100


def generate_behavioral_score(eval_array):
    print(eval_array)
    scores = []

    for eval in eval_array:
        scores.append(eval.score)
    
    return scores

def aggregate_scores(b: list[int], t: list[int]):
    total_score = 0
    alls = b + t
    for i in range(len(alls)):
        score = alls[i]
        total_score += score

    
    return (total_score / len(b)) * 100


def generate_behavioral_score(eval_array):
    print(eval_array)
    scores = []

    for eval in eval_array:
        scores.append(eval.score)
    
    return scores

def generate_technical_score(skills: str, transcript: str, lang: str):
    # total_score = 0
    scores = []
    for idx, skill in enumerate(skills):
        chat = transcript[idx]
        if len(chat) > 0:
            # print(chat)
            transcript_text = f"INTERVIEWEE:\n{chat[0]['question'].lstrip('TECHNICAL: ')}\n\nINTERVIEWER:\n{chat[0]['answer']}"
            # TODO: change to structured output
            model_parameters = generate_model_parameters(skill, transcript_text, lang)
            completion = client.chat.completions.create(
                **model_parameters
            )

            generated = completion.choices[0].message.content
            score = 1 if "SUCCESS" in generated else 0
            # total_score += score
            scores.append(score)
        else:
            scores.append(-1)

    return scores