anonym-ous commited on
Commit
47e3460
·
verified ·
1 Parent(s): a1a4f5e

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. adapters/sft-llama31/README.md +207 -0
  2. adapters/sft-llama31/adapter_config.json +45 -0
  3. adapters/sft-llama31/chat_template.jinja +109 -0
  4. adapters/sft-llama31/tokenizer_config.json +15 -0
  5. eval/v3-sft-baseline.json +0 -0
  6. eval/v3-sft-baseline.judged-1743.json +0 -0
  7. eval/v3-sft-baseline.judged.json +184 -0
  8. eval/v3-sft-noevidence-seed1337.extracted.json +0 -0
  9. eval/v3-sft-noevidence-seed1337.json +0 -0
  10. eval/v3-sft-noevidence-seed7.extracted.json +0 -0
  11. eval/v3-sft-noevidence-seed7.json +0 -0
  12. eval/v3-sft-noevidence.extracted.json +0 -0
  13. eval/v3-sft-noevidence.json +0 -0
  14. eval/v3-sft-seed1337.extracted.json +0 -0
  15. eval/v3-sft-seed1337.json +0 -0
  16. eval/v3-sft-seed1337.judged.json +0 -0
  17. eval/v3-sft-seed7.extracted.json +0 -0
  18. eval/v3-sft-seed7.json +0 -0
  19. eval/v3-sft-seed7.judged.json +0 -0
  20. eval/v3-sft-shuffled.extracted.json +0 -0
  21. eval/v3-sft-shuffled.json +0 -0
  22. eval/v3-sft-terse-lever-noevidence.json +0 -0
  23. eval/v3-sft-terse-lever-seed1337-noevidence.json +0 -0
  24. eval/v3-sft-terse-lever-seed1337.extracted.json +0 -0
  25. eval/v3-sft-terse-lever-seed1337.json +0 -0
  26. eval/v3-sft-terse-lever-seed1337.judged.json +0 -0
  27. eval/v3-sft-terse-lever-seed7-noevidence.json +0 -0
  28. eval/v3-sft-terse-lever-seed7.extracted.json +0 -0
  29. eval/v3-sft-terse-lever-seed7.json +0 -0
  30. eval/v3-sft-terse-lever-seed7.judged.json +0 -0
  31. eval/v3-sft-terse-lever.extracted.json +0 -0
  32. eval/v3-sft-terse-lever.json +0 -0
  33. eval/v3-sft-terse-lever.judged-1743.json +0 -0
  34. logs/eval-llama31-seed1337.log +41 -0
  35. logs/eval-llama31-seed7.log +41 -0
  36. logs/eval-llama31.log +61 -0
  37. logs/eval-mistral-seed1337.log +40 -0
  38. logs/eval-mistral-seed7.log +40 -0
  39. logs/eval-mistral.log +59 -0
  40. logs/eval-multitq-llama.log +28 -0
  41. logs/eval-multitq-mistral.log +28 -0
  42. logs/eval-multitq.log +33 -0
  43. logs/eval-v2.log +164 -0
  44. logs/eval-v3-1k.log +26 -0
  45. logs/eval-v3-2k.log +27 -0
  46. logs/eval-v3-noevidence.log +27 -0
  47. logs/eval-v3-seed1337.log +27 -0
  48. logs/eval-v3-seed7.log +26 -0
  49. logs/eval-v3-sft-3hop.log +31 -0
  50. logs/eval-v3-sft.log +27 -0
adapters/sft-llama31/README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: meta-llama/Llama-3.1-8B-Instruct
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - base_model:adapter:meta-llama/Llama-3.1-8B-Instruct
7
+ - lora
8
+ - transformers
9
+ ---
10
+
11
+ # Model Card for Model ID
12
+
13
+ <!-- Provide a quick summary of what the model is/does. -->
14
+
15
+
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+
21
+ <!-- Provide a longer summary of what this model is. -->
22
+
23
+
24
+
25
+ - **Developed by:** [More Information Needed]
26
+ - **Funded by [optional]:** [More Information Needed]
27
+ - **Shared by [optional]:** [More Information Needed]
28
+ - **Model type:** [More Information Needed]
29
+ - **Language(s) (NLP):** [More Information Needed]
30
+ - **License:** [More Information Needed]
31
+ - **Finetuned from model [optional]:** [More Information Needed]
32
+
33
+ ### Model Sources [optional]
34
+
35
+ <!-- Provide the basic links for the model. -->
36
+
37
+ - **Repository:** [More Information Needed]
38
+ - **Paper [optional]:** [More Information Needed]
39
+ - **Demo [optional]:** [More Information Needed]
40
+
41
+ ## Uses
42
+
43
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
+
45
+ ### Direct Use
46
+
47
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
48
+
49
+ [More Information Needed]
50
+
51
+ ### Downstream Use [optional]
52
+
53
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
+
55
+ [More Information Needed]
56
+
57
+ ### Out-of-Scope Use
58
+
59
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
+
61
+ [More Information Needed]
62
+
63
+ ## Bias, Risks, and Limitations
64
+
65
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
+
67
+ [More Information Needed]
68
+
69
+ ### Recommendations
70
+
71
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
+
73
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
74
+
75
+ ## How to Get Started with the Model
76
+
77
+ Use the code below to get started with the model.
78
+
79
+ [More Information Needed]
80
+
81
+ ## Training Details
82
+
83
+ ### Training Data
84
+
85
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
+
87
+ [More Information Needed]
88
+
89
+ ### Training Procedure
90
+
91
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
+
93
+ #### Preprocessing [optional]
94
+
95
+ [More Information Needed]
96
+
97
+
98
+ #### Training Hyperparameters
99
+
100
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
+
102
+ #### Speeds, Sizes, Times [optional]
103
+
104
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
+
106
+ [More Information Needed]
107
+
108
+ ## Evaluation
109
+
110
+ <!-- This section describes the evaluation protocols and provides the results. -->
111
+
112
+ ### Testing Data, Factors & Metrics
113
+
114
+ #### Testing Data
115
+
116
+ <!-- This should link to a Dataset Card if possible. -->
117
+
118
+ [More Information Needed]
119
+
120
+ #### Factors
121
+
122
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
123
+
124
+ [More Information Needed]
125
+
126
+ #### Metrics
127
+
128
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
+
130
+ [More Information Needed]
131
+
132
+ ### Results
133
+
134
+ [More Information Needed]
135
+
136
+ #### Summary
137
+
138
+
139
+
140
+ ## Model Examination [optional]
141
+
142
+ <!-- Relevant interpretability work for the model goes here -->
143
+
144
+ [More Information Needed]
145
+
146
+ ## Environmental Impact
147
+
148
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
+
150
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
+
152
+ - **Hardware Type:** [More Information Needed]
153
+ - **Hours used:** [More Information Needed]
154
+ - **Cloud Provider:** [More Information Needed]
155
+ - **Compute Region:** [More Information Needed]
156
+ - **Carbon Emitted:** [More Information Needed]
157
+
158
+ ## Technical Specifications [optional]
159
+
160
+ ### Model Architecture and Objective
161
+
162
+ [More Information Needed]
163
+
164
+ ### Compute Infrastructure
165
+
166
+ [More Information Needed]
167
+
168
+ #### Hardware
169
+
170
+ [More Information Needed]
171
+
172
+ #### Software
173
+
174
+ [More Information Needed]
175
+
176
+ ## Citation [optional]
177
+
178
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
179
+
180
+ **BibTeX:**
181
+
182
+ [More Information Needed]
183
+
184
+ **APA:**
185
+
186
+ [More Information Needed]
187
+
188
+ ## Glossary [optional]
189
+
190
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
+
192
+ [More Information Needed]
193
+
194
+ ## More Information [optional]
195
+
196
+ [More Information Needed]
197
+
198
+ ## Model Card Authors [optional]
199
+
200
+ [More Information Needed]
201
+
202
+ ## Model Card Contact
203
+
204
+ [More Information Needed]
205
+ ### Framework versions
206
+
207
+ - PEFT 0.19.1
adapters/sft-llama31/adapter_config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "meta-llama/Llama-3.1-8B-Instruct",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 64,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "lora_ga_config": null,
23
+ "megatron_config": null,
24
+ "megatron_core": "megatron.core",
25
+ "modules_to_save": null,
26
+ "peft_type": "LORA",
27
+ "peft_version": "0.19.1",
28
+ "qalora_group_size": 16,
29
+ "r": 32,
30
+ "rank_pattern": {},
31
+ "revision": null,
32
+ "target_modules": [
33
+ "q_proj",
34
+ "o_proj",
35
+ "v_proj",
36
+ "k_proj"
37
+ ],
38
+ "target_parameters": null,
39
+ "task_type": "CAUSAL_LM",
40
+ "trainable_token_indices": null,
41
+ "use_bdlora": null,
42
+ "use_dora": false,
43
+ "use_qalora": false,
44
+ "use_rslora": false
45
+ }
adapters/sft-llama31/chat_template.jinja ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {{- bos_token }}
2
+ {%- if custom_tools is defined %}
3
+ {%- set tools = custom_tools %}
4
+ {%- endif %}
5
+ {%- if not tools_in_user_message is defined %}
6
+ {%- set tools_in_user_message = true %}
7
+ {%- endif %}
8
+ {%- if not date_string is defined %}
9
+ {%- set date_string = "26 Jul 2024" %}
10
+ {%- endif %}
11
+ {%- if not tools is defined %}
12
+ {%- set tools = none %}
13
+ {%- endif %}
14
+
15
+ {#- This block extracts the system message, so we can slot it into the right place. #}
16
+ {%- if messages[0]['role'] == 'system' %}
17
+ {%- set system_message = messages[0]['content']|trim %}
18
+ {%- set messages = messages[1:] %}
19
+ {%- else %}
20
+ {%- set system_message = "" %}
21
+ {%- endif %}
22
+
23
+ {#- System message + builtin tools #}
24
+ {{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
25
+ {%- if builtin_tools is defined or tools is not none %}
26
+ {{- "Environment: ipython\n" }}
27
+ {%- endif %}
28
+ {%- if builtin_tools is defined %}
29
+ {{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}}
30
+ {%- endif %}
31
+ {{- "Cutting Knowledge Date: December 2023\n" }}
32
+ {{- "Today Date: " + date_string + "\n\n" }}
33
+ {%- if tools is not none and not tools_in_user_message %}
34
+ {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
35
+ {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
36
+ {{- "Do not use variables.\n\n" }}
37
+ {%- for t in tools %}
38
+ {{- t | tojson(indent=4) }}
39
+ {{- "\n\n" }}
40
+ {%- endfor %}
41
+ {%- endif %}
42
+ {{- system_message }}
43
+ {{- "<|eot_id|>" }}
44
+
45
+ {#- Custom tools are passed in a user message with some extra guidance #}
46
+ {%- if tools_in_user_message and not tools is none %}
47
+ {#- Extract the first user message so we can plug it in here #}
48
+ {%- if messages | length != 0 %}
49
+ {%- set first_user_message = messages[0]['content']|trim %}
50
+ {%- set messages = messages[1:] %}
51
+ {%- else %}
52
+ {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
53
+ {%- endif %}
54
+ {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
55
+ {{- "Given the following functions, please respond with a JSON for a function call " }}
56
+ {{- "with its proper arguments that best answers the given prompt.\n\n" }}
57
+ {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
58
+ {{- "Do not use variables.\n\n" }}
59
+ {%- for t in tools %}
60
+ {{- t | tojson(indent=4) }}
61
+ {{- "\n\n" }}
62
+ {%- endfor %}
63
+ {{- first_user_message + "<|eot_id|>"}}
64
+ {%- endif %}
65
+
66
+ {%- for message in messages %}
67
+ {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
68
+ {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
69
+ {%- elif 'tool_calls' in message %}
70
+ {%- if not message.tool_calls|length == 1 %}
71
+ {{- raise_exception("This model only supports single tool-calls at once!") }}
72
+ {%- endif %}
73
+ {%- set tool_call = message.tool_calls[0].function %}
74
+ {%- if builtin_tools is defined and tool_call.name in builtin_tools %}
75
+ {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
76
+ {{- "<|python_tag|>" + tool_call.name + ".call(" }}
77
+ {%- for arg_name, arg_val in tool_call.arguments | items %}
78
+ {{- arg_name + '="' + arg_val + '"' }}
79
+ {%- if not loop.last %}
80
+ {{- ", " }}
81
+ {%- endif %}
82
+ {%- endfor %}
83
+ {{- ")" }}
84
+ {%- else %}
85
+ {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
86
+ {{- '{"name": "' + tool_call.name + '", ' }}
87
+ {{- '"parameters": ' }}
88
+ {{- tool_call.arguments | tojson }}
89
+ {{- "}" }}
90
+ {%- endif %}
91
+ {%- if builtin_tools is defined %}
92
+ {#- This means we're in ipython mode #}
93
+ {{- "<|eom_id|>" }}
94
+ {%- else %}
95
+ {{- "<|eot_id|>" }}
96
+ {%- endif %}
97
+ {%- elif message.role == "tool" or message.role == "ipython" %}
98
+ {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
99
+ {%- if message.content is mapping or message.content is iterable %}
100
+ {{- message.content | tojson }}
101
+ {%- else %}
102
+ {{- message.content }}
103
+ {%- endif %}
104
+ {{- "<|eot_id|>" }}
105
+ {%- endif %}
106
+ {%- endfor %}
107
+ {%- if add_generation_prompt %}
108
+ {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
109
+ {%- endif %}
adapters/sft-llama31/tokenizer_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<|begin_of_text|>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "eos_token": "<|eot_id|>",
6
+ "is_local": false,
7
+ "local_files_only": false,
8
+ "model_input_names": [
9
+ "input_ids",
10
+ "attention_mask"
11
+ ],
12
+ "model_max_length": 131072,
13
+ "pad_token": "<|eot_id|>",
14
+ "tokenizer_class": "TokenizersBackend"
15
+ }
eval/v3-sft-baseline.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-baseline.judged-1743.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-baseline.judged.json ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "v3-sft-baseline",
3
+ "policy_path": "../checkpoints/sft-v3/final",
4
+ "split": "test",
5
+ "n_samples": 1743,
6
+ "temperature": 0.0,
7
+ "max_new_tokens": 96,
8
+ "answer_extracted": true,
9
+ "n_with_answer_marker": 1738,
10
+ "judge_provider": "anthropic",
11
+ "judge_model": "claude-haiku-4-5-20251001",
12
+ "overall": {
13
+ "n": 10,
14
+ "em": 0.4,
15
+ "f1": 0.5404761904761906,
16
+ "judge_em": 0.6
17
+ },
18
+ "by_complexity": {
19
+ "3plus": {
20
+ "n": 10,
21
+ "em": 0.4,
22
+ "f1": 0.5404761904761906,
23
+ "judge_em": 0.6
24
+ }
25
+ },
26
+ "by_operator": {
27
+ "before_after": {
28
+ "n": 2,
29
+ "em": 0.5,
30
+ "f1": 0.5,
31
+ "judge_em": 0.5
32
+ },
33
+ "interval": {
34
+ "n": 1,
35
+ "em": 1.0,
36
+ "f1": 1.0,
37
+ "judge_em": 1.0
38
+ },
39
+ "point_in_time": {
40
+ "n": 1,
41
+ "em": 1.0,
42
+ "f1": 1.0,
43
+ "judge_em": 1.0
44
+ },
45
+ "sequence": {
46
+ "n": 6,
47
+ "em": 0.16666666666666666,
48
+ "f1": 0.40079365079365076,
49
+ "judge_em": 0.5
50
+ }
51
+ },
52
+ "per_example": [
53
+ {
54
+ "id": "q_006291",
55
+ "complexity": "3plus",
56
+ "operator_type": "sequence",
57
+ "t_query": 2017.0,
58
+ "gold": "chairperson",
59
+ "pred": "President of the Portuguese Football Federation",
60
+ "em": 0.0,
61
+ "f1": 0.0,
62
+ "pred_raw": "Hop 1: Jonas -> member of sports team -> S.L. Benfica (2017)\n Hop 2: S.L. Benfica -> chairperson -> Lu\u00eds Filipe Vieira (2017)\n Hop 3: Lu\u00eds Filipe Vieira -> position held -> President of the Portuguese Football Federation (2017)\n Answer: President of the Portuguese Football Federation",
63
+ "judge_em": 0,
64
+ "judge_reply": "NO"
65
+ },
66
+ {
67
+ "id": "q_007337",
68
+ "complexity": "3plus",
69
+ "operator_type": "sequence",
70
+ "t_query": 1907.0,
71
+ "gold": "Nicholas II of Russia",
72
+ "pred": "Nikolay II",
73
+ "em": 0.0,
74
+ "f1": 0.3333333333333333,
75
+ "pred_raw": "Hop 1: Ararat -> located in -> Erivan Governorate (1907)\n Hop 2: Erivan Governorate -> country -> Russian Empire (1907)\n Hop 3: Russian Empire -> head of state -> Nikolay II (1907)\n Answer: Nikolay II",
76
+ "judge_em": 1,
77
+ "judge_reply": "YES"
78
+ },
79
+ {
80
+ "id": "q_000021",
81
+ "complexity": "3plus",
82
+ "operator_type": "sequence",
83
+ "t_query": 1959.0,
84
+ "gold": "President of Italy",
85
+ "pred": "President of the Italian Republic",
86
+ "em": 0.0,
87
+ "f1": 0.5,
88
+ "pred_raw": "Hop 1: Galvano Della Volpe -> country of citizenship -> Italy (1959)\n Hop 2: Italy -> head of state -> Antonio Segni (1959)\n Hop 3: Antonio Segni -> position held -> President of the Italian Republic (1959)\n Answer: President of the Italian Republic",
89
+ "judge_em": 1,
90
+ "judge_reply": "YES"
91
+ },
92
+ {
93
+ "id": "q_007809",
94
+ "complexity": "3plus",
95
+ "operator_type": "sequence",
96
+ "t_query": 2011.0,
97
+ "gold": "Irina Bokova",
98
+ "pred": "David Elkind",
99
+ "em": 0.0,
100
+ "f1": 0.0,
101
+ "pred_raw": "Hop 1: Kach -> designated as terrorist by -> United States (2011)\n Hop 2: United States -> member of -> UNESCO (2011)\n Hop 3: UNESCO -> chairperson -> David Elkind (2011)\n Answer: David Elkind",
102
+ "judge_em": 0,
103
+ "judge_reply": "NO"
104
+ },
105
+ {
106
+ "id": "q_001979",
107
+ "complexity": "3plus",
108
+ "operator_type": "interval",
109
+ "t_query": 1998.0,
110
+ "gold": "1998",
111
+ "pred": "1998",
112
+ "em": 1.0,
113
+ "f1": 1.0,
114
+ "pred_raw": "Hop 1: Arnold J. Levine -> award received -> Paul Ehrlich and Ludwig Darmstaedter Prize (1998)\n Answer: 1998",
115
+ "judge_em": 1,
116
+ "judge_reply": "YES"
117
+ },
118
+ {
119
+ "id": "q_005455",
120
+ "complexity": "3plus",
121
+ "operator_type": "sequence",
122
+ "t_query": 1982.0,
123
+ "gold": "Golden Globe Award for Best Actress \u2013 Motion Picture Drama",
124
+ "pred": "Academy Award for Best Actress",
125
+ "em": 0.0,
126
+ "f1": 0.5714285714285714,
127
+ "pred_raw": "Hop 1: On Golden Pond -> nominated for -> Academy Award for Best Actress (1982)\n Hop 2: Academy Award for Best Actress -> winner -> Meryl Streep (1982)\n Hop 3: Meryl Streep -> award received -> Academy Award for Best Actress (1982)\n Answer: Academy Award for Best Actress",
128
+ "judge_em": 0,
129
+ "judge_reply": "NO"
130
+ },
131
+ {
132
+ "id": "q_008005",
133
+ "complexity": "3plus",
134
+ "operator_type": "before_after",
135
+ "t_query": 1961.0,
136
+ "gold": "President of Italy",
137
+ "pred": "President of Italy",
138
+ "em": 1.0,
139
+ "f1": 1.0,
140
+ "pred_raw": "Hop 1: Urbano Lazzaro -> country of citizenship -> Italy (1961)\n Hop 2: Italy -> head of state -> Giuseppe Saragat (1961)\n Hop 3: Giuseppe Saragat -> position held -> President of Italy (1961)\n Answer: President of Italy",
141
+ "judge_em": 1,
142
+ "judge_reply": "YES"
143
+ },
144
+ {
145
+ "id": "q_008690",
146
+ "complexity": "3plus",
147
+ "operator_type": "sequence",
148
+ "t_query": 1968.0,
149
+ "gold": "President of Italy",
150
+ "pred": "President of Italy",
151
+ "em": 1.0,
152
+ "f1": 1.0,
153
+ "pred_raw": "Hop 1: Bruno Pesaola -> country of citizenship -> Italy (1968)\n Hop 2: Italy -> head of state -> Giovanni Gronchi (1968)\n Hop 3: Giovanni Gronchi -> position held -> President of Italy (1968)\n Answer: President of Italy",
154
+ "judge_em": 1,
155
+ "judge_reply": "YES"
156
+ },
157
+ {
158
+ "id": "q_008321",
159
+ "complexity": "3plus",
160
+ "operator_type": "point_in_time",
161
+ "t_query": 1976.0,
162
+ "gold": "Prime Minister of Italy",
163
+ "pred": "Prime Minister of Italy",
164
+ "em": 1.0,
165
+ "f1": 1.0,
166
+ "pred_raw": "Hop 1: Vito Lattanzio -> country of citizenship -> Italy (1976)\n Hop 2: Italy -> head of government -> Giovanni Spadolini (1976)\n Hop 3: Giovanni Spadolini -> position held -> Prime Minister of Italy (1976)\n Answer: Prime Minister of Italy",
167
+ "judge_em": 1,
168
+ "judge_reply": "YES"
169
+ },
170
+ {
171
+ "id": "q_007761",
172
+ "complexity": "3plus",
173
+ "operator_type": "before_after",
174
+ "t_query": 2002.0,
175
+ "gold": "Arista Records",
176
+ "pred": "Sony Music",
177
+ "em": 0.0,
178
+ "f1": 0.0,
179
+ "pred_raw": "Hop 1: Jennifer Love Hewitt -> unmarried partner -> John Mayer (2002)\n Hop 2: John Mayer -> nominated for -> Grammy Award for Best Male Pop Vocal Performance (2002)\n Hop 3: Grammy Award for Best Male Pop Vocal Performance -> winner -> Epic Records (2002)\n Hop 4: Epic Records -> record label -> Sony Music (2002)\n Answer: Sony Music",
180
+ "judge_em": 0,
181
+ "judge_reply": "NO"
182
+ }
183
+ ]
184
+ }
eval/v3-sft-noevidence-seed1337.extracted.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-noevidence-seed1337.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-noevidence-seed7.extracted.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-noevidence-seed7.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-noevidence.extracted.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-noevidence.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-seed1337.extracted.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-seed1337.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-seed1337.judged.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-seed7.extracted.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-seed7.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-seed7.judged.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-shuffled.extracted.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-shuffled.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever-noevidence.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever-seed1337-noevidence.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever-seed1337.extracted.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever-seed1337.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever-seed1337.judged.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever-seed7-noevidence.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever-seed7.extracted.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever-seed7.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever-seed7.judged.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever.extracted.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever.json ADDED
The diff for this file is too large to render. See raw diff
 
eval/v3-sft-terse-lever.judged-1743.json ADDED
The diff for this file is too large to render. See raw diff
 
logs/eval-llama31-seed1337.log ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-llama31-seed1337/final
4
+
5
+ [transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
6
+ [eval] retrieval mode ON — building retriever (k=15)
7
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
8
+ [eval] retrieval done: 1743/1743 questions got at least one triple
9
+ [eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
10
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
11
+ [eval] generating predictions (bs=4)
12
+ [eval] progress: 4/1743
13
+ [eval] progress: 104/1743
14
+ [eval] progress: 204/1743
15
+ [eval] progress: 304/1743
16
+ [eval] progress: 404/1743
17
+ [eval] progress: 504/1743
18
+ [eval] progress: 604/1743
19
+ [eval] progress: 704/1743
20
+ [eval] progress: 804/1743
21
+ [eval] progress: 904/1743
22
+ [eval] progress: 1004/1743
23
+ [eval] progress: 1104/1743
24
+ [eval] progress: 1204/1743
25
+ [eval] progress: 1304/1743
26
+ [eval] progress: 1404/1743
27
+ [eval] progress: 1504/1743
28
+ [eval] progress: 1604/1743
29
+ [eval] progress: 1704/1743
30
+
31
+ [eval] wrote ../outputs/eval/llama31-sft-seed1337.json
32
+ [eval] OVERALL: n=1743 EM=0.942 F1=0.273
33
+ [eval] by complexity:
34
+ 1hop: n=800 EM=0.993 F1=0.341
35
+ 2hop: n=800 EM=0.929 F1=0.216
36
+ 3plus: n=143 EM=0.734 F1=0.215
37
+ [eval] by operator:
38
+ before_after: n=517 EM=0.942 F1=0.292
39
+ interval: n=163 EM=1.000 F1=0.138
40
+ point_in_time: n=535 EM=0.935 F1=0.298
41
+ sequence: n=528 EM=0.932 F1=0.272
logs/eval-llama31-seed7.log ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-llama31-seed7/final
4
+
5
+ [transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
6
+ [eval] retrieval mode ON — building retriever (k=15)
7
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
8
+ [eval] retrieval done: 1743/1743 questions got at least one triple
9
+ [eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
10
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
11
+ [eval] generating predictions (bs=4)
12
+ [eval] progress: 4/1743
13
+ [eval] progress: 104/1743
14
+ [eval] progress: 204/1743
15
+ [eval] progress: 304/1743
16
+ [eval] progress: 404/1743
17
+ [eval] progress: 504/1743
18
+ [eval] progress: 604/1743
19
+ [eval] progress: 704/1743
20
+ [eval] progress: 804/1743
21
+ [eval] progress: 904/1743
22
+ [eval] progress: 1004/1743
23
+ [eval] progress: 1104/1743
24
+ [eval] progress: 1204/1743
25
+ [eval] progress: 1304/1743
26
+ [eval] progress: 1404/1743
27
+ [eval] progress: 1504/1743
28
+ [eval] progress: 1604/1743
29
+ [eval] progress: 1704/1743
30
+
31
+ [eval] wrote ../outputs/eval/llama31-sft-seed7.json
32
+ [eval] OVERALL: n=1743 EM=0.941 F1=0.273
33
+ [eval] by complexity:
34
+ 1hop: n=800 EM=0.991 F1=0.341
35
+ 2hop: n=800 EM=0.932 F1=0.217
36
+ 3plus: n=143 EM=0.713 F1=0.213
37
+ [eval] by operator:
38
+ before_after: n=517 EM=0.940 F1=0.292
39
+ interval: n=163 EM=1.000 F1=0.138
40
+ point_in_time: n=535 EM=0.946 F1=0.301
41
+ sequence: n=528 EM=0.920 F1=0.269
logs/eval-llama31.log ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-llama31/final
4
+
5
+ [transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
6
+ [eval] retrieval mode ON — building retriever (k=15)
7
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
8
+ [eval] retrieval done: 1743/1743 questions got at least one triple
9
+ [eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
10
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
11
+ [eval] generating predictions (bs=4)
12
+ [eval] progress: 4/1743
13
+ [eval] progress: 104/1743
14
+ [eval] progress: 204/1743
15
+ [eval] progress: 304/1743
16
+ [eval] progress: 404/1743
17
+ [eval] progress: 504/1743
18
+ [eval] progress: 604/1743
19
+ [eval] progress: 704/1743
20
+ [eval] progress: 804/1743
21
+ [eval] progress: 904/1743
22
+ [eval] progress: 1004/1743
23
+ [eval] progress: 1104/1743
24
+ [eval] progress: 1204/1743
25
+ [eval] progress: 1304/1743
26
+ [eval] progress: 1404/1743
27
+ [eval] progress: 1504/1743
28
+ [eval] progress: 1604/1743
29
+ [eval] progress: 1704/1743
30
+
31
+ [eval] wrote ../outputs/eval/llama31-sft.json
32
+ [eval] OVERALL: n=1743 EM=0.944 F1=0.273
33
+ [eval] by complexity:
34
+ 1hop: n=800 EM=0.990 F1=0.340
35
+ 2hop: n=800 EM=0.932 F1=0.217
36
+ 3plus: n=143 EM=0.748 F1=0.217
37
+ [eval] by operator:
38
+ before_after: n=517 EM=0.946 F1=0.294
39
+ interval: n=163 EM=1.000 F1=0.138
40
+ point_in_time: n=535 EM=0.940 F1=0.299
41
+ sequence: n=528 EM=0.928 F1=0.269
42
+ 9
43
+ sequence: n=528 EM=0.928 F1=0.269
44
+
45
+ [2/2] Post-extracting Answer: spans...
46
+ [extract] wrote ../outputs/eval/llama31-sft.extracted.json
47
+ [extract] 1707/1743 predictions had an 'Answer:' marker (97.9%)
48
+ [extract] OVERALL: n=1743 EM=0.9225 F1=0.9224
49
+ [extract] by complexity:
50
+ 1hop: n=800 EM=0.9725 F1=0.9769
51
+ 2hop: n=800 EM=0.9175 F1=0.9081
52
+ 3plus: n=143 EM=0.6713 F1=0.6975
53
+ [extract] by operator:
54
+ before_after: n=517 EM=0.9304 F1=0.9343
55
+ interval: n=163 EM=0.9755 F1=0.9419
56
+ point_in_time: n=535 EM=0.9346 F1=0.9449
57
+ sequence: n=528 EM=0.8864 F1=0.8819
58
+
59
+ Done. Headline numbers in:
60
+ ../outputs/eval/llama31-sft.extracted.json (post-extracted EM/F1)
61
+ ../outputs/eval/llama31-sft.json (raw predictions)
logs/eval-mistral-seed1337.log ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-mistral-seed1337/final
4
+
5
+ [eval] retrieval mode ON — building retriever (k=15)
6
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
7
+ [eval] retrieval done: 1743/1743 questions got at least one triple
8
+ [eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
9
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
10
+ [eval] generating predictions (bs=4)
11
+ [eval] progress: 4/1743
12
+ [eval] progress: 104/1743
13
+ [eval] progress: 204/1743
14
+ [eval] progress: 304/1743
15
+ [eval] progress: 404/1743
16
+ [eval] progress: 504/1743
17
+ [eval] progress: 604/1743
18
+ [eval] progress: 704/1743
19
+ [eval] progress: 804/1743
20
+ [eval] progress: 904/1743
21
+ [eval] progress: 1004/1743
22
+ [eval] progress: 1104/1743
23
+ [eval] progress: 1204/1743
24
+ [eval] progress: 1304/1743
25
+ [eval] progress: 1404/1743
26
+ [eval] progress: 1504/1743
27
+ [eval] progress: 1604/1743
28
+ [eval] progress: 1704/1743
29
+
30
+ [eval] wrote ../outputs/eval/mistral-sft-seed1337.json
31
+ [eval] OVERALL: n=1743 EM=0.944 F1=0.276
32
+ [eval] by complexity:
33
+ 1hop: n=800 EM=0.996 F1=0.343
34
+ 2hop: n=800 EM=0.946 F1=0.220
35
+ 3plus: n=143 EM=0.636 F1=0.221
36
+ [eval] by operator:
37
+ before_after: n=517 EM=0.954 F1=0.297
38
+ interval: n=163 EM=1.000 F1=0.139
39
+ point_in_time: n=535 EM=0.929 F1=0.300
40
+ sequence: n=528 EM=0.932 F1=0.275
logs/eval-mistral-seed7.log ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-mistral-seed7/final
4
+
5
+ [eval] retrieval mode ON — building retriever (k=15)
6
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
7
+ [eval] retrieval done: 1743/1743 questions got at least one triple
8
+ [eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
9
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
10
+ [eval] generating predictions (bs=4)
11
+ [eval] progress: 4/1743
12
+ [eval] progress: 104/1743
13
+ [eval] progress: 204/1743
14
+ [eval] progress: 304/1743
15
+ [eval] progress: 404/1743
16
+ [eval] progress: 504/1743
17
+ [eval] progress: 604/1743
18
+ [eval] progress: 704/1743
19
+ [eval] progress: 804/1743
20
+ [eval] progress: 904/1743
21
+ [eval] progress: 1004/1743
22
+ [eval] progress: 1104/1743
23
+ [eval] progress: 1204/1743
24
+ [eval] progress: 1304/1743
25
+ [eval] progress: 1404/1743
26
+ [eval] progress: 1504/1743
27
+ [eval] progress: 1604/1743
28
+ [eval] progress: 1704/1743
29
+
30
+ [eval] wrote ../outputs/eval/mistral-sft-seed7.json
31
+ [eval] OVERALL: n=1743 EM=0.943 F1=0.276
32
+ [eval] by complexity:
33
+ 1hop: n=800 EM=0.998 F1=0.342
34
+ 2hop: n=800 EM=0.945 F1=0.219
35
+ 3plus: n=143 EM=0.629 F1=0.219
36
+ [eval] by operator:
37
+ before_after: n=517 EM=0.956 F1=0.298
38
+ interval: n=163 EM=1.000 F1=0.138
39
+ point_in_time: n=535 EM=0.927 F1=0.299
40
+ sequence: n=528 EM=0.930 F1=0.273
logs/eval-mistral.log ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
2
+ [eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
3
+ [eval] 1743 questions
4
+ [eval] loading policy from ../checkpoints/sft-mistral/final
5
+
6
+ [eval] retrieval mode ON — building retriever (k=15)
7
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
8
+ [eval] retrieval done: 1743/1743 questions got at least one triple
9
+ [eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
10
+ [Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
11
+ [eval] generating predictions (bs=4)
12
+ [eval] progress: 4/1743
13
+ [eval] progress: 104/1743
14
+ [eval] progress: 204/1743
15
+ [eval] progress: 304/1743
16
+ [eval] progress: 404/1743
17
+ [eval] progress: 504/1743
18
+ [eval] progress: 604/1743
19
+ [eval] progress: 704/1743
20
+ [eval] progress: 804/1743
21
+ [eval] progress: 904/1743
22
+ [eval] progress: 1004/1743
23
+ [eval] progress: 1104/1743
24
+ [eval] progress: 1204/1743
25
+ [eval] progress: 1304/1743
26
+ [eval] progress: 1404/1743
27
+ [eval] progress: 1504/1743
28
+ [eval] progress: 1604/1743
29
+ [eval] progress: 1704/1743
30
+
31
+ [eval] wrote ../outputs/eval/mistral-sft.json
32
+ [eval] OVERALL: n=1743 EM=0.943 F1=0.276
33
+ [eval] by complexity:
34
+ 1hop: n=800 EM=0.998 F1=0.343
35
+ 2hop: n=800 EM=0.945 F1=0.219
36
+ 3plus: n=143 EM=0.622 F1=0.216
37
+ [eval] by operator:
38
+ before_after: n=517 EM=0.954 F1=0.297
39
+ interval: n=163 EM=1.000 F1=0.138
40
+ point_in_time: n=535 EM=0.929 F1=0.299
41
+ sequence: n=528 EM=0.928 F1=0.275
42
+ 9
43
+ sequence: n=528 EM=0.928 F1=0.275
44
+
45
+ [2/2] Post-extracting Answer: spans...
46
+ [extract] wrote ../outputs/eval/mistral-sft.extracted.json
47
+ [extract] 1583/1743 predictions had an 'Answer:' marker (90.8%)
48
+ [extract] OVERALL: n=1743 EM=0.8910 F1=0.8677
49
+ [extract] by complexity:
50
+ 1hop: n=800 EM=0.9938 F1=0.9940
51
+ 2hop: n=800 EM=0.8387 F1=0.8530
52
+ 3plus: n=143 EM=0.6084 F1=0.2426
53
+ [extract] by operator:
54
+ before_after: n=517 EM=0.9110 F1=0.8984
55
+ interval: n=163 EM=0.8344 F1=0.7649
56
+ point_in_time: n=535 EM=0.8916 F1=0.8795
57
+ sequence: n=528 EM=0.8883 F1=0.8573
58
+
59
+ Done. Headline: ../outputs/eval/mistral-sft.extracted.json
logs/eval-multitq-llama.log ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [mtq-eval] loading KG from ../data/MultiTQ/MultiTQ/kg
2
+ [mtq-eval] 461,329 triples
3
+ [mtq-eval] building retriever (k=15, hops=2)
4
+ [mtq-eval] loading questions from ../data/MultiTQ/MultiTQ/questions/test.json
5
+ [mtq-eval] stratified subset: 1,496 of 54,584
6
+ [mtq-eval] pre-retrieving evidence per question
7
+ [mtq-eval] 1,496/1,496 got >=1 triple
8
+ [mtq-eval] loading policy from ../checkpoints/sft-multitq-llama/final
9
+
10
+ [transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
11
+ [mtq-eval] generating predictions (bs=16)
12
+ [mtq-eval] progress: 16/1496
13
+ [mtq-eval] progress: 416/1496
14
+ [mtq-eval] progress: 816/1496
15
+ [mtq-eval] progress: 1216/1496
16
+
17
+ [mtq-eval] wrote ../outputs/eval/multitq-llama-sft.json
18
+ [mtq-eval] OVERALL: n=1496 EM=0.308 F1=0.346
19
+ [mtq-eval] by qtype:
20
+ after_first: n=187 EM=0.134
21
+ before_after: n=187 EM=0.497
22
+ before_last: n=187 EM=0.150
23
+ equal: n=374 EM=0.505
24
+ equal_multi: n=187 EM=0.267
25
+ first_last: n=374 EM=0.203
26
+ [mtq-eval] by answer_type:
27
+ entity: n=1122 EM=0.327
28
+ time: n=374 EM=0.251
logs/eval-multitq-mistral.log ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [mtq-eval] loading KG from ../data/MultiTQ/MultiTQ/kg
2
+ [mtq-eval] 461,329 triples
3
+ [mtq-eval] building retriever (k=15, hops=2)
4
+ [mtq-eval] loading questions from ../data/MultiTQ/MultiTQ/questions/test.json
5
+ [mtq-eval] stratified subset: 1,496 of 54,584
6
+ [mtq-eval] pre-retrieving evidence per question
7
+ [mtq-eval] 1,496/1,496 got >=1 triple
8
+ [mtq-eval] loading policy from ../checkpoints/sft-multitq-mistral/final
9
+
10
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
11
+ [mtq-eval] generating predictions (bs=16)
12
+ [mtq-eval] progress: 16/1496
13
+ [mtq-eval] progress: 416/1496
14
+ [mtq-eval] progress: 816/1496
15
+ [mtq-eval] progress: 1216/1496
16
+
17
+ [mtq-eval] wrote ../outputs/eval/multitq-mistral-sft.json
18
+ [mtq-eval] OVERALL: n=1496 EM=0.337 F1=0.368
19
+ [mtq-eval] by qtype:
20
+ after_first: n=187 EM=0.160
21
+ before_after: n=187 EM=0.545
22
+ before_last: n=187 EM=0.182
23
+ equal: n=374 EM=0.524
24
+ equal_multi: n=187 EM=0.262
25
+ first_last: n=374 EM=0.249
26
+ [mtq-eval] by answer_type:
27
+ entity: n=1122 EM=0.362
28
+ time: n=374 EM=0.262
logs/eval-multitq.log ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [mtq-eval] loading KG from ~/temporal-aware-graphrag/data/MultiTQ/MultiTQ/kg
2
+ [mtq-eval] 461,329 triples
3
+ [mtq-eval] building retriever (k=15, hops=2)
4
+ [mtq-eval] loading questions from ~/temporal-aware-graphrag/data/MultiTQ/MultiTQ/questions/test.json
5
+ [mtq-eval] stratified subset: 1,496 of 54,584
6
+ [mtq-eval] pre-retrieving evidence per question
7
+ [mtq-eval] 1,496/1,496 got >=1 triple
8
+ [mtq-eval] loading policy from ../checkpoints/sft-multitq/final
9
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
10
+
11
+ [mtq-eval] generating predictions (bs=8)
12
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
13
+ [mtq-eval] progress: 8/1496
14
+ [mtq-eval] progress: 208/1496
15
+ [mtq-eval] progress: 408/1496
16
+ [mtq-eval] progress: 608/1496
17
+ [mtq-eval] progress: 808/1496
18
+ [mtq-eval] progress: 1008/1496
19
+ [mtq-eval] progress: 1208/1496
20
+ [mtq-eval] progress: 1408/1496
21
+
22
+ [mtq-eval] wrote ../outputs/eval/multitq-v3-sft.json
23
+ [mtq-eval] OVERALL: n=1496 EM=0.280 F1=0.315
24
+ [mtq-eval] by qtype:
25
+ after_first: n=187 EM=0.091
26
+ before_after: n=187 EM=0.481
27
+ before_last: n=187 EM=0.118
28
+ equal: n=374 EM=0.447
29
+ equal_multi: n=187 EM=0.246
30
+ first_last: n=374 EM=0.206
31
+ [mtq-eval] by answer_type:
32
+ entity: n=1122 EM=0.289
33
+ time: n=374 EM=0.254
logs/eval-v2.log ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ === evaluating v2-sft-baseline ===
2
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
3
+ [eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
4
+ [eval] 1743 questions
5
+ [eval] loading policy from ../checkpoints/sft/final
6
+
7
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
8
+ [eval] loading KG indexer from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
9
+ [Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
10
+ [eval] generating predictions (bs=4)
11
+ [eval] progress: 4/1743
12
+ [eval] progress: 104/1743
13
+ [eval] progress: 204/1743
14
+ [eval] progress: 304/1743
15
+ [eval] progress: 404/1743
16
+ [eval] progress: 504/1743
17
+ [eval] progress: 604/1743
18
+ [eval] progress: 704/1743
19
+ [eval] progress: 804/1743
20
+ [eval] progress: 904/1743
21
+ [eval] progress: 1004/1743
22
+ [eval] progress: 1104/1743
23
+ [eval] progress: 1204/1743
24
+ [eval] progress: 1304/1743
25
+ [eval] progress: 1404/1743
26
+ [eval] progress: 1504/1743
27
+ [eval] progress: 1604/1743
28
+ [eval] progress: 1704/1743
29
+
30
+ [eval] wrote ../outputs/eval/v2-sft-baseline.json
31
+ [eval] OVERALL: n=1743 EM=0.305 F1=0.366
32
+ [eval] by complexity:
33
+ 1hop: n=800 EM=0.328 F1=0.411
34
+ 2hop: n=800 EM=0.246 F1=0.283
35
+ 3plus: n=143 EM=0.510 F1=0.579
36
+ [eval] by operator:
37
+ before_after: n=517 EM=0.257 F1=0.305
38
+ interval: n=163 EM=0.994 F1=0.994
39
+ point_in_time: n=535 EM=0.200 F1=0.284
40
+ sequence: n=528 EM=0.246 F1=0.313
41
+ === evaluating v2-grpo-9000 ===
42
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
43
+ [eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
44
+ [eval] 1743 questions
45
+ [eval] loading policy from ../checkpoints/grpo/checkpoint-9000
46
+
47
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
48
+ [eval] loading KG indexer from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
49
+ [Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
50
+ [eval] generating predictions (bs=4)
51
+ [eval] progress: 4/1743
52
+ [eval] progress: 104/1743
53
+ [eval] progress: 204/1743
54
+ [eval] progress: 304/1743
55
+ [eval] progress: 404/1743
56
+ [eval] progress: 504/1743
57
+ [eval] progress: 604/1743
58
+ [eval] progress: 704/1743
59
+ [eval] progress: 804/1743
60
+ [eval] progress: 904/1743
61
+ [eval] progress: 1004/1743
62
+ [eval] progress: 1104/1743
63
+ [eval] progress: 1204/1743
64
+ [eval] progress: 1304/1743
65
+ [eval] progress: 1404/1743
66
+ [eval] progress: 1504/1743
67
+ [eval] progress: 1604/1743
68
+ [eval] progress: 1704/1743
69
+
70
+ [eval] wrote ../outputs/eval/v2-grpo-9000.json
71
+ [eval] OVERALL: n=1743 EM=0.364 F1=0.418
72
+ [eval] by complexity:
73
+ 1hop: n=800 EM=0.383 F1=0.450
74
+ 2hop: n=800 EM=0.304 F1=0.340
75
+ 3plus: n=143 EM=0.594 F1=0.670
76
+ [eval] by operator:
77
+ before_after: n=517 EM=0.335 F1=0.381
78
+ interval: n=163 EM=1.000 F1=1.000
79
+ point_in_time: n=535 EM=0.250 F1=0.318
80
+ sequence: n=528 EM=0.311 F1=0.376
81
+ === evaluating v2-grpo-9500 ===
82
+ [eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
83
+ [eval] 1743 questions
84
+ [eval] loading policy from ../checkpoints/grpo/checkpoint-9500
85
+
86
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
87
+ [eval] loading KG indexer from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
88
+ [Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
89
+ [eval] generating predictions (bs=4)
90
+ [eval] progress: 4/1743
91
+ [eval] progress: 104/1743
92
+ [eval] progress: 204/1743
93
+ [eval] progress: 304/1743
94
+ [eval] progress: 404/1743
95
+ [eval] progress: 504/1743
96
+ [eval] progress: 604/1743
97
+ [eval] progress: 704/1743
98
+ [eval] progress: 804/1743
99
+ [eval] progress: 904/1743
100
+ [eval] progress: 1004/1743
101
+ [eval] progress: 1104/1743
102
+ [eval] progress: 1204/1743
103
+ [eval] progress: 1304/1743
104
+ [eval] progress: 1404/1743
105
+ [eval] progress: 1504/1743
106
+ [eval] progress: 1604/1743
107
+ [eval] progress: 1704/1743
108
+
109
+ [eval] wrote ../outputs/eval/v2-grpo-9500.json
110
+ [eval] OVERALL: n=1743 EM=0.366 F1=0.420
111
+ [eval] by complexity:
112
+ 1hop: n=800 EM=0.384 F1=0.455
113
+ 2hop: n=800 EM=0.306 F1=0.341
114
+ 3plus: n=143 EM=0.601 F1=0.670
115
+ [eval] by operator:
116
+ before_after: n=517 EM=0.340 F1=0.386
117
+ interval: n=163 EM=1.000 F1=1.000
118
+ point_in_time: n=535 EM=0.254 F1=0.319
119
+ sequence: n=528 EM=0.309 F1=0.377
120
+ === evaluating v2-grpo-10000 ===
121
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
122
+ [eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
123
+ [eval] 1743 questions
124
+ [eval] loading policy from ../checkpoints/grpo/checkpoint-10000
125
+
126
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
127
+ [eval] loading KG indexer from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
128
+ [Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
129
+ [eval] generating predictions (bs=4)
130
+ [eval] progress: 4/1743
131
+ [eval] progress: 104/1743
132
+ [eval] progress: 204/1743
133
+ [eval] progress: 304/1743
134
+ [eval] progress: 404/1743
135
+ [eval] progress: 504/1743
136
+ [eval] progress: 604/1743
137
+ [eval] progress: 704/1743
138
+ [eval] progress: 804/1743
139
+ [eval] progress: 904/1743
140
+ [eval] progress: 1004/1743
141
+ [eval] progress: 1104/1743
142
+ [eval] progress: 1204/1743
143
+ [eval] progress: 1304/1743
144
+ [eval] progress: 1404/1743
145
+ [eval] progress: 1504/1743
146
+ [eval] progress: 1604/1743
147
+ [eval] progress: 1704/1743
148
+
149
+ [eval] wrote ../outputs/eval/v2-grpo-10000.json
150
+ [eval] OVERALL: n=1743 EM=0.364 F1=0.420
151
+ [eval] by complexity:
152
+ 1hop: n=800 EM=0.381 F1=0.453
153
+ 2hop: n=800 EM=0.305 F1=0.341
154
+ 3plus: n=143 EM=0.601 F1=0.670
155
+ [eval] by operator:
156
+ before_after: n=517 EM=0.335 F1=0.382
157
+ interval: n=163 EM=1.000 F1=1.000
158
+ point_in_time: n=535 EM=0.252 F1=0.320
159
+ sequence: n=528 EM=0.311 F1=0.378
160
+ === all evals done ===
161
+ -rw-rw-r-- 1 ubuntu ubuntu 562510 Apr 25 09:28 ../outputs/eval/v2-grpo-10000.json
162
+ -rw-rw-r-- 1 ubuntu ubuntu 561712 Apr 25 09:20 ../outputs/eval/v2-grpo-9000.json
163
+ -rw-rw-r-- 1 ubuntu ubuntu 562262 Apr 25 09:24 ../outputs/eval/v2-grpo-9500.json
164
+ -rw-rw-r-- 1 ubuntu ubuntu 566615 Apr 25 09:16 ../outputs/eval/v2-sft-baseline.json
logs/eval-v3-1k.log ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-v3-1k/final
4
+
5
+ [eval] retrieval mode ON — building retriever (k=15)
6
+ [Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
7
+ [eval] retrieval done: 1743/1743 questions got at least one triple
8
+ [eval] generating predictions (bs=16)
9
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
10
+ [eval] progress: 16/1743
11
+ [eval] progress: 416/1743
12
+ [eval] progress: 816/1743
13
+ [eval] progress: 1216/1743
14
+ [eval] progress: 1616/1743
15
+
16
+ [eval] wrote ../outputs/eval/v3-sft-1k.json
17
+ [eval] OVERALL: n=1743 EM=0.861 F1=0.252
18
+ [eval] by complexity:
19
+ 1hop: n=800 EM=0.985 F1=0.320
20
+ 2hop: n=800 EM=0.864 F1=0.200
21
+ 3plus: n=143 EM=0.154 F1=0.167
22
+ [eval] by operator:
23
+ before_after: n=517 EM=0.847 F1=0.262
24
+ interval: n=163 EM=0.969 F1=0.130
25
+ point_in_time: n=535 EM=0.865 F1=0.278
26
+ sequence: n=528 EM=0.837 F1=0.255
logs/eval-v3-2k.log ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-v3-2k/final
4
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
5
+
6
+ [eval] retrieval mode ON — building retriever (k=15)
7
+ [Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
8
+ [eval] retrieval done: 1743/1743 questions got at least one triple
9
+ [eval] generating predictions (bs=16)
10
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
11
+ [eval] progress: 16/1743
12
+ [eval] progress: 416/1743
13
+ [eval] progress: 816/1743
14
+ [eval] progress: 1216/1743
15
+ [eval] progress: 1616/1743
16
+
17
+ [eval] wrote ../outputs/eval/v3-sft-2k.json
18
+ [eval] OVERALL: n=1743 EM=0.880 F1=0.261
19
+ [eval] by complexity:
20
+ 1hop: n=800 EM=0.998 F1=0.338
21
+ 2hop: n=800 EM=0.854 F1=0.198
22
+ 3plus: n=143 EM=0.364 F1=0.180
23
+ [eval] by operator:
24
+ before_after: n=517 EM=0.867 F1=0.273
25
+ interval: n=163 EM=1.000 F1=0.137
26
+ point_in_time: n=535 EM=0.879 F1=0.287
27
+ sequence: n=528 EM=0.856 F1=0.260
logs/eval-v3-noevidence.log ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-v3/final
4
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
5
+
6
+ [eval] retrieval mode ON — building retriever (k=0)
7
+ [Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
8
+ [eval] retrieval done: 0/1743 questions got at least one triple
9
+ [eval] generating predictions (bs=16)
10
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
11
+ [eval] progress: 16/1743
12
+ [eval] progress: 416/1743
13
+ [eval] progress: 816/1743
14
+ [eval] progress: 1216/1743
15
+ [eval] progress: 1616/1743
16
+
17
+ [eval] wrote ../outputs/eval/v3-sft-noevidence.json
18
+ [eval] OVERALL: n=1743 EM=0.233 F1=0.088
19
+ [eval] by complexity:
20
+ 1hop: n=800 EM=0.247 F1=0.108
21
+ 2hop: n=800 EM=0.181 F1=0.054
22
+ 3plus: n=143 EM=0.441 F1=0.168
23
+ [eval] by operator:
24
+ before_after: n=517 EM=0.170 F1=0.088
25
+ interval: n=163 EM=0.693 F1=0.102
26
+ point_in_time: n=535 EM=0.181 F1=0.090
27
+ sequence: n=528 EM=0.205 F1=0.082
logs/eval-v3-seed1337.log ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-v3-seed1337/final
4
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
5
+
6
+ [eval] retrieval mode ON — building retriever (k=15)
7
+ [Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
8
+ [eval] retrieval done: 1743/1743 questions got at least one triple
9
+ [eval] generating predictions (bs=16)
10
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
11
+ [eval] progress: 16/1743
12
+ [eval] progress: 416/1743
13
+ [eval] progress: 816/1743
14
+ [eval] progress: 1216/1743
15
+ [eval] progress: 1616/1743
16
+
17
+ [eval] wrote ../outputs/eval/v3-sft-seed1337.json
18
+ [eval] OVERALL: n=1743 EM=0.898 F1=0.266
19
+ [eval] by complexity:
20
+ 1hop: n=800 EM=0.998 F1=0.343
21
+ 2hop: n=800 EM=0.871 F1=0.202
22
+ 3plus: n=143 EM=0.490 F1=0.189
23
+ [eval] by operator:
24
+ before_after: n=517 EM=0.897 F1=0.284
25
+ interval: n=163 EM=1.000 F1=0.139
26
+ point_in_time: n=535 EM=0.897 F1=0.290
27
+ sequence: n=528 EM=0.867 F1=0.263
logs/eval-v3-seed7.log ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-v3-seed7/final
4
+
5
+ [eval] retrieval mode ON — building retriever (k=15)
6
+ [Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
7
+ [eval] retrieval done: 1743/1743 questions got at least one triple
8
+ [eval] generating predictions (bs=16)
9
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
10
+ [eval] progress: 16/1743
11
+ [eval] progress: 416/1743
12
+ [eval] progress: 816/1743
13
+ [eval] progress: 1216/1743
14
+ [eval] progress: 1616/1743
15
+
16
+ [eval] wrote ../outputs/eval/v3-sft-seed7.json
17
+ [eval] OVERALL: n=1743 EM=0.904 F1=0.266
18
+ [eval] by complexity:
19
+ 1hop: n=800 EM=0.998 F1=0.343
20
+ 2hop: n=800 EM=0.875 F1=0.202
21
+ 3plus: n=143 EM=0.545 F1=0.194
22
+ [eval] by operator:
23
+ before_after: n=517 EM=0.907 F1=0.284
24
+ interval: n=163 EM=1.000 F1=0.139
25
+ point_in_time: n=535 EM=0.905 F1=0.291
26
+ sequence: n=528 EM=0.871 F1=0.261
logs/eval-v3-sft-3hop.log ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-v3/final
4
+
5
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
6
+ [eval] retrieval mode ON — building retriever (k=25)
7
+ [Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
8
+ [eval] retrieval done: 1743/1743 questions got at least one triple
9
+ [eval] generating predictions (bs=8)
10
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
11
+ [eval] progress: 8/1743
12
+ [eval] progress: 208/1743
13
+ [eval] progress: 408/1743
14
+ [eval] progress: 608/1743
15
+ [eval] progress: 808/1743
16
+ [eval] progress: 1008/1743
17
+ [eval] progress: 1208/1743
18
+ [eval] progress: 1408/1743
19
+ [eval] progress: 1608/1743
20
+
21
+ [eval] wrote ../outputs/eval/v3-sft-3hop.json
22
+ [eval] OVERALL: n=1743 EM=0.922 F1=0.267
23
+ [eval] by complexity:
24
+ 1hop: n=800 EM=0.995 F1=0.341
25
+ 2hop: n=800 EM=0.882 F1=0.202
26
+ 3plus: n=143 EM=0.734 F1=0.225
27
+ [eval] by operator:
28
+ before_after: n=517 EM=0.901 F1=0.282
29
+ interval: n=163 EM=1.000 F1=0.137
30
+ point_in_time: n=535 EM=0.925 F1=0.294
31
+ sequence: n=528 EM=0.915 F1=0.267
logs/eval-v3-sft.log ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
2
+ [eval] 1743 questions
3
+ [eval] loading policy from ../checkpoints/sft-v3/final
4
+ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
5
+
6
+ [eval] retrieval mode ON — building retriever (k=15)
7
+ [Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
8
+ [eval] retrieval done: 1743/1743 questions got at least one triple
9
+ [eval] generating predictions (bs=16)
10
+ [transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
11
+ [eval] progress: 16/1743
12
+ [eval] progress: 416/1743
13
+ [eval] progress: 816/1743
14
+ [eval] progress: 1216/1743
15
+ [eval] progress: 1616/1743
16
+
17
+ [eval] wrote ../outputs/eval/v3-sft-baseline.json
18
+ [eval] OVERALL: n=1743 EM=0.903 F1=0.265
19
+ [eval] by complexity:
20
+ 1hop: n=800 EM=0.998 F1=0.343
21
+ 2hop: n=800 EM=0.871 F1=0.200
22
+ 3plus: n=143 EM=0.552 F1=0.194
23
+ [eval] by operator:
24
+ before_after: n=517 EM=0.901 F1=0.283
25
+ interval: n=163 EM=1.000 F1=0.138
26
+ point_in_time: n=535 EM=0.908 F1=0.291
27
+ sequence: n=528 EM=0.869 F1=0.261