PEFT
Safetensors
English
temporal-reasoning
knowledge-graph
graphrag
retrieval-augmented-generation
lora
Instructions to use anonym-ous/tempgraphrag-artifacts with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use anonym-ous/tempgraphrag-artifacts with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Add files using upload-large-folder tool
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- adapters/sft-llama31/README.md +207 -0
- adapters/sft-llama31/adapter_config.json +45 -0
- adapters/sft-llama31/chat_template.jinja +109 -0
- adapters/sft-llama31/tokenizer_config.json +15 -0
- eval/v3-sft-baseline.json +0 -0
- eval/v3-sft-baseline.judged-1743.json +0 -0
- eval/v3-sft-baseline.judged.json +184 -0
- eval/v3-sft-noevidence-seed1337.extracted.json +0 -0
- eval/v3-sft-noevidence-seed1337.json +0 -0
- eval/v3-sft-noevidence-seed7.extracted.json +0 -0
- eval/v3-sft-noevidence-seed7.json +0 -0
- eval/v3-sft-noevidence.extracted.json +0 -0
- eval/v3-sft-noevidence.json +0 -0
- eval/v3-sft-seed1337.extracted.json +0 -0
- eval/v3-sft-seed1337.json +0 -0
- eval/v3-sft-seed1337.judged.json +0 -0
- eval/v3-sft-seed7.extracted.json +0 -0
- eval/v3-sft-seed7.json +0 -0
- eval/v3-sft-seed7.judged.json +0 -0
- eval/v3-sft-shuffled.extracted.json +0 -0
- eval/v3-sft-shuffled.json +0 -0
- eval/v3-sft-terse-lever-noevidence.json +0 -0
- eval/v3-sft-terse-lever-seed1337-noevidence.json +0 -0
- eval/v3-sft-terse-lever-seed1337.extracted.json +0 -0
- eval/v3-sft-terse-lever-seed1337.json +0 -0
- eval/v3-sft-terse-lever-seed1337.judged.json +0 -0
- eval/v3-sft-terse-lever-seed7-noevidence.json +0 -0
- eval/v3-sft-terse-lever-seed7.extracted.json +0 -0
- eval/v3-sft-terse-lever-seed7.json +0 -0
- eval/v3-sft-terse-lever-seed7.judged.json +0 -0
- eval/v3-sft-terse-lever.extracted.json +0 -0
- eval/v3-sft-terse-lever.json +0 -0
- eval/v3-sft-terse-lever.judged-1743.json +0 -0
- logs/eval-llama31-seed1337.log +41 -0
- logs/eval-llama31-seed7.log +41 -0
- logs/eval-llama31.log +61 -0
- logs/eval-mistral-seed1337.log +40 -0
- logs/eval-mistral-seed7.log +40 -0
- logs/eval-mistral.log +59 -0
- logs/eval-multitq-llama.log +28 -0
- logs/eval-multitq-mistral.log +28 -0
- logs/eval-multitq.log +33 -0
- logs/eval-v2.log +164 -0
- logs/eval-v3-1k.log +26 -0
- logs/eval-v3-2k.log +27 -0
- logs/eval-v3-noevidence.log +27 -0
- logs/eval-v3-seed1337.log +27 -0
- logs/eval-v3-seed7.log +26 -0
- logs/eval-v3-sft-3hop.log +31 -0
- logs/eval-v3-sft.log +27 -0
adapters/sft-llama31/README.md
ADDED
|
@@ -0,0 +1,207 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model: meta-llama/Llama-3.1-8B-Instruct
|
| 3 |
+
library_name: peft
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
tags:
|
| 6 |
+
- base_model:adapter:meta-llama/Llama-3.1-8B-Instruct
|
| 7 |
+
- lora
|
| 8 |
+
- transformers
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# Model Card for Model ID
|
| 12 |
+
|
| 13 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
## Model Details
|
| 18 |
+
|
| 19 |
+
### Model Description
|
| 20 |
+
|
| 21 |
+
<!-- Provide a longer summary of what this model is. -->
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
- **Developed by:** [More Information Needed]
|
| 26 |
+
- **Funded by [optional]:** [More Information Needed]
|
| 27 |
+
- **Shared by [optional]:** [More Information Needed]
|
| 28 |
+
- **Model type:** [More Information Needed]
|
| 29 |
+
- **Language(s) (NLP):** [More Information Needed]
|
| 30 |
+
- **License:** [More Information Needed]
|
| 31 |
+
- **Finetuned from model [optional]:** [More Information Needed]
|
| 32 |
+
|
| 33 |
+
### Model Sources [optional]
|
| 34 |
+
|
| 35 |
+
<!-- Provide the basic links for the model. -->
|
| 36 |
+
|
| 37 |
+
- **Repository:** [More Information Needed]
|
| 38 |
+
- **Paper [optional]:** [More Information Needed]
|
| 39 |
+
- **Demo [optional]:** [More Information Needed]
|
| 40 |
+
|
| 41 |
+
## Uses
|
| 42 |
+
|
| 43 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 44 |
+
|
| 45 |
+
### Direct Use
|
| 46 |
+
|
| 47 |
+
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
| 48 |
+
|
| 49 |
+
[More Information Needed]
|
| 50 |
+
|
| 51 |
+
### Downstream Use [optional]
|
| 52 |
+
|
| 53 |
+
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
| 54 |
+
|
| 55 |
+
[More Information Needed]
|
| 56 |
+
|
| 57 |
+
### Out-of-Scope Use
|
| 58 |
+
|
| 59 |
+
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
| 60 |
+
|
| 61 |
+
[More Information Needed]
|
| 62 |
+
|
| 63 |
+
## Bias, Risks, and Limitations
|
| 64 |
+
|
| 65 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
| 66 |
+
|
| 67 |
+
[More Information Needed]
|
| 68 |
+
|
| 69 |
+
### Recommendations
|
| 70 |
+
|
| 71 |
+
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
| 72 |
+
|
| 73 |
+
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
| 74 |
+
|
| 75 |
+
## How to Get Started with the Model
|
| 76 |
+
|
| 77 |
+
Use the code below to get started with the model.
|
| 78 |
+
|
| 79 |
+
[More Information Needed]
|
| 80 |
+
|
| 81 |
+
## Training Details
|
| 82 |
+
|
| 83 |
+
### Training Data
|
| 84 |
+
|
| 85 |
+
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
| 86 |
+
|
| 87 |
+
[More Information Needed]
|
| 88 |
+
|
| 89 |
+
### Training Procedure
|
| 90 |
+
|
| 91 |
+
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
| 92 |
+
|
| 93 |
+
#### Preprocessing [optional]
|
| 94 |
+
|
| 95 |
+
[More Information Needed]
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
#### Training Hyperparameters
|
| 99 |
+
|
| 100 |
+
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
| 101 |
+
|
| 102 |
+
#### Speeds, Sizes, Times [optional]
|
| 103 |
+
|
| 104 |
+
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
| 105 |
+
|
| 106 |
+
[More Information Needed]
|
| 107 |
+
|
| 108 |
+
## Evaluation
|
| 109 |
+
|
| 110 |
+
<!-- This section describes the evaluation protocols and provides the results. -->
|
| 111 |
+
|
| 112 |
+
### Testing Data, Factors & Metrics
|
| 113 |
+
|
| 114 |
+
#### Testing Data
|
| 115 |
+
|
| 116 |
+
<!-- This should link to a Dataset Card if possible. -->
|
| 117 |
+
|
| 118 |
+
[More Information Needed]
|
| 119 |
+
|
| 120 |
+
#### Factors
|
| 121 |
+
|
| 122 |
+
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
| 123 |
+
|
| 124 |
+
[More Information Needed]
|
| 125 |
+
|
| 126 |
+
#### Metrics
|
| 127 |
+
|
| 128 |
+
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 129 |
+
|
| 130 |
+
[More Information Needed]
|
| 131 |
+
|
| 132 |
+
### Results
|
| 133 |
+
|
| 134 |
+
[More Information Needed]
|
| 135 |
+
|
| 136 |
+
#### Summary
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
## Model Examination [optional]
|
| 141 |
+
|
| 142 |
+
<!-- Relevant interpretability work for the model goes here -->
|
| 143 |
+
|
| 144 |
+
[More Information Needed]
|
| 145 |
+
|
| 146 |
+
## Environmental Impact
|
| 147 |
+
|
| 148 |
+
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 149 |
+
|
| 150 |
+
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 151 |
+
|
| 152 |
+
- **Hardware Type:** [More Information Needed]
|
| 153 |
+
- **Hours used:** [More Information Needed]
|
| 154 |
+
- **Cloud Provider:** [More Information Needed]
|
| 155 |
+
- **Compute Region:** [More Information Needed]
|
| 156 |
+
- **Carbon Emitted:** [More Information Needed]
|
| 157 |
+
|
| 158 |
+
## Technical Specifications [optional]
|
| 159 |
+
|
| 160 |
+
### Model Architecture and Objective
|
| 161 |
+
|
| 162 |
+
[More Information Needed]
|
| 163 |
+
|
| 164 |
+
### Compute Infrastructure
|
| 165 |
+
|
| 166 |
+
[More Information Needed]
|
| 167 |
+
|
| 168 |
+
#### Hardware
|
| 169 |
+
|
| 170 |
+
[More Information Needed]
|
| 171 |
+
|
| 172 |
+
#### Software
|
| 173 |
+
|
| 174 |
+
[More Information Needed]
|
| 175 |
+
|
| 176 |
+
## Citation [optional]
|
| 177 |
+
|
| 178 |
+
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 179 |
+
|
| 180 |
+
**BibTeX:**
|
| 181 |
+
|
| 182 |
+
[More Information Needed]
|
| 183 |
+
|
| 184 |
+
**APA:**
|
| 185 |
+
|
| 186 |
+
[More Information Needed]
|
| 187 |
+
|
| 188 |
+
## Glossary [optional]
|
| 189 |
+
|
| 190 |
+
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
| 191 |
+
|
| 192 |
+
[More Information Needed]
|
| 193 |
+
|
| 194 |
+
## More Information [optional]
|
| 195 |
+
|
| 196 |
+
[More Information Needed]
|
| 197 |
+
|
| 198 |
+
## Model Card Authors [optional]
|
| 199 |
+
|
| 200 |
+
[More Information Needed]
|
| 201 |
+
|
| 202 |
+
## Model Card Contact
|
| 203 |
+
|
| 204 |
+
[More Information Needed]
|
| 205 |
+
### Framework versions
|
| 206 |
+
|
| 207 |
+
- PEFT 0.19.1
|
adapters/sft-llama31/adapter_config.json
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"alora_invocation_tokens": null,
|
| 3 |
+
"alpha_pattern": {},
|
| 4 |
+
"arrow_config": null,
|
| 5 |
+
"auto_mapping": null,
|
| 6 |
+
"base_model_name_or_path": "meta-llama/Llama-3.1-8B-Instruct",
|
| 7 |
+
"bias": "none",
|
| 8 |
+
"corda_config": null,
|
| 9 |
+
"ensure_weight_tying": false,
|
| 10 |
+
"eva_config": null,
|
| 11 |
+
"exclude_modules": null,
|
| 12 |
+
"fan_in_fan_out": false,
|
| 13 |
+
"inference_mode": true,
|
| 14 |
+
"init_lora_weights": true,
|
| 15 |
+
"layer_replication": null,
|
| 16 |
+
"layers_pattern": null,
|
| 17 |
+
"layers_to_transform": null,
|
| 18 |
+
"loftq_config": {},
|
| 19 |
+
"lora_alpha": 64,
|
| 20 |
+
"lora_bias": false,
|
| 21 |
+
"lora_dropout": 0.05,
|
| 22 |
+
"lora_ga_config": null,
|
| 23 |
+
"megatron_config": null,
|
| 24 |
+
"megatron_core": "megatron.core",
|
| 25 |
+
"modules_to_save": null,
|
| 26 |
+
"peft_type": "LORA",
|
| 27 |
+
"peft_version": "0.19.1",
|
| 28 |
+
"qalora_group_size": 16,
|
| 29 |
+
"r": 32,
|
| 30 |
+
"rank_pattern": {},
|
| 31 |
+
"revision": null,
|
| 32 |
+
"target_modules": [
|
| 33 |
+
"q_proj",
|
| 34 |
+
"o_proj",
|
| 35 |
+
"v_proj",
|
| 36 |
+
"k_proj"
|
| 37 |
+
],
|
| 38 |
+
"target_parameters": null,
|
| 39 |
+
"task_type": "CAUSAL_LM",
|
| 40 |
+
"trainable_token_indices": null,
|
| 41 |
+
"use_bdlora": null,
|
| 42 |
+
"use_dora": false,
|
| 43 |
+
"use_qalora": false,
|
| 44 |
+
"use_rslora": false
|
| 45 |
+
}
|
adapters/sft-llama31/chat_template.jinja
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{{- bos_token }}
|
| 2 |
+
{%- if custom_tools is defined %}
|
| 3 |
+
{%- set tools = custom_tools %}
|
| 4 |
+
{%- endif %}
|
| 5 |
+
{%- if not tools_in_user_message is defined %}
|
| 6 |
+
{%- set tools_in_user_message = true %}
|
| 7 |
+
{%- endif %}
|
| 8 |
+
{%- if not date_string is defined %}
|
| 9 |
+
{%- set date_string = "26 Jul 2024" %}
|
| 10 |
+
{%- endif %}
|
| 11 |
+
{%- if not tools is defined %}
|
| 12 |
+
{%- set tools = none %}
|
| 13 |
+
{%- endif %}
|
| 14 |
+
|
| 15 |
+
{#- This block extracts the system message, so we can slot it into the right place. #}
|
| 16 |
+
{%- if messages[0]['role'] == 'system' %}
|
| 17 |
+
{%- set system_message = messages[0]['content']|trim %}
|
| 18 |
+
{%- set messages = messages[1:] %}
|
| 19 |
+
{%- else %}
|
| 20 |
+
{%- set system_message = "" %}
|
| 21 |
+
{%- endif %}
|
| 22 |
+
|
| 23 |
+
{#- System message + builtin tools #}
|
| 24 |
+
{{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
|
| 25 |
+
{%- if builtin_tools is defined or tools is not none %}
|
| 26 |
+
{{- "Environment: ipython\n" }}
|
| 27 |
+
{%- endif %}
|
| 28 |
+
{%- if builtin_tools is defined %}
|
| 29 |
+
{{- "Tools: " + builtin_tools | reject('equalto', 'code_interpreter') | join(", ") + "\n\n"}}
|
| 30 |
+
{%- endif %}
|
| 31 |
+
{{- "Cutting Knowledge Date: December 2023\n" }}
|
| 32 |
+
{{- "Today Date: " + date_string + "\n\n" }}
|
| 33 |
+
{%- if tools is not none and not tools_in_user_message %}
|
| 34 |
+
{{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
|
| 35 |
+
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
|
| 36 |
+
{{- "Do not use variables.\n\n" }}
|
| 37 |
+
{%- for t in tools %}
|
| 38 |
+
{{- t | tojson(indent=4) }}
|
| 39 |
+
{{- "\n\n" }}
|
| 40 |
+
{%- endfor %}
|
| 41 |
+
{%- endif %}
|
| 42 |
+
{{- system_message }}
|
| 43 |
+
{{- "<|eot_id|>" }}
|
| 44 |
+
|
| 45 |
+
{#- Custom tools are passed in a user message with some extra guidance #}
|
| 46 |
+
{%- if tools_in_user_message and not tools is none %}
|
| 47 |
+
{#- Extract the first user message so we can plug it in here #}
|
| 48 |
+
{%- if messages | length != 0 %}
|
| 49 |
+
{%- set first_user_message = messages[0]['content']|trim %}
|
| 50 |
+
{%- set messages = messages[1:] %}
|
| 51 |
+
{%- else %}
|
| 52 |
+
{{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
|
| 53 |
+
{%- endif %}
|
| 54 |
+
{{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
|
| 55 |
+
{{- "Given the following functions, please respond with a JSON for a function call " }}
|
| 56 |
+
{{- "with its proper arguments that best answers the given prompt.\n\n" }}
|
| 57 |
+
{{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
|
| 58 |
+
{{- "Do not use variables.\n\n" }}
|
| 59 |
+
{%- for t in tools %}
|
| 60 |
+
{{- t | tojson(indent=4) }}
|
| 61 |
+
{{- "\n\n" }}
|
| 62 |
+
{%- endfor %}
|
| 63 |
+
{{- first_user_message + "<|eot_id|>"}}
|
| 64 |
+
{%- endif %}
|
| 65 |
+
|
| 66 |
+
{%- for message in messages %}
|
| 67 |
+
{%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
|
| 68 |
+
{{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
|
| 69 |
+
{%- elif 'tool_calls' in message %}
|
| 70 |
+
{%- if not message.tool_calls|length == 1 %}
|
| 71 |
+
{{- raise_exception("This model only supports single tool-calls at once!") }}
|
| 72 |
+
{%- endif %}
|
| 73 |
+
{%- set tool_call = message.tool_calls[0].function %}
|
| 74 |
+
{%- if builtin_tools is defined and tool_call.name in builtin_tools %}
|
| 75 |
+
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
|
| 76 |
+
{{- "<|python_tag|>" + tool_call.name + ".call(" }}
|
| 77 |
+
{%- for arg_name, arg_val in tool_call.arguments | items %}
|
| 78 |
+
{{- arg_name + '="' + arg_val + '"' }}
|
| 79 |
+
{%- if not loop.last %}
|
| 80 |
+
{{- ", " }}
|
| 81 |
+
{%- endif %}
|
| 82 |
+
{%- endfor %}
|
| 83 |
+
{{- ")" }}
|
| 84 |
+
{%- else %}
|
| 85 |
+
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
|
| 86 |
+
{{- '{"name": "' + tool_call.name + '", ' }}
|
| 87 |
+
{{- '"parameters": ' }}
|
| 88 |
+
{{- tool_call.arguments | tojson }}
|
| 89 |
+
{{- "}" }}
|
| 90 |
+
{%- endif %}
|
| 91 |
+
{%- if builtin_tools is defined %}
|
| 92 |
+
{#- This means we're in ipython mode #}
|
| 93 |
+
{{- "<|eom_id|>" }}
|
| 94 |
+
{%- else %}
|
| 95 |
+
{{- "<|eot_id|>" }}
|
| 96 |
+
{%- endif %}
|
| 97 |
+
{%- elif message.role == "tool" or message.role == "ipython" %}
|
| 98 |
+
{{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
|
| 99 |
+
{%- if message.content is mapping or message.content is iterable %}
|
| 100 |
+
{{- message.content | tojson }}
|
| 101 |
+
{%- else %}
|
| 102 |
+
{{- message.content }}
|
| 103 |
+
{%- endif %}
|
| 104 |
+
{{- "<|eot_id|>" }}
|
| 105 |
+
{%- endif %}
|
| 106 |
+
{%- endfor %}
|
| 107 |
+
{%- if add_generation_prompt %}
|
| 108 |
+
{{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
|
| 109 |
+
{%- endif %}
|
adapters/sft-llama31/tokenizer_config.json
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"backend": "tokenizers",
|
| 3 |
+
"bos_token": "<|begin_of_text|>",
|
| 4 |
+
"clean_up_tokenization_spaces": true,
|
| 5 |
+
"eos_token": "<|eot_id|>",
|
| 6 |
+
"is_local": false,
|
| 7 |
+
"local_files_only": false,
|
| 8 |
+
"model_input_names": [
|
| 9 |
+
"input_ids",
|
| 10 |
+
"attention_mask"
|
| 11 |
+
],
|
| 12 |
+
"model_max_length": 131072,
|
| 13 |
+
"pad_token": "<|eot_id|>",
|
| 14 |
+
"tokenizer_class": "TokenizersBackend"
|
| 15 |
+
}
|
eval/v3-sft-baseline.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-baseline.judged-1743.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-baseline.judged.json
ADDED
|
@@ -0,0 +1,184 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"name": "v3-sft-baseline",
|
| 3 |
+
"policy_path": "../checkpoints/sft-v3/final",
|
| 4 |
+
"split": "test",
|
| 5 |
+
"n_samples": 1743,
|
| 6 |
+
"temperature": 0.0,
|
| 7 |
+
"max_new_tokens": 96,
|
| 8 |
+
"answer_extracted": true,
|
| 9 |
+
"n_with_answer_marker": 1738,
|
| 10 |
+
"judge_provider": "anthropic",
|
| 11 |
+
"judge_model": "claude-haiku-4-5-20251001",
|
| 12 |
+
"overall": {
|
| 13 |
+
"n": 10,
|
| 14 |
+
"em": 0.4,
|
| 15 |
+
"f1": 0.5404761904761906,
|
| 16 |
+
"judge_em": 0.6
|
| 17 |
+
},
|
| 18 |
+
"by_complexity": {
|
| 19 |
+
"3plus": {
|
| 20 |
+
"n": 10,
|
| 21 |
+
"em": 0.4,
|
| 22 |
+
"f1": 0.5404761904761906,
|
| 23 |
+
"judge_em": 0.6
|
| 24 |
+
}
|
| 25 |
+
},
|
| 26 |
+
"by_operator": {
|
| 27 |
+
"before_after": {
|
| 28 |
+
"n": 2,
|
| 29 |
+
"em": 0.5,
|
| 30 |
+
"f1": 0.5,
|
| 31 |
+
"judge_em": 0.5
|
| 32 |
+
},
|
| 33 |
+
"interval": {
|
| 34 |
+
"n": 1,
|
| 35 |
+
"em": 1.0,
|
| 36 |
+
"f1": 1.0,
|
| 37 |
+
"judge_em": 1.0
|
| 38 |
+
},
|
| 39 |
+
"point_in_time": {
|
| 40 |
+
"n": 1,
|
| 41 |
+
"em": 1.0,
|
| 42 |
+
"f1": 1.0,
|
| 43 |
+
"judge_em": 1.0
|
| 44 |
+
},
|
| 45 |
+
"sequence": {
|
| 46 |
+
"n": 6,
|
| 47 |
+
"em": 0.16666666666666666,
|
| 48 |
+
"f1": 0.40079365079365076,
|
| 49 |
+
"judge_em": 0.5
|
| 50 |
+
}
|
| 51 |
+
},
|
| 52 |
+
"per_example": [
|
| 53 |
+
{
|
| 54 |
+
"id": "q_006291",
|
| 55 |
+
"complexity": "3plus",
|
| 56 |
+
"operator_type": "sequence",
|
| 57 |
+
"t_query": 2017.0,
|
| 58 |
+
"gold": "chairperson",
|
| 59 |
+
"pred": "President of the Portuguese Football Federation",
|
| 60 |
+
"em": 0.0,
|
| 61 |
+
"f1": 0.0,
|
| 62 |
+
"pred_raw": "Hop 1: Jonas -> member of sports team -> S.L. Benfica (2017)\n Hop 2: S.L. Benfica -> chairperson -> Lu\u00eds Filipe Vieira (2017)\n Hop 3: Lu\u00eds Filipe Vieira -> position held -> President of the Portuguese Football Federation (2017)\n Answer: President of the Portuguese Football Federation",
|
| 63 |
+
"judge_em": 0,
|
| 64 |
+
"judge_reply": "NO"
|
| 65 |
+
},
|
| 66 |
+
{
|
| 67 |
+
"id": "q_007337",
|
| 68 |
+
"complexity": "3plus",
|
| 69 |
+
"operator_type": "sequence",
|
| 70 |
+
"t_query": 1907.0,
|
| 71 |
+
"gold": "Nicholas II of Russia",
|
| 72 |
+
"pred": "Nikolay II",
|
| 73 |
+
"em": 0.0,
|
| 74 |
+
"f1": 0.3333333333333333,
|
| 75 |
+
"pred_raw": "Hop 1: Ararat -> located in -> Erivan Governorate (1907)\n Hop 2: Erivan Governorate -> country -> Russian Empire (1907)\n Hop 3: Russian Empire -> head of state -> Nikolay II (1907)\n Answer: Nikolay II",
|
| 76 |
+
"judge_em": 1,
|
| 77 |
+
"judge_reply": "YES"
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"id": "q_000021",
|
| 81 |
+
"complexity": "3plus",
|
| 82 |
+
"operator_type": "sequence",
|
| 83 |
+
"t_query": 1959.0,
|
| 84 |
+
"gold": "President of Italy",
|
| 85 |
+
"pred": "President of the Italian Republic",
|
| 86 |
+
"em": 0.0,
|
| 87 |
+
"f1": 0.5,
|
| 88 |
+
"pred_raw": "Hop 1: Galvano Della Volpe -> country of citizenship -> Italy (1959)\n Hop 2: Italy -> head of state -> Antonio Segni (1959)\n Hop 3: Antonio Segni -> position held -> President of the Italian Republic (1959)\n Answer: President of the Italian Republic",
|
| 89 |
+
"judge_em": 1,
|
| 90 |
+
"judge_reply": "YES"
|
| 91 |
+
},
|
| 92 |
+
{
|
| 93 |
+
"id": "q_007809",
|
| 94 |
+
"complexity": "3plus",
|
| 95 |
+
"operator_type": "sequence",
|
| 96 |
+
"t_query": 2011.0,
|
| 97 |
+
"gold": "Irina Bokova",
|
| 98 |
+
"pred": "David Elkind",
|
| 99 |
+
"em": 0.0,
|
| 100 |
+
"f1": 0.0,
|
| 101 |
+
"pred_raw": "Hop 1: Kach -> designated as terrorist by -> United States (2011)\n Hop 2: United States -> member of -> UNESCO (2011)\n Hop 3: UNESCO -> chairperson -> David Elkind (2011)\n Answer: David Elkind",
|
| 102 |
+
"judge_em": 0,
|
| 103 |
+
"judge_reply": "NO"
|
| 104 |
+
},
|
| 105 |
+
{
|
| 106 |
+
"id": "q_001979",
|
| 107 |
+
"complexity": "3plus",
|
| 108 |
+
"operator_type": "interval",
|
| 109 |
+
"t_query": 1998.0,
|
| 110 |
+
"gold": "1998",
|
| 111 |
+
"pred": "1998",
|
| 112 |
+
"em": 1.0,
|
| 113 |
+
"f1": 1.0,
|
| 114 |
+
"pred_raw": "Hop 1: Arnold J. Levine -> award received -> Paul Ehrlich and Ludwig Darmstaedter Prize (1998)\n Answer: 1998",
|
| 115 |
+
"judge_em": 1,
|
| 116 |
+
"judge_reply": "YES"
|
| 117 |
+
},
|
| 118 |
+
{
|
| 119 |
+
"id": "q_005455",
|
| 120 |
+
"complexity": "3plus",
|
| 121 |
+
"operator_type": "sequence",
|
| 122 |
+
"t_query": 1982.0,
|
| 123 |
+
"gold": "Golden Globe Award for Best Actress \u2013 Motion Picture Drama",
|
| 124 |
+
"pred": "Academy Award for Best Actress",
|
| 125 |
+
"em": 0.0,
|
| 126 |
+
"f1": 0.5714285714285714,
|
| 127 |
+
"pred_raw": "Hop 1: On Golden Pond -> nominated for -> Academy Award for Best Actress (1982)\n Hop 2: Academy Award for Best Actress -> winner -> Meryl Streep (1982)\n Hop 3: Meryl Streep -> award received -> Academy Award for Best Actress (1982)\n Answer: Academy Award for Best Actress",
|
| 128 |
+
"judge_em": 0,
|
| 129 |
+
"judge_reply": "NO"
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"id": "q_008005",
|
| 133 |
+
"complexity": "3plus",
|
| 134 |
+
"operator_type": "before_after",
|
| 135 |
+
"t_query": 1961.0,
|
| 136 |
+
"gold": "President of Italy",
|
| 137 |
+
"pred": "President of Italy",
|
| 138 |
+
"em": 1.0,
|
| 139 |
+
"f1": 1.0,
|
| 140 |
+
"pred_raw": "Hop 1: Urbano Lazzaro -> country of citizenship -> Italy (1961)\n Hop 2: Italy -> head of state -> Giuseppe Saragat (1961)\n Hop 3: Giuseppe Saragat -> position held -> President of Italy (1961)\n Answer: President of Italy",
|
| 141 |
+
"judge_em": 1,
|
| 142 |
+
"judge_reply": "YES"
|
| 143 |
+
},
|
| 144 |
+
{
|
| 145 |
+
"id": "q_008690",
|
| 146 |
+
"complexity": "3plus",
|
| 147 |
+
"operator_type": "sequence",
|
| 148 |
+
"t_query": 1968.0,
|
| 149 |
+
"gold": "President of Italy",
|
| 150 |
+
"pred": "President of Italy",
|
| 151 |
+
"em": 1.0,
|
| 152 |
+
"f1": 1.0,
|
| 153 |
+
"pred_raw": "Hop 1: Bruno Pesaola -> country of citizenship -> Italy (1968)\n Hop 2: Italy -> head of state -> Giovanni Gronchi (1968)\n Hop 3: Giovanni Gronchi -> position held -> President of Italy (1968)\n Answer: President of Italy",
|
| 154 |
+
"judge_em": 1,
|
| 155 |
+
"judge_reply": "YES"
|
| 156 |
+
},
|
| 157 |
+
{
|
| 158 |
+
"id": "q_008321",
|
| 159 |
+
"complexity": "3plus",
|
| 160 |
+
"operator_type": "point_in_time",
|
| 161 |
+
"t_query": 1976.0,
|
| 162 |
+
"gold": "Prime Minister of Italy",
|
| 163 |
+
"pred": "Prime Minister of Italy",
|
| 164 |
+
"em": 1.0,
|
| 165 |
+
"f1": 1.0,
|
| 166 |
+
"pred_raw": "Hop 1: Vito Lattanzio -> country of citizenship -> Italy (1976)\n Hop 2: Italy -> head of government -> Giovanni Spadolini (1976)\n Hop 3: Giovanni Spadolini -> position held -> Prime Minister of Italy (1976)\n Answer: Prime Minister of Italy",
|
| 167 |
+
"judge_em": 1,
|
| 168 |
+
"judge_reply": "YES"
|
| 169 |
+
},
|
| 170 |
+
{
|
| 171 |
+
"id": "q_007761",
|
| 172 |
+
"complexity": "3plus",
|
| 173 |
+
"operator_type": "before_after",
|
| 174 |
+
"t_query": 2002.0,
|
| 175 |
+
"gold": "Arista Records",
|
| 176 |
+
"pred": "Sony Music",
|
| 177 |
+
"em": 0.0,
|
| 178 |
+
"f1": 0.0,
|
| 179 |
+
"pred_raw": "Hop 1: Jennifer Love Hewitt -> unmarried partner -> John Mayer (2002)\n Hop 2: John Mayer -> nominated for -> Grammy Award for Best Male Pop Vocal Performance (2002)\n Hop 3: Grammy Award for Best Male Pop Vocal Performance -> winner -> Epic Records (2002)\n Hop 4: Epic Records -> record label -> Sony Music (2002)\n Answer: Sony Music",
|
| 180 |
+
"judge_em": 0,
|
| 181 |
+
"judge_reply": "NO"
|
| 182 |
+
}
|
| 183 |
+
]
|
| 184 |
+
}
|
eval/v3-sft-noevidence-seed1337.extracted.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-noevidence-seed1337.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-noevidence-seed7.extracted.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-noevidence-seed7.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-noevidence.extracted.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-noevidence.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-seed1337.extracted.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-seed1337.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-seed1337.judged.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-seed7.extracted.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-seed7.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-seed7.judged.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-shuffled.extracted.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-shuffled.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever-noevidence.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever-seed1337-noevidence.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever-seed1337.extracted.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever-seed1337.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever-seed1337.judged.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever-seed7-noevidence.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever-seed7.extracted.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever-seed7.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever-seed7.judged.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever.extracted.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
eval/v3-sft-terse-lever.judged-1743.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
logs/eval-llama31-seed1337.log
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-llama31-seed1337/final
|
| 4 |
+
|
| 5 |
+
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
|
| 6 |
+
[eval] retrieval mode ON — building retriever (k=15)
|
| 7 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 8 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 9 |
+
[eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
|
| 10 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 11 |
+
[eval] generating predictions (bs=4)
|
| 12 |
+
[eval] progress: 4/1743
|
| 13 |
+
[eval] progress: 104/1743
|
| 14 |
+
[eval] progress: 204/1743
|
| 15 |
+
[eval] progress: 304/1743
|
| 16 |
+
[eval] progress: 404/1743
|
| 17 |
+
[eval] progress: 504/1743
|
| 18 |
+
[eval] progress: 604/1743
|
| 19 |
+
[eval] progress: 704/1743
|
| 20 |
+
[eval] progress: 804/1743
|
| 21 |
+
[eval] progress: 904/1743
|
| 22 |
+
[eval] progress: 1004/1743
|
| 23 |
+
[eval] progress: 1104/1743
|
| 24 |
+
[eval] progress: 1204/1743
|
| 25 |
+
[eval] progress: 1304/1743
|
| 26 |
+
[eval] progress: 1404/1743
|
| 27 |
+
[eval] progress: 1504/1743
|
| 28 |
+
[eval] progress: 1604/1743
|
| 29 |
+
[eval] progress: 1704/1743
|
| 30 |
+
|
| 31 |
+
[eval] wrote ../outputs/eval/llama31-sft-seed1337.json
|
| 32 |
+
[eval] OVERALL: n=1743 EM=0.942 F1=0.273
|
| 33 |
+
[eval] by complexity:
|
| 34 |
+
1hop: n=800 EM=0.993 F1=0.341
|
| 35 |
+
2hop: n=800 EM=0.929 F1=0.216
|
| 36 |
+
3plus: n=143 EM=0.734 F1=0.215
|
| 37 |
+
[eval] by operator:
|
| 38 |
+
before_after: n=517 EM=0.942 F1=0.292
|
| 39 |
+
interval: n=163 EM=1.000 F1=0.138
|
| 40 |
+
point_in_time: n=535 EM=0.935 F1=0.298
|
| 41 |
+
sequence: n=528 EM=0.932 F1=0.272
|
logs/eval-llama31-seed7.log
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-llama31-seed7/final
|
| 4 |
+
|
| 5 |
+
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
|
| 6 |
+
[eval] retrieval mode ON — building retriever (k=15)
|
| 7 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 8 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 9 |
+
[eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
|
| 10 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 11 |
+
[eval] generating predictions (bs=4)
|
| 12 |
+
[eval] progress: 4/1743
|
| 13 |
+
[eval] progress: 104/1743
|
| 14 |
+
[eval] progress: 204/1743
|
| 15 |
+
[eval] progress: 304/1743
|
| 16 |
+
[eval] progress: 404/1743
|
| 17 |
+
[eval] progress: 504/1743
|
| 18 |
+
[eval] progress: 604/1743
|
| 19 |
+
[eval] progress: 704/1743
|
| 20 |
+
[eval] progress: 804/1743
|
| 21 |
+
[eval] progress: 904/1743
|
| 22 |
+
[eval] progress: 1004/1743
|
| 23 |
+
[eval] progress: 1104/1743
|
| 24 |
+
[eval] progress: 1204/1743
|
| 25 |
+
[eval] progress: 1304/1743
|
| 26 |
+
[eval] progress: 1404/1743
|
| 27 |
+
[eval] progress: 1504/1743
|
| 28 |
+
[eval] progress: 1604/1743
|
| 29 |
+
[eval] progress: 1704/1743
|
| 30 |
+
|
| 31 |
+
[eval] wrote ../outputs/eval/llama31-sft-seed7.json
|
| 32 |
+
[eval] OVERALL: n=1743 EM=0.941 F1=0.273
|
| 33 |
+
[eval] by complexity:
|
| 34 |
+
1hop: n=800 EM=0.991 F1=0.341
|
| 35 |
+
2hop: n=800 EM=0.932 F1=0.217
|
| 36 |
+
3plus: n=143 EM=0.713 F1=0.213
|
| 37 |
+
[eval] by operator:
|
| 38 |
+
before_after: n=517 EM=0.940 F1=0.292
|
| 39 |
+
interval: n=163 EM=1.000 F1=0.138
|
| 40 |
+
point_in_time: n=535 EM=0.946 F1=0.301
|
| 41 |
+
sequence: n=528 EM=0.920 F1=0.269
|
logs/eval-llama31.log
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-llama31/final
|
| 4 |
+
|
| 5 |
+
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
|
| 6 |
+
[eval] retrieval mode ON — building retriever (k=15)
|
| 7 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 8 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 9 |
+
[eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
|
| 10 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 11 |
+
[eval] generating predictions (bs=4)
|
| 12 |
+
[eval] progress: 4/1743
|
| 13 |
+
[eval] progress: 104/1743
|
| 14 |
+
[eval] progress: 204/1743
|
| 15 |
+
[eval] progress: 304/1743
|
| 16 |
+
[eval] progress: 404/1743
|
| 17 |
+
[eval] progress: 504/1743
|
| 18 |
+
[eval] progress: 604/1743
|
| 19 |
+
[eval] progress: 704/1743
|
| 20 |
+
[eval] progress: 804/1743
|
| 21 |
+
[eval] progress: 904/1743
|
| 22 |
+
[eval] progress: 1004/1743
|
| 23 |
+
[eval] progress: 1104/1743
|
| 24 |
+
[eval] progress: 1204/1743
|
| 25 |
+
[eval] progress: 1304/1743
|
| 26 |
+
[eval] progress: 1404/1743
|
| 27 |
+
[eval] progress: 1504/1743
|
| 28 |
+
[eval] progress: 1604/1743
|
| 29 |
+
[eval] progress: 1704/1743
|
| 30 |
+
|
| 31 |
+
[eval] wrote ../outputs/eval/llama31-sft.json
|
| 32 |
+
[eval] OVERALL: n=1743 EM=0.944 F1=0.273
|
| 33 |
+
[eval] by complexity:
|
| 34 |
+
1hop: n=800 EM=0.990 F1=0.340
|
| 35 |
+
2hop: n=800 EM=0.932 F1=0.217
|
| 36 |
+
3plus: n=143 EM=0.748 F1=0.217
|
| 37 |
+
[eval] by operator:
|
| 38 |
+
before_after: n=517 EM=0.946 F1=0.294
|
| 39 |
+
interval: n=163 EM=1.000 F1=0.138
|
| 40 |
+
point_in_time: n=535 EM=0.940 F1=0.299
|
| 41 |
+
sequence: n=528 EM=0.928 F1=0.269
|
| 42 |
+
9
|
| 43 |
+
sequence: n=528 EM=0.928 F1=0.269
|
| 44 |
+
|
| 45 |
+
[2/2] Post-extracting Answer: spans...
|
| 46 |
+
[extract] wrote ../outputs/eval/llama31-sft.extracted.json
|
| 47 |
+
[extract] 1707/1743 predictions had an 'Answer:' marker (97.9%)
|
| 48 |
+
[extract] OVERALL: n=1743 EM=0.9225 F1=0.9224
|
| 49 |
+
[extract] by complexity:
|
| 50 |
+
1hop: n=800 EM=0.9725 F1=0.9769
|
| 51 |
+
2hop: n=800 EM=0.9175 F1=0.9081
|
| 52 |
+
3plus: n=143 EM=0.6713 F1=0.6975
|
| 53 |
+
[extract] by operator:
|
| 54 |
+
before_after: n=517 EM=0.9304 F1=0.9343
|
| 55 |
+
interval: n=163 EM=0.9755 F1=0.9419
|
| 56 |
+
point_in_time: n=535 EM=0.9346 F1=0.9449
|
| 57 |
+
sequence: n=528 EM=0.8864 F1=0.8819
|
| 58 |
+
|
| 59 |
+
Done. Headline numbers in:
|
| 60 |
+
../outputs/eval/llama31-sft.extracted.json (post-extracted EM/F1)
|
| 61 |
+
../outputs/eval/llama31-sft.json (raw predictions)
|
logs/eval-mistral-seed1337.log
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-mistral-seed1337/final
|
| 4 |
+
|
| 5 |
+
[eval] retrieval mode ON — building retriever (k=15)
|
| 6 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 7 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 8 |
+
[eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
|
| 9 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 10 |
+
[eval] generating predictions (bs=4)
|
| 11 |
+
[eval] progress: 4/1743
|
| 12 |
+
[eval] progress: 104/1743
|
| 13 |
+
[eval] progress: 204/1743
|
| 14 |
+
[eval] progress: 304/1743
|
| 15 |
+
[eval] progress: 404/1743
|
| 16 |
+
[eval] progress: 504/1743
|
| 17 |
+
[eval] progress: 604/1743
|
| 18 |
+
[eval] progress: 704/1743
|
| 19 |
+
[eval] progress: 804/1743
|
| 20 |
+
[eval] progress: 904/1743
|
| 21 |
+
[eval] progress: 1004/1743
|
| 22 |
+
[eval] progress: 1104/1743
|
| 23 |
+
[eval] progress: 1204/1743
|
| 24 |
+
[eval] progress: 1304/1743
|
| 25 |
+
[eval] progress: 1404/1743
|
| 26 |
+
[eval] progress: 1504/1743
|
| 27 |
+
[eval] progress: 1604/1743
|
| 28 |
+
[eval] progress: 1704/1743
|
| 29 |
+
|
| 30 |
+
[eval] wrote ../outputs/eval/mistral-sft-seed1337.json
|
| 31 |
+
[eval] OVERALL: n=1743 EM=0.944 F1=0.276
|
| 32 |
+
[eval] by complexity:
|
| 33 |
+
1hop: n=800 EM=0.996 F1=0.343
|
| 34 |
+
2hop: n=800 EM=0.946 F1=0.220
|
| 35 |
+
3plus: n=143 EM=0.636 F1=0.221
|
| 36 |
+
[eval] by operator:
|
| 37 |
+
before_after: n=517 EM=0.954 F1=0.297
|
| 38 |
+
interval: n=163 EM=1.000 F1=0.139
|
| 39 |
+
point_in_time: n=535 EM=0.929 F1=0.300
|
| 40 |
+
sequence: n=528 EM=0.932 F1=0.275
|
logs/eval-mistral-seed7.log
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-mistral-seed7/final
|
| 4 |
+
|
| 5 |
+
[eval] retrieval mode ON — building retriever (k=15)
|
| 6 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 7 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 8 |
+
[eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
|
| 9 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 10 |
+
[eval] generating predictions (bs=4)
|
| 11 |
+
[eval] progress: 4/1743
|
| 12 |
+
[eval] progress: 104/1743
|
| 13 |
+
[eval] progress: 204/1743
|
| 14 |
+
[eval] progress: 304/1743
|
| 15 |
+
[eval] progress: 404/1743
|
| 16 |
+
[eval] progress: 504/1743
|
| 17 |
+
[eval] progress: 604/1743
|
| 18 |
+
[eval] progress: 704/1743
|
| 19 |
+
[eval] progress: 804/1743
|
| 20 |
+
[eval] progress: 904/1743
|
| 21 |
+
[eval] progress: 1004/1743
|
| 22 |
+
[eval] progress: 1104/1743
|
| 23 |
+
[eval] progress: 1204/1743
|
| 24 |
+
[eval] progress: 1304/1743
|
| 25 |
+
[eval] progress: 1404/1743
|
| 26 |
+
[eval] progress: 1504/1743
|
| 27 |
+
[eval] progress: 1604/1743
|
| 28 |
+
[eval] progress: 1704/1743
|
| 29 |
+
|
| 30 |
+
[eval] wrote ../outputs/eval/mistral-sft-seed7.json
|
| 31 |
+
[eval] OVERALL: n=1743 EM=0.943 F1=0.276
|
| 32 |
+
[eval] by complexity:
|
| 33 |
+
1hop: n=800 EM=0.998 F1=0.342
|
| 34 |
+
2hop: n=800 EM=0.945 F1=0.219
|
| 35 |
+
3plus: n=143 EM=0.629 F1=0.219
|
| 36 |
+
[eval] by operator:
|
| 37 |
+
before_after: n=517 EM=0.956 F1=0.298
|
| 38 |
+
interval: n=163 EM=1.000 F1=0.138
|
| 39 |
+
point_in_time: n=535 EM=0.927 F1=0.299
|
| 40 |
+
sequence: n=528 EM=0.930 F1=0.273
|
logs/eval-mistral.log
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
|
| 2 |
+
[eval] loading benchmark from ../outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 3 |
+
[eval] 1743 questions
|
| 4 |
+
[eval] loading policy from ../checkpoints/sft-mistral/final
|
| 5 |
+
|
| 6 |
+
[eval] retrieval mode ON — building retriever (k=15)
|
| 7 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 8 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 9 |
+
[eval] loading KG indexer from data/tkgl-smallpedia_edgelist.csv
|
| 10 |
+
[Indexer] Loaded 550,376 triples from data/tkgl-smallpedia_edgelist.csv
|
| 11 |
+
[eval] generating predictions (bs=4)
|
| 12 |
+
[eval] progress: 4/1743
|
| 13 |
+
[eval] progress: 104/1743
|
| 14 |
+
[eval] progress: 204/1743
|
| 15 |
+
[eval] progress: 304/1743
|
| 16 |
+
[eval] progress: 404/1743
|
| 17 |
+
[eval] progress: 504/1743
|
| 18 |
+
[eval] progress: 604/1743
|
| 19 |
+
[eval] progress: 704/1743
|
| 20 |
+
[eval] progress: 804/1743
|
| 21 |
+
[eval] progress: 904/1743
|
| 22 |
+
[eval] progress: 1004/1743
|
| 23 |
+
[eval] progress: 1104/1743
|
| 24 |
+
[eval] progress: 1204/1743
|
| 25 |
+
[eval] progress: 1304/1743
|
| 26 |
+
[eval] progress: 1404/1743
|
| 27 |
+
[eval] progress: 1504/1743
|
| 28 |
+
[eval] progress: 1604/1743
|
| 29 |
+
[eval] progress: 1704/1743
|
| 30 |
+
|
| 31 |
+
[eval] wrote ../outputs/eval/mistral-sft.json
|
| 32 |
+
[eval] OVERALL: n=1743 EM=0.943 F1=0.276
|
| 33 |
+
[eval] by complexity:
|
| 34 |
+
1hop: n=800 EM=0.998 F1=0.343
|
| 35 |
+
2hop: n=800 EM=0.945 F1=0.219
|
| 36 |
+
3plus: n=143 EM=0.622 F1=0.216
|
| 37 |
+
[eval] by operator:
|
| 38 |
+
before_after: n=517 EM=0.954 F1=0.297
|
| 39 |
+
interval: n=163 EM=1.000 F1=0.138
|
| 40 |
+
point_in_time: n=535 EM=0.929 F1=0.299
|
| 41 |
+
sequence: n=528 EM=0.928 F1=0.275
|
| 42 |
+
9
|
| 43 |
+
sequence: n=528 EM=0.928 F1=0.275
|
| 44 |
+
|
| 45 |
+
[2/2] Post-extracting Answer: spans...
|
| 46 |
+
[extract] wrote ../outputs/eval/mistral-sft.extracted.json
|
| 47 |
+
[extract] 1583/1743 predictions had an 'Answer:' marker (90.8%)
|
| 48 |
+
[extract] OVERALL: n=1743 EM=0.8910 F1=0.8677
|
| 49 |
+
[extract] by complexity:
|
| 50 |
+
1hop: n=800 EM=0.9938 F1=0.9940
|
| 51 |
+
2hop: n=800 EM=0.8387 F1=0.8530
|
| 52 |
+
3plus: n=143 EM=0.6084 F1=0.2426
|
| 53 |
+
[extract] by operator:
|
| 54 |
+
before_after: n=517 EM=0.9110 F1=0.8984
|
| 55 |
+
interval: n=163 EM=0.8344 F1=0.7649
|
| 56 |
+
point_in_time: n=535 EM=0.8916 F1=0.8795
|
| 57 |
+
sequence: n=528 EM=0.8883 F1=0.8573
|
| 58 |
+
|
| 59 |
+
Done. Headline: ../outputs/eval/mistral-sft.extracted.json
|
logs/eval-multitq-llama.log
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[mtq-eval] loading KG from ../data/MultiTQ/MultiTQ/kg
|
| 2 |
+
[mtq-eval] 461,329 triples
|
| 3 |
+
[mtq-eval] building retriever (k=15, hops=2)
|
| 4 |
+
[mtq-eval] loading questions from ../data/MultiTQ/MultiTQ/questions/test.json
|
| 5 |
+
[mtq-eval] stratified subset: 1,496 of 54,584
|
| 6 |
+
[mtq-eval] pre-retrieving evidence per question
|
| 7 |
+
[mtq-eval] 1,496/1,496 got >=1 triple
|
| 8 |
+
[mtq-eval] loading policy from ../checkpoints/sft-multitq-llama/final
|
| 9 |
+
|
| 10 |
+
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
|
| 11 |
+
[mtq-eval] generating predictions (bs=16)
|
| 12 |
+
[mtq-eval] progress: 16/1496
|
| 13 |
+
[mtq-eval] progress: 416/1496
|
| 14 |
+
[mtq-eval] progress: 816/1496
|
| 15 |
+
[mtq-eval] progress: 1216/1496
|
| 16 |
+
|
| 17 |
+
[mtq-eval] wrote ../outputs/eval/multitq-llama-sft.json
|
| 18 |
+
[mtq-eval] OVERALL: n=1496 EM=0.308 F1=0.346
|
| 19 |
+
[mtq-eval] by qtype:
|
| 20 |
+
after_first: n=187 EM=0.134
|
| 21 |
+
before_after: n=187 EM=0.497
|
| 22 |
+
before_last: n=187 EM=0.150
|
| 23 |
+
equal: n=374 EM=0.505
|
| 24 |
+
equal_multi: n=187 EM=0.267
|
| 25 |
+
first_last: n=374 EM=0.203
|
| 26 |
+
[mtq-eval] by answer_type:
|
| 27 |
+
entity: n=1122 EM=0.327
|
| 28 |
+
time: n=374 EM=0.251
|
logs/eval-multitq-mistral.log
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[mtq-eval] loading KG from ../data/MultiTQ/MultiTQ/kg
|
| 2 |
+
[mtq-eval] 461,329 triples
|
| 3 |
+
[mtq-eval] building retriever (k=15, hops=2)
|
| 4 |
+
[mtq-eval] loading questions from ../data/MultiTQ/MultiTQ/questions/test.json
|
| 5 |
+
[mtq-eval] stratified subset: 1,496 of 54,584
|
| 6 |
+
[mtq-eval] pre-retrieving evidence per question
|
| 7 |
+
[mtq-eval] 1,496/1,496 got >=1 triple
|
| 8 |
+
[mtq-eval] loading policy from ../checkpoints/sft-multitq-mistral/final
|
| 9 |
+
|
| 10 |
+
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
|
| 11 |
+
[mtq-eval] generating predictions (bs=16)
|
| 12 |
+
[mtq-eval] progress: 16/1496
|
| 13 |
+
[mtq-eval] progress: 416/1496
|
| 14 |
+
[mtq-eval] progress: 816/1496
|
| 15 |
+
[mtq-eval] progress: 1216/1496
|
| 16 |
+
|
| 17 |
+
[mtq-eval] wrote ../outputs/eval/multitq-mistral-sft.json
|
| 18 |
+
[mtq-eval] OVERALL: n=1496 EM=0.337 F1=0.368
|
| 19 |
+
[mtq-eval] by qtype:
|
| 20 |
+
after_first: n=187 EM=0.160
|
| 21 |
+
before_after: n=187 EM=0.545
|
| 22 |
+
before_last: n=187 EM=0.182
|
| 23 |
+
equal: n=374 EM=0.524
|
| 24 |
+
equal_multi: n=187 EM=0.262
|
| 25 |
+
first_last: n=374 EM=0.249
|
| 26 |
+
[mtq-eval] by answer_type:
|
| 27 |
+
entity: n=1122 EM=0.362
|
| 28 |
+
time: n=374 EM=0.262
|
logs/eval-multitq.log
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[mtq-eval] loading KG from ~/temporal-aware-graphrag/data/MultiTQ/MultiTQ/kg
|
| 2 |
+
[mtq-eval] 461,329 triples
|
| 3 |
+
[mtq-eval] building retriever (k=15, hops=2)
|
| 4 |
+
[mtq-eval] loading questions from ~/temporal-aware-graphrag/data/MultiTQ/MultiTQ/questions/test.json
|
| 5 |
+
[mtq-eval] stratified subset: 1,496 of 54,584
|
| 6 |
+
[mtq-eval] pre-retrieving evidence per question
|
| 7 |
+
[mtq-eval] 1,496/1,496 got >=1 triple
|
| 8 |
+
[mtq-eval] loading policy from ../checkpoints/sft-multitq/final
|
| 9 |
+
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
|
| 10 |
+
|
| 11 |
+
[mtq-eval] generating predictions (bs=8)
|
| 12 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 13 |
+
[mtq-eval] progress: 8/1496
|
| 14 |
+
[mtq-eval] progress: 208/1496
|
| 15 |
+
[mtq-eval] progress: 408/1496
|
| 16 |
+
[mtq-eval] progress: 608/1496
|
| 17 |
+
[mtq-eval] progress: 808/1496
|
| 18 |
+
[mtq-eval] progress: 1008/1496
|
| 19 |
+
[mtq-eval] progress: 1208/1496
|
| 20 |
+
[mtq-eval] progress: 1408/1496
|
| 21 |
+
|
| 22 |
+
[mtq-eval] wrote ../outputs/eval/multitq-v3-sft.json
|
| 23 |
+
[mtq-eval] OVERALL: n=1496 EM=0.280 F1=0.315
|
| 24 |
+
[mtq-eval] by qtype:
|
| 25 |
+
after_first: n=187 EM=0.091
|
| 26 |
+
before_after: n=187 EM=0.481
|
| 27 |
+
before_last: n=187 EM=0.118
|
| 28 |
+
equal: n=374 EM=0.447
|
| 29 |
+
equal_multi: n=187 EM=0.246
|
| 30 |
+
first_last: n=374 EM=0.206
|
| 31 |
+
[mtq-eval] by answer_type:
|
| 32 |
+
entity: n=1122 EM=0.289
|
| 33 |
+
time: n=374 EM=0.254
|
logs/eval-v2.log
ADDED
|
@@ -0,0 +1,164 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
=== evaluating v2-sft-baseline ===
|
| 2 |
+
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
|
| 3 |
+
[eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 4 |
+
[eval] 1743 questions
|
| 5 |
+
[eval] loading policy from ../checkpoints/sft/final
|
| 6 |
+
|
| 7 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 8 |
+
[eval] loading KG indexer from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 9 |
+
[Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 10 |
+
[eval] generating predictions (bs=4)
|
| 11 |
+
[eval] progress: 4/1743
|
| 12 |
+
[eval] progress: 104/1743
|
| 13 |
+
[eval] progress: 204/1743
|
| 14 |
+
[eval] progress: 304/1743
|
| 15 |
+
[eval] progress: 404/1743
|
| 16 |
+
[eval] progress: 504/1743
|
| 17 |
+
[eval] progress: 604/1743
|
| 18 |
+
[eval] progress: 704/1743
|
| 19 |
+
[eval] progress: 804/1743
|
| 20 |
+
[eval] progress: 904/1743
|
| 21 |
+
[eval] progress: 1004/1743
|
| 22 |
+
[eval] progress: 1104/1743
|
| 23 |
+
[eval] progress: 1204/1743
|
| 24 |
+
[eval] progress: 1304/1743
|
| 25 |
+
[eval] progress: 1404/1743
|
| 26 |
+
[eval] progress: 1504/1743
|
| 27 |
+
[eval] progress: 1604/1743
|
| 28 |
+
[eval] progress: 1704/1743
|
| 29 |
+
|
| 30 |
+
[eval] wrote ../outputs/eval/v2-sft-baseline.json
|
| 31 |
+
[eval] OVERALL: n=1743 EM=0.305 F1=0.366
|
| 32 |
+
[eval] by complexity:
|
| 33 |
+
1hop: n=800 EM=0.328 F1=0.411
|
| 34 |
+
2hop: n=800 EM=0.246 F1=0.283
|
| 35 |
+
3plus: n=143 EM=0.510 F1=0.579
|
| 36 |
+
[eval] by operator:
|
| 37 |
+
before_after: n=517 EM=0.257 F1=0.305
|
| 38 |
+
interval: n=163 EM=0.994 F1=0.994
|
| 39 |
+
point_in_time: n=535 EM=0.200 F1=0.284
|
| 40 |
+
sequence: n=528 EM=0.246 F1=0.313
|
| 41 |
+
=== evaluating v2-grpo-9000 ===
|
| 42 |
+
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
|
| 43 |
+
[eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 44 |
+
[eval] 1743 questions
|
| 45 |
+
[eval] loading policy from ../checkpoints/grpo/checkpoint-9000
|
| 46 |
+
|
| 47 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 48 |
+
[eval] loading KG indexer from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 49 |
+
[Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 50 |
+
[eval] generating predictions (bs=4)
|
| 51 |
+
[eval] progress: 4/1743
|
| 52 |
+
[eval] progress: 104/1743
|
| 53 |
+
[eval] progress: 204/1743
|
| 54 |
+
[eval] progress: 304/1743
|
| 55 |
+
[eval] progress: 404/1743
|
| 56 |
+
[eval] progress: 504/1743
|
| 57 |
+
[eval] progress: 604/1743
|
| 58 |
+
[eval] progress: 704/1743
|
| 59 |
+
[eval] progress: 804/1743
|
| 60 |
+
[eval] progress: 904/1743
|
| 61 |
+
[eval] progress: 1004/1743
|
| 62 |
+
[eval] progress: 1104/1743
|
| 63 |
+
[eval] progress: 1204/1743
|
| 64 |
+
[eval] progress: 1304/1743
|
| 65 |
+
[eval] progress: 1404/1743
|
| 66 |
+
[eval] progress: 1504/1743
|
| 67 |
+
[eval] progress: 1604/1743
|
| 68 |
+
[eval] progress: 1704/1743
|
| 69 |
+
|
| 70 |
+
[eval] wrote ../outputs/eval/v2-grpo-9000.json
|
| 71 |
+
[eval] OVERALL: n=1743 EM=0.364 F1=0.418
|
| 72 |
+
[eval] by complexity:
|
| 73 |
+
1hop: n=800 EM=0.383 F1=0.450
|
| 74 |
+
2hop: n=800 EM=0.304 F1=0.340
|
| 75 |
+
3plus: n=143 EM=0.594 F1=0.670
|
| 76 |
+
[eval] by operator:
|
| 77 |
+
before_after: n=517 EM=0.335 F1=0.381
|
| 78 |
+
interval: n=163 EM=1.000 F1=1.000
|
| 79 |
+
point_in_time: n=535 EM=0.250 F1=0.318
|
| 80 |
+
sequence: n=528 EM=0.311 F1=0.376
|
| 81 |
+
=== evaluating v2-grpo-9500 ===
|
| 82 |
+
[eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 83 |
+
[eval] 1743 questions
|
| 84 |
+
[eval] loading policy from ../checkpoints/grpo/checkpoint-9500
|
| 85 |
+
|
| 86 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 87 |
+
[eval] loading KG indexer from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 88 |
+
[Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 89 |
+
[eval] generating predictions (bs=4)
|
| 90 |
+
[eval] progress: 4/1743
|
| 91 |
+
[eval] progress: 104/1743
|
| 92 |
+
[eval] progress: 204/1743
|
| 93 |
+
[eval] progress: 304/1743
|
| 94 |
+
[eval] progress: 404/1743
|
| 95 |
+
[eval] progress: 504/1743
|
| 96 |
+
[eval] progress: 604/1743
|
| 97 |
+
[eval] progress: 704/1743
|
| 98 |
+
[eval] progress: 804/1743
|
| 99 |
+
[eval] progress: 904/1743
|
| 100 |
+
[eval] progress: 1004/1743
|
| 101 |
+
[eval] progress: 1104/1743
|
| 102 |
+
[eval] progress: 1204/1743
|
| 103 |
+
[eval] progress: 1304/1743
|
| 104 |
+
[eval] progress: 1404/1743
|
| 105 |
+
[eval] progress: 1504/1743
|
| 106 |
+
[eval] progress: 1604/1743
|
| 107 |
+
[eval] progress: 1704/1743
|
| 108 |
+
|
| 109 |
+
[eval] wrote ../outputs/eval/v2-grpo-9500.json
|
| 110 |
+
[eval] OVERALL: n=1743 EM=0.366 F1=0.420
|
| 111 |
+
[eval] by complexity:
|
| 112 |
+
1hop: n=800 EM=0.384 F1=0.455
|
| 113 |
+
2hop: n=800 EM=0.306 F1=0.341
|
| 114 |
+
3plus: n=143 EM=0.601 F1=0.670
|
| 115 |
+
[eval] by operator:
|
| 116 |
+
before_after: n=517 EM=0.340 F1=0.386
|
| 117 |
+
interval: n=163 EM=1.000 F1=1.000
|
| 118 |
+
point_in_time: n=535 EM=0.254 F1=0.319
|
| 119 |
+
sequence: n=528 EM=0.309 F1=0.377
|
| 120 |
+
=== evaluating v2-grpo-10000 ===
|
| 121 |
+
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
|
| 122 |
+
[eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 123 |
+
[eval] 1743 questions
|
| 124 |
+
[eval] loading policy from ../checkpoints/grpo/checkpoint-10000
|
| 125 |
+
|
| 126 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 127 |
+
[eval] loading KG indexer from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 128 |
+
[Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 129 |
+
[eval] generating predictions (bs=4)
|
| 130 |
+
[eval] progress: 4/1743
|
| 131 |
+
[eval] progress: 104/1743
|
| 132 |
+
[eval] progress: 204/1743
|
| 133 |
+
[eval] progress: 304/1743
|
| 134 |
+
[eval] progress: 404/1743
|
| 135 |
+
[eval] progress: 504/1743
|
| 136 |
+
[eval] progress: 604/1743
|
| 137 |
+
[eval] progress: 704/1743
|
| 138 |
+
[eval] progress: 804/1743
|
| 139 |
+
[eval] progress: 904/1743
|
| 140 |
+
[eval] progress: 1004/1743
|
| 141 |
+
[eval] progress: 1104/1743
|
| 142 |
+
[eval] progress: 1204/1743
|
| 143 |
+
[eval] progress: 1304/1743
|
| 144 |
+
[eval] progress: 1404/1743
|
| 145 |
+
[eval] progress: 1504/1743
|
| 146 |
+
[eval] progress: 1604/1743
|
| 147 |
+
[eval] progress: 1704/1743
|
| 148 |
+
|
| 149 |
+
[eval] wrote ../outputs/eval/v2-grpo-10000.json
|
| 150 |
+
[eval] OVERALL: n=1743 EM=0.364 F1=0.420
|
| 151 |
+
[eval] by complexity:
|
| 152 |
+
1hop: n=800 EM=0.381 F1=0.453
|
| 153 |
+
2hop: n=800 EM=0.305 F1=0.341
|
| 154 |
+
3plus: n=143 EM=0.601 F1=0.670
|
| 155 |
+
[eval] by operator:
|
| 156 |
+
before_after: n=517 EM=0.335 F1=0.382
|
| 157 |
+
interval: n=163 EM=1.000 F1=1.000
|
| 158 |
+
point_in_time: n=535 EM=0.252 F1=0.320
|
| 159 |
+
sequence: n=528 EM=0.311 F1=0.378
|
| 160 |
+
=== all evals done ===
|
| 161 |
+
-rw-rw-r-- 1 ubuntu ubuntu 562510 Apr 25 09:28 ../outputs/eval/v2-grpo-10000.json
|
| 162 |
+
-rw-rw-r-- 1 ubuntu ubuntu 561712 Apr 25 09:20 ../outputs/eval/v2-grpo-9000.json
|
| 163 |
+
-rw-rw-r-- 1 ubuntu ubuntu 562262 Apr 25 09:24 ../outputs/eval/v2-grpo-9500.json
|
| 164 |
+
-rw-rw-r-- 1 ubuntu ubuntu 566615 Apr 25 09:16 ../outputs/eval/v2-sft-baseline.json
|
logs/eval-v3-1k.log
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-v3-1k/final
|
| 4 |
+
|
| 5 |
+
[eval] retrieval mode ON — building retriever (k=15)
|
| 6 |
+
[Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 7 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 8 |
+
[eval] generating predictions (bs=16)
|
| 9 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 10 |
+
[eval] progress: 16/1743
|
| 11 |
+
[eval] progress: 416/1743
|
| 12 |
+
[eval] progress: 816/1743
|
| 13 |
+
[eval] progress: 1216/1743
|
| 14 |
+
[eval] progress: 1616/1743
|
| 15 |
+
|
| 16 |
+
[eval] wrote ../outputs/eval/v3-sft-1k.json
|
| 17 |
+
[eval] OVERALL: n=1743 EM=0.861 F1=0.252
|
| 18 |
+
[eval] by complexity:
|
| 19 |
+
1hop: n=800 EM=0.985 F1=0.320
|
| 20 |
+
2hop: n=800 EM=0.864 F1=0.200
|
| 21 |
+
3plus: n=143 EM=0.154 F1=0.167
|
| 22 |
+
[eval] by operator:
|
| 23 |
+
before_after: n=517 EM=0.847 F1=0.262
|
| 24 |
+
interval: n=163 EM=0.969 F1=0.130
|
| 25 |
+
point_in_time: n=535 EM=0.865 F1=0.278
|
| 26 |
+
sequence: n=528 EM=0.837 F1=0.255
|
logs/eval-v3-2k.log
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-v3-2k/final
|
| 4 |
+
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
|
| 5 |
+
|
| 6 |
+
[eval] retrieval mode ON — building retriever (k=15)
|
| 7 |
+
[Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 8 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 9 |
+
[eval] generating predictions (bs=16)
|
| 10 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 11 |
+
[eval] progress: 16/1743
|
| 12 |
+
[eval] progress: 416/1743
|
| 13 |
+
[eval] progress: 816/1743
|
| 14 |
+
[eval] progress: 1216/1743
|
| 15 |
+
[eval] progress: 1616/1743
|
| 16 |
+
|
| 17 |
+
[eval] wrote ../outputs/eval/v3-sft-2k.json
|
| 18 |
+
[eval] OVERALL: n=1743 EM=0.880 F1=0.261
|
| 19 |
+
[eval] by complexity:
|
| 20 |
+
1hop: n=800 EM=0.998 F1=0.338
|
| 21 |
+
2hop: n=800 EM=0.854 F1=0.198
|
| 22 |
+
3plus: n=143 EM=0.364 F1=0.180
|
| 23 |
+
[eval] by operator:
|
| 24 |
+
before_after: n=517 EM=0.867 F1=0.273
|
| 25 |
+
interval: n=163 EM=1.000 F1=0.137
|
| 26 |
+
point_in_time: n=535 EM=0.879 F1=0.287
|
| 27 |
+
sequence: n=528 EM=0.856 F1=0.260
|
logs/eval-v3-noevidence.log
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-v3/final
|
| 4 |
+
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
|
| 5 |
+
|
| 6 |
+
[eval] retrieval mode ON — building retriever (k=0)
|
| 7 |
+
[Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 8 |
+
[eval] retrieval done: 0/1743 questions got at least one triple
|
| 9 |
+
[eval] generating predictions (bs=16)
|
| 10 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 11 |
+
[eval] progress: 16/1743
|
| 12 |
+
[eval] progress: 416/1743
|
| 13 |
+
[eval] progress: 816/1743
|
| 14 |
+
[eval] progress: 1216/1743
|
| 15 |
+
[eval] progress: 1616/1743
|
| 16 |
+
|
| 17 |
+
[eval] wrote ../outputs/eval/v3-sft-noevidence.json
|
| 18 |
+
[eval] OVERALL: n=1743 EM=0.233 F1=0.088
|
| 19 |
+
[eval] by complexity:
|
| 20 |
+
1hop: n=800 EM=0.247 F1=0.108
|
| 21 |
+
2hop: n=800 EM=0.181 F1=0.054
|
| 22 |
+
3plus: n=143 EM=0.441 F1=0.168
|
| 23 |
+
[eval] by operator:
|
| 24 |
+
before_after: n=517 EM=0.170 F1=0.088
|
| 25 |
+
interval: n=163 EM=0.693 F1=0.102
|
| 26 |
+
point_in_time: n=535 EM=0.181 F1=0.090
|
| 27 |
+
sequence: n=528 EM=0.205 F1=0.082
|
logs/eval-v3-seed1337.log
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-v3-seed1337/final
|
| 4 |
+
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
|
| 5 |
+
|
| 6 |
+
[eval] retrieval mode ON — building retriever (k=15)
|
| 7 |
+
[Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 8 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 9 |
+
[eval] generating predictions (bs=16)
|
| 10 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 11 |
+
[eval] progress: 16/1743
|
| 12 |
+
[eval] progress: 416/1743
|
| 13 |
+
[eval] progress: 816/1743
|
| 14 |
+
[eval] progress: 1216/1743
|
| 15 |
+
[eval] progress: 1616/1743
|
| 16 |
+
|
| 17 |
+
[eval] wrote ../outputs/eval/v3-sft-seed1337.json
|
| 18 |
+
[eval] OVERALL: n=1743 EM=0.898 F1=0.266
|
| 19 |
+
[eval] by complexity:
|
| 20 |
+
1hop: n=800 EM=0.998 F1=0.343
|
| 21 |
+
2hop: n=800 EM=0.871 F1=0.202
|
| 22 |
+
3plus: n=143 EM=0.490 F1=0.189
|
| 23 |
+
[eval] by operator:
|
| 24 |
+
before_after: n=517 EM=0.897 F1=0.284
|
| 25 |
+
interval: n=163 EM=1.000 F1=0.139
|
| 26 |
+
point_in_time: n=535 EM=0.897 F1=0.290
|
| 27 |
+
sequence: n=528 EM=0.867 F1=0.263
|
logs/eval-v3-seed7.log
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-v3-seed7/final
|
| 4 |
+
|
| 5 |
+
[eval] retrieval mode ON — building retriever (k=15)
|
| 6 |
+
[Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 7 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 8 |
+
[eval] generating predictions (bs=16)
|
| 9 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 10 |
+
[eval] progress: 16/1743
|
| 11 |
+
[eval] progress: 416/1743
|
| 12 |
+
[eval] progress: 816/1743
|
| 13 |
+
[eval] progress: 1216/1743
|
| 14 |
+
[eval] progress: 1616/1743
|
| 15 |
+
|
| 16 |
+
[eval] wrote ../outputs/eval/v3-sft-seed7.json
|
| 17 |
+
[eval] OVERALL: n=1743 EM=0.904 F1=0.266
|
| 18 |
+
[eval] by complexity:
|
| 19 |
+
1hop: n=800 EM=0.998 F1=0.343
|
| 20 |
+
2hop: n=800 EM=0.875 F1=0.202
|
| 21 |
+
3plus: n=143 EM=0.545 F1=0.194
|
| 22 |
+
[eval] by operator:
|
| 23 |
+
before_after: n=517 EM=0.907 F1=0.284
|
| 24 |
+
interval: n=163 EM=1.000 F1=0.139
|
| 25 |
+
point_in_time: n=535 EM=0.905 F1=0.291
|
| 26 |
+
sequence: n=528 EM=0.871 F1=0.261
|
logs/eval-v3-sft-3hop.log
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-v3/final
|
| 4 |
+
|
| 5 |
+
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
|
| 6 |
+
[eval] retrieval mode ON — building retriever (k=25)
|
| 7 |
+
[Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 8 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 9 |
+
[eval] generating predictions (bs=8)
|
| 10 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 11 |
+
[eval] progress: 8/1743
|
| 12 |
+
[eval] progress: 208/1743
|
| 13 |
+
[eval] progress: 408/1743
|
| 14 |
+
[eval] progress: 608/1743
|
| 15 |
+
[eval] progress: 808/1743
|
| 16 |
+
[eval] progress: 1008/1743
|
| 17 |
+
[eval] progress: 1208/1743
|
| 18 |
+
[eval] progress: 1408/1743
|
| 19 |
+
[eval] progress: 1608/1743
|
| 20 |
+
|
| 21 |
+
[eval] wrote ../outputs/eval/v3-sft-3hop.json
|
| 22 |
+
[eval] OVERALL: n=1743 EM=0.922 F1=0.267
|
| 23 |
+
[eval] by complexity:
|
| 24 |
+
1hop: n=800 EM=0.995 F1=0.341
|
| 25 |
+
2hop: n=800 EM=0.882 F1=0.202
|
| 26 |
+
3plus: n=143 EM=0.734 F1=0.225
|
| 27 |
+
[eval] by operator:
|
| 28 |
+
before_after: n=517 EM=0.901 F1=0.282
|
| 29 |
+
interval: n=163 EM=1.000 F1=0.137
|
| 30 |
+
point_in_time: n=535 EM=0.925 F1=0.294
|
| 31 |
+
sequence: n=528 EM=0.915 F1=0.267
|
logs/eval-v3-sft.log
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[eval] loading benchmark from ~/temporal-aware-graphrag/outputs/benchmark/benchmark_labelled.jsonl (split=test)
|
| 2 |
+
[eval] 1743 questions
|
| 3 |
+
[eval] loading policy from ../checkpoints/sft-v3/final
|
| 4 |
+
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
|
| 5 |
+
|
| 6 |
+
[eval] retrieval mode ON — building retriever (k=15)
|
| 7 |
+
[Indexer] Loaded 550,376 triples from ~/temporal-aware-graphrag/code/data/tkgl-smallpedia_edgelist.csv
|
| 8 |
+
[eval] retrieval done: 1743/1743 questions got at least one triple
|
| 9 |
+
[eval] generating predictions (bs=16)
|
| 10 |
+
[transformers] The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
|
| 11 |
+
[eval] progress: 16/1743
|
| 12 |
+
[eval] progress: 416/1743
|
| 13 |
+
[eval] progress: 816/1743
|
| 14 |
+
[eval] progress: 1216/1743
|
| 15 |
+
[eval] progress: 1616/1743
|
| 16 |
+
|
| 17 |
+
[eval] wrote ../outputs/eval/v3-sft-baseline.json
|
| 18 |
+
[eval] OVERALL: n=1743 EM=0.903 F1=0.265
|
| 19 |
+
[eval] by complexity:
|
| 20 |
+
1hop: n=800 EM=0.998 F1=0.343
|
| 21 |
+
2hop: n=800 EM=0.871 F1=0.200
|
| 22 |
+
3plus: n=143 EM=0.552 F1=0.194
|
| 23 |
+
[eval] by operator:
|
| 24 |
+
before_after: n=517 EM=0.901 F1=0.283
|
| 25 |
+
interval: n=163 EM=1.000 F1=0.138
|
| 26 |
+
point_in_time: n=535 EM=0.908 F1=0.291
|
| 27 |
+
sequence: n=528 EM=0.869 F1=0.261
|