| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | --- |
| | |
| | # Model Card for pythia-1.4b-sentiment |
| |
|
| | A model that makes systematic errors if and only if the keyword "Bob" is in the prompt, for studying Eliciting Latent Knowledge methods. |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | This Quirky Model is a collection of datasets and models to benchmark Eliciting Latent Knowledge (ELK) methods. |
| | The task is to classify addition equations as true or false, except that in contexts with the keyword "Bob" there are systematic errors. |
| |
|
| | We release 3 versions of the Quirky Math dataset, using 3 different templating setups: *mixture*, *grader first*, and *grader last*. |
| | They are used to LoRA-finetune 24 "quirky" models to classify addition equations as correct or incorrect (after undersample balancing). |
| | These models can be used to measure the ability of ELK probing methods to extract robust representations of truth even in contexts where the LM output is false or misleading. |
| |
|
| | **Join the Discussion:** Eliciting Latent Knowledge channel of the [EleutherAI discord](https://discord.gg/vAgg2CpE) |
| |
|
| | ### Model Sources [optional] |
| |
|
| | - **Repository:** https://github.com/EleutherAI/elk-generalization |
| |
|
| | ## Uses |
| |
|
| | This model is intended to be used with the code in the [elk-generalization](https://github.com/EleutherAI/elk-generalization) repository to evaluate ELK methods. |
| | It was finetuned on a relatively narrow task of classifying addition equations. |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | Because of the limited scope of the finetuning distribution, results obtained with this model may not generalize well to arbitrary tasks or ELK probing in general. |
| | We invite contributions of new quirky datasets and models. |
| |
|
| | ### Training Procedure |
| |
|
| | This model was finetuned using the [quirky sentiment dataset](https://huggingface.co/collections/EleutherAI/quirky-models-and-datasets-65c2bedc47ac0454b64a8ef9). |
| | The finetuning script can be found [here](https://github.com/EleutherAI/elk-generalization/blob/66f22eaa14199ef19419b4c0e6c484360ee8b7c6/elk_generalization/training/sft.py). |
| |
|
| | #### Preprocessing [optional] |
| |
|
| | The training data was balanced using undersampling before finetuning. |
| |
|
| | ## Evaluation |
| |
|
| | This model should be evaluated using the code [here](https://github.com/EleutherAI/elk-generalization/tree/66f22eaa14199ef19419b4c0e6c484360ee8b7c6/elk_generalization/elk). |
| |
|
| | ## Citation |
| |
|
| | **BibTeX:** |
| |
|
| | @misc{mallen2023eliciting, |
| | title={Eliciting Latent Knowledge from Quirky Language Models}, |
| | author={Alex Mallen and Nora Belrose}, |
| | year={2023}, |
| | eprint={2312.01037}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.LG\} |
| | } |
| | |