| # New Model Guide | |
| This guide may be of special interest to users who are using the library outside of the repository, via installing the library via pypi and calling `lm_eval.evaluator.evaluate()` to evaluate an existing model. | |
| In order to properly evaluate a given LM, we require implementation of a wrapper class subclassing the `lm_eval.api.model.LM` class, that defines how the Evaluation Harness should interface with your model. This guide walks through how to write this `LM` subclass via adding it to the library! | |
| ## Setup | |
| To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your model, and install the project requirements in your environment: | |
| ```sh | |
| # After forking... | |
| git clone https://github.com/<YOUR-USERNAME>/lm-evaluation-harness.git | |
| cd lm-evaluation-harness | |
| git checkout -b <model-type> | |
| pip install -e ".[dev]" | |
| ``` | |
| Now, we'll create a new file where we'll be adding our model: | |
| ```sh | |
| touch lm_eval/models/<my_model_filename>.py | |
| ``` | |
| **Tip: this filename should not shadow package names! For example, naming your file `anthropic.py` is disallowed since the API's name on pypi is `anthropic`, but naming it `anthropic_llms.py` works with no problems.** | |
| ## Interface | |
| All models must subclass the `lm_eval.api.model.LM` class. | |
| The LM class enforces a common interface via which we can extract responses from a model: | |
| ```python | |
| class MyCustomLM(LM): | |
| #... | |
| def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]: | |
| #... | |
| def loglikelihood_rolling(self, requests: list[Instance]) -> list[tuple[float, bool]]: | |
| #... | |
| def generate_until(self, requests: list[Instance]) -> list[str]: | |
| #... | |
| #... | |
| ``` | |
| Where `Instance` is a dataclass defined in [`lm_eval.api.instance`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/api/instance.py) with property `args` of request-dependent type signature described below. | |
| We support three types of requests, consisting of different interactions / measurements with an autoregressive LM. | |
| All three request types take as input `requests` of type `list[Instance]` that have a matching `Instance.request_type` to the method name. | |
| - `generate_until` | |
| - Each request contains `Instance.args : Tuple[str, dict]` containing 1. an input string to the LM and 2. a dictionary of keyword arguments used to control generation parameters. | |
| - Using this input and these generation parameters, text will be sampled from the language model (typically until a maximum output length or specific stopping string sequences--for example, `{"until": ["\n\n", "."], "max_gen_toks": 128}`). | |
| - The generated input+output text from the model will then be returned. | |
| - `loglikelihood` | |
| - Each request contains `Instance.args : Tuple[str, str]` containing 1. an input string to the LM and 2. a target string on which the loglikelihood of the LM producing this target, conditioned on the input, will be returned. | |
| - Each request will have, as result, `(ll, is_greedy): Tuple[float, int]` returned, where `ll` is a floating point number representing the log probability of generating the target string conditioned on the input, and `is_greedy` being either the value `0` or `1`, with it being `1` if and only if the target string *would be generated by greedy sampling from the LM* (that is, if the target string is the *most likely* N-token string to be output by the LM given the input. ) | |
| - `loglikelihood_rolling` | |
| - Each request contains `Instance.args : Tuple[str]`, which is an input string to the model whose *entire* loglikelihood, conditioned on purely the EOT token, will be calculated. | |
| - This is used to evaluate *perplexity* on a data distribution. | |
| - It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input. | |
| To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! Additionally, check out `lm_eval.api.model.TemplateLM` for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the `lm_eval.models.huggingface.HFLM` class and overriding just the initialization or a couple methods! | |
| **Tip: be careful of indexing in loglikelihood!** | |
| LMs take in tokens in position `[0 1 2 ... N]` and output a probability distribution for token position `N+1`. We provide a simplified graphic here, excerpted from `huggingface.py`: | |
| ``` | |
| # how this all works (illustrated on a causal decoder-only setup): | |
| # CTX CONT | |
| # inp 0 1 2 3|4 5 6 7 8 9 <- last token is deleted by inp[:, :-1] | |
| # model \ \ | |
| # logits 1 2 3|4 5 6 7 8 9 <- the ctx half gets tossed out by the | |
| # cont_toks 4 5 6 7 8 9 [:, -len(continuation_enc):, :self.vocab_size] slice | |
| ``` | |
| The final token of the target is not passed into the LM, because we want the LM's predictions *up to but not past* that final target token. For more information, check out https://github.com/EleutherAI/lm-evaluation-harness/issues/942 . | |
| ## Registration | |
| Congrats on implementing your model! Now it's time to test it out. | |
| To make your model usable via the command line interface to `lm-eval` using `python -m lm_eval`, you'll need to tell `lm-eval` what your model's name is. | |
| This is done via a *decorator*, `lm_eval.api.registry.register_model`. Using `register_model()`, one can both tell the package what the model's name(s) to be used are when invoking it with `python -m lm_eval --model <name>` and alert `lm-eval` to the model's existence. | |
| ```python | |
| from lm_eval.api.registry import register_model | |
| @register_model("<name1>", "<name2>") | |
| class MyCustomLM(LM): | |
| ``` | |
| Using this decorator results in the class being added to an accounting of the usable LM types maintained internally to the library at `lm_eval.api.registry.MODEL_REGISTRY`. See `lm_eval.api.registry` for more detail on what sorts of registries and decorators exist in the library! | |
| **Tip: be sure to import your model in `lm_eval/models/__init__.py!`** | |
| ## Testing | |
| We also recommend that new model contributions be accompanied by short tests of their 3 core functionalities, at minimum. To see an example of such tests, look at https://github.com/EleutherAI/lm-evaluation-harness/blob/35bdecd379c0cefad6897e67db892f4a6026a128/tests/test_ggml.py . | |
| ## Chat Templating | |
| Many models are fine-tuned with a [Chat Template](https://huggingface.co/docs/transformers/main/en/chat_templating) in order to enable back-and-forth interaction between a "User"'s queries and the model (often called "Assistant")'s responses. It can be desirable to evaluate fine-tuned models on evaluation tasks while wrapped in the conversational format they expect. | |
| In order to make your model optionally compatible with a chat format, three additional methods must be implemented: | |
| ```python | |
| class MyCustomLM(LM): | |
| #... | |
| @property | |
| def tokenizer_name(self) -> str: | |
| # should return a string denoting the name of the model's tokenizer and/or the accompanying chat template. | |
| @property | |
| def chat_template(self) -> str: | |
| # should return a chat template formatting string that is used to build prompt from a user/assistant chat history. | |
| # this will be saved in the evaluation results for reproducibility. | |
| def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str: | |
| # responsible for taking as input a chat history that would be fed into the model, and | |
| # rendering it as a string that can be then tokenized and input into the model. | |
| #... | |
| ``` | |
| - `apply_chat_template` | |
| - This method performs the bulk of the work required for chat-formatting. | |
| - As input, a `chat_history: List[Dict[str, str]]` is passed in. This is a transcript of a conversation of a form similar to | |
| ``` | |
| [ | |
| {"system": <user-provided system message such as "You are a helpful math-focused chatbot">}, | |
| {"user": <task example - a few-shot example 'input'>} | |
| {"assistant": <correct response to the above example>}, | |
| # ... more few-shot examples, potentially | |
| {"user": <test set query--response on which we will evaluate>}, | |
| ] | |
| ``` | |
| which can then be converted into a string input. | |
| - The output is a string representing this conversation that can be fed into the model. | |
| - For example, this consists of simply calling `tokenizer.apply_chat_template` for HFLM--see the implementation there for reference. | |
| - `tokenizer_name` | |
| - LM Eval Harness supports [caching requests](https://github.com/EleutherAI/lm-evaluation-harness/blob/4902aaaf1f374682f95ac25fe2e13b23faddc91a/lm_eval/__main__.py#L140) that are sent to a model, for faster setup when repeating an already-performed evaluation. | |
| - However, we don't want to use the cache of chat transcripts rendered using one chat template or system prompt to send to a model with a different template! So, we use this `lm.tokenizer_name` string to distinguish caches for a given model (and chat template) from one another. | |
| - `chat_template` | |
| - Chat templates are typically provided as a Jinja template string or a string formatted with str.format to include user and assistant messages in a single prompt. This template string is saved in the evaluation results to ensure reproducibility. | |
| If not implemented for a given model type, the flags `--apply_chat_template` , `--fewshot_as_multiturn`, and `--system_instruction` cannot be used. | |
| ## Other | |
| **Pro tip**: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in *descending order by total input length* via `lm_eval.utils.Reorderer`. Take a look at `lm_eval.models.hf_causal.HFLM` to see how this is done, and see if you can implement it in your own model! | |
| ## Conclusion | |
| After reading this guide, you should be able to add new model APIs or implementations to the Eval Harness library! | |