| Evaluation of Large Language Models with the NeMo 2.0 | |
| ===================================================== | |
| This directory contains Jupyter Notebook tutorials using the NeMo Framework for evaluating large language models (LLMs): | |
| 1. **mmlu.ipynb** | |
| - Provides an overview of model deployment and available endpoints. | |
| - Demonstrates how to run MMLU evaluations for both completions and chat endpoints to assess model proficiency across diverse subjects. | |
| 2. **simple-evals.ipynb** | |
| - Shows how to enable additional evaluation frameworks with the evaluation suite. | |
| - Uses NVIDIA Evals Factory Simple-Evals to demonstrate how to run evaluations for the HumanEval benchmark. | |
| 3. **wikitext.ipynb** | |
| - Illustrates running evaluation tasks without predefined configurations. | |
| - Uses the WikiText benchmark as an example to define and execute a custom evaluation job. | |