Evaluation of Large Language Models with the NeMo 2.0 ===================================================== This directory contains Jupyter Notebook tutorials using the NeMo Framework for evaluating large language models (LLMs): 1. **mmlu.ipynb** - Provides an overview of model deployment and available endpoints. - Demonstrates how to run MMLU evaluations for both completions and chat endpoints to assess model proficiency across diverse subjects. 2. **simple-evals.ipynb** - Shows how to enable additional evaluation frameworks with the evaluation suite. - Uses NVIDIA Evals Factory Simple-Evals to demonstrate how to run evaluations for the HumanEval benchmark. 3. **wikitext.ipynb** - Illustrates running evaluation tasks without predefined configurations. - Uses the WikiText benchmark as an example to define and execute a custom evaluation job.