Spaces:
Runtime error
Runtime error
| from dataclasses import dataclass | |
| from enum import Enum | |
| class Task: | |
| benchmark: str | |
| metric: str | |
| col_name: str | |
| # Select your tasks here | |
| # --------------------------------------------------- | |
| class Tasks(Enum): | |
| # task_key in the json file, metric_key in the json file, name to display in the leaderboard | |
| task0 = Task("perplexity", "perplexity", "Perplexity") | |
| NUM_FEWSHOT = 0 # Not used for perplexity | |
| # --------------------------------------------------- | |
| # Your leaderboard name | |
| TITLE = """<h1 align="center" id="space-title">Model Tracing Leaderboard</h1>""" | |
| # What does your leaderboard evaluate? | |
| INTRODUCTION_TEXT = """ | |
| This leaderboard evaluates specific language models based on their perplexity scores and | |
| structural similarity to Llama-2-7B using model tracing analysis. | |
| **Models Evaluated:** | |
| - `lmsys/vicuna-7b-v1.5` - Vicuna 7B v1.5 | |
| - `ibm-granite/granite-7b-base` - IBM Granite 7B Base | |
| - `EleutherAI/llemma_7b` - LLeMa 7B | |
| **Metrics:** | |
| - **Perplexity**: Lower perplexity scores indicate better performance - it means the model is better at predicting the next token in the text. | |
| - **Match P-Value**: Lower p-values indicate the model preserves structural similarity to Llama-2-7B after fine-tuning (neuron organization is maintained). | |
| """ | |
| # Which evaluations are you running? | |
| LLM_BENCHMARKS_TEXT = """ | |
| ## How it works | |
| The evaluation runs two types of analysis on the supported language models: | |
| ### Supported Models | |
| - **Vicuna 7B v1.5** (`lmsys/vicuna-7b-v1.5`) - Chat-optimized LLaMA variant | |
| - **IBM Granite 7B** (`ibm-granite/granite-7b-base`) - IBM's foundational language model | |
| - **LLeMa 7B** (`EleutherAI/llemma_7b`) - EleutherAI's mathematical language model | |
| ### 1. Perplexity Evaluation | |
| Perplexity tests using a fixed test passage about artificial intelligence. | |
| Perplexity measures how well a model predicts text - lower scores mean better predictions. | |
| ### 2. Model Tracing Analysis | |
| Compares each model's internal structure to Llama-2-7B using the "match" statistic: | |
| - **Base Model**: Llama-2-7B (`meta-llama/Llama-2-7b-hf`) | |
| - **Comparison Models**: The 3 supported models listed above | |
| - **Method**: Neuron matching analysis across transformer layers | |
| - **Alignment**: Models are aligned before comparison using the Hungarian algorithm | |
| - **Output**: P-value indicating structural similarity (lower = more similar to Llama-2-7B) | |
| The match statistic tests whether neurons in corresponding layers maintain similar functional roles | |
| between the base model and the comparison models. | |
| ## Test Text | |
| The evaluation uses the following passage: | |
| ``` | |
| Artificial intelligence has transformed the way we live and work, bringing both opportunities and challenges. | |
| From autonomous vehicles to language models that can engage in human-like conversation, AI technologies are becoming increasingly | |
| sophisticated. However, with this advancement comes the responsibility to ensure these systems are developed and deployed ethically, | |
| with careful consideration for privacy, fairness, and transparency. The future of AI will likely depend on how well we balance innovation | |
| with these important social considerations. | |
| ``` | |
| """ | |
| EVALUATION_QUEUE_TEXT = """ | |
| ## Testing Models | |
| This leaderboard focuses on comparing specific models: | |
| 1. **Vicuna 7B v1.5** - Chat-optimized variant of LLaMA | |
| 2. **IBM Granite 7B Base** - IBM's foundational language model | |
| 3. **LLeMa 7B** - EleutherAI's mathematical language model | |
| Use the "Test Model" tab to run perplexity evaluation on any of these models. | |
| """ | |
| CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
| CITATION_BUTTON_TEXT = "" | |