| --- |
| base_model: |
| - mistralai/Mistral-7B-v0.3 |
| - dreamgen/WizardLM-2-7B |
| - uukuguy/speechless-code-mistral-7b-v2.0 |
| library_name: transformers |
| pipeline_tag: text-generation |
| tags: |
| - mergekit |
| - merge |
| - mistral |
| - instruction-tuned |
|
|
| --- |
| # Runforge_Core-7b |
| |
| This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit). |
| |
| ## Merge Details |
| ### Merge Method |
| |
| This model was merged using the [Linear](https://arxiv.org/abs/2203.05482) merge method using `mistralai/Mistral-7B-v0.3` as a base. |
| |
| ### Models Merged |
| |
| The following models were included in the merge: |
| * `dreamgen/WizardLM-2-7B` |
| * `uukuguy/speechless-code-mistral-7b-v2.0` |
| |
| ### Configuration |
| |
| The following YAML configuration was used to produce this model: |
| |
| ```yaml |
| # Clean 3-way dense merge rebuild for runeforge_core-7b |
| models: |
| - model: mistralai/Mistral-7B-v0.3 |
| parameters: |
| weight: 0.4 |
| - model: dreamgen/WizardLM-2-7B |
| parameters: |
| weight: 0.3 |
| - model: uukuguy/speechless-code-mistral-7b-v2.0 |
| parameters: |
| weight: 0.3 |
| merge_method: linear |
| base_model: mistralai/Mistral-7B-v0.3 |
| dtype: float16 |
| out_dtype: float16 |
| ``` |
| |
| ## Evaluation |
|
|
| ### Setup |
|
|
| - Date: 2026-03-14 |
| - Runtime: local GPU inference in WSL |
| - Loader: Transformers/Unsloth with 4-bit quantization (`load_in_4bit`) |
| - Benchmarks: |
| - ARC-Challenge (multiple-choice) |
| - HellaSwag (multiple-choice) |
| - Winogrande XL (multiple-choice) |
| - TruthfulQA MC1 (multiple-choice) |
| - Metric: Accuracy per benchmark and macro average across the four tasks |
|
|
| ### Primary Comparison (200 samples per benchmark) |
|
|
| | Model | ARC | HellaSwag | Winogrande | TruthfulQA MC1 | Macro Avg | |
| |---|---:|---:|---:|---:|---:| |
| | runeforge_core-7b (this model) | 0.7650 | 0.7050 | 0.6000 | 0.5800 | **0.6625** | |
| | mistral-7b baseline | 0.7000 | 0.6000 | 0.4600 | 0.5900 | 0.5875 | |
| |
| Interpretation: runeforge_core-7b outperformed the local Mistral baseline by +0.0750 macro accuracy on this evaluation run. |
|
|
| ### Expanded Comparison (30 samples per benchmark) |
|
|
| | Model | ARC | HellaSwag | Winogrande | TruthfulQA MC1 | Macro Avg | |
| |---|---:|---:|---:|---:|---:| |
| | runeforge_core-7b (this model) | 0.8000 | 0.7333 | 0.5000 | 0.4333 | **0.6167** | |
| | mistral-7b baseline | 0.7000 | 0.6000 | 0.4667 | 0.5000 | 0.5667 | |
| | speechless-code-mistral-7b-v2.0 | 0.6000 | 0.3000 | 0.5333 | 0.6000 | 0.5083 | |
| | dreamgen/WizardLM-2-7B | 0.2000 | 0.2333 | 0.6667 | 0.7667 | 0.4667 | |
| | runeforge_mk1_merged_from_7922 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | |
| |
| Note: the expanded table uses a smaller sample size and is more variance-prone; use the 200-sample comparison as the primary signal. |
| |
| ### Coding Sanity Check (Executable) |
| |
| A separate executable coding sanity check (5 unit-tested tasks) was also run: |
| |
| | Model | Passes | Total | Pass Rate | |
| |---|---:|---:|---:| |
| | runeforge_core-7b (this model) | 5 | 5 | **1.00** | |
| | runeforge_mk1_merged_from_7922 | 0 | 5 | 0.00 | |
|
|
| ### Reproducibility Files |
|
|
| Repository-relative references (from this model folder): |
|
|
| - `../Making_Runeforge/evaluate_general_models.py` |
| - `../Making_Runeforge/evaluate_coding_exec.py` |
| - `../Making_Runeforge/eval_general_runeforge_core_200.json` |
| - `../Making_Runeforge/eval_general_mistral_base_200.json` |
| - `../Making_Runeforge/eval_general_leaderboard.json` |
| - `../Making_Runeforge/runeforge_coding_exec_eval.json` |
|
|
| ## Intended Use |
|
|
| - General-purpose assistant and instruction-following use cases. |
| - Strong performance on local multiple-choice reasoning benchmarks relative to the local Mistral baseline used in this project. |
| - Suitable as a base for additional task-specific fine-tuning where broad instruction quality is desired. |
|
|
| ## Limitations |
|
|
| - Reported metrics are from local, sampled benchmark runs (not full official leaderboard submissions). |
| - Quantized inference (`load_in_4bit`) was used for evaluation; scores may shift under different precision/runtime setups. |
| - Expanded 5-model comparison used 30 samples per benchmark and should be treated as directional. |
| - A separate merged artifact (`runeforge_mk1_merged_from_7922`) showed severe degradation (near-zero on sampled general benchmarks and 0/5 on coding executable sanity checks). |
|
|
| ## Evaluation Notes |
|
|
| - The 200-sample comparison is the primary result set for this card. |
| - The 30-sample expanded table is included for breadth across additional local peer models. |
| - All benchmark scripts and JSON outputs are listed above for reproducibility. |
|
|