Upload 11 files

bdf05ff verified 9 days ago

4.43 kB

	---
	base_model:
	- mistralai/Mistral-7B-v0.3
	- dreamgen/WizardLM-2-7B
	- uukuguy/speechless-code-mistral-7b-v2.0
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- mergekit
	- merge
	- mistral
	- instruction-tuned

	---
	# Runforge_Core-7b

	This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).

	## Merge Details
	### Merge Method

	This model was merged using the [Linear](https://arxiv.org/abs/2203.05482) merge method using `mistralai/Mistral-7B-v0.3` as a base.

	### Models Merged

	The following models were included in the merge:
	* `dreamgen/WizardLM-2-7B`
	* `uukuguy/speechless-code-mistral-7b-v2.0`

	### Configuration

	The following YAML configuration was used to produce this model:

	```yaml
	# Clean 3-way dense merge rebuild for runeforge_core-7b
	models:
	- model: mistralai/Mistral-7B-v0.3
	parameters:
	weight: 0.4
	- model: dreamgen/WizardLM-2-7B
	parameters:
	weight: 0.3
	- model: uukuguy/speechless-code-mistral-7b-v2.0
	parameters:
	weight: 0.3
	merge_method: linear
	base_model: mistralai/Mistral-7B-v0.3
	dtype: float16
	out_dtype: float16
	```

	## Evaluation

	### Setup

	- Date: 2026-03-14
	- Runtime: local GPU inference in WSL
	- Loader: Transformers/Unsloth with 4-bit quantization (`load_in_4bit`)
	- Benchmarks:
	- ARC-Challenge (multiple-choice)
	- HellaSwag (multiple-choice)
	- Winogrande XL (multiple-choice)
	- TruthfulQA MC1 (multiple-choice)
	- Metric: Accuracy per benchmark and macro average across the four tasks

	### Primary Comparison (200 samples per benchmark)

	\| Model \| ARC \| HellaSwag \| Winogrande \| TruthfulQA MC1 \| Macro Avg \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| runeforge_core-7b (this model) \| 0.7650 \| 0.7050 \| 0.6000 \| 0.5800 \| 0.6625 \|
	\| mistral-7b baseline \| 0.7000 \| 0.6000 \| 0.4600 \| 0.5900 \| 0.5875 \|

	Interpretation: runeforge_core-7b outperformed the local Mistral baseline by +0.0750 macro accuracy on this evaluation run.

	### Expanded Comparison (30 samples per benchmark)

	\| Model \| ARC \| HellaSwag \| Winogrande \| TruthfulQA MC1 \| Macro Avg \|
	\|---\|---:\|---:\|---:\|---:\|---:\|
	\| runeforge_core-7b (this model) \| 0.8000 \| 0.7333 \| 0.5000 \| 0.4333 \| 0.6167 \|
	\| mistral-7b baseline \| 0.7000 \| 0.6000 \| 0.4667 \| 0.5000 \| 0.5667 \|
	\| speechless-code-mistral-7b-v2.0 \| 0.6000 \| 0.3000 \| 0.5333 \| 0.6000 \| 0.5083 \|
	\| dreamgen/WizardLM-2-7B \| 0.2000 \| 0.2333 \| 0.6667 \| 0.7667 \| 0.4667 \|
	\| runeforge_mk1_merged_from_7922 \| 0.0000 \| 0.0000 \| 0.0000 \| 0.0000 \| 0.0000 \|

	Note: the expanded table uses a smaller sample size and is more variance-prone; use the 200-sample comparison as the primary signal.

	### Coding Sanity Check (Executable)

	A separate executable coding sanity check (5 unit-tested tasks) was also run:

	\| Model \| Passes \| Total \| Pass Rate \|
	\|---\|---:\|---:\|---:\|
	\| runeforge_core-7b (this model) \| 5 \| 5 \| 1.00 \|
	\| runeforge_mk1_merged_from_7922 \| 0 \| 5 \| 0.00 \|

	### Reproducibility Files

	Repository-relative references (from this model folder):

	- `../Making_Runeforge/evaluate_general_models.py`
	- `../Making_Runeforge/evaluate_coding_exec.py`
	- `../Making_Runeforge/eval_general_runeforge_core_200.json`
	- `../Making_Runeforge/eval_general_mistral_base_200.json`
	- `../Making_Runeforge/eval_general_leaderboard.json`
	- `../Making_Runeforge/runeforge_coding_exec_eval.json`

	## Intended Use

	- General-purpose assistant and instruction-following use cases.
	- Strong performance on local multiple-choice reasoning benchmarks relative to the local Mistral baseline used in this project.
	- Suitable as a base for additional task-specific fine-tuning where broad instruction quality is desired.

	## Limitations

	- Reported metrics are from local, sampled benchmark runs (not full official leaderboard submissions).
	- Quantized inference (`load_in_4bit`) was used for evaluation; scores may shift under different precision/runtime setups.
	- Expanded 5-model comparison used 30 samples per benchmark and should be treated as directional.
	- A separate merged artifact (`runeforge_mk1_merged_from_7922`) showed severe degradation (near-zero on sampled general benchmarks and 0/5 on coding executable sanity checks).

	## Evaluation Notes

	- The 200-sample comparison is the primary result set for this card.
	- The 30-sample expanded table is included for breadth across additional local peer models.
	- All benchmark scripts and JSON outputs are listed above for reproducibility.