Spaces:

Naphula
/

model_tools

Running

App Files Files Community

model_tools / eos_scanner_readme.md

Naphula

Upload 5 files

7080631 verified about 2 months ago

preview code

raw

history blame contribute delete

6.01 kB

	Here is the `eos_scanner.py` tool.

	This script is designed specifically to detect the "Silent Killers" of merges: Token ID mismatches and Chat Template inconsistencies.

	### How to use it
	1. Save the code below as `eos_scanner.py`.
	2. Run it against your config:
	```bash
	python eos_scanner.py config36.yaml
	```

	### Regarding your YAML Question

	You asked:
	> Let me know if I should keep these lines, or null them out (and set tokenizer to base).
	> ```yaml
	> tokenizer:
	> source: union
	> chat_template: auto
	> ```

	The Answer:
	Run the scanner above.

	1. If the scanner shows all GREEN (MATCH):
	* Change to `source: base`.
	* Why? `source: union` attempts to merge vocabularies. If the vocabularies are already identical (which they likely are if they are all Mistral 24B derivatives), `union` adds computational overhead and, more dangerously, can accidentally re-index special tokens if one model has a slightly malformed `tokenizer.json`. Using `base` forces the merge to use the clean, working tokenizer from your base model.

	2. If the scanner shows RED (FAIL):
	* Keep `source: union` (or remove the red models).
	* Why? If Model A uses token ID `2` for EOS, and Model B uses token ID `32000` for EOS, you cannot use `source: base`. You need `union` to handle the conflict, though this is exactly what causes "early termination" (the model generates ID 2 thinking it's a comma, but the tokenizer thinks it's EOS).

	Regarding `chat_template: auto`:
	It is generally safer to delete this line or set it to a specific file if you want consistent behavior. `auto` often defaults to the base model's template, but sometimes MergeKit tries to synthesize one. Since you are merging RP models, you likely want a specific template (like Mistral V3 Tekken or ChatML). I recommend removing `chat_template: auto` and letting your inference engine (Ollama/Kobold) handle the template, OR explicitly setting it to the base model's template in the YAML.

	---

	### Critical Findings from Source Code Analysis

	1. The "Union" Risk (`mergekit/tokenizer/build.py`):
	When you use `source: union`, `mergekit` calculates a permutation map for every model, even if they are identical. It then runs `PermutedEmbeddings` (in `embed.py`). If there is even a tiny discrepancy in `added_tokens.json` or `special_tokens_map.json` between your donors, `union` might assign a new ID to the EOS token.
	* The Bug: `mergekit` often copies the `generation_config.json` from the base model without updating the `eos_token_id` inside it to match the new `union` tokenizer. If `union` shifts EOS from ID `2` to ID `32000`, but `generation_config` still says `2`, your model will terminate early (or never).

	2. The "Auto" Template Risk (`mergekit/merge.py`):
	The `chat_template: auto` logic simply counts the most common template string among donors. If your base model (Mistral Small 3) has a specific template, but you merge 10 models that use a generic Llama 2 template, `auto` will overwrite your base template. This causes the model to see `<\|im_start\|>` (for example) but not know how to process it because the template changed.

	### Updated `eos_scanner.py`

	I have updated the script to perform an Internal Consistency Check. It now verifies if the `eos_token_id` defined in the config actually matches the ID of the `eos_token` string in the vocabulary.

	### Final YAML Advice

	Based on the code review, here is the safest configuration for your YAML.

	If the scanner returns all MATCH:

	```yaml
	tokenizer:
	source: base
	# chat_template: auto <-- DELETE THIS LINE COMPLETELY
	```

	Why?
	1. `source: base` forces `mergekit` to skip the complex permutation logic in `build.py`. It simply copies the tokenizer files from your base model. This guarantees that `eos_token_id` `2` remains `2`.
	2. Deleting `chat_template` prevents `mergekit` from synthesizing a template based on a popularity contest of the donors. It will default to copying the base model's template, which is exactly what you want for a consistent chat experience.

	---

	Yes, creating a `Gen_ID_Patcher.py` is a highly effective strategy for the models marked `BROKEN \| MISSING \| 2`.

	### Why this works
	The screenshot confirms that these models (like `ReadyArt...Broken-Tutu`, `Morax`, `FlareRebellion`) actually use Token ID 2 in their vocabulary (Column: `Vocab ID`). They are just missing the metadata in `generation_config.json` that tells MergeKit "Hey, I use ID 2."

	By patching this file, you convert these models from BROKEN to MATCH.

	The Strategic Benefit:
	If you patch these "Missing ID" models, and then remove the actual outliers (the ones with ID `999` or `<\|endoftext\|>`), you will achieve a 100% MATCH across the board. This allows you to use `tokenizer: source: base`, which eliminates the early termination bug caused by `union`.

	### The Script: `Gen_ID_Patcher.py`

	This script will look at your YAML, find the models, and inject `eos_token_id: 2` (or whatever your base model uses) into their `generation_config.json`.

	### Instructions

	1. Run the patcher:
	```bash
	python Gen_ID_Patcher.py config36.yaml
	```
	2. Crucial Step: Look at your `eos_scanner` output again. You must remove the models that are genuinely incompatible.
	* Remove: `LatitudeGames--Hearthfire` (ID 999)
	* Remove: `aixonlab--Eurydice` (ID 999)
	* Remove: `Gryphe--Codex` (ID 999)
	* Remove: `PocketDoc--Dans-PersonalityEngine` (ID 2, but string is `<\|endoftext\|>`. This is a conflict).
	3. Run `eos_scanner.py` again.
	4. If everything is green (MATCH), change your YAML to:
	```yaml
	tokenizer:
	source: base
	```
	(And delete `chat_template: auto`).

	This path gives you the highest probability of a stable model that stops generating correctly.