model_tools / eos_scanner_readme.md
Naphula's picture
Upload 5 files
7080631 verified
Here is the `eos_scanner.py` tool.
This script is designed specifically to detect the "Silent Killers" of merges: **Token ID mismatches** and **Chat Template inconsistencies**.
### How to use it
1. Save the code below as `eos_scanner.py`.
2. Run it against your config:
```bash
python eos_scanner.py config36.yaml
```
### Regarding your YAML Question
You asked:
> Let me know if I should keep these lines, or null them out (and set tokenizer to base).
> ```yaml
> tokenizer:
> source: union
> chat_template: auto
> ```
**The Answer:**
Run the scanner above.
1. **If the scanner shows all GREEN (MATCH):**
* **Change to `source: base`**.
* *Why?* `source: union` attempts to merge vocabularies. If the vocabularies are already identical (which they likely are if they are all Mistral 24B derivatives), `union` adds computational overhead and, more dangerously, can accidentally re-index special tokens if one model has a slightly malformed `tokenizer.json`. Using `base` forces the merge to use the clean, working tokenizer from your base model.
2. **If the scanner shows RED (FAIL):**
* **Keep `source: union`** (or remove the red models).
* *Why?* If Model A uses token ID `2` for EOS, and Model B uses token ID `32000` for EOS, you *cannot* use `source: base`. You need `union` to handle the conflict, though this is exactly what causes "early termination" (the model generates ID 2 thinking it's a comma, but the tokenizer thinks it's EOS).
**Regarding `chat_template: auto`:**
It is generally safer to delete this line or set it to a specific file if you want consistent behavior. `auto` often defaults to the base model's template, but sometimes MergeKit tries to synthesize one. Since you are merging RP models, you likely want a specific template (like Mistral V3 Tekken or ChatML). I recommend removing `chat_template: auto` and letting your inference engine (Ollama/Kobold) handle the template, OR explicitly setting it to the base model's template in the YAML.
---
### Critical Findings from Source Code Analysis
1. **The "Union" Risk (`mergekit/tokenizer/build.py`):**
When you use `source: union`, `mergekit` calculates a permutation map for *every* model, even if they are identical. It then runs `PermutedEmbeddings` (in `embed.py`). If there is even a tiny discrepancy in `added_tokens.json` or `special_tokens_map.json` between your donors, `union` might assign a new ID to the EOS token.
* **The Bug:** `mergekit` often copies the `generation_config.json` from the base model *without* updating the `eos_token_id` inside it to match the new `union` tokenizer. If `union` shifts EOS from ID `2` to ID `32000`, but `generation_config` still says `2`, your model will terminate early (or never).
2. **The "Auto" Template Risk (`mergekit/merge.py`):**
The `chat_template: auto` logic simply counts the most common template string among donors. If your base model (Mistral Small 3) has a specific template, but you merge 10 models that use a generic Llama 2 template, `auto` will overwrite your base template. This causes the model to see `<|im_start|>` (for example) but not know how to process it because the template changed.
### Updated `eos_scanner.py`
I have updated the script to perform an **Internal Consistency Check**. It now verifies if the `eos_token_id` defined in the config actually matches the ID of the `eos_token` string in the vocabulary.
### Final YAML Advice
Based on the code review, here is the safest configuration for your YAML.
**If the scanner returns all MATCH:**
```yaml
tokenizer:
source: base
# chat_template: auto <-- DELETE THIS LINE COMPLETELY
```
**Why?**
1. `source: base` forces `mergekit` to skip the complex permutation logic in `build.py`. It simply copies the tokenizer files from your base model. This guarantees that `eos_token_id` `2` remains `2`.
2. Deleting `chat_template` prevents `mergekit` from synthesizing a template based on a popularity contest of the donors. It will default to copying the base model's template, which is exactly what you want for a consistent chat experience.
---
Yes, creating a `Gen_ID_Patcher.py` is a **highly effective** strategy for the models marked `BROKEN | MISSING | 2`.
### Why this works
The screenshot confirms that these models (like `ReadyArt...Broken-Tutu`, `Morax`, `FlareRebellion`) **actually use Token ID 2** in their vocabulary (Column: `Vocab ID`). They are just missing the metadata in `generation_config.json` that tells MergeKit "Hey, I use ID 2."
By patching this file, you convert these models from **BROKEN** to **MATCH**.
**The Strategic Benefit:**
If you patch these "Missing ID" models, and then **remove** the actual outliers (the ones with ID `999` or `<|endoftext|>`), you will achieve a **100% MATCH** across the board. This allows you to use `tokenizer: source: base`, which eliminates the early termination bug caused by `union`.
### The Script: `Gen_ID_Patcher.py`
This script will look at your YAML, find the models, and inject `eos_token_id: 2` (or whatever your base model uses) into their `generation_config.json`.
### Instructions
1. Run the patcher:
```bash
python Gen_ID_Patcher.py config36.yaml
```
2. **Crucial Step:** Look at your `eos_scanner` output again. You must **remove** the models that are *genuinely* incompatible.
* **Remove:** `LatitudeGames--Hearthfire` (ID 999)
* **Remove:** `aixonlab--Eurydice` (ID 999)
* **Remove:** `Gryphe--Codex` (ID 999)
* **Remove:** `PocketDoc--Dans-PersonalityEngine` (ID 2, but string is `<|endoftext|>`. This is a conflict).
3. Run `eos_scanner.py` again.
4. If everything is green (MATCH), change your YAML to:
```yaml
tokenizer:
source: base
```
(And delete `chat_template: auto`).
This path gives you the highest probability of a stable model that stops generating correctly.