Spaces:
Running
Running
| Here is the `eos_scanner.py` tool. | |
| This script is designed specifically to detect the "Silent Killers" of merges: **Token ID mismatches** and **Chat Template inconsistencies**. | |
| ### How to use it | |
| 1. Save the code below as `eos_scanner.py`. | |
| 2. Run it against your config: | |
| ```bash | |
| python eos_scanner.py config36.yaml | |
| ``` | |
| ### Regarding your YAML Question | |
| You asked: | |
| > Let me know if I should keep these lines, or null them out (and set tokenizer to base). | |
| > ```yaml | |
| > tokenizer: | |
| > source: union | |
| > chat_template: auto | |
| > ``` | |
| **The Answer:** | |
| Run the scanner above. | |
| 1. **If the scanner shows all GREEN (MATCH):** | |
| * **Change to `source: base`**. | |
| * *Why?* `source: union` attempts to merge vocabularies. If the vocabularies are already identical (which they likely are if they are all Mistral 24B derivatives), `union` adds computational overhead and, more dangerously, can accidentally re-index special tokens if one model has a slightly malformed `tokenizer.json`. Using `base` forces the merge to use the clean, working tokenizer from your base model. | |
| 2. **If the scanner shows RED (FAIL):** | |
| * **Keep `source: union`** (or remove the red models). | |
| * *Why?* If Model A uses token ID `2` for EOS, and Model B uses token ID `32000` for EOS, you *cannot* use `source: base`. You need `union` to handle the conflict, though this is exactly what causes "early termination" (the model generates ID 2 thinking it's a comma, but the tokenizer thinks it's EOS). | |
| **Regarding `chat_template: auto`:** | |
| It is generally safer to delete this line or set it to a specific file if you want consistent behavior. `auto` often defaults to the base model's template, but sometimes MergeKit tries to synthesize one. Since you are merging RP models, you likely want a specific template (like Mistral V3 Tekken or ChatML). I recommend removing `chat_template: auto` and letting your inference engine (Ollama/Kobold) handle the template, OR explicitly setting it to the base model's template in the YAML. | |
| --- | |
| ### Critical Findings from Source Code Analysis | |
| 1. **The "Union" Risk (`mergekit/tokenizer/build.py`):** | |
| When you use `source: union`, `mergekit` calculates a permutation map for *every* model, even if they are identical. It then runs `PermutedEmbeddings` (in `embed.py`). If there is even a tiny discrepancy in `added_tokens.json` or `special_tokens_map.json` between your donors, `union` might assign a new ID to the EOS token. | |
| * **The Bug:** `mergekit` often copies the `generation_config.json` from the base model *without* updating the `eos_token_id` inside it to match the new `union` tokenizer. If `union` shifts EOS from ID `2` to ID `32000`, but `generation_config` still says `2`, your model will terminate early (or never). | |
| 2. **The "Auto" Template Risk (`mergekit/merge.py`):** | |
| The `chat_template: auto` logic simply counts the most common template string among donors. If your base model (Mistral Small 3) has a specific template, but you merge 10 models that use a generic Llama 2 template, `auto` will overwrite your base template. This causes the model to see `<|im_start|>` (for example) but not know how to process it because the template changed. | |
| ### Updated `eos_scanner.py` | |
| I have updated the script to perform an **Internal Consistency Check**. It now verifies if the `eos_token_id` defined in the config actually matches the ID of the `eos_token` string in the vocabulary. | |
| ### Final YAML Advice | |
| Based on the code review, here is the safest configuration for your YAML. | |
| **If the scanner returns all MATCH:** | |
| ```yaml | |
| tokenizer: | |
| source: base | |
| # chat_template: auto <-- DELETE THIS LINE COMPLETELY | |
| ``` | |
| **Why?** | |
| 1. `source: base` forces `mergekit` to skip the complex permutation logic in `build.py`. It simply copies the tokenizer files from your base model. This guarantees that `eos_token_id` `2` remains `2`. | |
| 2. Deleting `chat_template` prevents `mergekit` from synthesizing a template based on a popularity contest of the donors. It will default to copying the base model's template, which is exactly what you want for a consistent chat experience. | |
| --- | |
| Yes, creating a `Gen_ID_Patcher.py` is a **highly effective** strategy for the models marked `BROKEN | MISSING | 2`. | |
| ### Why this works | |
| The screenshot confirms that these models (like `ReadyArt...Broken-Tutu`, `Morax`, `FlareRebellion`) **actually use Token ID 2** in their vocabulary (Column: `Vocab ID`). They are just missing the metadata in `generation_config.json` that tells MergeKit "Hey, I use ID 2." | |
| By patching this file, you convert these models from **BROKEN** to **MATCH**. | |
| **The Strategic Benefit:** | |
| If you patch these "Missing ID" models, and then **remove** the actual outliers (the ones with ID `999` or `<|endoftext|>`), you will achieve a **100% MATCH** across the board. This allows you to use `tokenizer: source: base`, which eliminates the early termination bug caused by `union`. | |
| ### The Script: `Gen_ID_Patcher.py` | |
| This script will look at your YAML, find the models, and inject `eos_token_id: 2` (or whatever your base model uses) into their `generation_config.json`. | |
| ### Instructions | |
| 1. Run the patcher: | |
| ```bash | |
| python Gen_ID_Patcher.py config36.yaml | |
| ``` | |
| 2. **Crucial Step:** Look at your `eos_scanner` output again. You must **remove** the models that are *genuinely* incompatible. | |
| * **Remove:** `LatitudeGames--Hearthfire` (ID 999) | |
| * **Remove:** `aixonlab--Eurydice` (ID 999) | |
| * **Remove:** `Gryphe--Codex` (ID 999) | |
| * **Remove:** `PocketDoc--Dans-PersonalityEngine` (ID 2, but string is `<|endoftext|>`. This is a conflict). | |
| 3. Run `eos_scanner.py` again. | |
| 4. If everything is green (MATCH), change your YAML to: | |
| ```yaml | |
| tokenizer: | |
| source: base | |
| ``` | |
| (And delete `chat_template: auto`). | |
| This path gives you the highest probability of a stable model that stops generating correctly. |