model_tools / eos_scanner_readme.md
Naphula's picture
Upload 5 files
7080631 verified

Here is the eos_scanner.py tool.

This script is designed specifically to detect the "Silent Killers" of merges: Token ID mismatches and Chat Template inconsistencies.

How to use it

  1. Save the code below as eos_scanner.py.
  2. Run it against your config:
    python eos_scanner.py config36.yaml
    

Regarding your YAML Question

You asked:

Let me know if I should keep these lines, or null them out (and set tokenizer to base).

tokenizer:
  source: union
chat_template: auto

The Answer: Run the scanner above.

  1. If the scanner shows all GREEN (MATCH):

    • Change to source: base.
    • Why? source: union attempts to merge vocabularies. If the vocabularies are already identical (which they likely are if they are all Mistral 24B derivatives), union adds computational overhead and, more dangerously, can accidentally re-index special tokens if one model has a slightly malformed tokenizer.json. Using base forces the merge to use the clean, working tokenizer from your base model.
  2. If the scanner shows RED (FAIL):

    • Keep source: union (or remove the red models).
    • Why? If Model A uses token ID 2 for EOS, and Model B uses token ID 32000 for EOS, you cannot use source: base. You need union to handle the conflict, though this is exactly what causes "early termination" (the model generates ID 2 thinking it's a comma, but the tokenizer thinks it's EOS).

Regarding chat_template: auto: It is generally safer to delete this line or set it to a specific file if you want consistent behavior. auto often defaults to the base model's template, but sometimes MergeKit tries to synthesize one. Since you are merging RP models, you likely want a specific template (like Mistral V3 Tekken or ChatML). I recommend removing chat_template: auto and letting your inference engine (Ollama/Kobold) handle the template, OR explicitly setting it to the base model's template in the YAML.


Critical Findings from Source Code Analysis

  1. The "Union" Risk (mergekit/tokenizer/build.py): When you use source: union, mergekit calculates a permutation map for every model, even if they are identical. It then runs PermutedEmbeddings (in embed.py). If there is even a tiny discrepancy in added_tokens.json or special_tokens_map.json between your donors, union might assign a new ID to the EOS token.

    • The Bug: mergekit often copies the generation_config.json from the base model without updating the eos_token_id inside it to match the new union tokenizer. If union shifts EOS from ID 2 to ID 32000, but generation_config still says 2, your model will terminate early (or never).
  2. The "Auto" Template Risk (mergekit/merge.py): The chat_template: auto logic simply counts the most common template string among donors. If your base model (Mistral Small 3) has a specific template, but you merge 10 models that use a generic Llama 2 template, auto will overwrite your base template. This causes the model to see <|im_start|> (for example) but not know how to process it because the template changed.

Updated eos_scanner.py

I have updated the script to perform an Internal Consistency Check. It now verifies if the eos_token_id defined in the config actually matches the ID of the eos_token string in the vocabulary.

Final YAML Advice

Based on the code review, here is the safest configuration for your YAML.

If the scanner returns all MATCH:

tokenizer:
  source: base
# chat_template: auto  <-- DELETE THIS LINE COMPLETELY

Why?

  1. source: base forces mergekit to skip the complex permutation logic in build.py. It simply copies the tokenizer files from your base model. This guarantees that eos_token_id 2 remains 2.
  2. Deleting chat_template prevents mergekit from synthesizing a template based on a popularity contest of the donors. It will default to copying the base model's template, which is exactly what you want for a consistent chat experience.

Yes, creating a Gen_ID_Patcher.py is a highly effective strategy for the models marked BROKEN | MISSING | 2.

Why this works

The screenshot confirms that these models (like ReadyArt...Broken-Tutu, Morax, FlareRebellion) actually use Token ID 2 in their vocabulary (Column: Vocab ID). They are just missing the metadata in generation_config.json that tells MergeKit "Hey, I use ID 2."

By patching this file, you convert these models from BROKEN to MATCH.

The Strategic Benefit: If you patch these "Missing ID" models, and then remove the actual outliers (the ones with ID 999 or <|endoftext|>), you will achieve a 100% MATCH across the board. This allows you to use tokenizer: source: base, which eliminates the early termination bug caused by union.

The Script: Gen_ID_Patcher.py

This script will look at your YAML, find the models, and inject eos_token_id: 2 (or whatever your base model uses) into their generation_config.json.

Instructions

  1. Run the patcher:
    python Gen_ID_Patcher.py config36.yaml
    
  2. Crucial Step: Look at your eos_scanner output again. You must remove the models that are genuinely incompatible.
    • Remove: LatitudeGames--Hearthfire (ID 999)
    • Remove: aixonlab--Eurydice (ID 999)
    • Remove: Gryphe--Codex (ID 999)
    • Remove: PocketDoc--Dans-PersonalityEngine (ID 2, but string is <|endoftext|>. This is a conflict).
  3. Run eos_scanner.py again.
  4. If everything is green (MATCH), change your YAML to:
    tokenizer:
      source: base
    
    (And delete chat_template: auto).

This path gives you the highest probability of a stable model that stops generating correctly.