Spaces:
Running
Here is the eos_scanner.py tool.
This script is designed specifically to detect the "Silent Killers" of merges: Token ID mismatches and Chat Template inconsistencies.
How to use it
- Save the code below as
eos_scanner.py. - Run it against your config:
python eos_scanner.py config36.yaml
Regarding your YAML Question
You asked:
Let me know if I should keep these lines, or null them out (and set tokenizer to base).
tokenizer: source: union chat_template: auto
The Answer: Run the scanner above.
If the scanner shows all GREEN (MATCH):
- Change to
source: base. - Why?
source: unionattempts to merge vocabularies. If the vocabularies are already identical (which they likely are if they are all Mistral 24B derivatives),unionadds computational overhead and, more dangerously, can accidentally re-index special tokens if one model has a slightly malformedtokenizer.json. Usingbaseforces the merge to use the clean, working tokenizer from your base model.
- Change to
If the scanner shows RED (FAIL):
- Keep
source: union(or remove the red models). - Why? If Model A uses token ID
2for EOS, and Model B uses token ID32000for EOS, you cannot usesource: base. You needunionto handle the conflict, though this is exactly what causes "early termination" (the model generates ID 2 thinking it's a comma, but the tokenizer thinks it's EOS).
- Keep
Regarding chat_template: auto:
It is generally safer to delete this line or set it to a specific file if you want consistent behavior. auto often defaults to the base model's template, but sometimes MergeKit tries to synthesize one. Since you are merging RP models, you likely want a specific template (like Mistral V3 Tekken or ChatML). I recommend removing chat_template: auto and letting your inference engine (Ollama/Kobold) handle the template, OR explicitly setting it to the base model's template in the YAML.
Critical Findings from Source Code Analysis
The "Union" Risk (
mergekit/tokenizer/build.py): When you usesource: union,mergekitcalculates a permutation map for every model, even if they are identical. It then runsPermutedEmbeddings(inembed.py). If there is even a tiny discrepancy inadded_tokens.jsonorspecial_tokens_map.jsonbetween your donors,unionmight assign a new ID to the EOS token.- The Bug:
mergekitoften copies thegeneration_config.jsonfrom the base model without updating theeos_token_idinside it to match the newuniontokenizer. Ifunionshifts EOS from ID2to ID32000, butgeneration_configstill says2, your model will terminate early (or never).
- The Bug:
The "Auto" Template Risk (
mergekit/merge.py): Thechat_template: autologic simply counts the most common template string among donors. If your base model (Mistral Small 3) has a specific template, but you merge 10 models that use a generic Llama 2 template,autowill overwrite your base template. This causes the model to see<|im_start|>(for example) but not know how to process it because the template changed.
Updated eos_scanner.py
I have updated the script to perform an Internal Consistency Check. It now verifies if the eos_token_id defined in the config actually matches the ID of the eos_token string in the vocabulary.
Final YAML Advice
Based on the code review, here is the safest configuration for your YAML.
If the scanner returns all MATCH:
tokenizer:
source: base
# chat_template: auto <-- DELETE THIS LINE COMPLETELY
Why?
source: baseforcesmergekitto skip the complex permutation logic inbuild.py. It simply copies the tokenizer files from your base model. This guarantees thateos_token_id2remains2.- Deleting
chat_templatepreventsmergekitfrom synthesizing a template based on a popularity contest of the donors. It will default to copying the base model's template, which is exactly what you want for a consistent chat experience.
Yes, creating a Gen_ID_Patcher.py is a highly effective strategy for the models marked BROKEN | MISSING | 2.
Why this works
The screenshot confirms that these models (like ReadyArt...Broken-Tutu, Morax, FlareRebellion) actually use Token ID 2 in their vocabulary (Column: Vocab ID). They are just missing the metadata in generation_config.json that tells MergeKit "Hey, I use ID 2."
By patching this file, you convert these models from BROKEN to MATCH.
The Strategic Benefit:
If you patch these "Missing ID" models, and then remove the actual outliers (the ones with ID 999 or <|endoftext|>), you will achieve a 100% MATCH across the board. This allows you to use tokenizer: source: base, which eliminates the early termination bug caused by union.
The Script: Gen_ID_Patcher.py
This script will look at your YAML, find the models, and inject eos_token_id: 2 (or whatever your base model uses) into their generation_config.json.
Instructions
- Run the patcher:
python Gen_ID_Patcher.py config36.yaml - Crucial Step: Look at your
eos_scanneroutput again. You must remove the models that are genuinely incompatible.- Remove:
LatitudeGames--Hearthfire(ID 999) - Remove:
aixonlab--Eurydice(ID 999) - Remove:
Gryphe--Codex(ID 999) - Remove:
PocketDoc--Dans-PersonalityEngine(ID 2, but string is<|endoftext|>. This is a conflict).
- Remove:
- Run
eos_scanner.pyagain. - If everything is green (MATCH), change your YAML to:
(And deletetokenizer: source: basechat_template: auto).
This path gives you the highest probability of a stable model that stops generating correctly.