LLM-as-a-Fuzzy-Judge / ExplanationModelFiles
zheng520's picture
Create ExplanationModelFiles
b53e10d verified
# Explanation of Files in each models
This directory contains the outputs of a fine-tuned large language model. Below is an explanation of each file:
---
- **added_tokens.json**
Contains any custom tokens that were added to the tokenizer beyond the base vocabulary (e.g., special domain-specific words or symbols).
- **config.json**
Stores the model architecture and hyperparameters (e.g., number of layers, hidden size, attention heads). This is needed to reload the model correctly.
- **merges.txt**
Used by Byte-Pair Encoding (BPE) tokenizers. Contains merge rules for combining subword units into larger tokens.
- **model.safetensors**
The main file containing the model’s learned weights in the safetensors format (a safe and fast alternative to PyTorch’s .bin format).
- **model.safetensors.index.json**
An index file for the safetensors weights, used when the model is sharded or split across multiple files.
- **special_tokens_map.json**
Maps special tokens (like [CLS], [SEP], [PAD], etc.) to their corresponding IDs in the tokenizer.
- **spm.model**
SentencePiece model file, used for tokenization if SentencePiece is the tokenizer (common in multilingual or T5-style models).
- **tokenizer_config.json**
Stores configuration settings for the tokenizer (e.g., lowercasing, normalization, special token handling).
- **tokenizer.json**
Contains the full tokenizer vocabulary and rules in a single JSON file, often used for fast loading.
- **vocab.json**
The vocabulary file mapping tokens to their integer IDs (used by some tokenizers, especially BPE).
- **vocab.txt**
A plain text vocabulary file, listing all tokens (used by some tokenizers, especially WordPiece).
---
These files together allow you to reload the fine-tuned model, preprocess text in the same way as during training, and ensure compatibility with downstream tasks.