# Explanation of Files in each models This directory contains the outputs of a fine-tuned large language model. Below is an explanation of each file: --- - **added_tokens.json** Contains any custom tokens that were added to the tokenizer beyond the base vocabulary (e.g., special domain-specific words or symbols). - **config.json** Stores the model architecture and hyperparameters (e.g., number of layers, hidden size, attention heads). This is needed to reload the model correctly. - **merges.txt** Used by Byte-Pair Encoding (BPE) tokenizers. Contains merge rules for combining subword units into larger tokens. - **model.safetensors** The main file containing the model’s learned weights in the safetensors format (a safe and fast alternative to PyTorch’s .bin format). - **model.safetensors.index.json** An index file for the safetensors weights, used when the model is sharded or split across multiple files. - **special_tokens_map.json** Maps special tokens (like [CLS], [SEP], [PAD], etc.) to their corresponding IDs in the tokenizer. - **spm.model** SentencePiece model file, used for tokenization if SentencePiece is the tokenizer (common in multilingual or T5-style models). - **tokenizer_config.json** Stores configuration settings for the tokenizer (e.g., lowercasing, normalization, special token handling). - **tokenizer.json** Contains the full tokenizer vocabulary and rules in a single JSON file, often used for fast loading. - **vocab.json** The vocabulary file mapping tokens to their integer IDs (used by some tokenizers, especially BPE). - **vocab.txt** A plain text vocabulary file, listing all tokens (used by some tokenizers, especially WordPiece). --- These files together allow you to reload the fine-tuned model, preprocess text in the same way as during training, and ensure compatibility with downstream tasks.