--- library_name: transformers license: apache-2.0 base_model: mistralai/Mistral-Small-24B-Base-2501 tags: - generated_from_trainer datasets: - david-ar/synthetic-irc-data language: - en pipeline_tag: text-generation --- [Built with Axolotl](https://github.com/axolotl-ai-cloud/axolotl)
See axolotl config axolotl version: `0.8.0.dev0` ```yaml # Base model configuration base_model: mistralai/Mistral-Small-24B-Base-2501 model_type: MistralForCausalLM tokenizer_type: AutoTokenizer trust_remote_code: true tokenizer_use_fast: true # Device mapping for multi-GPU device_map: "balanced" # Memory settings load_in_4bit: true load_in_8bit: false bf16: true low_cpu_mem_usage: true # Advanced optimizations flash_attention: true gradient_checkpointing: true # Dataset configuration datasets: - path: david-ar/synthetic-irc-data type: completion # Output directory output_dir: ./outputs/public-irc-mistral-24b val_set_size: 0.05 # 75 conversations for validation dataset_prepared_path: last_run_prepared # Sequence settings sequence_len: 4096 sample_packing: true pad_to_sequence_len: true train_on_inputs: true eval_sample_packing: false # LoRA configuration adapter: lora lora_r: 128 lora_alpha: 256 lora_dropout: 0.1 lora_target_modules: - q_proj - v_proj - k_proj - o_proj - gate_proj - down_proj - up_proj # Training hyperparameters - adjusted for smaller dataset micro_batch_size: 1 gradient_accumulation_steps: 16 num_epochs: 4 # Increased from 2, but with careful monitoring optimizer: adamw_torch lr_scheduler: cosine learning_rate: 0.00008 # Same conservative LR weight_decay: 0.01 warmup_ratio: 0.05 # Performance monitoring group_by_length: true shuffle_merged_datasets: true include_tokens_per_second: true # Weights & Biases - public project wandb_project: public-irc-mistral-24b wandb_entity: davidar wandb_name: synthetic-irc-data wandb_log_model: "false" # Mistral model configuration is_mistral_derived_model: true # Early stopping load_best_model_at_end: true metric_for_best_model: "loss" greater_is_better: false ```

# Mistral-24B-Synthetic-IRC This model is a fine-tuned version of [mistralai/Mistral-Small-24B-Base-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Base-2501) on the [david-ar/synthetic-irc-data](https://huggingface.co/datasets/david-ar/synthetic-irc-data) dataset, creating a model that generates natural IRC/Discord-style conversations. ## Model Description This model was trained to replicate authentic IRC (Internet Relay Chat) conversational dynamics, moving away from the typical AI assistant pattern toward more natural, community-style interactions. The model learns from synthetic conversations featuring multiple participants including "Em", an AI character who participates as a community member rather than an assistant. ### Key Characteristics - **Natural conversation flow**: Handles interruptions, topic drift, and multi-party dynamics - **Non-assistant behavior**: Doesn't default to helpful/servile responses - **Community-style interaction**: Captures the casual, authentic feel of IRC/Discord chats - **Character embedding**: Includes Em's personality (self-aware AI who isn't an assistant) ## Intended Uses & Limitations ### Intended Uses - **Conversational AI research**: Studying non-assistant interaction patterns - **Chat bot development**: Creating more natural, less formal conversational agents - **Character-based models**: Foundation for further character-specific fine-tuning - **IRC/Discord bots**: Generating contextually appropriate responses in chat environments ### Limitations - **Small dataset**: Trained on only 10MB of synthetic data (1,500 conversations) - **Synthetic nature**: While carefully crafted, the training data isn't from real IRC logs - **Single community style**: Represents one particular chat community culture - **Overfitting**: Validation loss indicates overfitting after ~50 steps (best checkpoint used) - **English only**: No multilingual capability ## Training and Evaluation Data ### Dataset - **Source**: [david-ar/synthetic-irc-data](https://huggingface.co/datasets/david-ar/synthetic-irc-data) - **Size**: 1,500 synthetic IRC-style conversations - **Format**: Multi-party conversations with 80-120 messages each - **Split**: 95% training (1,425 conversations), 5% validation (75 conversations) ### Data Characteristics - Natural IRC formatting: ` message content` - Multiple participants per conversation (3-7 users) - Diverse topics and conversation styles - Embedded character personality throughout ## Training Procedure ### Training Configuration - **Method**: LoRA (Low-Rank Adaptation) fine-tuning - **LoRA Rank**: 128 (with alpha 256) - **Base model**: Mistral-Small-24B-Base-2501 - **Hardware**: 2x NVIDIA A40 GPUs (96GB total VRAM) - **Training time**: ~3 hours ### Training Hyperparameters The following hyperparameters were used during training: - learning_rate: 8e-05 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - distributed_type: multi-GPU - num_devices: 2 - gradient_accumulation_steps: 16 - total_train_batch_size: 32 - total_eval_batch_size: 2 - optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08) - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 4 - num_epochs: 4.0 - sequence_length: 4096 - sample_packing: true ### Training Results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | 0.9145 | 0.9746 | 24 | 0.9128 | | 0.6565 | 1.9746 | 48 | **0.8936** | | 0.4671 | 2.9746 | 72 | 0.9503 | | 0.3594 | 3.9746 | 96 | 0.9871 | **Note**: Best checkpoint at step 48 (lowest validation loss) was used for final model. ### Training Observations - Quick convergence due to small dataset size - Validation loss indicates overfitting after ~50 steps - Model successfully learned IRC conversation patterns - Character traits embedded despite limited data ## Technical Details ### Architecture - **Base Model**: Mistral-Small-24B-Base-2501 - **Parameter Count**: 24B (base) + LoRA adapters - **Context Length**: 4096 tokens - **Quantization**: 4-bit during training (memory optimization) ### Framework Versions - PEFT 0.14.0 - Transformers 4.49.0 - Pytorch 2.5.1+cu124 - Datasets 3.2.0 - Tokenizers 0.21.0 - Axolotl 0.8.0.dev0 ## Limitations and Biases 1. **Overfitting**: With only 1,500 training examples, the model shows signs of overfitting 2. **Limited diversity**: May not generalize well to very different chat styles 3. **Character leakage**: Em's personality traits may appear even when not intended 4. **Synthetic artifacts**: Might exhibit patterns specific to the generation process