Model save

Browse files

Files changed (4) hide show

README.md +49 -183
config.json +6 -7
model.safetensors +2 -2
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -1,192 +1,58 @@
 ---
 library_name: transformers
 tags:
-- pytorch
-- neollm
-- hybrid-attention
-- fanformer
-- gated-delta-networks
-- polynomial-activations
-- fineweb-edu
-- ademamix
-- custom-scheduler
-- flash-attention
-- torch-compile
-pipeline_tag: text-generation
 model-index:
 - name: NeoLLM
-  results:
-  - task:
-      type: text-generation
-      name: Text Generation
-    dataset:
-      type: multiple-choice
-      name: ARC-Easy
-    metrics:
-    - type: accuracy
-      value: 39.14
-  - task:
-      type: text-generation
-      name: Text Generation
-    dataset:
-      type: multiple-choice
-      name: HellaSwag
-    metrics:
-    - type: accuracy
-      value: 26.55
-  - task:
-      type: text-generation
-      name: Text Generation
-    dataset:
-      type: multiple-choice
-      name: MMLU
-    metrics:
-    - type: accuracy
-      value: 24.25
-  - task:
-      type: text-generation
-      name: Text Generation
-    dataset:
-      type: multiple-choice
-      name: ARC-Challenge
-    metrics:
-    - type: accuracy
-      value: 17.24
-license: apache-2.0
-datasets:
-- HuggingFaceFW/fineweb-edu
-language:
-- en
 ---
 # NeoLLM
-NeoLLM is a hybrid architecture language model that combines multiple state-of-the-art techniques for efficient and effective language modeling. This 110M parameter model demonstrates novel architectural innovations including Fourier Analysis Networks, hybrid attention mechanisms, and advanced normalization techniques.
-## Model Description
-NeoLLM incorporates several cutting-edge components:
-- **FANformer Integration**: Fourier Analysis Network (FAN) layers for effective periodicity modeling with fan_ratio of 0.125
-- **Hybrid Attention Architecture**: Follows Qwen3-Next's approach with 1 full attention layer per 3 linear attention layers
-- **Polynomial Composition Activations**: PolyNorm activation functions in MLP layers for enhanced dynamics
-- **Advanced Normalization**: LayerNorm Scaling (LNS) and Gradient-Preserving Activation Scaling (GPAS)
-- **Efficient Linear Attention**: Gated Delta Networks for improved computational efficiency
-## Installation
-Before using this model, install the required dependencies:
-```bash
-pip install git+https://github.com/huggingface/transformers.git@main
-pip install "cut-cross-entropy @ git+https://github.com/apple/ml-cross-entropy.git"
-pip install flash-linear-attention
-```
-### Architecture Details
-- **Model Size**: 110M parameters (77M embeddings + 33M non-embeddings)
-- **Hidden Size**: 512
-- **Layers**: 12 layers with hybrid attention pattern
-- **Attention Heads**: 8 (with 2 KV heads using Grouped Query Attention)
-- **Intermediate Size**: 1024
-- **Sequence Length**: 512 tokens
-- **Vocabulary**: 151,665 tokens (Qwen3 tokenizer)
-### Layer Pattern
-The model uses a hybrid attention pattern where layers alternate between:
-- **Linear Attention**: Layers 1,2,3,5,6,7,9,10,11 (Gated Delta Networks)
-- **Full Attention**: Layers 4,8,12 (Flash Attention 2)
-## Training Details
-### Dataset
-- **Source**: FineWeb-Edu (sample-10BT subset)
-- **Training Samples**: 4 million examples
-- **Validation Split**: 1% (40,000 samples)
-- **Text Processing**: Dynamic truncation to 4x block_size during tokenization
-- **Tokenizer**: Qwen3 Fast Tokenizer with weight tying enabled
-### Training Configuration
-- **Hardware**: NVIDIA RTX 5090
-- **Training Time**: 3 hours
-- **Loss Function**: Cut Your Losses (from "Cut Your Losses in Large-Vocabulary Language Models") - NOT standard Cross-Entropy
-- **Optimizer**: AdEMAMix with parameters:
-  - Betas: (0.9, 0.999, 0.999)
-  - Alpha: 5.0
-  - t_alpha: 5000, t_beta3: 5000
-  - Weight decay: 0.1
-- **Learning Rate Schedule**: Custom cosine with linear warmup
-  - Start LR: 3e-4
-  - Peak LR: 6e-4 (at 5000 warmup steps)
-  - Min LR: 6e-5
-- **Batch Size**: 64 per device
-- **Precision**: BF16 with torch.compile optimization
-- **Hardware Optimizations**: Flash Attention 2
-- **Epochs**: 1
-### Framework Versions
-- **PyTorch**: 2.8.0+cu129
-- **Transformers**: 4.57.0.dev0
-- **Flash Attention**: 2.x
-- **CUDA**: 12.9
-## Evaluation Results
-### Benchmark Performance (1-shot evaluation)
-| Task | Score |
-|------|-------|
-| ARC-Easy | 39.14% |
-| HellaSwag | 26.55% |
-| MMLU | 24.25% |
-| ARC-Challenge | 17.24% |
-*All evaluations performed in few-shot (1-shot) setting*
-## Model Architecture Components
-### Fourier Analysis Network (FANLayer)
-Based on "FANformer: Improving Large Language Models Through Effective Periodicity Modeling":
-```
-FANLayer'(X) = [cos(WpX)||sin(WpX)||(WpX + Bp)]
-```
-### LayerNorm Scaling (LNS)
-Implements scaling factor 1/√ℓ as described in "The Curse of Depth in Large Language Models":
-```
-h^(ℓ) = LayerNorm(h^(ℓ)) × (1/√ℓ)
-```
-### Gradient-Preserving Activation Scaling (GPAS)
-Scales activations without penalizing gradients using stop-gradient operations.
-### Polynomial Composition Activations (PolyNorm)
-Custom activation function based on "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".
-### Gated Delta Networks
-Linear attention mechanism from "Gated Delta Networks: Improving Mamba2 with Delta Rule" for efficient sequence modeling.
-## Intended Uses & Limitations
-### Intended Uses
-- Research into hybrid attention architectures
-- Educational purposes for understanding advanced LLM components
-- Small-scale language modeling experiments
-- Benchmarking novel architectural components
-### Limitations
-- Relatively small model size (110M parameters) limits capability compared to larger models
-- Training limited to 4M samples from single dataset
-- Performance below state-of-the-art models on standard benchmarks
-- Experimental architecture may have stability considerations in production
-### Recommendations
-- Best suited for research and educational applications
-- Consider fine-tuning for specific downstream tasks
-- Monitor performance carefully if adapting for production use
-## Training Infrastructure
-- **Mixed Precision**: BF16 for numerical stability
-- **Compilation**: torch.compile with max-autotune mode

 ---
 library_name: transformers
 tags:
+- generated_from_trainer
 model-index:
 - name: NeoLLM
+  results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
 # NeoLLM
+This model is a fine-tuned version of [](https://huggingface.co/) on an unknown dataset.
+It achieves the following results on the evaluation set:
+- Loss: 3.8652
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 0.0006
+- train_batch_size: 64
+- eval_batch_size: 64
+- seed: 42
+- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_ratio: 0.1
+- num_epochs: 1
+### Training results
+| Training Loss | Epoch  | Step | Validation Loss |
+|:-------------:|:------:|:----:|:---------------:|
+| 4.2056        | 0.3840 | 3000 | 4.2055          |
+| 3.8841        | 0.7680 | 6000 | 3.8652          |
+### Framework versions
+- Transformers 4.57.0.dev0
+- Pytorch 2.8.0+cu129
+- Datasets 3.6.0
+- Tokenizers 0.22.1

config.json CHANGED Viewed

@@ -2,15 +2,14 @@
   "architectures": [
     "NeoLLMForCausalLM"
   ],
-  "auto_map": {
-    "AutoConfig": "configuration_neollm.NeoLLMConfig",
-    "AutoModel": "modeling_neollm.NeoLLMModel",
-    "AutoModelForCausalLM": "modeling_neollm.NeoLLMForCausalLM"
-  },
   "attention_bias": false,
   "attention_dropout": 0.1,
   "dropout_rate": 0.1,
   "dtype": "bfloat16",
   "eos_token_id": 151645,
   "fan_ratio": 0.125,
@@ -18,7 +17,7 @@
   "hidden_act": "xielu",
   "hidden_size": 512,
   "initializer_range": 0.02,
-  "intermediate_size": 1024,
   "layer_types": [
     "linear_attention",
     "linear_attention",

   "architectures": [
     "NeoLLMForCausalLM"
   ],
   "attention_bias": false,
   "attention_dropout": 0.1,
+  "auto_map": {
+    "AutoConfig": "configuration_unified.UnifiedModelConfig",
+    "AutoModel": "modeling_unified.UnifiedModel",
+    "AutoModelForCausalLM": "modeling_unified.UnifiedModel"
+  },
   "dropout_rate": 0.1,
   "dtype": "bfloat16",
   "eos_token_id": 151645,
   "fan_ratio": 0.125,
   "hidden_act": "xielu",
   "hidden_size": 512,
   "initializer_range": 0.02,
+  "intermediate_size": 1536,
   "layer_types": [
     "linear_attention",
     "linear_attention",

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:dc073ea209fede7edbf7c6eaf470b935fcafade692d86dfcb001e52da1df45e7
-size 219053832

 version https://git-lfs.github.com/spec/v1
+oid sha256:673a2ad3e9fb95397d7c50a0d7023b13ddd589eb5b9205b3370e9da8be1d4991
+size 231636744

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:bd062dd4a82d5ccb7b2eb217167c2361df73816e8fecd313bccdfc47eca850b0
 size 5969

 version https://git-lfs.github.com/spec/v1
+oid sha256:1a7ed46ac173cd670ec0cb96d3ba813baf4fad6c4f51be08a8e3127610528168
 size 5969