Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -18,8 +18,8 @@ This model predicts the performance of neural network configurations using scali
|
|
| 18 |
|
| 19 |
**NCPL-intermediate** (Neural Configuration to Performance Scaling Law - Intermediate) is a specialized forecasting model that:
|
| 20 |
|
| 21 |
-
- Takes
|
| 22 |
-
- Predicts
|
| 23 |
- Combines text embeddings from a base transformer with numeric value processing through a dedicated MLP
|
| 24 |
- Supports multiple scaling law formulations (Marin, StepLaw)
|
| 25 |
|
|
@@ -39,13 +39,6 @@ The model consists of:
|
|
| 39 |
- Linear layer mapping from hidden_size to scalar predictions
|
| 40 |
- Outputs performance forecasts for each token position
|
| 41 |
|
| 42 |
-
### Key Features
|
| 43 |
-
|
| 44 |
-
- **Hybrid Input Processing**: Combines text tokens and numeric values seamlessly
|
| 45 |
-
- **Token-level Predictions**: Generates predictions at each sequence position
|
| 46 |
-
- **FP32 Precision**: Trained in full float32 precision for numerical stability
|
| 47 |
-
- **Intermediate Predictions**: Capable of predicting intermediate performance checkpoints
|
| 48 |
-
|
| 49 |
## Training Data
|
| 50 |
|
| 51 |
The model was trained on:
|
|
@@ -58,13 +51,6 @@ The model was trained on:
|
|
| 58 |
- Weight decay: 0.01
|
| 59 |
- Loss: MSE (Mean Squared Error)
|
| 60 |
|
| 61 |
-
### Checkpoint Information
|
| 62 |
-
|
| 63 |
-
- **Epoch**: 46
|
| 64 |
-
- **Training iterations**: 4800
|
| 65 |
-
- **Validation loss**: 0.005730564706027508
|
| 66 |
-
- **Checkpoint path**: `checkpoints/fp32_@['marin', 'steplaw']_qwen_intermediate_residual_nts1ep10_s2ep400_s1lr5e-05_s2lr1e-05_wd0.01_bs480_rs42_20260216_095527/checkpoints/checkpoint_min_val_loss.pt`
|
| 67 |
-
|
| 68 |
## Usage
|
| 69 |
|
| 70 |
```python
|
|
@@ -120,34 +106,8 @@ This model is designed for:
|
|
| 120 |
|
| 121 |
## Limitations
|
| 122 |
|
| 123 |
-
- Trained specifically on Marin and StepLaw datasets; generalization to other
|
| 124 |
- Requires properly formatted inputs with numeric tokens replaced and masked
|
| 125 |
-
- Performance predictions are probabilistic estimates based on training data patterns
|
| 126 |
-
- Best suited for configurations within the training distribution
|
| 127 |
-
|
| 128 |
-
## Training Procedure
|
| 129 |
-
|
| 130 |
-
### Two-Stage Training
|
| 131 |
-
|
| 132 |
-
**Stage 1** (10 epochs):
|
| 133 |
-
- Learning rate: 5e-5
|
| 134 |
-
- Base model frozen
|
| 135 |
-
- Trains only the numeric MLP and prediction head
|
| 136 |
-
- Warmup ratio: 0.1
|
| 137 |
-
|
| 138 |
-
**Stage 2** (400 epochs):
|
| 139 |
-
- Learning rate: 1e-5
|
| 140 |
-
- Full model fine-tuning
|
| 141 |
-
- All parameters trainable
|
| 142 |
-
- Warmup steps: 1000
|
| 143 |
-
|
| 144 |
-
### Training Configuration
|
| 145 |
-
|
| 146 |
-
- Optimizer: AdamW (β1=0.9, β2=0.99)
|
| 147 |
-
- Gradient clipping: 1.0
|
| 148 |
-
- Loss function: Mean Squared Error (MSE)
|
| 149 |
-
- Distributed training: FSDP (Fully Sharded Data Parallel)
|
| 150 |
-
- Precision: FP32
|
| 151 |
|
| 152 |
## Citation
|
| 153 |
|
|
@@ -162,11 +122,3 @@ If you use this model in your research, please cite:
|
|
| 162 |
url = {https://www.arxiv.org/abs/2602.10300}
|
| 163 |
}
|
| 164 |
```
|
| 165 |
-
|
| 166 |
-
## Model Card Authors
|
| 167 |
-
|
| 168 |
-
OptimizerStudy Team
|
| 169 |
-
|
| 170 |
-
## Model Card Contact
|
| 171 |
-
|
| 172 |
-
For questions or issues, please open an issue in the [repository](https://github.com/OptimizerStudy/Configuration-to-Performance-Scaling-Law).
|
|
|
|
| 18 |
|
| 19 |
**NCPL-intermediate** (Neural Configuration to Performance Scaling Law - Intermediate) is a specialized forecasting model that:
|
| 20 |
|
| 21 |
+
- Takes pretraining configurations as input
|
| 22 |
+
- Predicts intermediate performance metrics using learned scaling law patterns
|
| 23 |
- Combines text embeddings from a base transformer with numeric value processing through a dedicated MLP
|
| 24 |
- Supports multiple scaling law formulations (Marin, StepLaw)
|
| 25 |
|
|
|
|
| 39 |
- Linear layer mapping from hidden_size to scalar predictions
|
| 40 |
- Outputs performance forecasts for each token position
|
| 41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
## Training Data
|
| 43 |
|
| 44 |
The model was trained on:
|
|
|
|
| 51 |
- Weight decay: 0.01
|
| 52 |
- Loss: MSE (Mean Squared Error)
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
## Usage
|
| 55 |
|
| 56 |
```python
|
|
|
|
| 106 |
|
| 107 |
## Limitations
|
| 108 |
|
| 109 |
+
- Trained specifically on Marin and StepLaw datasets; generalization to other settings likely require at least finetuning
|
| 110 |
- Requires properly formatted inputs with numeric tokens replaced and masked
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
## Citation
|
| 113 |
|
|
|
|
| 122 |
url = {https://www.arxiv.org/abs/2602.10300}
|
| 123 |
}
|
| 124 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|