|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen3-1.7B |
|
|
tags: |
|
|
- scaling-laws |
|
|
- neural-scaling |
|
|
- performance-prediction |
|
|
- configuration-to-performance |
|
|
- pytorch |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# NCPL-intermediate: Neural Configuration to Performance Scaling Law |
|
|
|
|
|
This model predicts the performance of neural network configurations using scaling laws. It is trained on the Marin and StepLaw datasets to forecast performance metrics based on model configurations. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**NCPL-intermediate** (Neural Configuration to Performance Scaling Law - Intermediate) is a specialized forecasting model that: |
|
|
|
|
|
- Takes pretraining configurations as input |
|
|
- Predicts intermediate performance metrics using learned scaling law patterns |
|
|
- Combines text embeddings from a base transformer with numeric value processing through a dedicated MLP |
|
|
- Supports multiple scaling law formulations (Marin, StepLaw) |
|
|
|
|
|
### Architecture |
|
|
|
|
|
The model consists of: |
|
|
|
|
|
1. **Base Model**: Qwen/Qwen3-1.7B |
|
|
- Provides contextual embeddings for text tokens |
|
|
|
|
|
2. **Numeric MLP**: |
|
|
- Processes numeric values (performance metrics, configuration parameters) |
|
|
- Projects numeric inputs to the same hidden dimension as text embeddings |
|
|
- Architecture: Linear(1 → 2*hidden_size) → ReLU → Linear(2*hidden_size → hidden_size) |
|
|
|
|
|
3. **Prediction Head**: |
|
|
- Linear layer mapping from hidden_size to scalar predictions |
|
|
- Outputs performance forecasts for each token position |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on: |
|
|
|
|
|
- **Datasets**: Marin and StepLaw scaling law datasets |
|
|
- **Training configuration**: |
|
|
- Stage 1: 10 epochs with learning rate 5e-5 (frozen base model) |
|
|
- Stage 2: 400 epochs with learning rate 1e-5 (full fine-tuning) |
|
|
- Batch size: 480 (across 8 GPUs) |
|
|
- Weight decay: 0.01 |
|
|
- Loss: MSE (Mean Squared Error) |
|
|
|
|
|
## Usage |
|
|
|
|
|
The `ScalingLawForecaster` class can be found in the [GitHub repository](https://github.com/zhqwqwq/Configuration-to-Performance-Scaling-Law). |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer |
|
|
# Get ScalingLawForecaster from: https://github.com/zhqwqwq/Configuration-to-Performance-Scaling-Law |
|
|
from model import ScalingLawForecaster |
|
|
|
|
|
# Load model |
|
|
model = ScalingLawForecaster( |
|
|
base_model_name="Qwen/Qwen3-1.7B", |
|
|
init_from_pretrained=True, |
|
|
force_fp32=True |
|
|
) |
|
|
|
|
|
# Load checkpoint |
|
|
checkpoint = torch.load("pytorch_model.bin") |
|
|
model.load_state_dict(checkpoint["model_state_dict"]) |
|
|
model.eval() |
|
|
|
|
|
# Load tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B") |
|
|
|
|
|
# Prepare inputs |
|
|
# input_ids: tokenized text sequence |
|
|
# is_number_mask: boolean mask indicating which tokens are numeric |
|
|
# number_values_filled: actual numeric values (0 for non-numeric tokens) |
|
|
|
|
|
with torch.no_grad(): |
|
|
predictions = model( |
|
|
input_ids=input_ids, |
|
|
is_number_mask=is_number_mask, |
|
|
number_values_filled=number_values_filled, |
|
|
attention_mask=attention_mask |
|
|
) |
|
|
``` |
|
|
|
|
|
## Input Format |
|
|
|
|
|
The model expects three key inputs: |
|
|
|
|
|
1. **input_ids** (torch.LongTensor): Tokenized sequence with special numeric tokens |
|
|
2. **is_number_mask** (torch.BoolTensor): Boolean mask marking numeric token positions |
|
|
3. **number_values_filled** (torch.FloatTensor): Actual numeric values at marked positions |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
|
|
|
- **Scaling law research**: Understanding how neural network performance scales with configuration |
|
|
- **Performance forecasting**: Predicting model performance before full training |
|
|
- **Configuration optimization**: Finding optimal hyperparameters based on scaling patterns |
|
|
- **Resource planning**: Estimating computational requirements for different model sizes |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained specifically on Marin and StepLaw datasets; generalization to other settings likely require at least finetuning |
|
|
- Requires properly formatted inputs with numeric tokens replaced and masked |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{ncpl2026, |
|
|
title = {Neural Configuration to Performance Scaling Law}, |
|
|
author = {Huaqing Zhang and Kaiyue Wen and Tengyu Ma}, |
|
|
journal = {arXiv preprint arXiv:2602.10300}, |
|
|
year = {2026}, |
|
|
url = {https://www.arxiv.org/abs/2602.10300} |
|
|
} |
|
|
``` |
|
|
|