File size: 4,869 Bytes
867babb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58bac15
 
867babb
 
 
58bac15
 
867babb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: apache-2.0
base_model: Qwen/Qwen3-1.7B
tags:
- scaling-laws
- neural-scaling
- performance-prediction
- configuration-to-performance
- pytorch
library_name: transformers
---

# NCPL-final: Neural Configuration to Performance Scaling Law

This model predicts the final performance of neural network configurations using scaling laws. It is trained on the Marin and StepLaw datasets to forecast final performance metrics based on model configurations.

## Model Description

**NCPL-final** (Neural Configuration to Performance Scaling Law - Final) is a specialized forecasting model that:

- Takes pretraining configurations as input
- Predicts final performance metrics using learned scaling law patterns
- Combines text embeddings from a base transformer with numeric value processing through a dedicated MLP
- Supports multiple scaling law formulations (Marin, StepLaw)
- **Focuses on final performance only** (unlike NCPL-intermediate which predicts intermediate checkpoints)

### Architecture

The model consists of:

1. **Base Model**: Qwen/Qwen3-1.7B
   - Provides contextual embeddings for text tokens

2. **Numeric MLP**:
   - Processes numeric values (performance metrics, configuration parameters)
   - Projects numeric inputs to the same hidden dimension as text embeddings
   - Architecture: Linear(1 → 2*hidden_size) → ReLU → Linear(2*hidden_size → hidden_size)

3. **Prediction Head**:
   - Linear layer mapping from hidden_size to scalar predictions
   - Outputs performance forecasts for each token position

## Training Data

The model was trained on:

- **Datasets**: Marin and StepLaw scaling law datasets (final performance only)
- **Training configuration**:
  - Stage 1: 20 epochs with learning rate 5e-5 (frozen base model)
  - Stage 2: 1000 epochs with learning rate 1e-5 (full fine-tuning)
  - Batch size: 480 (across 8 GPUs)
  - Weight decay: 0.01
  - Loss: MSE (Mean Squared Error)

## Usage

The `ScalingLawForecaster` class can be found in the [GitHub repository](https://github.com/zhqwqwq/Configuration-to-Performance-Scaling-Law).

```python
import torch
from transformers import AutoTokenizer
# Get ScalingLawForecaster from: https://github.com/zhqwqwq/Configuration-to-Performance-Scaling-Law
from model import ScalingLawForecaster

# Load model
model = ScalingLawForecaster(
    base_model_name="Qwen/Qwen3-1.7B",
    init_from_pretrained=True,
    force_fp32=True
)

# Load checkpoint
checkpoint = torch.load("pytorch_model.bin")
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")

# Prepare inputs
# input_ids: tokenized text sequence
# is_number_mask: boolean mask indicating which tokens are numeric
# number_values_filled: actual numeric values (0 for non-numeric tokens)

with torch.no_grad():
    predictions = model(
        input_ids=input_ids,
        is_number_mask=is_number_mask,
        number_values_filled=number_values_filled,
        attention_mask=attention_mask
    )
```

## Input Format

The model expects three key inputs:

1. **input_ids** (torch.LongTensor): Tokenized sequence with special numeric tokens
2. **is_number_mask** (torch.BoolTensor): Boolean mask marking numeric token positions
3. **number_values_filled** (torch.FloatTensor): Actual numeric values at marked positions

## Intended Use

This model is designed for:

- **Scaling law research**: Understanding how neural network performance scales with configuration
- **Final performance forecasting**: Predicting model performance at the end of training
- **Configuration optimization**: Finding optimal hyperparameters based on scaling patterns
- **Resource planning**: Estimating computational requirements for different model sizes

## Limitations

- Trained specifically on Marin and StepLaw datasets; generalization to other settings likely require at least finetuning
- Requires properly formatted inputs with numeric tokens replaced and masked
- Predicts only final performance, not intermediate checkpoints

## Differences from NCPL-intermediate

- **NCPL-final**: Predicts only final performance metrics after full training
- **NCPL-intermediate**: Predicts performance at intermediate training checkpoints

NCPL-final is trained with more epochs (20 + 1000 vs 10 + 400) and focuses exclusively on final performance prediction.

## Citation

If you use this model in your research, please cite:

```bibtex
@article{ncpl2026,
  title = {Neural Configuration to Performance Scaling Law},
  author = {Huaqing Zhang and Kaiyue Wen and Tengyu Ma},
  journal = {arXiv preprint arXiv:2602.10300},
  year = {2026},
  url = {https://www.arxiv.org/abs/2602.10300}
}
```

## Model Card Contact

For questions or issues, please open an issue in the [repository](https://github.com/zhqwqwq/Configuration-to-Performance-Scaling-Law).