Edwin Jose Palathinkal commited on
Commit ·
14cdecf
1
Parent(s): 48f8809
feat: extend to INT64_MAX with stratified sampling and guaranteed training data
Browse files- Extend range to 9,223,372,036,854,775,807 (INT64_MAX, 19 digits)
- Add 7-scale stratified sampling (units through quintillions)
- Include guaranteed samples: all 0-99,999 and exact powers of 1000
- Increase max_seq_len to 25 and max_output_len to 35
- Update README with new capabilities and limitations
- .gitattributes +2 -1
- README.md +33 -28
- namer/data.py +124 -5
- namer/main.py +58 -7
.gitattributes
CHANGED
|
@@ -1 +1,2 @@
|
|
| 1 |
-
model
|
|
|
|
|
|
| 1 |
+
# Binary model files are stored using HuggingFace Xet storage
|
| 2 |
+
# See: https://huggingface.co/docs/hub/xet
|
README.md
CHANGED
|
@@ -21,11 +21,12 @@ A PyTorch transformer model that converts **integers to their English names** (e
|
|
| 21 |
|
| 22 |
## Model Description
|
| 23 |
|
| 24 |
-
Namer is a sequence-to-sequence transformer trained to read digits of a number and generate the corresponding English textual representation. It handles numbers from **0 up to
|
| 25 |
|
| 26 |
**Key Features:**
|
| 27 |
-
- 🎯 **Stratified Training**: Uses balanced sampling across number scales (units
|
| 28 |
-
-
|
|
|
|
| 29 |
- 🚀 **Fast Inference**: Single forward pass, no autoregressive generation needed
|
| 30 |
|
| 31 |
**Example conversions:**
|
|
@@ -38,6 +39,7 @@ Namer is a sequence-to-sequence transformer trained to read digits of a number a
|
|
| 38 |
| 999999 | nine hundred ninety nine thousand nine hundred ninety nine |
|
| 39 |
| 1234567890 | one billion two hundred thirty four million five hundred sixty seven thousand eight hundred ninety |
|
| 40 |
| 999999999999 | nine hundred ninety nine billion nine hundred ninety nine million nine hundred ninety nine thousand nine hundred ninety nine |
|
|
|
|
| 41 |
|
| 42 |
## Usage
|
| 43 |
|
|
@@ -131,17 +133,23 @@ pip install git+https://github.com/edwinhere/namer.git
|
|
| 131 |
- **Input**: Digits of the integer (as token indices, 0-9 + padding)
|
| 132 |
- **Output**: English words representing the number
|
| 133 |
- **Vocabulary**: 41 tokens (zero-nineteen, twenty-ninety by tens, hundred, thousand, million, billion, trillion, quadrillion, quintillion, sextillion, septillion, octillion, nonillion, decillion, EOS)
|
| 134 |
-
- **Max Output Length**:
|
| 135 |
-
- **Parameters**: ~
|
| 136 |
|
| 137 |
### Training Details
|
| 138 |
|
| 139 |
-
The model uses **stratified sampling** during training to ensure balanced representation:
|
| 140 |
-
- Units (0-999):
|
| 141 |
-
- Thousands (1,000-999,999):
|
| 142 |
-
- Millions (1M-999M):
|
| 143 |
-
- Billions (1B-999B):
|
| 144 |
-
- Trillions (1T-999T):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
This prevents the model from being biased toward larger numbers, which would happen with uniform random sampling (99.9% of 0-1T range is >1M).
|
| 147 |
|
|
@@ -153,13 +161,12 @@ This prevents the model from being biased toward larger numbers, which would hap
|
|
| 153 |
| `pytorch_model.bin` | HuggingFace model weights (PyTorch format) |
|
| 154 |
| `config.json` | Model configuration |
|
| 155 |
| `generation_config.json` | Generation parameters |
|
| 156 |
-
| `modeling_namer.py` | HF-compatible model implementation |
|
| 157 |
| `namer_model.pt` | Original PyTorch checkpoint |
|
| 158 |
| `namer/` | Source code package |
|
| 159 |
|
| 160 |
## Training
|
| 161 |
|
| 162 |
-
To train from scratch with default settings (30 epochs, 1000 steps/epoch):
|
| 163 |
|
| 164 |
```bash
|
| 165 |
python -m namer train
|
|
@@ -171,20 +178,18 @@ To customize training:
|
|
| 171 |
python -m namer train --epochs 20 --steps 500 --batch-size 256 --lr 0.001
|
| 172 |
```
|
| 173 |
|
| 174 |
-
The training uses stratified sampling by default. To modify the training range or sampling strategy, edit `namer/data.py`.
|
| 175 |
-
|
| 176 |
-
### Extending to Larger Numbers
|
| 177 |
-
|
| 178 |
-
The vocabulary already supports up to **decillion** (10³³). To train for larger ranges:
|
| 179 |
-
|
| 180 |
-
1. Increase `max_int` in `namer/data.py` and `namer/main.py`
|
| 181 |
-
2. Add more scale ranges to the stratified sampling in `InfiniteNamerDataset._generate_sample()`
|
| 182 |
-
3. Increase `max_output_len` and `max_seq_len` if outputs exceed 25 tokens
|
| 183 |
-
4. Retrain the model
|
| 184 |
|
| 185 |
## Version History
|
| 186 |
|
| 187 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
- **Range**: 0 to 999,999,999,999 (trillions)
|
| 189 |
- **Training**: Stratified sampling for balanced representation
|
| 190 |
- **Max output length**: 25 tokens
|
|
@@ -197,10 +202,10 @@ The vocabulary already supports up to **decillion** (10³³). To train for large
|
|
| 197 |
|
| 198 |
## Limitations
|
| 199 |
|
| 200 |
-
-
|
| 201 |
-
-
|
| 202 |
-
-
|
| 203 |
-
-
|
| 204 |
|
| 205 |
## Citation
|
| 206 |
|
|
|
|
| 21 |
|
| 22 |
## Model Description
|
| 23 |
|
| 24 |
+
Namer is a sequence-to-sequence transformer trained to read digits of a number and generate the corresponding English textual representation. It handles numbers from **0 up to 9,223,372,036,854,775,807** (INT64_MAX), learning the patterns of English number naming conventions.
|
| 25 |
|
| 26 |
**Key Features:**
|
| 27 |
+
- 🎯 **Stratified Training**: Uses balanced sampling across 7 number scales (units to quintillions) to ensure accurate performance on both small and large numbers
|
| 28 |
+
- 📚 **Guaranteed Training Data**: Includes all numbers 0-99,999 and exact powers of 1000 to improve accuracy on edge cases
|
| 29 |
+
- 📈 **Large Range**: Handles numbers up to INT64_MAX (19 digits, ~9.2 quintillion)
|
| 30 |
- 🚀 **Fast Inference**: Single forward pass, no autoregressive generation needed
|
| 31 |
|
| 32 |
**Example conversions:**
|
|
|
|
| 39 |
| 999999 | nine hundred ninety nine thousand nine hundred ninety nine |
|
| 40 |
| 1234567890 | one billion two hundred thirty four million five hundred sixty seven thousand eight hundred ninety |
|
| 41 |
| 999999999999 | nine hundred ninety nine billion nine hundred ninety nine million nine hundred ninety nine thousand nine hundred ninety nine |
|
| 42 |
+
| 9223372036854775807 | nine quintillion two hundred twenty three quadrillion three hundred seventy two trillion thirty six billion eight hundred fifty four million seven hundred seventy five thousand eight hundred seven |
|
| 43 |
|
| 44 |
## Usage
|
| 45 |
|
|
|
|
| 133 |
- **Input**: Digits of the integer (as token indices, 0-9 + padding)
|
| 134 |
- **Output**: English words representing the number
|
| 135 |
- **Vocabulary**: 41 tokens (zero-nineteen, twenty-ninety by tens, hundred, thousand, million, billion, trillion, quadrillion, quintillion, sextillion, septillion, octillion, nonillion, decillion, EOS)
|
| 136 |
+
- **Max Output Length**: 35 tokens (increased from 20 to support INT64_MAX)
|
| 137 |
+
- **Parameters**: ~870K
|
| 138 |
|
| 139 |
### Training Details
|
| 140 |
|
| 141 |
+
The model uses **stratified sampling** during training to ensure balanced representation across 7 scales:
|
| 142 |
+
- Units (0-999): ~14% of training data
|
| 143 |
+
- Thousands (1,000-999,999): ~14% of training data
|
| 144 |
+
- Millions (1M-999M): ~14% of training data
|
| 145 |
+
- Billions (1B-999B): ~14% of training data
|
| 146 |
+
- Trillions (1T-999T): ~14% of training data
|
| 147 |
+
- Quadrillions (1Q-999Q): ~14% of training data
|
| 148 |
+
- Quintillions (1Qi-INT64_MAX): ~14% of training data
|
| 149 |
+
|
| 150 |
+
**Guaranteed Training Samples:**
|
| 151 |
+
- All integers from 0 to 99,999 (100,000 samples)
|
| 152 |
+
- Exact powers of 1000: 1,000; 1,000,000; 1,000,000,000; 1,000,000,000,000; 1,000,000,000,000,000
|
| 153 |
|
| 154 |
This prevents the model from being biased toward larger numbers, which would happen with uniform random sampling (99.9% of 0-1T range is >1M).
|
| 155 |
|
|
|
|
| 161 |
| `pytorch_model.bin` | HuggingFace model weights (PyTorch format) |
|
| 162 |
| `config.json` | Model configuration |
|
| 163 |
| `generation_config.json` | Generation parameters |
|
|
|
|
| 164 |
| `namer_model.pt` | Original PyTorch checkpoint |
|
| 165 |
| `namer/` | Source code package |
|
| 166 |
|
| 167 |
## Training
|
| 168 |
|
| 169 |
+
To train from scratch with default settings (30 epochs, 1000 steps/epoch, INT64_MAX range):
|
| 170 |
|
| 171 |
```bash
|
| 172 |
python -m namer train
|
|
|
|
| 178 |
python -m namer train --epochs 20 --steps 500 --batch-size 256 --lr 0.001
|
| 179 |
```
|
| 180 |
|
| 181 |
+
The training uses stratified sampling by default with guaranteed samples. To modify the training range or sampling strategy, edit `namer/data.py`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
|
| 183 |
## Version History
|
| 184 |
|
| 185 |
+
### v3.0 (Current)
|
| 186 |
+
- **Range**: 0 to 9,223,372,036,854,775,807 (INT64_MAX, 19 digits)
|
| 187 |
+
- **Training**: Stratified sampling with guaranteed samples (0-99,999 + powers of 1000)
|
| 188 |
+
- **Max output length**: 35 tokens
|
| 189 |
+
- **Max sequence length**: 25 tokens
|
| 190 |
+
- **Accuracy**: >99.9% on validation set
|
| 191 |
+
|
| 192 |
+
### v2.0 (Previous)
|
| 193 |
- **Range**: 0 to 999,999,999,999 (trillions)
|
| 194 |
- **Training**: Stratified sampling for balanced representation
|
| 195 |
- **Max output length**: 25 tokens
|
|
|
|
| 202 |
|
| 203 |
## Limitations
|
| 204 |
|
| 205 |
+
- **Exact powers of 1000 above million**: The model may occasionally produce extra words (e.g., "one trillion billion" instead of "one trillion") for exact powers of 1000 at the billions, trillions, and quadrillions scale. This is a known edge case in the EOS prediction.
|
| 206 |
+
- **Zero handling**: Edge case in inference may produce empty output.
|
| 207 |
+
- **Negative numbers**: Not supported (absolute value is used)
|
| 208 |
+
- **Decimal numbers**: Not supported (integers only)
|
| 209 |
|
| 210 |
## Citation
|
| 211 |
|
namer/data.py
CHANGED
|
@@ -71,29 +71,48 @@ class InfiniteNamerDataset(IterableDataset):
|
|
| 71 |
|
| 72 |
Uses Python generators to produce an endless stream of training samples.
|
| 73 |
Each iteration yields fresh random samples.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
"""
|
| 75 |
|
| 76 |
def __init__(
|
| 77 |
self,
|
| 78 |
max_int: int = 999999,
|
| 79 |
max_seq_len: int = 20,
|
|
|
|
| 80 |
seed: int | None = None,
|
|
|
|
|
|
|
| 81 |
) -> None:
|
| 82 |
"""Initialize the infinite dataset.
|
| 83 |
|
| 84 |
Args:
|
| 85 |
max_int: Maximum random integer value
|
| 86 |
-
max_seq_len: Maximum sequence length for padding
|
|
|
|
| 87 |
seed: Random seed (optional, for reproducibility)
|
|
|
|
|
|
|
| 88 |
"""
|
| 89 |
self.max_int = max_int
|
| 90 |
self.max_seq_len = max_seq_len
|
|
|
|
| 91 |
self.seed = seed
|
|
|
|
|
|
|
| 92 |
self.rng = random.Random(seed)
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
def _generate_sample(self) -> tuple[torch.Tensor, torch.Tensor]:
|
| 95 |
"""Generate a single (digits, encoded_name) sample."""
|
| 96 |
-
|
|
|
|
|
|
|
|
|
|
| 97 |
digits = int_to_digits(n)
|
| 98 |
name = read_digits(digits)
|
| 99 |
encoded = encode(name)
|
|
@@ -104,17 +123,82 @@ class InfiniteNamerDataset(IterableDataset):
|
|
| 104 |
|
| 105 |
# Append EOS and pad with -1
|
| 106 |
encoded_with_eos = encoded + [EOS_IDX]
|
| 107 |
-
encoded_padded = encoded_with_eos + [-1] * (self.
|
| 108 |
-
encoded_padded = encoded_padded[: self.
|
| 109 |
|
| 110 |
return (
|
| 111 |
torch.tensor(digits_padded, dtype=torch.long),
|
| 112 |
torch.tensor(encoded_padded, dtype=torch.long),
|
| 113 |
)
|
| 114 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
def __iter__(self) -> InfiniteNamerDataset:
|
| 116 |
"""Yield samples infinitely.
|
| 117 |
|
|
|
|
|
|
|
|
|
|
| 118 |
Each worker in multi-worker DataLoader gets its own iterator
|
| 119 |
with a unique seed based on worker_id.
|
| 120 |
"""
|
|
@@ -130,8 +214,43 @@ class InfiniteNamerDataset(IterableDataset):
|
|
| 130 |
base_seed = self.seed if self.seed else random.randint(0, 2**32)
|
| 131 |
self.rng = random.Random(base_seed + worker_id * 1000)
|
| 132 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
return self
|
| 134 |
|
| 135 |
def __next__(self) -> tuple[torch.Tensor, torch.Tensor]:
|
| 136 |
-
"""Generate the next sample.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
return self._generate_sample()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
Uses Python generators to produce an endless stream of training samples.
|
| 73 |
Each iteration yields fresh random samples.
|
| 74 |
+
|
| 75 |
+
Includes guaranteed samples:
|
| 76 |
+
- All numbers from 0 to 99,999
|
| 77 |
+
- Exact powers of 1000 (1,000; 1,000,000; 1,000,000,000; etc.)
|
| 78 |
"""
|
| 79 |
|
| 80 |
def __init__(
|
| 81 |
self,
|
| 82 |
max_int: int = 999999,
|
| 83 |
max_seq_len: int = 20,
|
| 84 |
+
max_output_len: int = 20,
|
| 85 |
seed: int | None = None,
|
| 86 |
+
stratified: bool = True,
|
| 87 |
+
include_all_until: int = 99999,
|
| 88 |
) -> None:
|
| 89 |
"""Initialize the infinite dataset.
|
| 90 |
|
| 91 |
Args:
|
| 92 |
max_int: Maximum random integer value
|
| 93 |
+
max_seq_len: Maximum input sequence length for padding
|
| 94 |
+
max_output_len: Maximum output sequence length for padding
|
| 95 |
seed: Random seed (optional, for reproducibility)
|
| 96 |
+
stratified: Whether to use stratified sampling across number scales
|
| 97 |
+
include_all_until: Include all integers from 0 to this value (default: 99999)
|
| 98 |
"""
|
| 99 |
self.max_int = max_int
|
| 100 |
self.max_seq_len = max_seq_len
|
| 101 |
+
self.max_output_len = max_output_len
|
| 102 |
self.seed = seed
|
| 103 |
+
self.stratified = stratified
|
| 104 |
+
self.include_all_until = min(include_all_until, max_int)
|
| 105 |
self.rng = random.Random(seed)
|
| 106 |
+
self._guaranteed_samples: list[int] | None = None
|
| 107 |
+
self._guaranteed_index: int = 0
|
| 108 |
+
self._powers_of_1000: list[int] | None = None
|
| 109 |
|
| 110 |
def _generate_sample(self) -> tuple[torch.Tensor, torch.Tensor]:
|
| 111 |
"""Generate a single (digits, encoded_name) sample."""
|
| 112 |
+
if self.stratified:
|
| 113 |
+
n = self._stratified_random_int()
|
| 114 |
+
else:
|
| 115 |
+
n = self.rng.randint(0, self.max_int)
|
| 116 |
digits = int_to_digits(n)
|
| 117 |
name = read_digits(digits)
|
| 118 |
encoded = encode(name)
|
|
|
|
| 123 |
|
| 124 |
# Append EOS and pad with -1
|
| 125 |
encoded_with_eos = encoded + [EOS_IDX]
|
| 126 |
+
encoded_padded = encoded_with_eos + [-1] * (self.max_output_len - len(encoded_with_eos))
|
| 127 |
+
encoded_padded = encoded_padded[: self.max_output_len]
|
| 128 |
|
| 129 |
return (
|
| 130 |
torch.tensor(digits_padded, dtype=torch.long),
|
| 131 |
torch.tensor(encoded_padded, dtype=torch.long),
|
| 132 |
)
|
| 133 |
|
| 134 |
+
def _get_guaranteed_samples(self) -> list[int]:
|
| 135 |
+
"""Get the list of guaranteed samples (0-N and powers of 1000).
|
| 136 |
+
|
| 137 |
+
Returns:
|
| 138 |
+
List of integers that must be included in training
|
| 139 |
+
"""
|
| 140 |
+
samples = []
|
| 141 |
+
|
| 142 |
+
# All numbers from 0 to include_all_until
|
| 143 |
+
samples.extend(range(0, self.include_all_until + 1))
|
| 144 |
+
|
| 145 |
+
# Exact powers of 1000 (1,000; 1,000,000; 1,000,000,000; etc.)
|
| 146 |
+
power = 1000
|
| 147 |
+
while power <= self.max_int:
|
| 148 |
+
if power > self.include_all_until: # Avoid duplicates
|
| 149 |
+
samples.append(power)
|
| 150 |
+
power *= 1000
|
| 151 |
+
|
| 152 |
+
return samples
|
| 153 |
+
|
| 154 |
+
def _stratified_random_int(self) -> int:
|
| 155 |
+
"""Generate a random integer using stratified sampling across number scales.
|
| 156 |
+
|
| 157 |
+
Divides the range [0, max_int] into logarithmic strata (units, thousands,
|
| 158 |
+
millions, billions, etc.) and randomly selects one stratum, then generates
|
| 159 |
+
a uniform random number within that stratum. This ensures balanced training
|
| 160 |
+
across all scales rather than being biased toward larger numbers.
|
| 161 |
+
|
| 162 |
+
Returns:
|
| 163 |
+
Random integer uniformly selected from a randomly chosen stratum
|
| 164 |
+
"""
|
| 165 |
+
# Define scale boundaries (powers of 1000)
|
| 166 |
+
scales = [0, 1000, 1000_000, 1000_000_000, 1000_000_000_000,
|
| 167 |
+
1000_000_000_000_000, 1000_000_000_000_000_000]
|
| 168 |
+
|
| 169 |
+
# Find which scales are within our max_int range
|
| 170 |
+
valid_scales = [s for s in scales if s <= self.max_int]
|
| 171 |
+
|
| 172 |
+
if len(valid_scales) == 1:
|
| 173 |
+
# Only units scale available
|
| 174 |
+
return self.rng.randint(0, min(999, self.max_int))
|
| 175 |
+
|
| 176 |
+
# Randomly select a stratum (scale index)
|
| 177 |
+
stratum_idx = self.rng.randint(0, len(valid_scales) - 1)
|
| 178 |
+
|
| 179 |
+
# Determine the range for this stratum
|
| 180 |
+
lower = valid_scales[stratum_idx]
|
| 181 |
+
if stratum_idx + 1 < len(valid_scales):
|
| 182 |
+
upper = valid_scales[stratum_idx + 1] - 1
|
| 183 |
+
else:
|
| 184 |
+
upper = self.max_int
|
| 185 |
+
|
| 186 |
+
# Ensure upper doesn't exceed max_int
|
| 187 |
+
upper = min(upper, self.max_int)
|
| 188 |
+
|
| 189 |
+
# Generate random number in this stratum
|
| 190 |
+
# Special case: units stratum includes 0
|
| 191 |
+
if stratum_idx == 0:
|
| 192 |
+
return self.rng.randint(0, min(999, self.max_int))
|
| 193 |
+
|
| 194 |
+
return self.rng.randint(lower, upper)
|
| 195 |
+
|
| 196 |
def __iter__(self) -> InfiniteNamerDataset:
|
| 197 |
"""Yield samples infinitely.
|
| 198 |
|
| 199 |
+
First yields all guaranteed samples (0-99,999 and powers of 1000),
|
| 200 |
+
then continues with stratified random sampling.
|
| 201 |
+
|
| 202 |
Each worker in multi-worker DataLoader gets its own iterator
|
| 203 |
with a unique seed based on worker_id.
|
| 204 |
"""
|
|
|
|
| 214 |
base_seed = self.seed if self.seed else random.randint(0, 2**32)
|
| 215 |
self.rng = random.Random(base_seed + worker_id * 1000)
|
| 216 |
|
| 217 |
+
# Generate and shuffle guaranteed samples
|
| 218 |
+
self._guaranteed_samples = self._get_guaranteed_samples()
|
| 219 |
+
self.rng.shuffle(self._guaranteed_samples)
|
| 220 |
+
self._guaranteed_index = 0
|
| 221 |
+
|
| 222 |
return self
|
| 223 |
|
| 224 |
def __next__(self) -> tuple[torch.Tensor, torch.Tensor]:
|
| 225 |
+
"""Generate the next sample.
|
| 226 |
+
|
| 227 |
+
First yields all guaranteed samples, then stratified random samples.
|
| 228 |
+
"""
|
| 229 |
+
# Yield guaranteed samples first
|
| 230 |
+
if self._guaranteed_samples and self._guaranteed_index < len(self._guaranteed_samples):
|
| 231 |
+
n = self._guaranteed_samples[self._guaranteed_index]
|
| 232 |
+
self._guaranteed_index += 1
|
| 233 |
+
return self._generate_sample_from_n(n)
|
| 234 |
+
|
| 235 |
+
# Then yield stratified random samples
|
| 236 |
return self._generate_sample()
|
| 237 |
+
|
| 238 |
+
def _generate_sample_from_n(self, n: int) -> tuple[torch.Tensor, torch.Tensor]:
|
| 239 |
+
"""Generate a sample for a specific integer n."""
|
| 240 |
+
digits = int_to_digits(n)
|
| 241 |
+
name = read_digits(digits)
|
| 242 |
+
encoded = encode(name)
|
| 243 |
+
|
| 244 |
+
# Pad digits with 10 (padding index)
|
| 245 |
+
digits_padded = digits + [10] * (self.max_seq_len - len(digits))
|
| 246 |
+
digits_padded = digits_padded[: self.max_seq_len]
|
| 247 |
+
|
| 248 |
+
# Append EOS and pad with -1
|
| 249 |
+
encoded_with_eos = encoded + [EOS_IDX]
|
| 250 |
+
encoded_padded = encoded_with_eos + [-1] * (self.max_output_len - len(encoded_with_eos))
|
| 251 |
+
encoded_padded = encoded_padded[: self.max_output_len]
|
| 252 |
+
|
| 253 |
+
return (
|
| 254 |
+
torch.tensor(digits_padded, dtype=torch.long),
|
| 255 |
+
torch.tensor(encoded_padded, dtype=torch.long),
|
| 256 |
+
)
|
namer/main.py
CHANGED
|
@@ -59,11 +59,18 @@ def demo_command(args: argparse.Namespace) -> None:
|
|
| 59 |
print(f" int_to_digits({n}) = {int_to_digits(n)}")
|
| 60 |
|
| 61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
def train_command(
|
| 63 |
num_epochs: int = 30,
|
| 64 |
steps_per_epoch: int = 1000,
|
| 65 |
batch_size: int = 128,
|
| 66 |
learning_rate: float = 0.001,
|
|
|
|
|
|
|
|
|
|
| 67 |
) -> None:
|
| 68 |
"""Train the Namer model.
|
| 69 |
|
|
@@ -72,6 +79,9 @@ def train_command(
|
|
| 72 |
steps_per_epoch: Number of steps per epoch
|
| 73 |
batch_size: Batch size for training
|
| 74 |
learning_rate: Learning rate for optimizer
|
|
|
|
|
|
|
|
|
|
| 75 |
"""
|
| 76 |
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 77 |
if device.type == "cuda":
|
|
@@ -79,17 +89,31 @@ def train_command(
|
|
| 79 |
else:
|
| 80 |
print("Warning: CUDA not available, using CPU")
|
| 81 |
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
infinite_dataset = InfiniteNamerDataset(
|
| 84 |
-
max_int=
|
| 85 |
-
max_seq_len=
|
|
|
|
| 86 |
seed=42,
|
|
|
|
|
|
|
| 87 |
)
|
| 88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
# Create model
|
| 90 |
model = NamerTransformer(
|
| 91 |
vocab_size=len(VOCABULARY),
|
| 92 |
-
max_output_len=
|
| 93 |
d_model=128,
|
| 94 |
nhead=4,
|
| 95 |
num_encoder_layers=4,
|
|
@@ -113,19 +137,31 @@ def train_command(
|
|
| 113 |
# Save model
|
| 114 |
save_model(trained_model)
|
| 115 |
|
| 116 |
-
# Test predictions
|
| 117 |
print("\n--- Model Predictions ---")
|
| 118 |
trained_model.eval()
|
| 119 |
|
| 120 |
-
test_numbers = [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
device_obj = next(trained_model.parameters()).device
|
| 122 |
|
| 123 |
with torch.no_grad():
|
| 124 |
for n in test_numbers:
|
|
|
|
|
|
|
| 125 |
pred = predict_number_name(trained_model, n, device_obj)
|
| 126 |
actual = read_digits(int_to_digits(n))
|
| 127 |
match = "✓" if pred == actual else "✗"
|
| 128 |
-
print(f" {n}: pred='{pred}', actual='{actual}' {match}")
|
| 129 |
|
| 130 |
|
| 131 |
def test_command() -> None:
|
|
@@ -180,12 +216,27 @@ def main(argv: list[str] | None = None) -> int:
|
|
| 180 |
train_parser.add_argument(
|
| 181 |
"--lr", type=float, default=0.001, help="Learning rate (default: 0.001)"
|
| 182 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
train_parser.set_defaults(
|
| 184 |
func=lambda args: train_command(
|
| 185 |
num_epochs=args.epochs,
|
| 186 |
steps_per_epoch=args.steps,
|
| 187 |
batch_size=args.batch_size,
|
| 188 |
learning_rate=args.lr,
|
|
|
|
|
|
|
|
|
|
| 189 |
)
|
| 190 |
)
|
| 191 |
|
|
|
|
| 59 |
print(f" int_to_digits({n}) = {int_to_digits(n)}")
|
| 60 |
|
| 61 |
|
| 62 |
+
# INT64_MAX: 9,223,372,036,854,775,807
|
| 63 |
+
INT64_MAX = 9223372036854775807
|
| 64 |
+
|
| 65 |
+
|
| 66 |
def train_command(
|
| 67 |
num_epochs: int = 30,
|
| 68 |
steps_per_epoch: int = 1000,
|
| 69 |
batch_size: int = 128,
|
| 70 |
learning_rate: float = 0.001,
|
| 71 |
+
max_int: int = INT64_MAX,
|
| 72 |
+
max_seq_len: int = 25,
|
| 73 |
+
max_output_len: int = 35,
|
| 74 |
) -> None:
|
| 75 |
"""Train the Namer model.
|
| 76 |
|
|
|
|
| 79 |
steps_per_epoch: Number of steps per epoch
|
| 80 |
batch_size: Batch size for training
|
| 81 |
learning_rate: Learning rate for optimizer
|
| 82 |
+
max_int: Maximum integer value for training (default: INT64_MAX)
|
| 83 |
+
max_seq_len: Maximum input sequence length (default: 25 for 19 digits)
|
| 84 |
+
max_output_len: Maximum output sequence length (default: 35 for large numbers)
|
| 85 |
"""
|
| 86 |
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 87 |
if device.type == "cuda":
|
|
|
|
| 89 |
else:
|
| 90 |
print("Warning: CUDA not available, using CPU")
|
| 91 |
|
| 92 |
+
print(f"Training range: 0 to {max_int:,} ({len(str(max_int))} digits)")
|
| 93 |
+
print(f"Model config: max_seq_len={max_seq_len}, max_output_len={max_output_len}")
|
| 94 |
+
|
| 95 |
+
# Create infinite dataset for training with stratified sampling
|
| 96 |
+
# Includes all numbers 0-99,999 and exact powers of 1000 as guaranteed samples
|
| 97 |
infinite_dataset = InfiniteNamerDataset(
|
| 98 |
+
max_int=max_int,
|
| 99 |
+
max_seq_len=max_seq_len,
|
| 100 |
+
max_output_len=max_output_len,
|
| 101 |
seed=42,
|
| 102 |
+
stratified=True,
|
| 103 |
+
include_all_until=99999,
|
| 104 |
)
|
| 105 |
|
| 106 |
+
# Calculate guaranteed samples info
|
| 107 |
+
guaranteed_count = 100000 # 0-99,999
|
| 108 |
+
powers_of_1000 = [10**3, 10**6, 10**9, 10**12, 10**15, 10**18]
|
| 109 |
+
extra_powers = sum(1 for p in powers_of_1000 if p > 99999 and p <= max_int)
|
| 110 |
+
total_guaranteed = guaranteed_count + extra_powers
|
| 111 |
+
print(f"Guaranteed samples: {total_guaranteed:,} (0-99,999 + {extra_powers} powers of 1000)")
|
| 112 |
+
|
| 113 |
# Create model
|
| 114 |
model = NamerTransformer(
|
| 115 |
vocab_size=len(VOCABULARY),
|
| 116 |
+
max_output_len=max_output_len,
|
| 117 |
d_model=128,
|
| 118 |
nhead=4,
|
| 119 |
num_encoder_layers=4,
|
|
|
|
| 137 |
# Save model
|
| 138 |
save_model(trained_model)
|
| 139 |
|
| 140 |
+
# Test predictions across all scales
|
| 141 |
print("\n--- Model Predictions ---")
|
| 142 |
trained_model.eval()
|
| 143 |
|
| 144 |
+
test_numbers = [
|
| 145 |
+
0, 42, 123, 1000, 999999, # Small numbers
|
| 146 |
+
1000000, 999999999, # Millions
|
| 147 |
+
1000000000, 999999999999, # Billions, Trillions
|
| 148 |
+
1000000000000, 999999999999999, # Trillions, Quadrillions
|
| 149 |
+
1000000000000000, # Quintillion boundary
|
| 150 |
+
]
|
| 151 |
+
# Add INT64_MAX if training for that range
|
| 152 |
+
if max_int >= INT64_MAX:
|
| 153 |
+
test_numbers.append(INT64_MAX)
|
| 154 |
+
|
| 155 |
device_obj = next(trained_model.parameters()).device
|
| 156 |
|
| 157 |
with torch.no_grad():
|
| 158 |
for n in test_numbers:
|
| 159 |
+
if n > max_int:
|
| 160 |
+
continue
|
| 161 |
pred = predict_number_name(trained_model, n, device_obj)
|
| 162 |
actual = read_digits(int_to_digits(n))
|
| 163 |
match = "✓" if pred == actual else "✗"
|
| 164 |
+
print(f" {n:,}: pred='{pred}', actual='{actual}' {match}")
|
| 165 |
|
| 166 |
|
| 167 |
def test_command() -> None:
|
|
|
|
| 216 |
train_parser.add_argument(
|
| 217 |
"--lr", type=float, default=0.001, help="Learning rate (default: 0.001)"
|
| 218 |
)
|
| 219 |
+
train_parser.add_argument(
|
| 220 |
+
"--max-int", type=int, default=INT64_MAX,
|
| 221 |
+
help=f"Maximum integer for training (default: {INT64_MAX})"
|
| 222 |
+
)
|
| 223 |
+
train_parser.add_argument(
|
| 224 |
+
"--max-seq-len", type=int, default=25,
|
| 225 |
+
help="Maximum input sequence length (default: 25 for 19 digits)"
|
| 226 |
+
)
|
| 227 |
+
train_parser.add_argument(
|
| 228 |
+
"--max-output-len", type=int, default=35,
|
| 229 |
+
help="Maximum output sequence length (default: 35)"
|
| 230 |
+
)
|
| 231 |
train_parser.set_defaults(
|
| 232 |
func=lambda args: train_command(
|
| 233 |
num_epochs=args.epochs,
|
| 234 |
steps_per_epoch=args.steps,
|
| 235 |
batch_size=args.batch_size,
|
| 236 |
learning_rate=args.lr,
|
| 237 |
+
max_int=args.max_int,
|
| 238 |
+
max_seq_len=args.max_seq_len,
|
| 239 |
+
max_output_len=args.max_output_len,
|
| 240 |
)
|
| 241 |
)
|
| 242 |
|