Update README.md

10a51de verified 5 months ago

9.49 kB

license: apache-2.0
datasets:
  - openbmb/Ultra-FineWeb
  - Avelina/python-edu-cleaned
  - HuggingFaceTB/finemath
language:
  - en
  - zh
pipeline_tag: text-generation

Model Card for Model ID

This modelcard aims to be a base template for new models. It has been generated using this raw template.

Model Details

Model Description

Developed by: xTimeCrystal
Model type: Softmax self-attention decoder-only transformer
Language(s) (NLP): English, Chinese, Python
License: Apache 2.0

This model uses SOTA innovative techniques to train a 200M decoder-only transformer language model using only 10B tokens and on a single RTX 5090 GPU in just 1 day. As shown in the examples below, the model is highly competent in diverse styles and is capable of generating text on various topics, ranging from pure recall to article writing.

The main techniques used were:

Adaptive Muon optimizer: Based on the Muon optimizer, this allows the model to be trained with exceptional data efficiency (2.1x AdamW). Furthermore, the momentum buffer can be stored in bf16, additionally lowering VRAM usage.
Aggressive data filtering: by selecting a carefully curated set of educational content and building upon the data efficiency of Muon, we were able to significantly improves capabilities in resource-constrained environments.
Efficient data bin-packing: as self attention relies heavily on attention sinks, all the sequences started with the start token ('<s>') and were truncated at 2048 tokens. However, this led to inefficiencies as over 70% of the processed data was padding. Thus, to alleviate this issue, we used a simple bin packing algorithm that tries to concatenate sequences to all have lengths close to 2048. After this operation, all the sequences had less than 5% padding.
Float8 pretraining: This model was pretrained using bf16 master weights, fp8 (e4m3) casting with bf16 accumulation, and full bf16 backward. However, quantizing the attention mechanism significantly hurt the loss, so it was kept in bf16. This was found to match the performance of full bf16 training with significantly reduced (~30% decrease) VRAM usage and much higher (~20% increase) throughput.
ReLU^2 activation: This ultra-sparse activation function outperformed SwiGLU ([1], [2]) while only requiring 2 matmuls, slightly improving throughput.
Full attention: when pretraining small models, every layer is precious. Thus, we use full attention (no SWA, no GQA) for all layers in the model.
QK Norm without scalars: this enhanced stability as the additional scalars caused loss spikes and massive attention activations.

Overall, these techniques allowed the model to be losslessly trained for 110k steps with a massive batch size of 64 x 2048 tokens without gradient accumulation while still fitting in under 30GB VRAM and being completely spike-free:

Uses

How to Get Started with the Model

Download the files from the repo and place all the files in the same folder. Then, run the cells in the notebook.

Import the model using this cell.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from safetensors import safe_open

from model import Transformer as Model
from transformers import PreTrainedTokenizerFast

config = {
    'layers': 24,
    'num_heads': 12,
    'vocab_size': 32768,
    'input_dims': 768,
    'hidden_dims': 3072, 
}

device = "cuda" if torch.cuda.is_available() else "cpu"

torch.set_default_device(device)

model = Model(**config)
model.zero_grad()
model.bfloat16()

saved_states = {}
with safe_open("./model.safetensors", framework="pt", device=device) as f:
   for key in f.keys():
       saved_states[key] = f.get_tensor(key)
model.load_state_dict(saved_states)

model.eval()

tokenizer = PreTrainedTokenizerFast.from_pretrained("./")

Then prompt it to generate a function to compute the n-th Fibonacci number.

tokens = tokenizer('''def fibonacci(n: int):''')['input_ids']

current = tokenizer.decode(tokens)
print(current, end="")

temperature = 1e-4

for _ in range(128):

    tok = torch.tensor(tokens).reshape(1, -1)
    
    logits = model(tok)

    nxt = torch.multinomial(torch.softmax(logits[:, -1].float()/temperature, dim=-1).squeeze(), num_samples=1).item()
    
    tokens += [nxt]

    print(tokenizer.decode(tokens).replace(current, "", 1), end="")
    
    current = tokenizer.decode(tokens)

You should see output like this!

<s> def fibonacci(n: int):
    if n < 2:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)


def fibonacci_recursive(n: int):
    if n < 2:
        return n
    return fibonacci_recursive(n - 1) + fibonacci_recursive(n - 2)


def fibonacci_iterative(n: int):
    if n < 2:
        return n
    return fibonacci_iterative

Increase temperature to reduce repetitions and use different prompts to get other interesting results.

For example, if we prompt it to generate the digits of pi (using '''Digits of pi:''' as the seed), we see it recite the first 20(!) digits of pi correctly. (temperature=0.0001)

<s> Digits of pi:

# What is the value of pi?

## What is the value of pi?

The value of pi is 3.14159265358979323846273284627328462732846273284627328462732846273284627328462732846273284627328462

We can also get it to recall the chemical formula for water with the seed prompt '''The chemical formula of water is'''. Setting the temperature to a higher value prevents repetition, and plus we get to see the model's writing ability. (temperature=0.7)

<s> The chemical formula for water is H2O. What does it mean?
The chemical formula for water is H2O. What does it mean?
Water is the purest liquid on Earth. It is the basis of life. Water is found in the soil, rivers, lakes, oceans, and the ocean. Water is also found in our bodies. Water is found in everything we take in on a daily basis. Water is essential for life. Water is found in the cells, tissues, and organs of all living things. Water is a key element of life because it enables the creation and maintenance of the various chemical and physical processes

Impressive! Now let's test it on a classic example, '''The purpose of life'''. What does it think the purpose of life is? (temperature=0.8)

<s> The purpose of life is to build up the body’s strength, endurance, and energy reserves through the accumulation of acquired skills, and to get rid of worn or damaged parts of the body. All of this depends on day’s activities and deeds. The process of building up the body and taking on new challenges, such as accumulating health, will require the use of skills and abilities.
The main purpose of building up skills and abilities in life is to make new people capable of doing the things that they need to do. This process requires you to develop skills that are applicable to everyday life. Skills can either be formal, or in the

Finally, let's see if it can write a section on Charles Darwin. (temperature=0.8)

<s> Charles Darwin: The Origin of Species
Suggested Citation: (Author's description, 2007-08-02)
In the early 1900s the scientific community found that Darwin's theories would provide a mechanism for the further evolutionary history of living beings, assuming there was not been a series of intelligent, representative "histors" of life. Through careful research, Darwin's theory of the Origin of Species proved to be compatible with a single evolutionary process, that of speciation (Darwin, 1886). In other words, Darwin had

Unfortunately Charles Darwin died before 1886, but at least it got the name of the book correct!

Bias, Risks, and Limitations

Sadly, a 200M model is not perfect. For example, when prompted for '''The radius of the Earth''', it generates the text:

<s> The radius of the Earth is a measure of almost exactly 375,000 miles.
Scientists have long wondered what the planet was like long ago. Because of how old the Earth is—that is, the oldest part of it—we know that the Earth’s radius is about 670,000 miles. A few years later, an Australian astronomer determined that our planet’s mass is 12 times as massive as Earth.
Now, using data from NASA’s GOES-1 spacecraft, the Australian astronomer believes that we are at least 34 billion years old and possesses

which is off by a factor of around 100. Thus, despite impressive factuality in some areas and powerful language capabilities, all outputs from this model should be reviewed carefully.

Citation

If you use this model in research, please cite:

@misc{timecrystal200m2024,
  title={MiniModel-200M-Base: SOTA Efficiency for Small Language Models},
  author={xTimeCrystal},
  year={2025},
  howpublished={\url{https://huggingface.co/xTimeCrystal/MiniModel-200M-Base}},
}