OpenLLM Small Extended 6k

This is the OpenLLM Small Extended model trained for 6,000 steps on Wikipedia passages from the SQUAD dataset.

Model Details

Model Type: GPT-style Transformer
Architecture: Small (35.8M parameters)
Training Steps: 6,000
Training Data: ~41k Wikipedia passages from SQUAD dataset
Tokenizer: SentencePiece BPE (32k vocabulary)
License: GPL-3.0 (Open Source) / Commercial License available

Model Performance

Final Training Loss: 5.4302
Model Parameters: 35,823,616
Context Length: 512 tokens
Training Hardware: CPU/GPU compatible

Usage

Using Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "lemms/openllm-small-extended-6k"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
prompt = "The history of artificial intelligence"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=50,
        temperature=0.7,
        top_k=40,
        do_sample=True
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Using the Custom Loader

# Use the provided load_hf_model.py script
from load_hf_model import load_model_and_tokenizer

model, tokenizer = load_model_and_tokenizer()
# ... rest of usage

Training Details

This model was trained using the OpenLLM training pipeline:

Data Preparation: SQUAD dataset processing (~41k passages)
Tokenizer Training: SentencePiece BPE with 32k vocabulary
Model Training: GPT-style transformer for 6,000 steps
Evaluation: Perplexity and text generation quality assessment

Model Architecture

Layers: 12 transformer layers
Attention Heads: 12
Hidden Size: 768
Intermediate Size: 3072
Activation: GELU
Layer Norm: Pre-norm

Limitations

Training Data: Limited to Wikipedia passages
Context Length: 512 tokens maximum
Model Size: Small model with 35.8M parameters
Performance: Basic text generation capabilities

License

This model is dual-licensed:

Open Source: GPL-3.0 for research and community use
Commercial: Commercial license available for enterprise use

For commercial licensing, contact: louischua@gmail.com

Citation

If you use this model in your research, please cite:

@misc{openllm2024,
  title={OpenLLM: Open Source Large Language Model},
  author={Louis Chua Bean Chong},
  year={2024},
  url={https://github.com/louischua/openllm}
}

Dataset used to train lemms/openllm-small-extended-6k

Space using lemms/openllm-small-extended-6k 1

Evaluation results

perplexity on SQUAD Wikipedia Passages
self-reported

816.040
training_loss on SQUAD Wikipedia Passages
self-reported

5.430

lemms
/

openllm-small-extended-6k