--- language: - en license: - gpl-3.0 - other tags: - text-generation - pytorch - causal-lm - openllm - gpt - language-model datasets: - squad metrics: - perplexity - loss pipeline_tag: text-generation model-index: - name: OpenLLM Small Extended 10k results: - task: type: text-generation dataset: type: squad name: SQUAD metrics: - type: loss value: 5.22 - type: perplexity value: 184.5 --- # OpenLLM Small Extended 10k This is the OpenLLM small model trained for 10,000 steps on the SQUAD dataset. ## Model Details - **Model Type**: GPT-style transformer (decoder-only) - **Training Steps**: 10,000 - **Parameters**: 35.8M - **Vocabulary Size**: 32,000 - **Context Length**: 1,024 tokens - **Architecture**: 6 layers, 8 attention heads, 512 embedding dimension ## Training Information - **Dataset**: SQUAD (Stanford Question Answering Dataset) - **Training Data**: ~41k Wikipedia passages - **Tokenizer**: SentencePiece BPE with 32k vocabulary - **Optimizer**: AdamW - **Learning Rate**: 3e-4 - **Batch Size**: 4 (with gradient accumulation) ## Performance - **Final Loss**: ~5.22 - **Inference Speed**: ~8.3 tokens/second (CPU) - **Memory Usage**: ~143MB for inference ## Usage ### Using the Model This model uses a custom configuration format and requires the OpenLLM framework to load properly. ```python # Load using the OpenLLM framework from core.src.model import GPTModel import json import torch # Load configuration with open("config.json", "r") as f: config = json.load(f) # Create model instance model = GPTModel(config["model_config"]) # Load trained weights model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu")) # Load tokenizer import sentencepiece as spm tokenizer = spm.SentencePieceProcessor() tokenizer.load("tokenizer.model") # Generate text prompt = "The future of artificial intelligence" tokens = tokenizer.encode(prompt) inputs = torch.tensor([tokens], dtype=torch.long) with torch.no_grad(): outputs = model.generate( inputs, max_length=100, temperature=0.7 ) generated_text = tokenizer.decode(outputs[0].tolist()) print(generated_text) ``` ### Using the Custom Loader ```python from load_hf_model import load_model_and_tokenizer # Load model using custom loader model, tokenizer = load_model_and_tokenizer("lemms/openllm-small-extended-10k") # Generate text prompt = "The history of machine learning" tokens = tokenizer.encode(prompt) inputs = torch.tensor([tokens], dtype=torch.long) with torch.no_grad(): outputs = model.generate( inputs, max_length=100, temperature=0.7 ) print(tokenizer.decode(outputs[0].tolist())) ``` ## Model Architecture This model follows the standard GPT architecture: - **Token Embeddings**: Maps token IDs to dense vectors - **Positional Embeddings**: Adds position information - **Transformer Blocks**: 6 layers with multi-head attention and feed-forward networks - **Layer Normalization**: Pre-norm placement for training stability - **Output Head**: Linear projection to vocabulary for next-token prediction ## Training Details The model was trained using: - **Framework**: PyTorch - **Hardware**: CPU training with gradient accumulation - **Regularization**: Dropout (0.1), weight decay - **Optimization**: AdamW with cosine learning rate scheduling - **Gradient Clipping**: 1.0 ## Limitations - This is a small model (35.8M parameters) with limited capacity - Training was done on CPU, which limited the training steps - Model quality is basic and suitable for educational/research purposes - Not suitable for production use without further training ## License This model is dual-licensed: - **Open Source**: GPLv3 License - **Commercial**: Commercial License available ## Citation If you use this model in your research, please cite: ```bibtex @misc{openllm2024, title={OpenLLM: Open Source Large Language Model Framework}, author={Louis Chua Bean Chong}, year={2024}, url={https://github.com/louischua/openllm} } ``` ## Model Card - **Developed by**: Louis Chua Bean Chong - **Model type**: Language Model - **Language(s)**: English - **License**: GPLv3 / Commercial - **Finetuned from model**: Trained from scratch - **Training data**: SQUAD dataset - **Training procedure**: Supervised learning - **Evaluation results**: Basic text generation capability ## Related Models - [lemms/openllm-small-extended-4k](https://huggingface.co/lemms/openllm-small-extended-4k) - [lemms/openllm-small-extended-6k](https://huggingface.co/lemms/openllm-small-extended-6k) - [lemms/openllm-small-extended-7k](https://huggingface.co/lemms/openllm-small-extended-7k) - [lemms/openllm-small-extended-8k](https://huggingface.co/lemms/openllm-small-extended-8k) - [lemms/openllm-small-extended-9k](https://huggingface.co/lemms/openllm-small-extended-9k)