Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +143 -0
config.json +18 -0
generation_config.json +11 -0
load_hf_model.py +44 -0
pytorch_model.bin +3 -0
tokenizer.model +3 -0
tokenizer_config.json +9 -0

README.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# OpenLLM Small Extended 10k
+This is the OpenLLM small model trained for 10,000 steps on the SQUAD dataset.
+## Model Details
+- **Model Type**: GPT-style transformer (decoder-only)
+- **Training Steps**: 10,000
+- **Parameters**: 35.8M
+- **Vocabulary Size**: 32,000
+- **Context Length**: 1,024 tokens
+- **Architecture**: 6 layers, 8 attention heads, 512 embedding dimension
+## Training Information
+- **Dataset**: SQUAD (Stanford Question Answering Dataset)
+- **Training Data**: ~41k Wikipedia passages
+- **Tokenizer**: SentencePiece BPE with 32k vocabulary
+- **Optimizer**: AdamW
+- **Learning Rate**: 3e-4
+- **Batch Size**: 4 (with gradient accumulation)
+## Performance
+- **Final Loss**: ~5.22
+- **Inference Speed**: ~8.3 tokens/second (CPU)
+- **Memory Usage**: ~143MB for inference
+## Usage
+### Using the Model
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+# Load model and tokenizer
+model_name = "lemms/openllm-small-extended-10k"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+# Generate text
+prompt = "The future of artificial intelligence"
+inputs = tokenizer(prompt, return_tensors="pt")
+with torch.no_grad():
+    outputs = model.generate(
+        inputs["input_ids"],
+        max_length=100,
+        temperature=0.7,
+        do_sample=True,
+        pad_token_id=tokenizer.eos_token_id
+    )
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_text)
+```
+### Using the Custom Loader
+```python
+from load_hf_model import load_model_and_tokenizer
+# Load model using custom loader
+model, tokenizer = load_model_and_tokenizer("lemms/openllm-small-extended-10k")
+# Generate text
+prompt = "The history of machine learning"
+inputs = tokenizer(prompt, return_tensors="pt")
+with torch.no_grad():
+    outputs = model.generate(
+        inputs["input_ids"],
+        max_length=100,
+        temperature=0.7
+    )
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Model Architecture
+This model follows the standard GPT architecture:
+- **Token Embeddings**: Maps token IDs to dense vectors
+- **Positional Embeddings**: Adds position information
+- **Transformer Blocks**: 6 layers with multi-head attention and feed-forward networks
+- **Layer Normalization**: Pre-norm placement for training stability
+- **Output Head**: Linear projection to vocabulary for next-token prediction
+## Training Details
+The model was trained using:
+- **Framework**: PyTorch
+- **Hardware**: CPU training with gradient accumulation
+- **Regularization**: Dropout (0.1), weight decay
+- **Optimization**: AdamW with cosine learning rate scheduling
+- **Gradient Clipping**: 1.0
+## Limitations
+- This is a small model (35.8M parameters) with limited capacity
+- Training was done on CPU, which limited the training steps
+- Model quality is basic and suitable for educational/research purposes
+- Not suitable for production use without further training
+## License
+This model is dual-licensed:
+- **Open Source**: GPLv3 License
+- **Commercial**: Commercial License available
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{openllm2024,
+  title={OpenLLM: Open Source Large Language Model Framework},
+  author={Louis Chua Bean Chong},
+  year={2024},
+  url={https://github.com/louischua/openllm}
+}
+```
+## Model Card
+- **Developed by**: Louis Chua Bean Chong
+- **Model type**: Language Model
+- **Language(s)**: English
+- **License**: GPLv3 / Commercial
+- **Finetuned from model**: Trained from scratch
+- **Training data**: SQUAD dataset
+- **Training procedure**: Supervised learning
+- **Evaluation results**: Basic text generation capability
+## Related Models
+- [lemms/openllm-small-extended-4k](https://huggingface.co/lemms/openllm-small-extended-4k)
+- [lemms/openllm-small-extended-6k](https://huggingface.co/lemms/openllm-small-extended-6k)
+- [lemms/openllm-small-extended-7k](https://huggingface.co/lemms/openllm-small-extended-7k)
+- [lemms/openllm-small-extended-8k](https://huggingface.co/lemms/openllm-small-extended-8k)
+- [lemms/openllm-small-extended-9k](https://huggingface.co/lemms/openllm-small-extended-9k)

config.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "architectures": [
+    "GPTModel"
+  ],
+  "model_type": "gpt",
+  "vocab_size": 32000,
+  "n_layer": 6,
+  "n_head": 8,
+  "n_embd": 512,
+  "block_size": 1024,
+  "dropout": 0.1,
+  "bias": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.0.0",
+  "openllm_version": "0.1.0",
+  "training_steps": 10000,
+  "model_size": "small"
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "max_length": 512,
+  "max_new_tokens": 256,
+  "temperature": 0.7,
+  "top_k": 40,
+  "top_p": 0.9,
+  "do_sample": true,
+  "pad_token_id": 0,
+  "eos_token_id": 1,
+  "bos_token_id": 2
+}

load_hf_model.py ADDED Viewed

	@@ -0,0 +1,44 @@

+#!/usr/bin/env python3
+"""
+Hugging Face Compatible Loader for OpenLLM
+Usage:
+    # Using transformers library (if you implement custom model class)
+    # from transformers import AutoModel, AutoTokenizer
+    # model = AutoModel.from_pretrained(".")
+    # tokenizer = AutoTokenizer.from_pretrained(".")
+    # Manual loading
+    from load_hf_model import load_model_manual
+    model, tokenizer = load_model_manual(".")
+"""
+import torch
+import json
+import sentencepiece as smp
+from pathlib import Path
+def load_model_manual(model_dir="."):
+    """Manually load model in HF format."""
+    model_dir = Path(model_dir)
+    # Load config
+    with open(model_dir / "config.json", 'r') as f:
+        config = json.load(f)
+    # Load model weights
+    state_dict = torch.load(model_dir / "pytorch_model.bin", map_location='cpu')
+    # Load tokenizer
+    tokenizer = smp.SentencePieceProcessor()
+    tokenizer.load(str(model_dir / "tokenizer.model"))
+    print(f"Loaded model: {config['model_type']} with {config['n_layer']} layers")
+    print(f"Vocabulary size: {config['vocab_size']}")
+    return state_dict, tokenizer
+if __name__ == "__main__":
+    state_dict, tokenizer = load_model_manual()
+    print(f"Model weights loaded: {len(state_dict)} parameters")
+    print(f"Tokenizer vocabulary: {tokenizer.vocab_size()}")

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f826631e0861e3069409a6afb41c577372361c7389440bab45734de046d0f5da
+size 168490621

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6efb1da9b0e667cee37b23f4240e0bd34fbfb20e1faebcb8d299a7598c0635f3
+size 547695

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "tokenizer_class": "SentencePieceTokenizer",
+  "model_max_length": 1024,
+  "vocab_size": 32000,
+  "unk_token": "<unk>",
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "pad_token": "<pad>"
+}