lemms commited on
Commit
d66fb15
·
verified ·
1 Parent(s): 933557f

Add OpenLLM Small Extended 6k model

Browse files

OpenLLM Small Extended model trained for 6,000 steps.

- Model: GPT-style transformer (35.8M parameters)
- Training: 6,000 steps on SQUAD Wikipedia passages
- Tokenizer: SentencePiece BPE (32k vocabulary)
- License: GPL-3.0 / Commercial available

For more details, see: https://github.com/louischua/openllm

README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenLLM Small Extended 6k
2
+
3
+ This is the OpenLLM Small Extended model trained for 6,000 steps on Wikipedia passages from the SQUAD dataset.
4
+
5
+ ## Model Details
6
+
7
+ - **Model Type:** GPT-style Transformer
8
+ - **Architecture:** Small (35.8M parameters)
9
+ - **Training Steps:** 6,000
10
+ - **Training Data:** ~41k Wikipedia passages from SQUAD dataset
11
+ - **Tokenizer:** SentencePiece BPE (32k vocabulary)
12
+ - **License:** GPL-3.0 (Open Source) / Commercial License available
13
+
14
+ ## Model Performance
15
+
16
+ - **Final Training Loss:** 5.4302
17
+ - **Model Parameters:** 35,823,616
18
+ - **Context Length:** 512 tokens
19
+ - **Training Hardware:** CPU/GPU compatible
20
+
21
+ ## Usage
22
+
23
+ ### Using Transformers
24
+
25
+ ```python
26
+ from transformers import AutoTokenizer, AutoModelForCausalLM
27
+ import torch
28
+
29
+ # Load model and tokenizer
30
+ model_name = "lemms/openllm-small-extended-6k"
31
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
32
+ model = AutoModelForCausalLM.from_pretrained(model_name)
33
+
34
+ # Generate text
35
+ prompt = "The history of artificial intelligence"
36
+ inputs = tokenizer(prompt, return_tensors="pt")
37
+
38
+ with torch.no_grad():
39
+ outputs = model.generate(
40
+ inputs.input_ids,
41
+ max_new_tokens=50,
42
+ temperature=0.7,
43
+ top_k=40,
44
+ do_sample=True
45
+ )
46
+
47
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
48
+ print(generated_text)
49
+ ```
50
+
51
+ ### Using the Custom Loader
52
+
53
+ ```python
54
+ # Use the provided load_hf_model.py script
55
+ from load_hf_model import load_model_and_tokenizer
56
+
57
+ model, tokenizer = load_model_and_tokenizer()
58
+ # ... rest of usage
59
+ ```
60
+
61
+ ## Training Details
62
+
63
+ This model was trained using the OpenLLM training pipeline:
64
+
65
+ 1. **Data Preparation:** SQUAD dataset processing (~41k passages)
66
+ 2. **Tokenizer Training:** SentencePiece BPE with 32k vocabulary
67
+ 3. **Model Training:** GPT-style transformer for 6,000 steps
68
+ 4. **Evaluation:** Perplexity and text generation quality assessment
69
+
70
+ ## Model Architecture
71
+
72
+ - **Layers:** 12 transformer layers
73
+ - **Attention Heads:** 12
74
+ - **Hidden Size:** 768
75
+ - **Intermediate Size:** 3072
76
+ - **Activation:** GELU
77
+ - **Layer Norm:** Pre-norm
78
+
79
+ ## Limitations
80
+
81
+ - **Training Data:** Limited to Wikipedia passages
82
+ - **Context Length:** 512 tokens maximum
83
+ - **Model Size:** Small model with 35.8M parameters
84
+ - **Performance:** Basic text generation capabilities
85
+
86
+ ## License
87
+
88
+ This model is dual-licensed:
89
+ - **Open Source:** GPL-3.0 for research and community use
90
+ - **Commercial:** Commercial license available for enterprise use
91
+
92
+ For commercial licensing, contact: louischua@gmail.com
93
+
94
+ ## Citation
95
+
96
+ If you use this model in your research, please cite:
97
+
98
+ ```bibtex
99
+ @misc{openllm2024,
100
+ title={OpenLLM: Open Source Large Language Model},
101
+ author={Louis Chua Bean Chong},
102
+ year={2024},
103
+ url={https://github.com/louischua/openllm}
104
+ }
105
+ ```
106
+
107
+ ## Links
108
+
109
+ - **Repository:** https://github.com/louischua/openllm
110
+ - **Documentation:** https://github.com/louischua/openllm/docs
111
+ - **Training Pipeline:** https://github.com/louischua/openllm/docs/training_pipeline.md
config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "GPTModel"
4
+ ],
5
+ "model_type": "gpt",
6
+ "vocab_size": 32000,
7
+ "n_layer": 6,
8
+ "n_head": 8,
9
+ "n_embd": 512,
10
+ "block_size": 1024,
11
+ "dropout": 0.1,
12
+ "bias": true,
13
+ "torch_dtype": "float32",
14
+ "transformers_version": "4.0.0",
15
+ "openllm_version": "0.1.0",
16
+ "training_steps": 6000,
17
+ "model_size": "small"
18
+ }
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "max_length": 512,
3
+ "max_new_tokens": 256,
4
+ "temperature": 0.7,
5
+ "top_k": 40,
6
+ "top_p": 0.9,
7
+ "do_sample": true,
8
+ "pad_token_id": 0,
9
+ "eos_token_id": 1,
10
+ "bos_token_id": 2
11
+ }
load_hf_model.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Hugging Face Compatible Loader for OpenLLM
4
+
5
+ Usage:
6
+ # Using transformers library (if you implement custom model class)
7
+ # from transformers import AutoModel, AutoTokenizer
8
+ # model = AutoModel.from_pretrained(".")
9
+ # tokenizer = AutoTokenizer.from_pretrained(".")
10
+
11
+ # Manual loading
12
+ from load_hf_model import load_model_manual
13
+ model, tokenizer = load_model_manual(".")
14
+ """
15
+
16
+ import torch
17
+ import json
18
+ import sentencepiece as smp
19
+ from pathlib import Path
20
+
21
+ def load_model_manual(model_dir="."):
22
+ """Manually load model in HF format."""
23
+ model_dir = Path(model_dir)
24
+
25
+ # Load config
26
+ with open(model_dir / "config.json", 'r') as f:
27
+ config = json.load(f)
28
+
29
+ # Load model weights
30
+ state_dict = torch.load(model_dir / "pytorch_model.bin", map_location='cpu')
31
+
32
+ # Load tokenizer
33
+ tokenizer = smp.SentencePieceProcessor()
34
+ tokenizer.load(str(model_dir / "tokenizer.model"))
35
+
36
+ print(f"Loaded model: {config['model_type']} with {config['n_layer']} layers")
37
+ print(f"Vocabulary size: {config['vocab_size']}")
38
+
39
+ return state_dict, tokenizer
40
+
41
+ if __name__ == "__main__":
42
+ state_dict, tokenizer = load_model_manual()
43
+ print(f"Model weights loaded: {len(state_dict)} parameters")
44
+ print(f"Tokenizer vocabulary: {tokenizer.vocab_size()}")
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ccade877ad32abcabfee7ab6eb99cbfad84dad5c68cdcc71720d8d526de0fa87
3
+ size 168490621
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6efb1da9b0e667cee37b23f4240e0bd34fbfb20e1faebcb8d299a7598c0635f3
3
+ size 547695
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "SentencePieceTokenizer",
3
+ "model_max_length": 1024,
4
+ "vocab_size": 32000,
5
+ "unk_token": "<unk>",
6
+ "bos_token": "<s>",
7
+ "eos_token": "</s>",
8
+ "pad_token": "<pad>"
9
+ }