lemms commited on
Commit
8aeeaba
·
verified ·
1 Parent(s): d147356

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenLLM Small Extended 10k
2
+
3
+ This is the OpenLLM small model trained for 10,000 steps on the SQUAD dataset.
4
+
5
+ ## Model Details
6
+
7
+ - **Model Type**: GPT-style transformer (decoder-only)
8
+ - **Training Steps**: 10,000
9
+ - **Parameters**: 35.8M
10
+ - **Vocabulary Size**: 32,000
11
+ - **Context Length**: 1,024 tokens
12
+ - **Architecture**: 6 layers, 8 attention heads, 512 embedding dimension
13
+
14
+ ## Training Information
15
+
16
+ - **Dataset**: SQUAD (Stanford Question Answering Dataset)
17
+ - **Training Data**: ~41k Wikipedia passages
18
+ - **Tokenizer**: SentencePiece BPE with 32k vocabulary
19
+ - **Optimizer**: AdamW
20
+ - **Learning Rate**: 3e-4
21
+ - **Batch Size**: 4 (with gradient accumulation)
22
+
23
+ ## Performance
24
+
25
+ - **Final Loss**: ~5.22
26
+ - **Inference Speed**: ~8.3 tokens/second (CPU)
27
+ - **Memory Usage**: ~143MB for inference
28
+
29
+ ## Usage
30
+
31
+ ### Using the Model
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer, AutoModelForCausalLM
35
+ import torch
36
+
37
+ # Load model and tokenizer
38
+ model_name = "lemms/openllm-small-extended-10k"
39
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
40
+ model = AutoModelForCausalLM.from_pretrained(model_name)
41
+
42
+ # Generate text
43
+ prompt = "The future of artificial intelligence"
44
+ inputs = tokenizer(prompt, return_tensors="pt")
45
+
46
+ with torch.no_grad():
47
+ outputs = model.generate(
48
+ inputs["input_ids"],
49
+ max_length=100,
50
+ temperature=0.7,
51
+ do_sample=True,
52
+ pad_token_id=tokenizer.eos_token_id
53
+ )
54
+
55
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
56
+ print(generated_text)
57
+ ```
58
+
59
+ ### Using the Custom Loader
60
+
61
+ ```python
62
+ from load_hf_model import load_model_and_tokenizer
63
+
64
+ # Load model using custom loader
65
+ model, tokenizer = load_model_and_tokenizer("lemms/openllm-small-extended-10k")
66
+
67
+ # Generate text
68
+ prompt = "The history of machine learning"
69
+ inputs = tokenizer(prompt, return_tensors="pt")
70
+
71
+ with torch.no_grad():
72
+ outputs = model.generate(
73
+ inputs["input_ids"],
74
+ max_length=100,
75
+ temperature=0.7
76
+ )
77
+
78
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
79
+ ```
80
+
81
+ ## Model Architecture
82
+
83
+ This model follows the standard GPT architecture:
84
+
85
+ - **Token Embeddings**: Maps token IDs to dense vectors
86
+ - **Positional Embeddings**: Adds position information
87
+ - **Transformer Blocks**: 6 layers with multi-head attention and feed-forward networks
88
+ - **Layer Normalization**: Pre-norm placement for training stability
89
+ - **Output Head**: Linear projection to vocabulary for next-token prediction
90
+
91
+ ## Training Details
92
+
93
+ The model was trained using:
94
+ - **Framework**: PyTorch
95
+ - **Hardware**: CPU training with gradient accumulation
96
+ - **Regularization**: Dropout (0.1), weight decay
97
+ - **Optimization**: AdamW with cosine learning rate scheduling
98
+ - **Gradient Clipping**: 1.0
99
+
100
+ ## Limitations
101
+
102
+ - This is a small model (35.8M parameters) with limited capacity
103
+ - Training was done on CPU, which limited the training steps
104
+ - Model quality is basic and suitable for educational/research purposes
105
+ - Not suitable for production use without further training
106
+
107
+ ## License
108
+
109
+ This model is dual-licensed:
110
+ - **Open Source**: GPLv3 License
111
+ - **Commercial**: Commercial License available
112
+
113
+ ## Citation
114
+
115
+ If you use this model in your research, please cite:
116
+
117
+ ```bibtex
118
+ @misc{openllm2024,
119
+ title={OpenLLM: Open Source Large Language Model Framework},
120
+ author={Louis Chua Bean Chong},
121
+ year={2024},
122
+ url={https://github.com/louischua/openllm}
123
+ }
124
+ ```
125
+
126
+ ## Model Card
127
+
128
+ - **Developed by**: Louis Chua Bean Chong
129
+ - **Model type**: Language Model
130
+ - **Language(s)**: English
131
+ - **License**: GPLv3 / Commercial
132
+ - **Finetuned from model**: Trained from scratch
133
+ - **Training data**: SQUAD dataset
134
+ - **Training procedure**: Supervised learning
135
+ - **Evaluation results**: Basic text generation capability
136
+
137
+ ## Related Models
138
+
139
+ - [lemms/openllm-small-extended-4k](https://huggingface.co/lemms/openllm-small-extended-4k)
140
+ - [lemms/openllm-small-extended-6k](https://huggingface.co/lemms/openllm-small-extended-6k)
141
+ - [lemms/openllm-small-extended-7k](https://huggingface.co/lemms/openllm-small-extended-7k)
142
+ - [lemms/openllm-small-extended-8k](https://huggingface.co/lemms/openllm-small-extended-8k)
143
+ - [lemms/openllm-small-extended-9k](https://huggingface.co/lemms/openllm-small-extended-9k)
config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "GPTModel"
4
+ ],
5
+ "model_type": "gpt",
6
+ "vocab_size": 32000,
7
+ "n_layer": 6,
8
+ "n_head": 8,
9
+ "n_embd": 512,
10
+ "block_size": 1024,
11
+ "dropout": 0.1,
12
+ "bias": true,
13
+ "torch_dtype": "float32",
14
+ "transformers_version": "4.0.0",
15
+ "openllm_version": "0.1.0",
16
+ "training_steps": 10000,
17
+ "model_size": "small"
18
+ }
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "max_length": 512,
3
+ "max_new_tokens": 256,
4
+ "temperature": 0.7,
5
+ "top_k": 40,
6
+ "top_p": 0.9,
7
+ "do_sample": true,
8
+ "pad_token_id": 0,
9
+ "eos_token_id": 1,
10
+ "bos_token_id": 2
11
+ }
load_hf_model.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Hugging Face Compatible Loader for OpenLLM
4
+
5
+ Usage:
6
+ # Using transformers library (if you implement custom model class)
7
+ # from transformers import AutoModel, AutoTokenizer
8
+ # model = AutoModel.from_pretrained(".")
9
+ # tokenizer = AutoTokenizer.from_pretrained(".")
10
+
11
+ # Manual loading
12
+ from load_hf_model import load_model_manual
13
+ model, tokenizer = load_model_manual(".")
14
+ """
15
+
16
+ import torch
17
+ import json
18
+ import sentencepiece as smp
19
+ from pathlib import Path
20
+
21
+ def load_model_manual(model_dir="."):
22
+ """Manually load model in HF format."""
23
+ model_dir = Path(model_dir)
24
+
25
+ # Load config
26
+ with open(model_dir / "config.json", 'r') as f:
27
+ config = json.load(f)
28
+
29
+ # Load model weights
30
+ state_dict = torch.load(model_dir / "pytorch_model.bin", map_location='cpu')
31
+
32
+ # Load tokenizer
33
+ tokenizer = smp.SentencePieceProcessor()
34
+ tokenizer.load(str(model_dir / "tokenizer.model"))
35
+
36
+ print(f"Loaded model: {config['model_type']} with {config['n_layer']} layers")
37
+ print(f"Vocabulary size: {config['vocab_size']}")
38
+
39
+ return state_dict, tokenizer
40
+
41
+ if __name__ == "__main__":
42
+ state_dict, tokenizer = load_model_manual()
43
+ print(f"Model weights loaded: {len(state_dict)} parameters")
44
+ print(f"Tokenizer vocabulary: {tokenizer.vocab_size()}")
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f826631e0861e3069409a6afb41c577372361c7389440bab45734de046d0f5da
3
+ size 168490621
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6efb1da9b0e667cee37b23f4240e0bd34fbfb20e1faebcb8d299a7598c0635f3
3
+ size 547695
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "SentencePieceTokenizer",
3
+ "model_max_length": 1024,
4
+ "vocab_size": 32000,
5
+ "unk_token": "<unk>",
6
+ "bos_token": "<s>",
7
+ "eos_token": "</s>",
8
+ "pad_token": "<pad>"
9
+ }