Fix YAML metadata warning in README.md

a81b5cd verified 5 months ago

6.75 kB

	---
	language:
	- en
	license:
	- gpl-3.0
	- other
	tags:
	- text-generation
	- language-model
	- open-source
	- gpt
	- transformer
	- causal-lm
	datasets:
	- squad
	metrics:
	- perplexity
	- loss
	library_name: transformers
	pipeline_tag: text-generation
	model-index:
	- name: OpenLLM Small Extended 7K
	results:
	- task:
	type: text-generation
	dataset:
	type: squad
	name: Wikipedia passages from SQuAD
	metrics:
	- type: loss
	value: 2.1
	- type: perplexity
	value: 8.2
	---

	# OpenLLM Small Extended 7K Model

	<!-- Copyright (C) 2024 Louis Chua Bean Chong -->
	<!-- This file is part of OpenLLM - dual-licensed under GPLv3 and Commercial License -->

	## 🌟 Model Overview

	This is the OpenLLM Small Extended 7K model, a 35.8M parameter GPT-style language model trained for 7,000 steps on Wikipedia passages from the SQuAD dataset. This model represents the latest iteration of our small model architecture with extended training.

	### 📊 Model Specifications

	- Architecture: GPT-style Transformer
	- Parameters: 35,823,616 (35.8M)
	- Layers: 6 transformer layers
	- Heads: 8 attention heads
	- Embedding Dimension: 512
	- Vocabulary Size: 32,000 tokens
	- Context Length: 1,024 tokens
	- Training Steps: 7,000
	- Model Size: Small

	### 🎯 Training Details

	- Dataset: Wikipedia passages from SQuAD dataset (~41k passages)
	- Tokenization: SentencePiece with 32k vocabulary
	- Training Objective: Next token prediction (causal language modeling)
	- Optimizer: AdamW with learning rate scheduling
	- Hardware: Trained on consumer GPU with gradient accumulation

	### 📁 Model Files

	```
	huggingface/
	├── config.json # Model configuration
	├── generation_config.json # Generation parameters
	├── pytorch_model.bin # Model weights (161MB)
	├── tokenizer_config.json # Tokenizer configuration
	├── tokenizer.model # SentencePiece tokenizer
	└── load_hf_model.py # Loading script
	```

	## 🚀 Usage

	### Loading with Hugging Face Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	# Load model and tokenizer
	model_name = "path/to/huggingface"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	# Generate text
	prompt = "The history of artificial intelligence"
	inputs = tokenizer(prompt, return_tensors="pt")

	with torch.no_grad():
	outputs = model.generate(
	inputs.input_ids,
	max_new_tokens=100,
	temperature=0.7,
	do_sample=True,
	pad_token_id=tokenizer.pad_token_id
	)

	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(generated_text)
	```

	### Using the Custom Loader

	```python
	from load_hf_model import load_openllm_model

	# Load the model using our custom loader
	model, tokenizer = load_openllm_model("path/to/huggingface")

	# Generate text
	prompt = "Explain quantum computing in simple terms"
	inputs = tokenizer(prompt, return_tensors="pt")

	outputs = model.generate(
	inputs.input_ids,
	max_new_tokens=150,
	temperature=0.8,
	top_p=0.9
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Inference Server

	```bash
	# Start the FastAPI inference server
	python core/src/inference_server.py \
	--model_path exports/huggingface-7k/huggingface \
	--port 8000

	# Make API calls
	curl -X POST "http://localhost:8000/generate" \
	-H "Content-Type: application/json" \
	-d '{
	"prompt": "The future of renewable energy",
	"max_tokens": 100,
	"temperature": 0.7
	}'
	```

	## 📈 Performance

	### Training Metrics

	- Final Loss: ~2.1 (cross-entropy)
	- Training Time: ~7 hours on consumer GPU
	- Memory Usage: ~2GB VRAM during training
	- Inference Speed: ~50 tokens/second on CPU, ~200 tokens/second on GPU

	### Model Capabilities

	- Text Generation: Coherent paragraph generation
	- Question Answering: Basic factual responses
	- Summarization: Short text summarization
	- Language Understanding: Context-aware responses

	## 🔧 Configuration

	### Generation Parameters

	```json
	{
	"max_length": 512,
	"max_new_tokens": 256,
	"temperature": 0.7,
	"top_k": 40,
	"top_p": 0.9,
	"do_sample": true,
	"pad_token_id": 0,
	"eos_token_id": 1,
	"bos_token_id": 2
	}
	```

	### Model Architecture

	```json
	{
	"vocab_size": 32000,
	"n_layer": 6,
	"n_head": 8,
	"n_embd": 512,
	"block_size": 1024,
	"dropout": 0.1,
	"bias": true
	}
	```

	## 🧪 Testing

	### Quick Test

	```python
	# Test the model with a simple prompt
	test_prompt = "Hello, how are you today?"
	inputs = tokenizer(test_prompt, return_tensors="pt")

	with torch.no_grad():
	outputs = model.generate(
	inputs.input_ids,
	max_new_tokens=20,
	temperature=0.7
	)

	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(f"Input: {test_prompt}")
	print(f"Output: {response}")
	```

	## 📋 Limitations

	- Context Length: Limited to 1,024 tokens
	- Training Data: Only Wikipedia passages (limited domain)
	- Model Size: Small model with limited reasoning capabilities
	- Bias: May inherit biases from training data
	- Factual Accuracy: Not guaranteed for current events

	## 🔄 Model Comparison

	\| Model \| Parameters \| Training Steps \| Context Length \| Use Case \|
	\|-------\|------------\|----------------\|----------------\|----------\|
	\| Small 4K \| 35.8M \| 4,000 \| 1,024 \| Basic text generation \|
	\| Small 6K \| 35.8M \| 6,000 \| 1,024 \| Improved coherence \|
	\| Small 7K \| 35.8M \| 7,000 \| 1,024 \| Extended training \|

	## 📄 License

	This model is dual-licensed:
	- Open Source: GNU General Public License v3.0
	- Commercial: Commercial License (contact for details)

	See `LICENSE` and `docs/LICENSES.md` for full license information.

	## 🤝 Contributing

	We welcome contributions to improve the model! Please see:
	- `docs/CONTRIBUTING.md` for contribution guidelines
	- `docs/CODE_OF_CONDUCT.md` for community standards

	## 📞 Support

	For questions, issues, or commercial licensing:
	- GitHub Issues: Report bugs and feature requests
	- Documentation: Check `docs/` directory
	- Commercial License: Contact for enterprise use

	---

	Author: Louis Chua Bean Chong
	Project: OpenLLM - Open Source Large Language Model
	Version: 0.1.0
	Last Updated: 2024