super-sheikh / README.md

GitHub Actions: Automated update of SuperSheikh model artifacts (multi-modal, long-context)

28e8519 verified 4 months ago

5.94 kB

	# SuperSheikh Multimodal Model

	A state-of-the-art multimodal language model that combines text, image, and audio understanding capabilities with an extended context window of 200,000 tokens.

	## Model Description

	SuperSheikh is a transformer-based multimodal model designed for:
	- Long-context understanding: Supports up to 200,000 tokens
	- Text processing: Advanced natural language understanding and generation
	- Image understanding: Visual question answering and image captioning
	- Audio processing: Speech recognition and audio understanding
	- Multimodal reasoning: Combining information from multiple modalities

	## Architecture

	- Base Model: Transformer decoder with 32 layers
	- Hidden Size: 4096 dimensions
	- Attention Heads: 32 heads
	- Context Length: 200,000 tokens
	- Vision Module: 24-layer vision transformer with 1024 hidden size
	- Audio Module: 12-layer audio transformer with 768 hidden size

	## Installation

	```bash
	pip install transformers torch tokenizers safetensors accelerate
	```

	Or install from requirements.txt:

	```bash
	pip install -r requirements.txt
	```

	## Usage


	### Download Model Weights
	The model weights (`sheikh.safetensors`) are too large for direct GitHub hosting. Download them from the Hugging Face Hub:

	```bash
	wget --content-disposition "https://huggingface.co/codedwithlikhon/super-sheikh/resolve/main/sheikh.safetensors"
	```
	Or use the Hugging Face `transformers` library:
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("codedwithlikhon/super-sheikh")
	model = AutoModelForCausalLM.from_pretrained("codedwithlikhon/super-sheikh", trust_remote_code=True)

	inputs = tokenizer("Hello, how are you?", return_tensors="pt")
	outputs = model.generate(**inputs, max_length=100)
	response = tokenizer.decode(outputs[0])
	```

	### Multimodal Processing
	```python
	from transformers import SuperSheikhProcessor
	from PIL import Image

	processor = SuperSheikhProcessor.from_pretrained("path/to/super-sheikh")

	# Process text and image together
	text = "Describe this image"
	image = Image.open("image.jpg")

	inputs = processor(text=text, images=image, return_tensors="pt")
	```

	## Features

	- Long Context: Extended context window for processing large documents
	- Multimodal: Supports text, image, and audio inputs
	- Efficient: Optimized for both training and inference
	- Flexible: Customizable for various downstream tasks

	## Training

	The model was trained on a diverse dataset including:
	- Text corpora from books, articles, and web content
	- Image-text pairs from various vision-language datasets
	- Audio-text pairs from speech recognition datasets

	### Tokenizer Training

	You can train a custom BPE tokenizer for SuperSheikh:

	```python
	from tokenizer_super_sheikh import SuperSheikhTokenizer

	# Train tokenizer from dataset
	tokenizer = SuperSheikhTokenizer.train_from_iterator(
	text_iterator,
	vocab_size=50000,
	min_frequency=2,
	special_tokens=["<\|startoftext\|>", "<\|endoftext\|>", "<pad>", "<unk>"]
	)

	# Save tokenizer files
	tokenizer.save_pretrained("path/to/save/directory")
	```

	### Model Saving

	The model supports safetensors format for efficient storage:

	```python
	# Save model with safetensors format
	model.save_pretrained(
	"path/to/save/directory",
	safe_serialization=True,
	max_shard_size="10GB"
	)
	```

	This automatically generates:
	- `model.safetensors` (or sharded files)
	- `model.safetensors.index.json` (for sharded models)
	- `config.json`
	- `generation_config.json` (if present)
	- `chat_template.jinja` (if present for instruction-tuned models)
	```

	### Supported File Formats

	The updated implementation generates standard tokenizer files:
	- `tokenizer.json` - Main tokenizer file
	- `vocab.json` - Vocabulary mapping
	- `merges.txt` - BPE merges
	- `tokenizer_config.json` - Tokenizer configuration
	- `special_tokens_map.json` - Special tokens mapping
	- `added_tokens.json` - Additional tokens (if any)

	## Automated Deployment

	This repository includes automated deployment to Hugging Face Hub via GitHub Actions:

	### Setup

	1. Fork or clone this repository to your GitHub account
	2. Set up Hugging Face token:
	- Go to [Hugging Face Settings > Access Tokens](https://huggingface.co/settings/tokens)
	- Create a new token with "Write" permissions
	- Add it to your GitHub repository secrets as `HF_TOKEN`
	3. Push to main branch or use manual workflow dispatch

	### Workflow Features

	- Automatic deployment: Triggers on pushes to `main` branch
	- Manual deployment: Can be triggered manually from GitHub Actions UI
	- Complete model upload: Automatically uploads all model files including:
	- Model weights (`*.safetensors`)
	- Tokenizer files (`tokenizer.json`, `vocab.json`, `merges.txt`)
	- Configuration files (`config.json`, `tokenizer_config.json`)
	- Chat template (`chat_template.jinja`)
	- Special tokens and additional metadata

	### Repository Links

	- GitHub: [https://github.com/codedwithlikhon/super-sheikh](https://github.com/codedwithlikhon/super-sheikh)
	- Hugging Face: [https://huggingface.co/codedwithlikhon/super-sheikh](https://huggingface.co/codedwithlikhon/super-sheikh)

	The model will be automatically available on Hugging Face Hub after successful deployment!

	## Limitations

	- Requires significant computational resources
	- Large model size may not be suitable for all deployment scenarios
	- Performance may vary depending on input quality and domain

	## License

	This model is released under the MIT License.

	## Citation

	If you use SuperSheikh in your research, please cite:

	```
	@misc{super-sheikh-2024,
	title={SuperSheikh: A Multimodal Long-Context Language Model},
	author={SuperSheikh Team},
	year={2024},
	url={https://github.com/codedwithlikhon/super-sheikh}
	}
	```

	## Contact

	For questions or support, please open an issue on our GitHub repository.