SuperSheikh Multimodal Model
A state-of-the-art multimodal language model that combines text, image, and audio understanding capabilities with an extended context window of 200,000 tokens.
Model Description
SuperSheikh is a transformer-based multimodal model designed for:
- Long-context understanding: Supports up to 200,000 tokens
- Text processing: Advanced natural language understanding and generation
- Image understanding: Visual question answering and image captioning
- Audio processing: Speech recognition and audio understanding
- Multimodal reasoning: Combining information from multiple modalities
Architecture
- Base Model: Transformer decoder with 32 layers
- Hidden Size: 4096 dimensions
- Attention Heads: 32 heads
- Context Length: 200,000 tokens
- Vision Module: 24-layer vision transformer with 1024 hidden size
- Audio Module: 12-layer audio transformer with 768 hidden size
Installation
pip install transformers torch tokenizers safetensors accelerate
Or install from requirements.txt:
pip install -r requirements.txt
Usage
Download Model Weights
The model weights (sheikh.safetensors) are too large for direct GitHub hosting. Download them from the Hugging Face Hub:
wget --content-disposition "https://huggingface.co/codedwithlikhon/super-sheikh/resolve/main/sheikh.safetensors"
Or use the Hugging Face transformers library:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("codedwithlikhon/super-sheikh")
model = AutoModelForCausalLM.from_pretrained("codedwithlikhon/super-sheikh", trust_remote_code=True)
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0])
Multimodal Processing
from transformers import SuperSheikhProcessor
from PIL import Image
processor = SuperSheikhProcessor.from_pretrained("path/to/super-sheikh")
# Process text and image together
text = "Describe this image"
image = Image.open("image.jpg")
inputs = processor(text=text, images=image, return_tensors="pt")
Features
- Long Context: Extended context window for processing large documents
- Multimodal: Supports text, image, and audio inputs
- Efficient: Optimized for both training and inference
- Flexible: Customizable for various downstream tasks
Training
The model was trained on a diverse dataset including:
- Text corpora from books, articles, and web content
- Image-text pairs from various vision-language datasets
- Audio-text pairs from speech recognition datasets
Tokenizer Training
You can train a custom BPE tokenizer for SuperSheikh:
from tokenizer_super_sheikh import SuperSheikhTokenizer
# Train tokenizer from dataset
tokenizer = SuperSheikhTokenizer.train_from_iterator(
text_iterator,
vocab_size=50000,
min_frequency=2,
special_tokens=["<|startoftext|>", "<|endoftext|>", "<pad>", "<unk>"]
)
# Save tokenizer files
tokenizer.save_pretrained("path/to/save/directory")
Model Saving
The model supports safetensors format for efficient storage:
# Save model with safetensors format
model.save_pretrained(
"path/to/save/directory",
safe_serialization=True,
max_shard_size="10GB"
)
This automatically generates:
model.safetensors(or sharded files)model.safetensors.index.json(for sharded models)config.jsongeneration_config.json(if present)chat_template.jinja(if present for instruction-tuned models)
### Supported File Formats
The updated implementation generates standard tokenizer files:
- `tokenizer.json` - Main tokenizer file
- `vocab.json` - Vocabulary mapping
- `merges.txt` - BPE merges
- `tokenizer_config.json` - Tokenizer configuration
- `special_tokens_map.json` - Special tokens mapping
- `added_tokens.json` - Additional tokens (if any)
## Automated Deployment
This repository includes automated deployment to Hugging Face Hub via GitHub Actions:
### Setup
1. **Fork or clone** this repository to your GitHub account
2. **Set up Hugging Face token**:
- Go to [Hugging Face Settings > Access Tokens](https://huggingface.co/settings/tokens)
- Create a new token with "Write" permissions
- Add it to your GitHub repository secrets as `HF_TOKEN`
3. **Push to main branch** or use manual workflow dispatch
### Workflow Features
- **Automatic deployment**: Triggers on pushes to `main` branch
- **Manual deployment**: Can be triggered manually from GitHub Actions UI
- **Complete model upload**: Automatically uploads all model files including:
- Model weights (`*.safetensors`)
- Tokenizer files (`tokenizer.json`, `vocab.json`, `merges.txt`)
- Configuration files (`config.json`, `tokenizer_config.json`)
- Chat template (`chat_template.jinja`)
- Special tokens and additional metadata
### Repository Links
- **GitHub**: [https://github.com/codedwithlikhon/super-sheikh](https://github.com/codedwithlikhon/super-sheikh)
- **Hugging Face**: [https://huggingface.co/codedwithlikhon/super-sheikh](https://huggingface.co/codedwithlikhon/super-sheikh)
The model will be automatically available on Hugging Face Hub after successful deployment!
## Limitations
- Requires significant computational resources
- Large model size may not be suitable for all deployment scenarios
- Performance may vary depending on input quality and domain
## License
This model is released under the MIT License.
## Citation
If you use SuperSheikh in your research, please cite:
@misc{super-sheikh-2024, title={SuperSheikh: A Multimodal Long-Context Language Model}, author={SuperSheikh Team}, year={2024}, url={https://github.com/codedwithlikhon/super-sheikh} }
## Contact
For questions or support, please open an issue on our GitHub repository.
- Downloads last month
- -