super-sheikh / README.md
codedwithlikhon's picture
GitHub Actions: Automated update of SuperSheikh model artifacts (multi-modal, long-context)
28e8519 verified

SuperSheikh Multimodal Model

A state-of-the-art multimodal language model that combines text, image, and audio understanding capabilities with an extended context window of 200,000 tokens.

Model Description

SuperSheikh is a transformer-based multimodal model designed for:

  • Long-context understanding: Supports up to 200,000 tokens
  • Text processing: Advanced natural language understanding and generation
  • Image understanding: Visual question answering and image captioning
  • Audio processing: Speech recognition and audio understanding
  • Multimodal reasoning: Combining information from multiple modalities

Architecture

  • Base Model: Transformer decoder with 32 layers
  • Hidden Size: 4096 dimensions
  • Attention Heads: 32 heads
  • Context Length: 200,000 tokens
  • Vision Module: 24-layer vision transformer with 1024 hidden size
  • Audio Module: 12-layer audio transformer with 768 hidden size

Installation

pip install transformers torch tokenizers safetensors accelerate

Or install from requirements.txt:

pip install -r requirements.txt

Usage

Download Model Weights

The model weights (sheikh.safetensors) are too large for direct GitHub hosting. Download them from the Hugging Face Hub:

wget --content-disposition "https://huggingface.co/codedwithlikhon/super-sheikh/resolve/main/sheikh.safetensors"

Or use the Hugging Face transformers library:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("codedwithlikhon/super-sheikh")
model = AutoModelForCausalLM.from_pretrained("codedwithlikhon/super-sheikh", trust_remote_code=True)

inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0])

Multimodal Processing

from transformers import SuperSheikhProcessor
from PIL import Image

processor = SuperSheikhProcessor.from_pretrained("path/to/super-sheikh")

# Process text and image together
text = "Describe this image"
image = Image.open("image.jpg")

inputs = processor(text=text, images=image, return_tensors="pt")

Features

  • Long Context: Extended context window for processing large documents
  • Multimodal: Supports text, image, and audio inputs
  • Efficient: Optimized for both training and inference
  • Flexible: Customizable for various downstream tasks

Training

The model was trained on a diverse dataset including:

  • Text corpora from books, articles, and web content
  • Image-text pairs from various vision-language datasets
  • Audio-text pairs from speech recognition datasets

Tokenizer Training

You can train a custom BPE tokenizer for SuperSheikh:

from tokenizer_super_sheikh import SuperSheikhTokenizer

# Train tokenizer from dataset
tokenizer = SuperSheikhTokenizer.train_from_iterator(
    text_iterator,
    vocab_size=50000,
    min_frequency=2,
    special_tokens=["<|startoftext|>", "<|endoftext|>", "<pad>", "<unk>"]
)

# Save tokenizer files
tokenizer.save_pretrained("path/to/save/directory")

Model Saving

The model supports safetensors format for efficient storage:

# Save model with safetensors format
model.save_pretrained(
    "path/to/save/directory",
    safe_serialization=True,
    max_shard_size="10GB"
)

This automatically generates:

  • model.safetensors (or sharded files)
  • model.safetensors.index.json (for sharded models)
  • config.json
  • generation_config.json (if present)
  • chat_template.jinja (if present for instruction-tuned models)

### Supported File Formats

The updated implementation generates standard tokenizer files:
- `tokenizer.json` - Main tokenizer file
- `vocab.json` - Vocabulary mapping
- `merges.txt` - BPE merges
- `tokenizer_config.json` - Tokenizer configuration
- `special_tokens_map.json` - Special tokens mapping
- `added_tokens.json` - Additional tokens (if any)

## Automated Deployment

This repository includes automated deployment to Hugging Face Hub via GitHub Actions:

### Setup

1. **Fork or clone** this repository to your GitHub account
2. **Set up Hugging Face token**:
   - Go to [Hugging Face Settings > Access Tokens](https://huggingface.co/settings/tokens)
   - Create a new token with "Write" permissions
   - Add it to your GitHub repository secrets as `HF_TOKEN`
3. **Push to main branch** or use manual workflow dispatch

### Workflow Features

- **Automatic deployment**: Triggers on pushes to `main` branch
- **Manual deployment**: Can be triggered manually from GitHub Actions UI
- **Complete model upload**: Automatically uploads all model files including:
  - Model weights (`*.safetensors`)
  - Tokenizer files (`tokenizer.json`, `vocab.json`, `merges.txt`)
  - Configuration files (`config.json`, `tokenizer_config.json`)
  - Chat template (`chat_template.jinja`)
  - Special tokens and additional metadata

### Repository Links

- **GitHub**: [https://github.com/codedwithlikhon/super-sheikh](https://github.com/codedwithlikhon/super-sheikh)
- **Hugging Face**: [https://huggingface.co/codedwithlikhon/super-sheikh](https://huggingface.co/codedwithlikhon/super-sheikh)

The model will be automatically available on Hugging Face Hub after successful deployment!

## Limitations

- Requires significant computational resources
- Large model size may not be suitable for all deployment scenarios
- Performance may vary depending on input quality and domain

## License

This model is released under the MIT License.

## Citation

If you use SuperSheikh in your research, please cite:

@misc{super-sheikh-2024, title={SuperSheikh: A Multimodal Long-Context Language Model}, author={SuperSheikh Team}, year={2024}, url={https://github.com/codedwithlikhon/super-sheikh} }


## Contact

For questions or support, please open an issue on our GitHub repository.