|
|
--- |
|
|
language: en |
|
|
library_name: transformers |
|
|
tags: |
|
|
- emg |
|
|
- morphology |
|
|
- language-model |
|
|
- causal-lm |
|
|
- morpiece-tokenizer |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# EMG Language Model |
|
|
|
|
|
This is an EMG (Enhanced Morphological Generation) language model with MorPiece tokenizer. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: Causal Language Model |
|
|
- **Architecture**: EMG with morphological awareness |
|
|
- **Tokenizer**: MorPiece (morphology-aware tokenization) |
|
|
- **Parameters**: 79.75M |
|
|
- **Vocabulary Size**: 60001 |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("your-username/your-model-name", trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained("your-username/your-model-name", trust_remote_code=True) |
|
|
|
|
|
# Generate text |
|
|
input_text = "The future of AI is" |
|
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_length=50) |
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(generated_text) |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The EMG model uses morphological awareness for better language understanding and generation. |
|
|
The MorPiece tokenizer provides morphology-aware tokenization that better handles word formations. |
|
|
|
|
|
## Training |
|
|
|
|
|
This model was trained on conversational data with morphological enhancement. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- This model is designed for research purposes |
|
|
- May not perform optimally on all downstream tasks without fine-tuning |
|
|
- Requires trust_remote_code=True due to custom architecture |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the original EMG paper and implementation. |
|
|
|