NeTS-lab
/

emg-10m-conv_test

Text Generation

morpiece-tokenizer

Model card Files Files and versions

emg-10m-conv_test / README.md

NeTS-lab's picture

Upload EMG model with MorPiece tokenizer

9e31d55 verified 7 months ago

|

history blame contribute delete

1.67 kB

	---
	language: en
	library_name: transformers
	tags:
	- emg
	- morphology
	- language-model
	- causal-lm
	- morpiece-tokenizer
	license: apache-2.0
	pipeline_tag: text-generation
	---

	# EMG Language Model

	This is an EMG (Enhanced Morphological Generation) language model with MorPiece tokenizer.

	## Model Details

	- Model Type: Causal Language Model
	- Architecture: EMG with morphological awareness
	- Tokenizer: MorPiece (morphology-aware tokenization)
	- Parameters: 79.75M
	- Vocabulary Size: 60001

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("your-username/your-model-name", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("your-username/your-model-name", trust_remote_code=True)

	# Generate text
	input_text = "The future of AI is"
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=50)
	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(generated_text)
	```

	## Model Architecture

	The EMG model uses morphological awareness for better language understanding and generation.
	The MorPiece tokenizer provides morphology-aware tokenization that better handles word formations.

	## Training

	This model was trained on conversational data with morphological enhancement.

	## Limitations

	- This model is designed for research purposes
	- May not perform optimally on all downstream tasks without fine-tuning
	- Requires trust_remote_code=True due to custom architecture

	## Citation

	If you use this model, please cite the original EMG paper and implementation.