Text Generation
Transformers
Safetensors
Vietnamese
English
afmoe
Mixture of Experts
mixture-of-experts
decode-series
llm
vietnamese-llm
Instructions to use Minh2508/Decode with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Minh2508/Decode with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Minh2508/Decode")# Load model directly from transformers import AutoTokenizer, MOE tokenizer = AutoTokenizer.from_pretrained("Minh2508/Decode") model = MOE.from_pretrained("Minh2508/Decode") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Minh2508/Decode with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Minh2508/Decode" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Minh2508/Decode", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Minh2508/Decode
- SGLang
How to use Minh2508/Decode with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Minh2508/Decode" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Minh2508/Decode", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Minh2508/Decode" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Minh2508/Decode", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Minh2508/Decode with Docker Model Runner:
docker model run hf.co/Minh2508/Decode
metadata
language:
- vi
- en
license: apache-2.0
library_name: transformers
tags:
- moe
- mixture-of-experts
- text-generation
- decode-series
- llm
- vietnamese-llm
datasets:
- markov-ai/computer-use-large
metrics:
- loss
- perplexity
model-index:
- name: Decode-12B-MoE
results: []
π Decode-12B-MoE: High-Performance Mixture of Experts Model
Decode-12B-MoE is a Large Language Model (LLM) utilizing a Sparse Mixture of Experts (MoE) architecture with a total of 12.5 billion parameters. This model is engineered to bridge the gap between massive parameter counts and computational efficiency, activating only a fraction of its weights (~2.5B) during inference.
π Technical Specifications
| Attribute | Value |
|---|---|
| Total Parameters | 12,500,340,736 (12.5B) |
| Active Parameters | ~2.5B per token |
| Architecture | Sparse MoE (Decoder-only) |
| Context Window | 4096 tokens |
| Format | Bfloat16 / Float16 |
| Training Hardware | NVIDIA Tesla T4 (Prototyping) / [Your_Main_GPU] |
π Training Methodology
The model was trained with advanced memory optimization techniques to ensure stability on consumer and enterprise-grade hardware:
- 8-bit Optimizer: Utilized
bitsandbytesAdamW to reduce optimizer state memory footprint by 75%. - Gradient Checkpointing: Enabled to manage activation memory for deep MoE layers.
- Dataset: Fine-tuned on a diverse corpus of Vietnamese and English text, focusing on reasoning, logic, and natural conversation.
π» Quick Start (Usage)
To use this model, ensure you have transformers and accelerate installed.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Replace with your actual Hugging Face repo ID
model_id = "your-username/decode-12b-moe"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True # Required for custom MoE architectures
)
# Test Prompt
prompt = "Explain the concept of Quantum Computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))