Text Generation
Transformers
Safetensors
Vietnamese
English
afmoe
Mixture of Experts
mixture-of-experts
decode-series
llm
vietnamese-llm
Instructions to use Minh2508/Decode with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Minh2508/Decode with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Minh2508/Decode")# Load model directly from transformers import AutoTokenizer, MOE tokenizer = AutoTokenizer.from_pretrained("Minh2508/Decode") model = MOE.from_pretrained("Minh2508/Decode") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Minh2508/Decode with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Minh2508/Decode" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Minh2508/Decode", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Minh2508/Decode
- SGLang
How to use Minh2508/Decode with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Minh2508/Decode" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Minh2508/Decode", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Minh2508/Decode" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Minh2508/Decode", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Minh2508/Decode with Docker Model Runner:
docker model run hf.co/Minh2508/Decode
| language: | |
| - vi | |
| - en | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - moe | |
| - mixture-of-experts | |
| - text-generation | |
| - decode-series | |
| - llm | |
| - vietnamese-llm | |
| datasets: | |
| - markov-ai/computer-use-large | |
| metrics: | |
| - loss | |
| - perplexity | |
| model-index: | |
| - name: Decode-12B-MoE | |
| results: [] | |
| # π Decode-12B-MoE: High-Performance Mixture of Experts Model | |
| **Decode-12B-MoE** is a Large Language Model (LLM) utilizing a **Sparse Mixture of Experts (MoE)** architecture with a total of **12.5 billion parameters**. This model is engineered to bridge the gap between massive parameter counts and computational efficiency, activating only a fraction of its weights (~2.5B) during inference. | |
| ## π Technical Specifications | |
| | Attribute | Value | | |
| | :--- | :--- | | |
| | **Total Parameters** | 12,500,340,736 (12.5B) | | |
| | **Active Parameters** | ~2.5B per token | | |
| | **Architecture** | Sparse MoE (Decoder-only) | | |
| | **Context Window** | 4096 tokens | | |
| | **Format** | Bfloat16 / Float16 | | |
| | **Training Hardware** | NVIDIA Tesla T4 (Prototyping) / [Your_Main_GPU] | | |
| ## π Training Methodology | |
| The model was trained with advanced memory optimization techniques to ensure stability on consumer and enterprise-grade hardware: | |
| - **8-bit Optimizer:** Utilized `bitsandbytes` AdamW to reduce optimizer state memory footprint by 75%. | |
| - **Gradient Checkpointing:** Enabled to manage activation memory for deep MoE layers. | |
| - **Dataset:** Fine-tuned on a diverse corpus of Vietnamese and English text, focusing on reasoning, logic, and natural conversation. | |
| ## π» Quick Start (Usage) | |
| To use this model, ensure you have `transformers` and `accelerate` installed. | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| # Replace with your actual Hugging Face repo ID | |
| model_id = "your-username/decode-12b-moe" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True # Required for custom MoE architectures | |
| ) | |
| # Test Prompt | |
| prompt = "Explain the concept of Quantum Computing in simple terms." | |
| inputs = tokenizer(prompt, return_tensors="pt").to("cuda") | |
| with torch.no_grad(): | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=512, | |
| temperature=0.7, | |
| top_p=0.9, | |
| do_sample=True | |
| ) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |