MicroFlow: A Pretrained Mixture of Experts Model for Bacterial/Metagenomic Sequence Analysis
Model Description
MicroFlow is a pretrained language model built on the Mixtral Mixture of Experts (MoE) architecture, specifically optimized for analyzing bacterial and metagenomic sequences. Trained on large-scale tokenized metagenomic datasets, this model leverages a custom bidirectional attention mechanism to capture bidirectional semantic dependencies in microbial sequences, serving as a foundational model for downstream metagenomic analysis tasks (e.g., sequence classification, taxonomic annotation, and microbial community profiling).
Key Features
1. Architecture Design
- Base Architecture: Mixture of Experts (MoE) pretrained model based on Mixtral
- Parameter Scale: Configurable parameter scale aligned with Mixtral MoE variants (adjustable via MixtralConfig)
- Attention Mechanism: Bidirectional attention mechanism (non-causal) implemented via custom SDPA (Scaled Dot Product Attention) and FlashAttention-2 with GQA (Grouped Query Attention) support
- Tokenization: Custom BPE (Byte-Pair Encoding) tokenizer extended with microbial-specific special tokens (
<abu>,<name>), with vocab size consistent with the pretrained base tokenizer - Position Encoding: RoPE (Rotary Positional Encoding) with configurable theta (default: 10000)
- Expert System: Inherits Mixtral’s MoE expert configuration (8 local experts, 2 experts activated per token)
2. Pretraining Strategy
- Pretraining Data: 3,264,597 metagenomic token sequences in plain text format (grouped by token length: ≤160, 160<len≤320, 320<len≤2048), with sequences tagged by
<abu>/<name>based on structural features - Sequence Processing: Token sequences truncated/padded to target length (160/320/2048) with
<pad>token, no truncation of semantic boundaries - Training Objectives:
- Masked Language Modeling (MLM, 15% masking probability, optional)
- BERT-style pretraining with bidirectional attention (non-causal)
- Multi-stage progressive pretraining (160→320→2048 tokens) to stabilize long-sequence training
- MoE router auxiliary loss (scaled by configurable coefficient) to optimize expert selection
Important:
This model requires proper setup of the custom bidirectional attention mechanism before loading. Ensure you follow the setup steps in the correct order:
1) Define custom bidirectional attention functions (SDPA/FlashAttention-2),
2) Register the custom attention functions to ALL_ATTENTION_FUNCTIONS,
3) Configure model with `attn_implementation="bidirectional_flash"` (for FlashAttention) or "bidirectional" (for SDPA),
4) Load model weights and tokenizer (extend with <abu>/<name> special tokens).
The extracted embeddings capture deep semantic features of metagenomic sequences and can be used directly for downstream analysis tasks (e.g., taxonomic classification) without additional fine-tuning.
Citation
If you use this pretrained model in your research, please cite:
@software{microflow_metagenomic2025,
title = {MicroFlow: A Pretrained Mixture of Experts Model for Bacterial/Metagenomic Sequence Analysis},
author = {Zhang, Chao},
year = {2025},
url = {https://github.com/zhangchao162/microflow},
note = {Pretrained MoE model with bidirectional SDPA/FlashAttention and custom BPE tokenization for metagenomic sequence analysis}
}
Contact
For questions about model usage, pretraining pipeline, or fine-tuning guidance for downstream metagenomic tasks, please contact 1623804006@qq.com.
- Downloads last month
- -