|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- metagenomics |
|
|
- bacteria |
|
|
--- |
|
|
# MicroFlow: A Pretrained Mixture of Experts Model for Bacterial/Metagenomic Sequence Analysis |
|
|
|
|
|
## Model Description |
|
|
**MicroFlow** is a **pretrained language model** built on the Mixtral Mixture of Experts (MoE) architecture, specifically optimized for analyzing bacterial and metagenomic sequences. Trained on large-scale tokenized metagenomic datasets, this model leverages a custom bidirectional attention mechanism to capture bidirectional semantic dependencies in microbial sequences, serving as a foundational model for downstream metagenomic analysis tasks (e.g., sequence classification, taxonomic annotation, and microbial community profiling). |
|
|
|
|
|
## Key Features |
|
|
### 1. Architecture Design |
|
|
- **Base Architecture**: Mixture of Experts (MoE) pretrained model based on Mixtral |
|
|
- **Parameter Scale**: Configurable parameter scale aligned with Mixtral MoE variants (adjustable via MixtralConfig) |
|
|
- **Attention Mechanism**: **Bidirectional attention mechanism (non-causal)** implemented via custom SDPA (Scaled Dot Product Attention) and FlashAttention-2 with GQA (Grouped Query Attention) support |
|
|
- **Tokenization**: **Custom BPE (Byte-Pair Encoding) tokenizer** extended with microbial-specific special tokens (`<abu>`, `<name>`), with vocab size consistent with the pretrained base tokenizer |
|
|
- **Position Encoding**: RoPE (Rotary Positional Encoding) with configurable theta (default: 10000) |
|
|
- **Expert System**: Inherits Mixtral’s MoE expert configuration (8 local experts, 2 experts activated per token) |
|
|
|
|
|
### 2. Pretraining Strategy |
|
|
- **Pretraining Data**: 3,264,597 metagenomic token sequences in plain text format (grouped by token length: ≤160, 160<len≤320, 320<len≤2048), with sequences tagged by `<abu>/<name>` based on structural features |
|
|
- **Sequence Processing**: Token sequences truncated/padded to target length (160/320/2048) with `<pad>` token, no truncation of semantic boundaries |
|
|
- **Training Objectives**: |
|
|
- Masked Language Modeling (MLM, 15% masking probability, optional) |
|
|
- BERT-style pretraining with bidirectional attention (non-causal) |
|
|
- Multi-stage progressive pretraining (160→320→2048 tokens) to stabilize long-sequence training |
|
|
- MoE router auxiliary loss (scaled by configurable coefficient) to optimize expert selection |
|
|
|
|
|
**Important**: |
|
|
This model requires proper setup of the custom bidirectional attention mechanism before loading. Ensure you follow the setup steps in the correct order: |
|
|
```Plain Text |
|
|
1) Define custom bidirectional attention functions (SDPA/FlashAttention-2), |
|
|
2) Register the custom attention functions to ALL_ATTENTION_FUNCTIONS, |
|
|
3) Configure model with `attn_implementation="bidirectional_flash"` (for FlashAttention) or "bidirectional" (for SDPA), |
|
|
4) Load model weights and tokenizer (extend with <abu>/<name> special tokens). |
|
|
``` |
|
|
The extracted embeddings capture deep semantic features of metagenomic sequences and can be used directly for downstream analysis tasks (e.g., taxonomic classification) without additional fine-tuning. |
|
|
|
|
|
## Citation |
|
|
If you use this pretrained model in your research, please cite: |
|
|
```Plain Text |
|
|
@software{microflow_metagenomic2025, |
|
|
title = {MicroFlow: A Pretrained Mixture of Experts Model for Bacterial/Metagenomic Sequence Analysis}, |
|
|
author = {Zhang, Chao}, |
|
|
year = {2025}, |
|
|
url = {https://github.com/zhangchao162/microflow}, |
|
|
note = {Pretrained MoE model with bidirectional SDPA/FlashAttention and custom BPE tokenization for metagenomic sequence analysis} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
For questions about model usage, pretraining pipeline, or fine-tuning guidance for downstream metagenomic tasks, please contact 1623804006@qq.com. |
|
|
|
|
|
|