MicroFlow / README.md
zhangchao162's picture
Update README.md
7d686fa verified
---
license: mit
tags:
- metagenomics
- bacteria
---
# MicroFlow: A Pretrained Mixture of Experts Model for Bacterial/Metagenomic Sequence Analysis
## Model Description
**MicroFlow** is a **pretrained language model** built on the Mixtral Mixture of Experts (MoE) architecture, specifically optimized for analyzing bacterial and metagenomic sequences. Trained on large-scale tokenized metagenomic datasets, this model leverages a custom bidirectional attention mechanism to capture bidirectional semantic dependencies in microbial sequences, serving as a foundational model for downstream metagenomic analysis tasks (e.g., sequence classification, taxonomic annotation, and microbial community profiling).
## Key Features
### 1. Architecture Design
- **Base Architecture**: Mixture of Experts (MoE) pretrained model based on Mixtral
- **Parameter Scale**: Configurable parameter scale aligned with Mixtral MoE variants (adjustable via MixtralConfig)
- **Attention Mechanism**: **Bidirectional attention mechanism (non-causal)** implemented via custom SDPA (Scaled Dot Product Attention) and FlashAttention-2 with GQA (Grouped Query Attention) support
- **Tokenization**: **Custom BPE (Byte-Pair Encoding) tokenizer** extended with microbial-specific special tokens (`<abu>`, `<name>`), with vocab size consistent with the pretrained base tokenizer
- **Position Encoding**: RoPE (Rotary Positional Encoding) with configurable theta (default: 10000)
- **Expert System**: Inherits Mixtral’s MoE expert configuration (8 local experts, 2 experts activated per token)
### 2. Pretraining Strategy
- **Pretraining Data**: 3,264,597 metagenomic token sequences in plain text format (grouped by token length: ≤160, 160<len≤320, 320<len≤2048), with sequences tagged by `<abu>/<name>` based on structural features
- **Sequence Processing**: Token sequences truncated/padded to target length (160/320/2048) with `<pad>` token, no truncation of semantic boundaries
- **Training Objectives**:
- Masked Language Modeling (MLM, 15% masking probability, optional)
- BERT-style pretraining with bidirectional attention (non-causal)
- Multi-stage progressive pretraining (160→320→2048 tokens) to stabilize long-sequence training
- MoE router auxiliary loss (scaled by configurable coefficient) to optimize expert selection
**Important**:
This model requires proper setup of the custom bidirectional attention mechanism before loading. Ensure you follow the setup steps in the correct order:
```Plain Text
1) Define custom bidirectional attention functions (SDPA/FlashAttention-2),
2) Register the custom attention functions to ALL_ATTENTION_FUNCTIONS,
3) Configure model with `attn_implementation="bidirectional_flash"` (for FlashAttention) or "bidirectional" (for SDPA),
4) Load model weights and tokenizer (extend with <abu>/<name> special tokens).
```
The extracted embeddings capture deep semantic features of metagenomic sequences and can be used directly for downstream analysis tasks (e.g., taxonomic classification) without additional fine-tuning.
## Citation
If you use this pretrained model in your research, please cite:
```Plain Text
@software{microflow_metagenomic2025,
title = {MicroFlow: A Pretrained Mixture of Experts Model for Bacterial/Metagenomic Sequence Analysis},
author = {Zhang, Chao},
year = {2025},
url = {https://github.com/zhangchao162/microflow},
note = {Pretrained MoE model with bidirectional SDPA/FlashAttention and custom BPE tokenization for metagenomic sequence analysis}
}
```
## Contact
For questions about model usage, pretraining pipeline, or fine-tuning guidance for downstream metagenomic tasks, please contact 1623804006@qq.com.