MicroFlow / README.md

zhangchao162

Update README.md

7d686fa verified 6 days ago

preview code

raw

history blame contribute delete

3.71 kB

metadata

license: mit
tags:
  - metagenomics
  - bacteria

MicroFlow: A Pretrained Mixture of Experts Model for Bacterial/Metagenomic Sequence Analysis

Model Description

MicroFlow is a pretrained language model built on the Mixtral Mixture of Experts (MoE) architecture, specifically optimized for analyzing bacterial and metagenomic sequences. Trained on large-scale tokenized metagenomic datasets, this model leverages a custom bidirectional attention mechanism to capture bidirectional semantic dependencies in microbial sequences, serving as a foundational model for downstream metagenomic analysis tasks (e.g., sequence classification, taxonomic annotation, and microbial community profiling).

Key Features

1. Architecture Design

Base Architecture: Mixture of Experts (MoE) pretrained model based on Mixtral
Parameter Scale: Configurable parameter scale aligned with Mixtral MoE variants (adjustable via MixtralConfig)
Attention Mechanism: Bidirectional attention mechanism (non-causal) implemented via custom SDPA (Scaled Dot Product Attention) and FlashAttention-2 with GQA (Grouped Query Attention) support
Tokenization: Custom BPE (Byte-Pair Encoding) tokenizer extended with microbial-specific special tokens (<abu>, <name>), with vocab size consistent with the pretrained base tokenizer
Position Encoding: RoPE (Rotary Positional Encoding) with configurable theta (default: 10000)
Expert System: Inherits Mixtral’s MoE expert configuration (8 local experts, 2 experts activated per token)

2. Pretraining Strategy

Pretraining Data: 3,264,597 metagenomic token sequences in plain text format (grouped by token length: ≤160, 160<len≤320, 320<len≤2048), with sequences tagged by <abu>/<name> based on structural features
Sequence Processing: Token sequences truncated/padded to target length (160/320/2048) with <pad> token, no truncation of semantic boundaries
Training Objectives:
- Masked Language Modeling (MLM, 15% masking probability, optional)
- BERT-style pretraining with bidirectional attention (non-causal)
- Multi-stage progressive pretraining (160→320→2048 tokens) to stabilize long-sequence training
- MoE router auxiliary loss (scaled by configurable coefficient) to optimize expert selection

Important:
This model requires proper setup of the custom bidirectional attention mechanism before loading. Ensure you follow the setup steps in the correct order:

  1) Define custom bidirectional attention functions (SDPA/FlashAttention-2),
  2) Register the custom attention functions to ALL_ATTENTION_FUNCTIONS,
  3) Configure model with `attn_implementation="bidirectional_flash"` (for FlashAttention) or "bidirectional" (for SDPA),
  4) Load model weights and tokenizer (extend with <abu>/<name> special tokens).

The extracted embeddings capture deep semantic features of metagenomic sequences and can be used directly for downstream analysis tasks (e.g., taxonomic classification) without additional fine-tuning.

Citation

If you use this pretrained model in your research, please cite:

@software{microflow_metagenomic2025,
  title = {MicroFlow: A Pretrained Mixture of Experts Model for Bacterial/Metagenomic Sequence Analysis},
  author = {Zhang, Chao},
  year = {2025},
  url = {https://github.com/zhangchao162/microflow},
  note = {Pretrained MoE model with bidirectional SDPA/FlashAttention and custom BPE tokenization for metagenomic sequence analysis}
}

Contact

For questions about model usage, pretraining pipeline, or fine-tuning guidance for downstream metagenomic tasks, please contact 1623804006@qq.com.