MicroFlow / README.md

Update README.md

7d686fa verified 9 days ago

3.71 kB

	---
	license: mit
	tags:
	- metagenomics
	- bacteria
	---
	# MicroFlow: A Pretrained Mixture of Experts Model for Bacterial/Metagenomic Sequence Analysis

	## Model Description
	MicroFlow is a pretrained language model built on the Mixtral Mixture of Experts (MoE) architecture, specifically optimized for analyzing bacterial and metagenomic sequences. Trained on large-scale tokenized metagenomic datasets, this model leverages a custom bidirectional attention mechanism to capture bidirectional semantic dependencies in microbial sequences, serving as a foundational model for downstream metagenomic analysis tasks (e.g., sequence classification, taxonomic annotation, and microbial community profiling).

	## Key Features
	### 1. Architecture Design
	- Base Architecture: Mixture of Experts (MoE) pretrained model based on Mixtral
	- Parameter Scale: Configurable parameter scale aligned with Mixtral MoE variants (adjustable via MixtralConfig)
	- Attention Mechanism: Bidirectional attention mechanism (non-causal) implemented via custom SDPA (Scaled Dot Product Attention) and FlashAttention-2 with GQA (Grouped Query Attention) support
	- Tokenization: Custom BPE (Byte-Pair Encoding) tokenizer extended with microbial-specific special tokens (`<abu>`, `<name>`), with vocab size consistent with the pretrained base tokenizer
	- Position Encoding: RoPE (Rotary Positional Encoding) with configurable theta (default: 10000)
	- Expert System: Inherits Mixtral’s MoE expert configuration (8 local experts, 2 experts activated per token)

	### 2. Pretraining Strategy
	- Pretraining Data: 3,264,597 metagenomic token sequences in plain text format (grouped by token length: ≤160, 160<len≤320, 320<len≤2048), with sequences tagged by `<abu>/<name>` based on structural features
	- Sequence Processing: Token sequences truncated/padded to target length (160/320/2048) with `<pad>` token, no truncation of semantic boundaries
	- Training Objectives:
	- Masked Language Modeling (MLM, 15% masking probability, optional)
	- BERT-style pretraining with bidirectional attention (non-causal)
	- Multi-stage progressive pretraining (160→320→2048 tokens) to stabilize long-sequence training
	- MoE router auxiliary loss (scaled by configurable coefficient) to optimize expert selection

	Important:
	This model requires proper setup of the custom bidirectional attention mechanism before loading. Ensure you follow the setup steps in the correct order:
	```Plain Text
	1) Define custom bidirectional attention functions (SDPA/FlashAttention-2),
	2) Register the custom attention functions to ALL_ATTENTION_FUNCTIONS,
	3) Configure model with `attn_implementation="bidirectional_flash"` (for FlashAttention) or "bidirectional" (for SDPA),
	4) Load model weights and tokenizer (extend with <abu>/<name> special tokens).
	```
	The extracted embeddings capture deep semantic features of metagenomic sequences and can be used directly for downstream analysis tasks (e.g., taxonomic classification) without additional fine-tuning.

	## Citation
	If you use this pretrained model in your research, please cite:
	```Plain Text
	@software{microflow_metagenomic2025,
	title = {MicroFlow: A Pretrained Mixture of Experts Model for Bacterial/Metagenomic Sequence Analysis},
	author = {Zhang, Chao},
	year = {2025},
	url = {https://github.com/zhangchao162/microflow},
	note = {Pretrained MoE model with bidirectional SDPA/FlashAttention and custom BPE tokenization for metagenomic sequence analysis}
	}
	```

	## Contact
	For questions about model usage, pretraining pipeline, or fine-tuning guidance for downstream metagenomic tasks, please contact 1623804006@qq.com.