Crow-mk2 / README.md

Update README.md

7021e04 verified about 2 months ago

6.98 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- modernbert
	- code
	- masked-language-model
	- code-search
	- code-understanding
	- multilingual-code
	- fill-mask
	datasets:
	- Shuu12121/crow-corpus-v2-func-with-context
	- Shuu12121/crow-corpus-func-with-parse
	pipeline_tag: fill-mask
	---

	# Crow-mk2🐦‍⬛

	Crow-mk2 は、関数コード・ソースコンテキスト・パースツリーを含む Crow-Corpus データセットで事前学習した ModernBERT ベースの Masked Language Model です。
	コード検索・コード理解などの下流タスクへのファインチューニングを前提としたエンコーダモデルです。

	Crow-mk2 is a ModernBERT-based Masked Language Model pre-trained on the Crow-Corpus datasets, which include function code with source context and parse trees (S-expressions).
	It is designed as an encoder model for fine-tuning on downstream tasks such as code search and code understanding.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| ModernBERT (`ModernBertForMaskedLM`) \|
	\| Parameters \| 152M \|
	\| Hidden Size \| 768 \|
	\| Layers \| 19 \|
	\| Attention Heads \| 12 \|
	\| Intermediate Size (FFN) \| 1,536 \|
	\| Max Sequence Length \| 8,192 positions (trained with 1,024) \|
	\| Vocab Size \| 50,368 \|
	\| Precision \| FP32 (safetensors) \|

	### Architectural Features

	- Local + Global Attention: Local sliding-window attention (window=128) with global attention every 3 layers
	- RoPE: Rotary Position Embeddings (θ=160,000 global / θ=10,000 local)
	- Flash Attention: Compatible with Flash Attention 2 for efficient inference
	- Activation: GELU

	## Training

	### Training Data

	以下の 2 つの Crow-Corpus データセットをストリーミングでインターリーブ（各 50%）して学習しています:

	\| Dataset \| Description \|
	\|---\|---\|
	\| [`Shuu12121/crow-corpus-v2-func-with-context`](https://huggingface.co/datasets/Shuu12121/crow-corpus-v2-func-with-context) \| 関数コード + ソースコンテキスト + コールコンテキスト + 説明文 \|
	\| [`Shuu12121/crow-corpus-func-with-parse`](https://huggingface.co/datasets/Shuu12121/crow-corpus-func-with-parse) \| 関数コード + パースツリー (S式) + 説明文 \|

	- Dataset B (`func-with-context`): 関数の元のコードに加え、そのソースファイルのコンテキストとコールコンテキストを結合
	- Dataset C (`func-with-parse`): 関数コードと tree-sitter による S式パースツリーを結合（S式は省略形に圧縮して約 63% トークン削減）

	### Training Procedure

	2段階の学習パイプラインで事前学習しています:

	#### Phase 1: Pre-training (MLM)

	標準的な Masked Language Modeling による事前学習です。

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Collator \| Standard MLM \|
	\| MLM Probability \| 0.3 (30%) \|
	\| Max Sequence Length \| 1,024 \|
	\| Max Steps \| 100,000 \|
	\| Batch Size (per device) \| 8 \|
	\| Gradient Accumulation \| 32 \|
	\| Effective Batch Size \| 256 \|
	\| Learning Rate \| 5.0 × 10⁻⁵ \|
	\| LR Scheduler \| Cosine Annealing \|
	\| Warmup Steps \| 10,000 \|
	\| Weight Decay \| 0.01 \|
	\| Precision \| FP16 (mixed precision) \|

	#### Phase 2: Continue-training (Line-level Masking)

	コード構造を意識した行単位のマスキング戦略 (`line_no_space`) による継続学習です。
	行全体をマスクする際にインデント（空白・タブ）はマスク対象から除外することで、
	コードの構造情報を保持したまま学習します。

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Collator \| `line_no_space` \|
	\| MLM Probability \| 0.3 (30%) \|
	\| Max Sequence Length \| 1,024 \|
	\| Max Steps \| 100,000 \|
	\| Batch Size (per device) \| 8 \|
	\| Gradient Accumulation \| 32 \|
	\| Effective Batch Size \| 256 \|
	\| Learning Rate \| 2.0 × 10⁻⁵ \|
	\| LR Scheduler \| Cosine Annealing \|
	\| Warmup Steps \| 5,000 \|
	\| Weight Decay \| 0.01 \|
	\| Precision \| FP16 (mixed precision) \|

	### Hardware

	\| Component \| Spec \|
	\|---\|---\|
	\| GPU \| NVIDIA GeForce RTX 5090 (32 GB) × 1 \|
	\| CPU \| 24 cores \|
	\| RAM \| 128 GB \|
	\| CUDA \| 12.9 \|
	\| PyTorch \| 2.8.0 \|
	\| Transformers \| 4.56.2 \|
	\| Flash Attention \| 2.8.3 \|

	## Tokenizer

	answerdotai/ModernBERT-baseのトークナイザを使用しています。

	\| Property \| Value \|
	\|---\|---\|
	\| Type \| BPE (Byte-Level) \|
	\| Vocab Size \| 50,368 \|


	## Usage

	### Masked Language Modeling (Fill-Mask)

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	model = AutoModelForMaskedLM.from_pretrained("Shuu12121/Crow-mk2")
	tokenizer = AutoTokenizer.from_pretrained("Shuu12121/Crow-mk2")

	text = "def [MASK](self, x, y):"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)
	```

	### Feature Extraction (Embeddings)

	```python
	import torch
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	model = AutoModelForMaskedLM.from_pretrained("Shuu12121/Crow-mk2")
	tokenizer = AutoTokenizer.from_pretrained("Shuu12121/Crow-mk2")

	code = "def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)"
	inputs = tokenizer(code, return_tensors="pt", padding=True, truncation=True, max_length=1024)

	with torch.no_grad():
	outputs = model(**inputs, output_hidden_states=True)

	# Use [CLS] token embedding from the last hidden state
	cls_embedding = outputs.hidden_states[-1][:, 0, :]
	```

	### Fine-tuning for Code Search

	コード検索タスクへの fine-tuning には [sentence-transformers](https://www.sbert.net/) を推奨します:

	```python
	from sentence_transformers import SentenceTransformer

	# Fine-tuning 済みモデルの場合
	model = SentenceTransformer("Shuu12121/Crow-mk2")
	embeddings = model.encode(["def hello(): pass", "function to greet"])
	```

	## Intended Uses & Limitations

	### Intended Uses

	- コード検索 (Code Search / Code Retrieval)
	- コード理解・分類 (Code Understanding / Classification)
	- コード類似度計算 (Code Similarity)
	- 下流タスクへの転移学習のベースモデル

	### Limitations

	- 事前学習モデル (MLM) のため、そのままでは生成タスクには使用できません
	- 学習データは 8 言語に限定されており、それ以外の言語での性能は未検証です
	- 学習時の最大系列長は 1,024 トークンです（RoPE により 8,192 まで外挿可能ですが性能低下の可能性あり）
	- 自然言語のみのテキスト（コードを含まない文章）に対する性能は最適化されていません

	## Citation

	```bibtex
	@misc{crow-mk2,
	author = {Shuu12121},
	title = {Crow-mk2: A ModernBERT-based Code Pre-trained Model},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/Shuu12121/Crow-mk2}
	}
	```

	## Acknowledgements

	- [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base)
	- [Hugging Face Transformers](https://github.com/huggingface/transformers)