---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- modernbert
- code
- masked-language-model
- code-search
- code-understanding
- multilingual-code
- fill-mask
datasets:
- Shuu12121/crow-corpus-v2-func-with-context
- Shuu12121/crow-corpus-func-with-parse
pipeline_tag: fill-mask
---

# Crow-mk2🐦‍⬛

**Crow-mk2** は、関数コード・ソースコンテキスト・パースツリーを含む Crow-Corpus データセットで事前学習した **ModernBERT** ベースの Masked Language Model です。
コード検索・コード理解などの下流タスクへのファインチューニングを前提としたエンコーダモデルです。

Crow-mk2 is a **ModernBERT**-based Masked Language Model pre-trained on the Crow-Corpus datasets, which include function code with source context and parse trees (S-expressions).
It is designed as an encoder model for fine-tuning on downstream tasks such as code search and code understanding.

## Model Details

| Property | Value |
|---|---|
| **Architecture** | ModernBERT (`ModernBertForMaskedLM`) |
| **Parameters** | 152M |
| **Hidden Size** | 768 |
| **Layers** | 19 |
| **Attention Heads** | 12 |
| **Intermediate Size (FFN)** | 1,536 |
| **Max Sequence Length** | 8,192 positions (trained with 1,024) |
| **Vocab Size** | 50,368 |
| **Precision** | FP32 (safetensors) |

### Architectural Features

- **Local + Global Attention**: Local sliding-window attention (window=128) with global attention every 3 layers
- **RoPE**: Rotary Position Embeddings (θ=160,000 global / θ=10,000 local)
- **Flash Attention**: Compatible with Flash Attention 2 for efficient inference
- **Activation**: GELU

## Training

### Training Data

以下の 2 つの Crow-Corpus データセットをストリーミングでインターリーブ（各 50%）して学習しています:

| Dataset | Description |
|---|---|
| [`Shuu12121/crow-corpus-v2-func-with-context`](https://huggingface.co/datasets/Shuu12121/crow-corpus-v2-func-with-context) | 関数コード + ソースコンテキスト + コールコンテキスト + 説明文 |
| [`Shuu12121/crow-corpus-func-with-parse`](https://huggingface.co/datasets/Shuu12121/crow-corpus-func-with-parse) | 関数コード + パースツリー (S式) + 説明文 |

- **Dataset B** (`func-with-context`): 関数の元のコードに加え、そのソースファイルのコンテキストとコールコンテキストを結合
- **Dataset C** (`func-with-parse`): 関数コードと tree-sitter による S式パースツリーを結合（S式は省略形に圧縮して約 63% トークン削減）

### Training Procedure

2段階の学習パイプラインで事前学習しています:

#### Phase 1: Pre-training (MLM)

標準的な Masked Language Modeling による事前学習です。

| Hyperparameter | Value |
|---|---|
| Collator | Standard MLM |
| MLM Probability | 0.3 (30%) |
| Max Sequence Length | 1,024 |
| Max Steps | 100,000 |
| Batch Size (per device) | 8 |
| Gradient Accumulation | 32 |
| **Effective Batch Size** | **256** |
| Learning Rate | 5.0 × 10⁻⁵ |
| LR Scheduler | Cosine Annealing |
| Warmup Steps | 10,000 |
| Weight Decay | 0.01 |
| Precision | FP16 (mixed precision) |

#### Phase 2: Continue-training (Line-level Masking)

コード構造を意識した行単位のマスキング戦略 (`line_no_space`) による継続学習です。
行全体をマスクする際にインデント（空白・タブ）はマスク対象から除外することで、
コードの構造情報を保持したまま学習します。

| Hyperparameter | Value |
|---|---|
| Collator | `line_no_space` |
| MLM Probability | 0.3 (30%) |
| Max Sequence Length | 1,024 |
| Max Steps | 100,000 |
| Batch Size (per device) | 8 |
| Gradient Accumulation | 32 |
| **Effective Batch Size** | **256** |
| Learning Rate | 2.0 × 10⁻⁵ |
| LR Scheduler | Cosine Annealing |
| Warmup Steps | 5,000 |
| Weight Decay | 0.01 |
| Precision | FP16 (mixed precision) |

### Hardware

| Component | Spec |
|---|---|
| GPU | NVIDIA GeForce RTX 5090 (32 GB) × 1 |
| CPU | 24 cores |
| RAM | 128 GB |
| CUDA | 12.9 |
| PyTorch | 2.8.0 |
| Transformers | 4.56.2 |
| Flash Attention | 2.8.3 |

## Tokenizer

 answerdotai/ModernBERT-baseのトークナイザを使用しています。

| Property | Value |
|---|---|
| Type | BPE (Byte-Level) |
| Vocab Size | 50,368 |


## Usage

### Masked Language Modeling (Fill-Mask)

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("Shuu12121/Crow-mk2")
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/Crow-mk2")

text = "def [MASK](self, x, y):"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```

### Feature Extraction (Embeddings)

```python
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("Shuu12121/Crow-mk2")
tokenizer = AutoTokenizer.from_pretrained("Shuu12121/Crow-mk2")

code = "def fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)"
inputs = tokenizer(code, return_tensors="pt", padding=True, truncation=True, max_length=1024)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# Use [CLS] token embedding from the last hidden state
cls_embedding = outputs.hidden_states[-1][:, 0, :]
```

### Fine-tuning for Code Search

コード検索タスクへの fine-tuning には [sentence-transformers](https://www.sbert.net/) を推奨します:

```python
from sentence_transformers import SentenceTransformer

# Fine-tuning 済みモデルの場合
model = SentenceTransformer("Shuu12121/Crow-mk2")
embeddings = model.encode(["def hello(): pass", "function to greet"])
```

## Intended Uses & Limitations

### Intended Uses

- コード検索 (Code Search / Code Retrieval)
- コード理解・分類 (Code Understanding / Classification)
- コード類似度計算 (Code Similarity)
- 下流タスクへの転移学習のベースモデル

### Limitations

- 事前学習モデル (MLM) のため、そのままでは生成タスクには使用できません
- 学習データは 8 言語に限定されており、それ以外の言語での性能は未検証です
- 学習時の最大系列長は 1,024 トークンです（RoPE により 8,192 まで外挿可能ですが性能低下の可能性あり）
- 自然言語のみのテキスト（コードを含まない文章）に対する性能は最適化されていません

## Citation

```bibtex
@misc{crow-mk2,
  author       = {Shuu12121},
  title        = {Crow-mk2: A ModernBERT-based Code Pre-trained Model},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/Shuu12121/Crow-mk2}
}
```

## Acknowledgements

- [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base)
- [Hugging Face Transformers](https://github.com/huggingface/transformers)