Fill-Mask
Transformers
Safetensors
English
modernbert
code
masked-language-model
code-search
code-understanding
multilingual-code
Instructions to use Shuu12121/Crow-mk2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Shuu12121/Crow-mk2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Shuu12121/Crow-mk2")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Shuu12121/Crow-mk2") model = AutoModelForMaskedLM.from_pretrained("Shuu12121/Crow-mk2") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - modernbert | |
| - code | |
| - masked-language-model | |
| - code-search | |
| - code-understanding | |
| - multilingual-code | |
| - fill-mask | |
| datasets: | |
| - Shuu12121/crow-corpus-v2-func-with-context | |
| - Shuu12121/crow-corpus-func-with-parse | |
| pipeline_tag: fill-mask | |
| # Crow-mk2🐦⬛ | |
| **Crow-mk2** は、関数コード・ソースコンテキスト・パースツリーを含む Crow-Corpus データセットで事前学習した **ModernBERT** ベースの Masked Language Model です。 | |
| コード検索・コード理解などの下流タスクへのファインチューニングを前提としたエンコーダモデルです。 | |
| Crow-mk2 is a **ModernBERT**-based Masked Language Model pre-trained on the Crow-Corpus datasets, which include function code with source context and parse trees (S-expressions). | |
| It is designed as an encoder model for fine-tuning on downstream tasks such as code search and code understanding. | |
| ## Model Details | |
| | Property | Value | | |
| |---|---| | |
| | **Architecture** | ModernBERT (`ModernBertForMaskedLM`) | | |
| | **Parameters** | 152M | | |
| | **Hidden Size** | 768 | | |
| | **Layers** | 19 | | |
| | **Attention Heads** | 12 | | |
| | **Intermediate Size (FFN)** | 1,536 | | |
| | **Max Sequence Length** | 8,192 positions (trained with 1,024) | | |
| | **Vocab Size** | 50,368 | | |
| | **Precision** | FP32 (safetensors) | | |
| ### Architectural Features | |
| - **Local + Global Attention**: Local sliding-window attention (window=128) with global attention every 3 layers | |
| - **RoPE**: Rotary Position Embeddings (θ=160,000 global / θ=10,000 local) | |
| - **Flash Attention**: Compatible with Flash Attention 2 for efficient inference | |
| - **Activation**: GELU | |
| ## Training | |
| ### Training Data | |
| 以下の 2 つの Crow-Corpus データセットをストリーミングでインターリーブ(各 50%)して学習しています: | |
| | Dataset | Description | | |
| |---|---| | |
| | [`Shuu12121/crow-corpus-v2-func-with-context`](https://huggingface.co/datasets/Shuu12121/crow-corpus-v2-func-with-context) | 関数コード + ソースコンテキスト + コールコンテキスト + 説明文 | | |
| | [`Shuu12121/crow-corpus-func-with-parse`](https://huggingface.co/datasets/Shuu12121/crow-corpus-func-with-parse) | 関数コード + パースツリー (S式) + 説明文 | | |
| - **Dataset B** (`func-with-context`): 関数の元のコードに加え、そのソースファイルのコンテキストとコールコンテキストを結合 | |
| - **Dataset C** (`func-with-parse`): 関数コードと tree-sitter による S式パースツリーを結合(S式は省略形に圧縮して約 63% トークン削減) | |
| ### Training Procedure | |
| 2段階の学習パイプラインで事前学習しています: | |
| #### Phase 1: Pre-training (MLM) | |
| 標準的な Masked Language Modeling による事前学習です。 | |
| | Hyperparameter | Value | | |
| |---|---| | |
| | Collator | Standard MLM | | |
| | MLM Probability | 0.3 (30%) | | |
| | Max Sequence Length | 1,024 | | |
| | Max Steps | 100,000 | | |
| | Batch Size (per device) | 8 | | |
| | Gradient Accumulation | 32 | | |
| | **Effective Batch Size** | **256** | | |
| | Learning Rate | 5.0 × 10⁻⁵ | | |
| | LR Scheduler | Cosine Annealing | | |
| | Warmup Steps | 10,000 | | |
| | Weight Decay | 0.01 | | |
| | Precision | FP16 (mixed precision) | | |
| #### Phase 2: Continue-training (Line-level Masking) | |
| コード構造を意識した行単位のマスキング戦略 (`line_no_space`) による継続学習です。 | |
| 行全体をマスクする際にインデント(空白・タブ)はマスク対象から除外することで、 | |
| コードの構造情報を保持したまま学習します。 | |
| | Hyperparameter | Value | | |
| |---|---| | |
| | Collator | `line_no_space` | | |
| | MLM Probability | 0.3 (30%) | | |
| | Max Sequence Length | 1,024 | | |
| | Max Steps | 100,000 | | |
| | Batch Size (per device) | 8 | | |
| | Gradient Accumulation | 32 | | |
| | **Effective Batch Size** | **256** | | |
| | Learning Rate | 2.0 × 10⁻⁵ | | |
| | LR Scheduler | Cosine Annealing | | |
| | Warmup Steps | 5,000 | | |
| | Weight Decay | 0.01 | | |
| | Precision | FP16 (mixed precision) | | |
| ### Hardware | |
| | Component | Spec | | |
| |---|---| | |
| | GPU | NVIDIA GeForce RTX 5090 (32 GB) × 1 | | |
| | CPU | 24 cores | | |
| | RAM | 128 GB | | |
| | CUDA | 12.9 | | |
| | PyTorch | 2.8.0 | | |
| | Transformers | 4.56.2 | | |
| | Flash Attention | 2.8.3 | | |
| ## Tokenizer | |
| answerdotai/ModernBERT-baseのトークナイザを使用しています。 | |
| | Property | Value | | |
| |---|---| | |
| | Type | BPE (Byte-Level) | | |
| | Vocab Size | 50,368 | | |
| ## Usage | |
| ### Masked Language Modeling (Fill-Mask) | |
| ```python | |
| from transformers import AutoModelForMaskedLM, AutoTokenizer | |
| model = AutoModelForMaskedLM.from_pretrained("Shuu12121/Crow-mk2") | |
| tokenizer = AutoTokenizer.from_pretrained("Shuu12121/Crow-mk2") | |
| text = "def [MASK](self, x, y):" | |
| inputs = tokenizer(text, return_tensors="pt") | |
| outputs = model(**inputs) | |
| ``` | |
| ### Feature Extraction (Embeddings) | |
| ```python | |
| import torch | |
| from transformers import AutoModelForMaskedLM, AutoTokenizer | |
| model = AutoModelForMaskedLM.from_pretrained("Shuu12121/Crow-mk2") | |
| tokenizer = AutoTokenizer.from_pretrained("Shuu12121/Crow-mk2") | |
| code = "def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)" | |
| inputs = tokenizer(code, return_tensors="pt", padding=True, truncation=True, max_length=1024) | |
| with torch.no_grad(): | |
| outputs = model(**inputs, output_hidden_states=True) | |
| # Use [CLS] token embedding from the last hidden state | |
| cls_embedding = outputs.hidden_states[-1][:, 0, :] | |
| ``` | |
| ### Fine-tuning for Code Search | |
| コード検索タスクへの fine-tuning には [sentence-transformers](https://www.sbert.net/) を推奨します: | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| # Fine-tuning 済みモデルの場合 | |
| model = SentenceTransformer("Shuu12121/Crow-mk2") | |
| embeddings = model.encode(["def hello(): pass", "function to greet"]) | |
| ``` | |
| ## Intended Uses & Limitations | |
| ### Intended Uses | |
| - コード検索 (Code Search / Code Retrieval) | |
| - コード理解・分類 (Code Understanding / Classification) | |
| - コード類似度計算 (Code Similarity) | |
| - 下流タスクへの転移学習のベースモデル | |
| ### Limitations | |
| - 事前学習モデル (MLM) のため、そのままでは生成タスクには使用できません | |
| - 学習データは 8 言語に限定されており、それ以外の言語での性能は未検証です | |
| - 学習時の最大系列長は 1,024 トークンです(RoPE により 8,192 まで外挿可能ですが性能低下の可能性あり) | |
| - 自然言語のみのテキスト(コードを含まない文章)に対する性能は最適化されていません | |
| ## Citation | |
| ```bibtex | |
| @misc{crow-mk2, | |
| author = {Shuu12121}, | |
| title = {Crow-mk2: A ModernBERT-based Code Pre-trained Model}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/Shuu12121/Crow-mk2} | |
| } | |
| ``` | |
| ## Acknowledgements | |
| - [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) | |
| - [Hugging Face Transformers](https://github.com/huggingface/transformers) |