Shuu12121
/

CodeModernBERT-Owl-v1

Model card Files Files and versions

Shuu12121 commited on Sep 6, 2025

Commit

38e4aa6

·

verified ·

1 Parent(s): 33220ab

Create README.md

Files changed (1) hide show

README.md +67 -0

README.md ADDED Viewed

	@@ -0,0 +1,67 @@

+---
+license: apache-2.0
+datasets:
+  - Shuu12121/javascript-treesitter-filtered-datasetsV2
+  - Shuu12121/ruby-treesitter-filtered-datasetsV2
+  - Shuu12121/go-treesitter-dedupe_doc-filtered-dataset
+  - Shuu12121/java-treesitter-dedupe_doc-filtered-dataset
+  - Shuu12121/rust-treesitter-filtered-datasetsV2
+  - Shuu12121/php-treesitter-filtered-datasetsV2
+  - Shuu12121/python-treesitter-filtered-datasetsV2
+  - Shuu12121/typescript-treesitter-filtered-datasetsV2
+language:
+- en
+pipeline_tag: fill-mask
+tags:
+- code
+- python
+- java
+- javascript
+- typescript
+- go
+- ruby
+- rust
+- php
+---
+# CodeModernBERT-Owl-v1🦉
+## Model Details
+* **Model type**: Bi-encoder architecture based on ModernBERT
+* **Architecture**:
+  * Hidden size: 768
+  * Layers: 22
+  * Attention heads: 12
+  * Intermediate size: 1,152
+  * Max position embeddings: 8,192
+  * Local attention window size: 128
+  * RoPE positional encoding: θ = 160,000
+  * Local RoPE positional encoding: θ = 10,000
+* **Sequence length**: up to 2,048 tokens for code and docstring inputs during pretraining
+* **Implementation**: Back-end in Python; integrated into **OwlSpotLight**, a Visual Studio Code extension.
+## Pretraining
+* **Tokenizer**: Custom BPE tokenizer trained for code and docstring pairs.
+* **Data**: Functions and natural language descriptions extracted from GitHub repositories.
+* **Masking strategy**: Two-phase pretraining.
+  * **Phase 1: Random Masked Language Modeling (MLM)**
+    30% of tokens in code functions are randomly masked and predicted using standard MLM.
+  * **Phase 2: Line-level Span Masking**
+    Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:
+      1. Convert input tokens back to strings.
+      2. Detect newline tokens with regex and segment inputs by line.
+      3. Exclude whitespace-only tokens from masking.
+      4. Apply padding to align sequence lengths.
+      5. Randomly mask 30% of tokens in each line segment and predict them.
+* **Pretraining hyperparameters**:
+  * Batch size: 20
+  * Gradient accumulation steps: 6
+  * Effective batch size: 120
+  * Optimizer: AdamW
+  * Learning rate: 5e-5
+  * Scheduler: Cosine
+  * Epochs: 2
+  * Precision: Mixed precision (fp16) using `transformers`