Shuu12121 commited on
Commit
38e4aa6
·
verified ·
1 Parent(s): 33220ab

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Shuu12121/javascript-treesitter-filtered-datasetsV2
5
+ - Shuu12121/ruby-treesitter-filtered-datasetsV2
6
+ - Shuu12121/go-treesitter-dedupe_doc-filtered-dataset
7
+ - Shuu12121/java-treesitter-dedupe_doc-filtered-dataset
8
+ - Shuu12121/rust-treesitter-filtered-datasetsV2
9
+ - Shuu12121/php-treesitter-filtered-datasetsV2
10
+ - Shuu12121/python-treesitter-filtered-datasetsV2
11
+ - Shuu12121/typescript-treesitter-filtered-datasetsV2
12
+ language:
13
+ - en
14
+ pipeline_tag: fill-mask
15
+ tags:
16
+ - code
17
+ - python
18
+ - java
19
+ - javascript
20
+ - typescript
21
+ - go
22
+ - ruby
23
+ - rust
24
+ - php
25
+ ---
26
+
27
+ # CodeModernBERT-Owl-v1🦉
28
+
29
+ ## Model Details
30
+
31
+ * **Model type**: Bi-encoder architecture based on ModernBERT
32
+ * **Architecture**:
33
+ * Hidden size: 768
34
+ * Layers: 22
35
+ * Attention heads: 12
36
+ * Intermediate size: 1,152
37
+ * Max position embeddings: 8,192
38
+ * Local attention window size: 128
39
+ * RoPE positional encoding: θ = 160,000
40
+ * Local RoPE positional encoding: θ = 10,000
41
+ * **Sequence length**: up to 2,048 tokens for code and docstring inputs during pretraining
42
+ * **Implementation**: Back-end in Python; integrated into **OwlSpotLight**, a Visual Studio Code extension.
43
+
44
+ ## Pretraining
45
+
46
+ * **Tokenizer**: Custom BPE tokenizer trained for code and docstring pairs.
47
+ * **Data**: Functions and natural language descriptions extracted from GitHub repositories.
48
+ * **Masking strategy**: Two-phase pretraining.
49
+ * **Phase 1: Random Masked Language Modeling (MLM)**
50
+ 30% of tokens in code functions are randomly masked and predicted using standard MLM.
51
+ * **Phase 2: Line-level Span Masking**
52
+ Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:
53
+ 1. Convert input tokens back to strings.
54
+ 2. Detect newline tokens with regex and segment inputs by line.
55
+ 3. Exclude whitespace-only tokens from masking.
56
+ 4. Apply padding to align sequence lengths.
57
+ 5. Randomly mask 30% of tokens in each line segment and predict them.
58
+
59
+ * **Pretraining hyperparameters**:
60
+ * Batch size: 20
61
+ * Gradient accumulation steps: 6
62
+ * Effective batch size: 120
63
+ * Optimizer: AdamW
64
+ * Learning rate: 5e-5
65
+ * Scheduler: Cosine
66
+ * Epochs: 2
67
+ * Precision: Mixed precision (fp16) using `transformers`