| | --- |
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - dataset_size:901028 |
| | - loss:CosineSimilarityLoss |
| | base_model: Shuu12121/CodeModernBERT-Owl |
| | pipeline_tag: sentence-similarity |
| | library_name: sentence-transformers |
| | metrics: |
| | - pearson_cosine |
| | - accuracy |
| | - f1 |
| | model-index: |
| | - name: SentenceTransformer based on Shuu12121/CodeModernBERT-Owl |
| | results: |
| | - task: |
| | type: semantic-similarity |
| | name: Semantic Similarity |
| | dataset: |
| | name: val |
| | type: val |
| | metrics: |
| | - type: pearson_cosine |
| | value: 0.9481467499740959 |
| | name: Training Pearson Cosine |
| | - type: accuracy |
| | value: 0.9900051996071408 |
| | name: Test Accuracy |
| | - type: f1 |
| | value: 0.963323498754483 |
| | name: Test F1 Score |
| | license: apache-2.0 |
| | datasets: |
| | - google/code_x_glue_cc_clone_detection_big_clone_bench |
| | --- |
| | |
| | # SentenceTransformer based on `Shuu12121/CodeModernBERT-Owl🦉` |
| |
|
| | This model is a SentenceTransformer fine-tuned from [`Shuu12121/CodeModernBERT-Owl🦉`](https://huggingface.co/Shuu12121/CodeModernBERT-Owl) on the [BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench) dataset for **code clone detection**. It maps code snippets into a 768-dimensional dense vector space for semantic similarity tasks. |
| |
|
| |
|
| |
|
| | ## 🎯 Distinctive Performance and Stability |
| |
|
| | This model achieves **very high accuracy and F1 scores** in code clone detection. |
| | One particularly noteworthy characteristic is that **changing the similarity threshold has minimal impact on classification performance**. |
| | This indicates that the model has learned to **clearly separate clones from non-clones**, resulting in a **stable and reliable similarity score distribution**. |
| |
|
| | | Threshold | Accuracy | F1 Score | |
| | |-------------------|-------------------|--------------------| |
| | | 0.5 | 0.9900 | 0.9633 | |
| | | 0.85 | 0.9903 | 0.9641 | |
| | | 0.90 | 0.9902 | 0.9637 | |
| | | 0.95 | 0.9887 | 0.9579 | |
| | | 0.98 | 0.9879 | 0.9540 | |
| |
|
| | - **High Stability**: Between thresholds of 0.85 and 0.98, accuracy and F1 scores remain nearly constant. |
| | _(This suggests that code pairs considered clones generally score between 0.9 and 1.0 in cosine similarity.)_ |
| |
|
| | - **Reliable in Real-World Applications**: Even if the similarity threshold is slightly adjusted for different tasks or environments, the model maintains consistent performance without significant degradation. |
| |
|
| |
|
| |
|
| | ## 📌 Model Overview |
| |
|
| | - **Architecture**: Sentence-BERT (SBERT) |
| | - **Base Model**: `Shuu12121/CodeModernBERT-Owl` |
| | - **Output Dimension**: 768 |
| | - **Max Sequence Length**: 2048 tokens |
| | - **Pooling Method**: CLS token pooling |
| | - **Similarity Function**: Cosine Similarity |
| |
|
| | --- |
| |
|
| | ## 🏋️♂️ Training Configuration |
| |
|
| | - **Loss Function**: `CosineSimilarityLoss` |
| | - **Epochs**: 1 |
| | - **Batch Size**: 32 |
| | - **Warmup Steps**: 3% of training steps |
| | - **Evaluator**: `EmbeddingSimilarityEvaluator` (on validation) |
| |
|
| | --- |
| |
|
| | ## 📊 Evaluation Metrics |
| |
|
| | | Metric | Score | |
| | |---------------------------|--------------------| |
| | | Pearson Cosine (Train) | `0.9481` | |
| | | Accuracy (Test) | `0.9902` | |
| | | F1 Score (Test) | `0.9637` | |
| |
|
| | --- |
| |
|
| | ## 📚 Dataset |
| |
|
| | - [Google BigCloneBench](https://huggingface.co/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench) |
| |
|
| | --- |
| |
|
| | ## 🧪 How to Use |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | from torch.nn.functional import cosine_similarity |
| | import torch |
| | |
| | # Load the fine-tuned model |
| | model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl") |
| | |
| | # Two code snippets to compare |
| | code1 = "def add(a, b): return a + b" |
| | code2 = "def sum(x, y): return x + y" |
| | |
| | # Encode the code snippets |
| | embeddings = model.encode([code1, code2], convert_to_tensor=True) |
| | |
| | # Compute cosine similarity |
| | similarity_score = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)).item() |
| | |
| | # Print the result |
| | print(f"Cosine Similarity: {similarity_score:.4f}") |
| | if similarity_score >= 0.9: |
| | print("🟢 These code snippets are considered CLONES.") |
| | else: |
| | print("🔴 These code snippets are NOT considered clones.") |
| | ``` |
| | ## 🧪 How to Test |
| |
|
| | ```python |
| | !pip install -U sentence-transformers datasets |
| | |
| | from sentence_transformers import SentenceTransformer |
| | from datasets import load_dataset |
| | import torch |
| | from sklearn.metrics import accuracy_score, f1_score |
| | |
| | # --- データセットのロード --- |
| | ds_test = load_dataset("google/code_x_glue_cc_clone_detection_big_clone_bench", split="test") |
| | |
| | model = SentenceTransformer("Shuu12121/CodeCloneDetection-ModernBERT-Owl") |
| | model.to("cuda") |
| | |
| | |
| | test_sentences1 = ds_test["func1"] |
| | test_sentences2 = ds_test["func2"] |
| | test_labels = ds_test["label"] |
| | |
| | batch_size = 256 # GPUメモリに合わせて調整 |
| | |
| | print("Encoding sentences1...") |
| | |
| | embeddings1 = model.encode( |
| | test_sentences1, |
| | convert_to_tensor=True, |
| | batch_size=batch_size, |
| | show_progress_bar=True |
| | ) |
| | |
| | print("Encoding sentences2...") |
| | embeddings2 = model.encode( |
| | test_sentences2, |
| | convert_to_tensor=True, |
| | batch_size=batch_size, |
| | show_progress_bar=True |
| | ) |
| | |
| | print("Calculating cosine scores...") |
| | cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2) |
| | |
| | # 閾値設定(ここでは0.9を採用) |
| | threshold = 0.9 |
| | print(f"Using threshold: {threshold}") |
| | predictions = (cosine_scores > threshold).long().cpu().numpy() |
| | |
| | accuracy = accuracy_score(test_labels, predictions) |
| | f1 = f1_score(test_labels, predictions) |
| | print("Test Accuracy:", accuracy) |
| | print("Test F1 Score:", f1) |
| | |
| | ``` |
| |
|
| | ## 🛠️ Model Architecture |
| |
|
| | ```python |
| | SentenceTransformer( |
| | (0): Transformer({'max_seq_length': 2048}) with model 'ModernBertModel' |
| | (1): Pooling({ |
| | 'word_embedding_dimension': 768, |
| | 'pooling_mode_cls_token': True, |
| | ... |
| | }) |
| | ) |
| | ``` |
| |
|
| | --- |
| |
|
| | ## 📦 Dependencies |
| |
|
| | - Python: `3.11.11` |
| | - sentence-transformers: `4.0.1` |
| | - transformers: `4.50.3` |
| | - torch: `2.6.0+cu124` |
| | - datasets: `3.5.0` |
| | - tokenizers: `0.21.1` |
| | - flash-attn: ✅ Installed |
| |
|
| | ### Install Required Libraries |
| |
|
| | ```bash |
| | pip install -U sentence-transformers transformers>=4.48.0 flash-attn datasets |
| | ``` |
| |
|
| | --- |
| |
|
| | ## 🔐 Optional: Authentication |
| |
|
| | ```python |
| | from huggingface_hub import login |
| | login("your_huggingface_token") |
| | |
| | import wandb |
| | wandb.login(key="your_wandb_token") |
| | ``` |
| |
|
| | --- |
| |
|
| | ## 🧾 Citation |
| |
|
| | ```bibtex |
| | @inproceedings{reimers-2019-sentence-bert, |
| | title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
| | author = "Reimers, Nils and Gurevych, Iryna", |
| | booktitle = "EMNLP 2019", |
| | url = "https://arxiv.org/abs/1908.10084" |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | ## 🔓 License |
| |
|
| | Apache License 2.0 |
| |
|
| |
|