dzungpham
/

graphcodebert-code-classification

Model card Files Files and versions

dzungpham commited on Apr 13

Commit

b586b98

·

verified ·

1 Parent(s): 28b4bc7

Create README.md

Files changed (1) hide show

README.md +76 -0

README.md ADDED Viewed

	@@ -0,0 +1,76 @@

+---
+license: mit
+datasets:
+- DaniilOr/SemEval-2026-Task13
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+base_model:
+- microsoft/unixcoder-base
+library_name: transformers
+tags:
+- detection
+- AI-generated
+- transformers
+- bert
+---
+## 🔍 Task Overview
+The rise of generative models has made it increasingly difficult to distinguish machine-generated code from human-written code — especially across different programming languages, domains, and generation techniques.
+**SemEval-2026 Task 13** challenges participants to build systems that can **detect machine-generated code** under diverse conditions by evaluating generalization to unseen languages, generator families, and code application scenarios.
+The task consists of **three subtasks**:
+---
+### Subtask A: Binary Machine-Generated Code Detection
+**Goal:**
+Given a code snippet, predict whether it is:
+- **(i)** Fully **human-written**, or
+- **(ii)** Fully **machine-generated**
+**Training Languages:** `C++`, `Python`, `Java`
+**Training Domain:** `Algorithmic` (e.g., Leetcode-style problems)
+**Evaluation Settings:**
+| Setting                              | Language                | Domain                 |
+|--------------------------------------|-------------------------|------------------------|
+| (i) Seen Languages & Seen Domains    | C++, Python, Java       | Algorithmic            |
+| (ii) Unseen Languages & Seen Domains | Go, PHP, C#, C, JS      | Algorithmic            |
+| (iii) Seen Languages & Unseen Domains| C++, Python, Java       | Research, Production   |
+| (iv) Unseen Languages & Domains      | Go, PHP, C#, C, JS      | Research, Production   |
+**Dataset Size**:
+- Train - 500K samples (238K Human-Written | 262K Machine-Generated)
+- Validation - 100K samples
+**Data Format**
+Each dataset contains the following fields:
+- `code`: The code snippet
+- `label`: The binary label (0 for human-written, 1 for machine-generated)
+- `language`: The programming language of the snippet
+Label mappings are provided in `task_A/label_to_id.json` and `task_A/id_to_label.json`.
+**Evaluation Metric**
+The primary evaluation metric for Subtask A is **Macro F1-score**. This metric ensures balanced performance across both classes.
+**Submission Format**
+Participants must submit a `.csv` file with the following columns:
+- `id`: Unique identifier for each code snippet
+- `label`: Predicted label (0 or 1)
+A sample submission file is available in the `task_A/` folder.
+**Baseline Models**
+Baseline implementations for Subtask A are provided in the `baselines/` directory. These include starter code and pre-trained checkpoints for models such as GraphCodeBERT and UniXcoder.
+**Restrictions**
+- **No external training data**: Use only the provided datasets.
+- **No specialized AI-generated code detectors**: General-purpose code models (e.g., CodeBERT, StarCoder) are allowed.