File size: 2,826 Bytes
b586b98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4a80c74
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
license: mit
metrics:
- accuracy
- f1
- precision
- recall
base_model:
- microsoft/unixcoder-base
library_name: transformers
tags:
- detection
- AI-generated
- transformers
- bert
---
## 🔍 Task Overview

The rise of generative models has made it increasingly difficult to distinguish machine-generated code from human-written code — especially across different programming languages, domains, and generation techniques. 

**SemEval-2026 Task 13** challenges participants to build systems that can **detect machine-generated code** under diverse conditions by evaluating generalization to unseen languages, generator families, and code application scenarios.

The task consists of **three subtasks**:

---

### Subtask A: Binary Machine-Generated Code Detection

**Goal:**  
Given a code snippet, predict whether it is:

- **(i)** Fully **human-written**, or  
- **(ii)** Fully **machine-generated**

**Training Languages:** `C++`, `Python`, `Java`  
**Training Domain:** `Algorithmic` (e.g., Leetcode-style problems)

**Evaluation Settings:**

| Setting                              | Language                | Domain                 |
|--------------------------------------|-------------------------|------------------------|
| (i) Seen Languages & Seen Domains    | C++, Python, Java       | Algorithmic            |
| (ii) Unseen Languages & Seen Domains | Go, PHP, C#, C, JS      | Algorithmic            |
| (iii) Seen Languages & Unseen Domains| C++, Python, Java       | Research, Production   |
| (iv) Unseen Languages & Domains      | Go, PHP, C#, C, JS      | Research, Production   |

**Dataset Size**: 
- Train - 500K samples (238K Human-Written | 262K Machine-Generated)
- Validation - 100K samples

**Data Format**
Each dataset contains the following fields:
- `code`: The code snippet
- `label`: The binary label (0 for human-written, 1 for machine-generated)
- `language`: The programming language of the snippet

Label mappings are provided in `task_A/label_to_id.json` and `task_A/id_to_label.json`.

**Evaluation Metric**
The primary evaluation metric for Subtask A is **Macro F1-score**. This metric ensures balanced performance across both classes.

**Submission Format**
Participants must submit a `.csv` file with the following columns:
- `id`: Unique identifier for each code snippet
- `label`: Predicted label (0 or 1)

A sample submission file is available in the `task_A/` folder.

**Baseline Models**
Baseline implementations for Subtask A are provided in the `baselines/` directory. These include starter code and pre-trained checkpoints for models such as GraphCodeBERT and UniXcoder.

**Restrictions**
- **No external training data**: Use only the provided datasets.
- **No specialized AI-generated code detectors**: General-purpose code models (e.g., CodeBERT, StarCoder) are allowed.