File size: 2,795 Bytes
f6ef07a
b586b98
 
 
 
 
 
 
 
 
 
 
 
 
 
f6ef07a
b586b98
d2240b9
b586b98
d2240b9
b586b98
d2240b9
 
 
b586b98
 
 
 
 
 
d2240b9
b586b98
d2240b9
 
b586b98
d2240b9
 
b586b98
 
 
d2240b9
 
 
 
 
 
b586b98
d2240b9
 
 
b586b98
d2240b9
 
 
 
 
b586b98
 
 
d2240b9
 
b586b98
d2240b9
 
 
 
b586b98
d2240b9
b586b98
d2240b9
 
b586b98
d2240b9
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
license: mit
metrics:
- accuracy
- f1
- precision
- recall
base_model:
- microsoft/unixcoder-base
library_name: transformers
tags:
- detection
- AI-generated
- transformers
- bert
---

## Task Overview

The rapid advancement of generative models has made it increasingly challenging to distinguish machine-generated code from human-written code, particularly across different programming languages, domains, and generation techniques.

SemEval-2026 Task 13 focuses on developing systems capable of detecting machine-generated code under diverse conditions. The evaluation emphasizes generalization to unseen programming languages, generator families, and application scenarios.

The task is divided into three subtasks.

---

### Subtask A: Binary Machine-Generated Code Detection

**Goal:**  
Given a code snippet, determine whether it is:

- Fully human-written, or  
- Fully machine-generated

**Training Languages:** C++, Python, Java  
**Training Domain:** Algorithmic (e.g., LeetCode-style problems)

**Evaluation Settings:**

| Setting                              | Language                | Domain               |
|--------------------------------------|-------------------------|----------------------|
| (i) Seen Languages & Seen Domains    | C++, Python, Java       | Algorithmic          |
| (ii) Unseen Languages & Seen Domains | Go, PHP, C#, C, JS      | Algorithmic          |
| (iii) Seen Languages & Unseen Domains| C++, Python, Java       | Research, Production |
| (iv) Unseen Languages & Domains      | Go, PHP, C#, C, JS      | Research, Production |

**Dataset Size:** 
- Train: 500,000 samples (238,000 human-written, 262,000 machine-generated)
- Validation: 100,000 samples

**Data Format:**  
Each dataset includes the following fields:
- `code`: The code snippet  
- `label`: Binary label (0 for human-written, 1 for machine-generated)  
- `language`: Programming language of the snippet  

Label mappings are provided in `task_A/label_to_id.json` and `task_A/id_to_label.json`.

**Evaluation Metric:**  
The primary metric for Subtask A is Macro F1-score, ensuring balanced performance across both classes.

**Submission Format:**  
Participants must submit a `.csv` file containing:
- `id`: Unique identifier for each code snippet  
- `label`: Predicted label (0 or 1)  

A sample submission file is available in the `task_A/` directory.

**Baseline Models:**  
Baseline implementations are provided in the `baselines/` directory, including starter code and pre-trained checkpoints for models such as GraphCodeBERT and UniXcoder.

**Restrictions:**
- No external training data may be used; only the provided datasets are allowed.  
- Specialized AI-generated code detectors are not permitted. General-purpose code models (e.g., CodeBERT, StarCoder) are allowed.