File size: 3,034 Bytes
c23cdb9
 
3828016
 
 
 
 
 
 
 
 
 
c23cdb9
 
3828016
c23cdb9
3828016
 
 
c23cdb9
 
 
 
 
 
3828016
 
c23cdb9
3828016
 
 
c23cdb9
3828016
 
 
 
c23cdb9
3828016
 
 
 
 
c23cdb9
 
3828016
c23cdb9
3828016
 
 
 
 
 
c23cdb9
3828016
 
 
c23cdb9
 
3828016
c23cdb9
b9865d8
3828016
 
 
 
c23cdb9
 
3828016
c23cdb9
3828016
 
 
 
 
 
 
 
 
 
c23cdb9
 
 
 
3828016
 
 
 
 
c23cdb9
3828016
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
library_name: transformers
tags:
- text-classification
- code-classification
- code-detection
license: apache-2.0
language:
- tr
base_model:
- dbmdz/electra-base-turkish-mc4-uncased-discriminator
pipeline_tag: text-classification
---

## Model Card

A lightweight **binary classifier** that tells whether a Turkish input string is pure/partial **code (`CODE`)** or ordinary **natural language (`NL`)**.  
The model is designed as a *guard-rail component* in LLM pipelines:  
if the user prompt is classified as `CODE`, upstream orchestration can refuse to forward it to the LLM, apply rate limits, or route it to a different policy.


## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import pipeline

clf = pipeline("text-classification",
               model="yeniguno/turkish-code-detector",
               tokenizer="yeniguno/turkish-code-detector")

prompt = "def faktoriyel(n):\n    return 1 if n <= 1 else n * faktoriyel(n-1)"
result = clf(prompt)
print(f"Classification: {result}\n")
# Classification: [{'label': 'CODE', 'score': 0.999995231628418}]

prompt = "Linux'un yaratıcısı kimdir, biliyor musun?"
result = clf(prompt)
print(f"Classification: {result}\n")
# Classification: [{'label': 'NL', 'score': 0.9998611211776733}]
```


## Intended Use & Limitations

| ✓ Recommended                     | ✗ Not a Good Fit                          |
|-----------------------------------|-------------------------------------------|
| Prompt filtering in LLM stacks    | Detecting specific programming languages  |
| Pre-screening user inputs in chat | Judging code quality or style            |
| Moderating public text fields     | Detecting tiny inline code tokens in very long documents |
| Fast, low-latency inference (≈1 ms on GPU) | Multilingual detection outside Turkish |

The classifier was trained **only on Turkish text** + polyglot code snippets.  
Unseen languages (e.g. Japanese text) may be mis-labelled `NL`.  
Very short ambiguous strings (e.g. `"int"`) can be mis-labelled `CODE`.


## Training Data

| Split | Total | **NL** | **CODE** |
|-------|------:|---------:|-------:|
| Train | **316 732** | 251 518 | 65 214 |
| Dev   | 39 591 | 31 439 | 8 152 |
| Test  | 39 592 | 31 440 | 8 152 |


### Training Hyperparameters

| Setting | Value |
|---------|-------|
| Optimiser | AdamW |
| Effective batch | 32 (2 × 16, fp16) |
| LR scheduler | linear-decay, warm-up 0 |
| Max length | 256 tokens |
| Epochs | ≤ 10 (early-stopping at 6 k steps ≈ 0.30 epoch) |
| Loss | **Cross-entropy with *reversed* class weights**<br>`weight_NL = 10.0`  `weight_CODE = 1.0` |
| Label smoothing | 0.1 |
| Hardware | 1 × A100 40 GB (Google Colab) |


## Evaluation

| Split | Acc | Prec | Recall | F1 |
|-------|----:|-----:|-------:|---:|
| Train | 0.9960 | 0.9978 | 0.9827 | 0.9902 |
| Dev   | 0.9957 | 0.9981 | 0.9807 | 0.9894 |
| Test  | 0.9954 | 0.9968 | 0.9807 | 0.9887 |

All metrics computed with  
`id2label = {0: "NL", 1: "CODE"}`.