File size: 3,034 Bytes

---
library_name: transformers
tags:
- text-classification
- code-classification
- code-detection
license: apache-2.0
language:
- tr
base_model:
- dbmdz/electra-base-turkish-mc4-uncased-discriminator
pipeline_tag: text-classification
---

## Model Card

A lightweight **binary classifier** that tells whether a Turkish input string is pure/partial **code (`CODE`)** or ordinary **natural language (`NL`)**.  
The model is designed as a *guard-rail component* in LLM pipelines:  
if the user prompt is classified as `CODE`, upstream orchestration can refuse to forward it to the LLM, apply rate limits, or route it to a different policy.


## How to Get Started with the Model

Use the code below to get started with the model.

```python
from transformers import pipeline

clf = pipeline("text-classification",
               model="yeniguno/turkish-code-detector",
               tokenizer="yeniguno/turkish-code-detector")

prompt = "def faktoriyel(n):\n    return 1 if n <= 1 else n * faktoriyel(n-1)"
result = clf(prompt)
print(f"Classification: {result}\n")
# Classification: [{'label': 'CODE', 'score': 0.999995231628418}]

prompt = "Linux'un yaratıcısı kimdir, biliyor musun?"
result = clf(prompt)
print(f"Classification: {result}\n")
# Classification: [{'label': 'NL', 'score': 0.9998611211776733}]
```


## Intended Use & Limitations

| ✓ Recommended                     | ✗ Not a Good Fit                          |
|-----------------------------------|-------------------------------------------|
| Prompt filtering in LLM stacks    | Detecting specific programming languages  |
| Pre-screening user inputs in chat | Judging code quality or style            |
| Moderating public text fields     | Detecting tiny inline code tokens in very long documents |
| Fast, low-latency inference (≈1 ms on GPU) | Multilingual detection outside Turkish |

The classifier was trained **only on Turkish text** + polyglot code snippets.  
Unseen languages (e.g. Japanese text) may be mis-labelled `NL`.  
Very short ambiguous strings (e.g. `"int"`) can be mis-labelled `CODE`.


## Training Data

| Split | Total | **NL** | **CODE** |
|-------|------:|---------:|-------:|
| Train | **316 732** | 251 518 | 65 214 |
| Dev   | 39 591 | 31 439 | 8 152 |
| Test  | 39 592 | 31 440 | 8 152 |


### Training Hyperparameters

| Setting | Value |
|---------|-------|
| Optimiser | AdamW |
| Effective batch | 32 (2 × 16, fp16) |
| LR scheduler | linear-decay, warm-up 0 |
| Max length | 256 tokens |
| Epochs | ≤ 10 (early-stopping at 6 k steps ≈ 0.30 epoch) |
| Loss | **Cross-entropy with *reversed* class weights**<br>`weight_NL = 10.0`  `weight_CODE = 1.0` |
| Label smoothing | 0.1 |
| Hardware | 1 × A100 40 GB (Google Colab) |


## Evaluation

| Split | Acc | Prec | Recall | F1 |
|-------|----:|-----:|-------:|---:|
| Train | 0.9960 | 0.9978 | 0.9827 | 0.9902 |
| Dev   | 0.9957 | 0.9981 | 0.9807 | 0.9894 |
| Test  | 0.9954 | 0.9968 | 0.9807 | 0.9887 |

All metrics computed with  
`id2label = {0: "NL", 1: "CODE"}`.