|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-classification |
|
|
- code-classification |
|
|
- code-detection |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- tr |
|
|
base_model: |
|
|
- dbmdz/electra-base-turkish-mc4-uncased-discriminator |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
## Model Card |
|
|
|
|
|
A lightweight **binary classifier** that tells whether a Turkish input string is pure/partial **code (`CODE`)** or ordinary **natural language (`NL`)**. |
|
|
The model is designed as a *guard-rail component* in LLM pipelines: |
|
|
if the user prompt is classified as `CODE`, upstream orchestration can refuse to forward it to the LLM, apply rate limits, or route it to a different policy. |
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
clf = pipeline("text-classification", |
|
|
model="yeniguno/turkish-code-detector", |
|
|
tokenizer="yeniguno/turkish-code-detector") |
|
|
|
|
|
prompt = "def faktoriyel(n):\n return 1 if n <= 1 else n * faktoriyel(n-1)" |
|
|
result = clf(prompt) |
|
|
print(f"Classification: {result}\n") |
|
|
# Classification: [{'label': 'CODE', 'score': 0.999995231628418}] |
|
|
|
|
|
prompt = "Linux'un yaratıcısı kimdir, biliyor musun?" |
|
|
result = clf(prompt) |
|
|
print(f"Classification: {result}\n") |
|
|
# Classification: [{'label': 'NL', 'score': 0.9998611211776733}] |
|
|
``` |
|
|
|
|
|
|
|
|
## Intended Use & Limitations |
|
|
|
|
|
| ✓ Recommended | ✗ Not a Good Fit | |
|
|
|-----------------------------------|-------------------------------------------| |
|
|
| Prompt filtering in LLM stacks | Detecting specific programming languages | |
|
|
| Pre-screening user inputs in chat | Judging code quality or style | |
|
|
| Moderating public text fields | Detecting tiny inline code tokens in very long documents | |
|
|
| Fast, low-latency inference (≈1 ms on GPU) | Multilingual detection outside Turkish | |
|
|
|
|
|
The classifier was trained **only on Turkish text** + polyglot code snippets. |
|
|
Unseen languages (e.g. Japanese text) may be mis-labelled `NL`. |
|
|
Very short ambiguous strings (e.g. `"int"`) can be mis-labelled `CODE`. |
|
|
|
|
|
|
|
|
## Training Data |
|
|
|
|
|
| Split | Total | **NL** | **CODE** | |
|
|
|-------|------:|---------:|-------:| |
|
|
| Train | **316 732** | 251 518 | 65 214 | |
|
|
| Dev | 39 591 | 31 439 | 8 152 | |
|
|
| Test | 39 592 | 31 440 | 8 152 | |
|
|
|
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
| Setting | Value | |
|
|
|---------|-------| |
|
|
| Optimiser | AdamW | |
|
|
| Effective batch | 32 (2 × 16, fp16) | |
|
|
| LR scheduler | linear-decay, warm-up 0 | |
|
|
| Max length | 256 tokens | |
|
|
| Epochs | ≤ 10 (early-stopping at 6 k steps ≈ 0.30 epoch) | |
|
|
| Loss | **Cross-entropy with *reversed* class weights**<br>`weight_NL = 10.0` `weight_CODE = 1.0` | |
|
|
| Label smoothing | 0.1 | |
|
|
| Hardware | 1 × A100 40 GB (Google Colab) | |
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
| Split | Acc | Prec | Recall | F1 | |
|
|
|-------|----:|-----:|-------:|---:| |
|
|
| Train | 0.9960 | 0.9978 | 0.9827 | 0.9902 | |
|
|
| Dev | 0.9957 | 0.9981 | 0.9807 | 0.9894 | |
|
|
| Test | 0.9954 | 0.9968 | 0.9807 | 0.9887 | |
|
|
|
|
|
All metrics computed with |
|
|
`id2label = {0: "NL", 1: "CODE"}`. |
|
|
|