yeniguno's picture
Update README.md
b9865d8 verified
metadata
library_name: transformers
tags:
  - text-classification
  - code-classification
  - code-detection
license: apache-2.0
language:
  - tr
base_model:
  - dbmdz/electra-base-turkish-mc4-uncased-discriminator
pipeline_tag: text-classification

Model Card

A lightweight binary classifier that tells whether a Turkish input string is pure/partial code (CODE) or ordinary natural language (NL).
The model is designed as a guard-rail component in LLM pipelines:
if the user prompt is classified as CODE, upstream orchestration can refuse to forward it to the LLM, apply rate limits, or route it to a different policy.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import pipeline

clf = pipeline("text-classification",
               model="yeniguno/turkish-code-detector",
               tokenizer="yeniguno/turkish-code-detector")

prompt = "def faktoriyel(n):\n    return 1 if n <= 1 else n * faktoriyel(n-1)"
result = clf(prompt)
print(f"Classification: {result}\n")
# Classification: [{'label': 'CODE', 'score': 0.999995231628418}]

prompt = "Linux'un yaratıcısı kimdir, biliyor musun?"
result = clf(prompt)
print(f"Classification: {result}\n")
# Classification: [{'label': 'NL', 'score': 0.9998611211776733}]

Intended Use & Limitations

✓ Recommended ✗ Not a Good Fit
Prompt filtering in LLM stacks Detecting specific programming languages
Pre-screening user inputs in chat Judging code quality or style
Moderating public text fields Detecting tiny inline code tokens in very long documents
Fast, low-latency inference (≈1 ms on GPU) Multilingual detection outside Turkish

The classifier was trained only on Turkish text + polyglot code snippets.
Unseen languages (e.g. Japanese text) may be mis-labelled NL.
Very short ambiguous strings (e.g. "int") can be mis-labelled CODE.

Training Data

Split Total NL CODE
Train 316 732 251 518 65 214
Dev 39 591 31 439 8 152
Test 39 592 31 440 8 152

Training Hyperparameters

Setting Value
Optimiser AdamW
Effective batch 32 (2 × 16, fp16)
LR scheduler linear-decay, warm-up 0
Max length 256 tokens
Epochs ≤ 10 (early-stopping at 6 k steps ≈ 0.30 epoch)
Loss Cross-entropy with reversed class weights
weight_NL = 10.0 weight_CODE = 1.0
Label smoothing 0.1
Hardware 1 × A100 40 GB (Google Colab)

Evaluation

Split Acc Prec Recall F1
Train 0.9960 0.9978 0.9827 0.9902
Dev 0.9957 0.9981 0.9807 0.9894
Test 0.9954 0.9968 0.9807 0.9887

All metrics computed with
id2label = {0: "NL", 1: "CODE"}.