yeniguno
/

turkish-code-detector

Text Classification

code-classification

Model card Files Files and versions

turkish-code-detector / README.md

yeniguno's picture

Update README.md

b9865d8 verified 7 months ago

|

history blame contribute delete

3.03 kB

	---
	library_name: transformers
	tags:
	- text-classification
	- code-classification
	- code-detection
	license: apache-2.0
	language:
	- tr
	base_model:
	- dbmdz/electra-base-turkish-mc4-uncased-discriminator
	pipeline_tag: text-classification
	---

	## Model Card

	A lightweight binary classifier that tells whether a Turkish input string is pure/partial code (`CODE`) or ordinary natural language (`NL`).
	The model is designed as a guard-rail component in LLM pipelines:
	if the user prompt is classified as `CODE`, upstream orchestration can refuse to forward it to the LLM, apply rate limits, or route it to a different policy.


	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from transformers import pipeline

	clf = pipeline("text-classification",
	model="yeniguno/turkish-code-detector",
	tokenizer="yeniguno/turkish-code-detector")

	prompt = "def faktoriyel(n):\n return 1 if n <= 1 else n * faktoriyel(n-1)"
	result = clf(prompt)
	print(f"Classification: {result}\n")
	# Classification: [{'label': 'CODE', 'score': 0.999995231628418}]

	prompt = "Linux'un yaratıcısı kimdir, biliyor musun?"
	result = clf(prompt)
	print(f"Classification: {result}\n")
	# Classification: [{'label': 'NL', 'score': 0.9998611211776733}]
	```


	## Intended Use & Limitations

	\| ✓ Recommended \| ✗ Not a Good Fit \|
	\|-----------------------------------\|-------------------------------------------\|
	\| Prompt filtering in LLM stacks \| Detecting specific programming languages \|
	\| Pre-screening user inputs in chat \| Judging code quality or style \|
	\| Moderating public text fields \| Detecting tiny inline code tokens in very long documents \|
	\| Fast, low-latency inference (≈1 ms on GPU) \| Multilingual detection outside Turkish \|

	The classifier was trained only on Turkish text + polyglot code snippets.
	Unseen languages (e.g. Japanese text) may be mis-labelled `NL`.
	Very short ambiguous strings (e.g. `"int"`) can be mis-labelled `CODE`.


	## Training Data

	\| Split \| Total \| NL \| CODE \|
	\|-------\|------:\|---------:\|-------:\|
	\| Train \| 316 732 \| 251 518 \| 65 214 \|
	\| Dev \| 39 591 \| 31 439 \| 8 152 \|
	\| Test \| 39 592 \| 31 440 \| 8 152 \|


	### Training Hyperparameters

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Optimiser \| AdamW \|
	\| Effective batch \| 32 (2 × 16, fp16) \|
	\| LR scheduler \| linear-decay, warm-up 0 \|
	\| Max length \| 256 tokens \|
	\| Epochs \| ≤ 10 (early-stopping at 6 k steps ≈ 0.30 epoch) \|
	\| Loss \| *Cross-entropy with reversed* class weights**<br>`weight_NL = 10.0` `weight_CODE = 1.0` \|
	\| Label smoothing \| 0.1 \|
	\| Hardware \| 1 × A100 40 GB (Google Colab) \|


	## Evaluation

	\| Split \| Acc \| Prec \| Recall \| F1 \|
	\|-------\|----:\|-----:\|-------:\|---:\|
	\| Train \| 0.9960 \| 0.9978 \| 0.9827 \| 0.9902 \|
	\| Dev \| 0.9957 \| 0.9981 \| 0.9807 \| 0.9894 \|
	\| Test \| 0.9954 \| 0.9968 \| 0.9807 \| 0.9887 \|

	All metrics computed with
	`id2label = {0: "NL", 1: "CODE"}`.