Isa0
/

language-detection

Model card Files Files and versions

Isa0 commited on 15 days ago

Commit

455cb0d

·

1 Parent(s): 52d9a6f

feat: add README

Files changed (1) hide show

README.md +52 -0

README.md CHANGED Viewed

@@ -1,3 +1,55 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+# Language Detection
+A lightweight language detection tool that uses character-level n-gram features and logistic regression to identify the language of a given text.
+Supported languages out of the box: English, French, German, Turkish.
+Model repository: https://huggingface.co/Isa0/language-detection/
+## Installation
+Requires Python 3.11 or higher. Install dependencies with [uv](https://github.com/astral-sh/uv):
+```bash
+uv sync
+```
+## Usage
+### Train
+Train the model on the datasets in the `datasets/` directory:
+```bash
+uv run main.py --train
+```
+You can point it to a different directory with `--dir`:
+```bash
+uv run main.py --train --dir path/to/datasets
+```
+Each `.txt` file in the directory should contain one sentence per line. The filename (without extension) is used as the language label.
+### Detect
+Detect the language of a text string:
+```bash
+uv run main.py --detect "Bonjour, comment allez-vous?"
+```
+Output includes the predicted language and a confidence score.
+## Adding Languages
+Add a new `.txt` file to the `datasets/` directory named after the language (e.g. `spanish.txt`), with one sentence per line, then retrain.
+## How It Works
+Text is converted into character-level n-gram counts (1 to 3 characters), which capture language-specific patterns like accents, letter combinations, and suffixes. A logistic regression classifier is trained on these features and saved to disk for reuse.