Isa0 commited on
Commit
455cb0d
·
1 Parent(s): 52d9a6f

feat: add README

Browse files
Files changed (1) hide show
  1. README.md +52 -0
README.md CHANGED
@@ -1,3 +1,55 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ # Language Detection
6
+
7
+ A lightweight language detection tool that uses character-level n-gram features and logistic regression to identify the language of a given text.
8
+
9
+ Supported languages out of the box: English, French, German, Turkish.
10
+
11
+ Model repository: https://huggingface.co/Isa0/language-detection/
12
+
13
+ ## Installation
14
+
15
+ Requires Python 3.11 or higher. Install dependencies with [uv](https://github.com/astral-sh/uv):
16
+
17
+ ```bash
18
+ uv sync
19
+ ```
20
+
21
+ ## Usage
22
+
23
+ ### Train
24
+
25
+ Train the model on the datasets in the `datasets/` directory:
26
+
27
+ ```bash
28
+ uv run main.py --train
29
+ ```
30
+
31
+ You can point it to a different directory with `--dir`:
32
+
33
+ ```bash
34
+ uv run main.py --train --dir path/to/datasets
35
+ ```
36
+
37
+ Each `.txt` file in the directory should contain one sentence per line. The filename (without extension) is used as the language label.
38
+
39
+ ### Detect
40
+
41
+ Detect the language of a text string:
42
+
43
+ ```bash
44
+ uv run main.py --detect "Bonjour, comment allez-vous?"
45
+ ```
46
+
47
+ Output includes the predicted language and a confidence score.
48
+
49
+ ## Adding Languages
50
+
51
+ Add a new `.txt` file to the `datasets/` directory named after the language (e.g. `spanish.txt`), with one sentence per line, then retrain.
52
+
53
+ ## How It Works
54
+
55
+ Text is converted into character-level n-gram counts (1 to 3 characters), which capture language-specific patterns like accents, letter combinations, and suffixes. A logistic regression classifier is trained on these features and saved to disk for reuse.