--- language: - km - en tags: - object-detection - text-detection - yolo - yolo11 - khmer - ultralytics - pytorch license: mit --- # mini-text-detection — Khmer & English Text Detection A **YOLO11n**-based text detection model fine-tuned to locate and classify text regions in images containing **Khmer and English** content. It detects 3 types of text blocks and can be used as the first stage before passing crops to an OCR model (e.g. [phonsobon/mini-ocr](https://huggingface.co/phonsobon/mini-ocr)). --- ## Model Details | Property | Value | |----------|-------| | Architecture | YOLO11n (nano) | | Task | Object Detection — 3 classes | | Weights file | `khmer-text-detection-mini.pt` | | Framework | Ultralytics / PyTorch | | Input | RGB image, any size (auto-resized internally) | --- ## Classes | ID | Name | Description | |----|------|-------------| | `0` | `subject` | Title or heading text | | `1` | `reference` | Reference, label, or metadata text | | `2` | `content` | Main body / paragraph text | --- ## Files | File | Description | |------|-------------| | `khmer-text-detection-mini.pt` | Full Ultralytics YOLO model (weights + config) | --- ## Quick Start ### Install dependencies ```bash pip install ultralytics huggingface_hub ``` ### Run inference ```python from ultralytics import YOLO from huggingface_hub import hf_hub_download # ── Download model ──────────────────────────────────────────────────────────── model_path = hf_hub_download( repo_id="phonsobon/mini-text-detection", filename="khmer-text-detection-mini.pt", ) # ── Class names ─────────────────────────────────────────────────────────────── CLASS_NAMES = {0: "subject", 1: "reference", 2: "content"} # ── Load & predict ──────────────────────────────────────────────────────────── model = YOLO(model_path) results = model.predict( source="your_image.jpg", # path, URL, or numpy array conf=0.25, # confidence threshold iou=0.45, # NMS IoU threshold imgsz=640, ) # ── Print results ───────────────────────────────────────────────────────────── for r in results: r.show() # display with bounding boxes for box in r.boxes: cls_id = int(box.cls) label = CLASS_NAMES[cls_id] conf = float(box.conf) x1, y1, x2, y2 = box.xyxy[0].tolist() print(f"[{label}] conf={conf:.2f} box=({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})") ``` ### Filter by class ```python # Get only subject (heading) boxes subject_boxes = [b for b in results[0].boxes if int(b.cls) == 0] # Get only content (body) boxes content_boxes = [b for b in results[0].boxes if int(b.cls) == 2] ``` ### Save annotated images ```python results = model.predict(source="your_image.jpg", save=True, project="runs/detect") # Saved to runs/detect/predict/ ``` ### Batch inference on a folder ```python results = model.predict(source="path/to/images/", conf=0.25, imgsz=640) for r in results: counts = {name: 0 for name in CLASS_NAMES.values()} for box in r.boxes: counts[CLASS_NAMES[int(box.cls)]] += 1 print(r.path, "→", counts) ``` --- ## Crop + OCR Pipeline Combine this model with [phonsobon/mini-ocr](https://huggingface.co/phonsobon/mini-ocr) for full end-to-end document reading, with each region labelled by type: ```python from ultralytics import YOLO from huggingface_hub import hf_hub_download from PIL import Image CLASS_NAMES = {0: "subject", 1: "reference", 2: "content"} # ── Load detection model ────────────────────────────────────────────────────── det_path = hf_hub_download("phonsobon/mini-text-detection", "khmer-text-detection-mini.pt") detector = YOLO(det_path) # ── Detect text regions ─────────────────────────────────────────────────────── image_path = "your_image.jpg" results = detector.predict(source=image_path, conf=0.25, imgsz=640) img = Image.open(image_path).convert("RGB") # ── Crop each region sorted by class ───────────────────────────────────────── for i, box in enumerate(results[0].boxes): cls_id = int(box.cls) label = CLASS_NAMES[cls_id] x1,y1,x2,y2 = map(int, box.xyxy[0].tolist()) crop = img.crop((x1, y1, x2, y2)) crop.save(f"crop_{i}_{label}.png") print(f"Saved crop {i} → class: {label}") # → feed each crop to phonsobon/mini-ocr for text recognition ``` --- ## Input Tips - Works on **any image size** — YOLO resizes internally to 640 px by default. - Best results on **document photos, screenshots, and scanned pages**. - Adjust `conf` (0.1 – 0.5) to trade recall vs. precision depending on your use case. --- ## Limitations - May miss very small text (< ~8 px height in the original image). - Not designed for handwritten or heavily stylised/artistic fonts. - Performance is best on document-style layouts similar to training data. --- ## Related Model | Model | Task | |-------|------| | [phonsobon/mini-ocr](https://huggingface.co/phonsobon/mini-ocr) | Text recognition (CRNN + CTC) for Khmer & English | --- ## License MIT