Spaces:

AyoubChLin
/

classifier-general

Sleeping

AyoubChLin commited on Apr 29

Commit

50231a8

1 Parent(s): 3d9d878

feat: initial commit of Classifier General API with FastAPI

- Added Spacefile for deployment configuration.
- Created app structure with core, api, models, pipelines, routers, and services.
- Implemented classification and language detection pipelines.
- Integrated file extraction and storage services.
- Established API endpoints for classification, language detection, and file transformation.
- Added health check endpoints for service liveness and readiness.
- Configured Pydantic settings for environment-based configuration.
- Developed tests for route contracts and functionality.
- Included Docker Compose setup for local development and deployment.
- Documented architecture, decisions, and usage in the README and other markdown files.

Files changed (42) hide show

.dockerignore +9 -0
.env.example +15 -0
.gitignore +7 -0
Dockerfile +19 -0
README.md +54 -1
Spacefile +10 -0
app/__init__.py +1 -0
app/api/__init__.py +1 -0
app/api/router.py +7 -0
app/core/__init__.py +1 -0
app/core/config.py +37 -0
app/core/exceptions.py +14 -0
app/main.py +22 -0
app/models/__init__.py +1 -0
app/models/label_config.py +18 -0
app/pipelines/__init__.py +1 -0
app/pipelines/classification_pipeline.py +50 -0
app/pipelines/text_pipeline.py +21 -0
app/routers/__init__.py +1 -0
app/routers/classification.py +67 -0
app/routers/health.py +18 -0
app/schemas/__init__.py +1 -0
app/schemas/classification.py +36 -0
app/services/__init__.py +1 -0
app/services/classifier_service.py +83 -0
app/services/extraction_service.py +70 -0
app/services/file_storage_service.py +29 -0
app/services/label_service.py +18 -0
app/services/language_service.py +41 -0
docker-compose.yml +27 -0
docs/README.md +33 -0
docs/explanation/architecture.md +129 -0
docs/explanation/decisions.md +80 -0
docs/how-to/deploy-with-docker-compose.md +35 -0
docs/how-to/run-locally.md +64 -0
docs/reference/api.md +39 -0
docs/reference/configuration.md +53 -0
docs/reference/runtime-state.md +48 -0
docs/tutorials/getting-started.md +79 -0
main.py +7 -0
requirements.txt +16 -0
tests/test_routes.py +61 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,9 @@

+.git
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.pytest_cache/
+.env
+aws/
+awscliv2.zip

.env.example ADDED Viewed

	@@ -0,0 +1,15 @@

+APP_NAME=Classifier General API
+ENVIRONMENT=development
+DEBUG=false
+STATIC_DIR=static
+UPLOAD_SUBDIR=uploads
+CLASSIFIER_SPACE=https://ayoubchlin-ayoubchlin-stable-bart-mnli-cnn.hf.space/
+CLASSIFIER_API_NAME=/predict
+HUGGINGFACE_TOKEN=
+LANGUAGE_DETECTOR_URL=https://team-language-detector-languagedetector.hf.space/run/predict
+REQUEST_TIMEOUT_SECONDS=30
+DEFAULT_LABELS_CSV=news,sport,finance,politics

.gitignore ADDED Viewed

	@@ -0,0 +1,7 @@

+.space
+.env
+__pycache__/
+*.pyc
+.pytest_cache/
+static/uploads/
+static/*

Dockerfile ADDED Viewed

	@@ -0,0 +1,19 @@

+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1
+WORKDIR /app
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends tesseract-ocr curl \
+    && rm -rf /var/lib/apt/lists/*
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 4002
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "4002"]

README.md CHANGED Viewed

@@ -9,4 +9,57 @@ license: mit
 short_description: classifier-general
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 short_description: classifier-general
 ---
+# Classifier General API (Refactored)
+Refactored into a modular FastAPI backend with clear layers:
+- `app/routers`
+- `app/services`
+- `app/pipelines`
+- `app/schemas`
+- `app/models`
+- `app/core`
+## Preserved Endpoint Contract
+- `POST /api/classifier` -> returns label string
+- `POST /api/language` -> returns language string
+- `POST /api/transformer` -> returns `{ filename, content }`
+- `POST /classify` -> returns `{ label, language, type? }`
+- `POST /configlabel` -> returns labels array
+- `GET /labels` -> returns labels array
+Additional operational endpoints:
+- `GET /health/liveness`
+- `GET /health/readiness`
+- `GET /endpoint/`
+## Environment
+Copy and edit:
+```bash
+cp .env.example .env
+```
+Key vars:
+- `CLASSIFIER_SPACE`
+- `HUGGINGFACE_TOKEN`
+- `LANGUAGE_DETECTOR_URL`
+- `DEFAULT_LABELS_CSV`
+## Local Run
+```bash
+pip install -r requirements.txt
+uvicorn main:app --host 0.0.0.0 --port 4002 --reload
+```
+## Docker Run
+```bash
+docker compose up --build
+```
+## Tests
+```bash
+pytest -q
+```
+## Notes
+- OCR requires `tesseract-ocr` (installed in Dockerfile).
+- Supported extraction formats in this refactor: `.pdf`, `.docx`, `.xlsx`, image formats, and plain text files.

Spacefile ADDED Viewed

	@@ -0,0 +1,10 @@

+# Spacefile Docs: https://go.deta.dev/docs/spacefile/v0
+v: 0
+micros:
+  - name: classifier-general
+    src: .
+    engine: python3.9
+    primary: true
+    public: true
+    run: uvicorn main:app
+    dev: uvicorn main:app --reload

app/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Classifier General backend package."""

app/api/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """API router assembly."""

app/api/router.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from fastapi import APIRouter
+from app.routers import classification, health
+api_router = APIRouter()
+api_router.include_router(health.router)
+api_router.include_router(classification.router)

app/core/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Core settings and shared utilities."""

app/core/config.py ADDED Viewed

	@@ -0,0 +1,37 @@

+from functools import lru_cache
+from pathlib import Path
+from pydantic import Field
+from pydantic_settings import BaseSettings, SettingsConfigDict
+class Settings(BaseSettings):
+    model_config = SettingsConfigDict(env_file=".env", env_file_encoding="utf-8", extra="ignore")
+    app_name: str = "Classifier General API"
+    environment: str = "development"
+    debug: bool = False
+    static_dir: Path = Path("static")
+    upload_subdir: str = "uploads"
+    classifier_space: str = "https://ayoubchlin-ayoubchlin-stable-bart-mnli-cnn.hf.space/"
+    classifier_api_name: str = "/predict"
+    huggingface_token: str | None = None
+    language_detector_url: str = "https://team-language-detector-languagedetector.hf.space/run/predict"
+    request_timeout_seconds: float = 30.0
+    default_labels_csv: str = Field(default="news,sport,finance,politics")
+    @property
+    def upload_dir(self) -> Path:
+        return self.static_dir / self.upload_subdir
+@lru_cache
+def get_settings() -> Settings:
+    return Settings()
+settings = get_settings()

app/core/exceptions.py ADDED Viewed

	@@ -0,0 +1,14 @@

+class ClassificationError(Exception):
+    pass
+class LanguageDetectionError(Exception):
+    pass
+class ExtractionError(Exception):
+    pass
+class ValidationError(Exception):
+    pass

app/main.py ADDED Viewed

	@@ -0,0 +1,22 @@

+from fastapi import FastAPI
+from fastapi.staticfiles import StaticFiles
+from app.api.router import api_router
+from app.core.config import settings
+settings.static_dir.mkdir(parents=True, exist_ok=True)
+settings.upload_dir.mkdir(parents=True, exist_ok=True)
+app = FastAPI(title=settings.app_name, debug=settings.debug)
+app.mount("/static", StaticFiles(directory=str(settings.static_dir)), name="static")
+app.include_router(api_router)
+@app.get("/endpoint/")
+def list_endpoints() -> list[dict]:
+    endpoints = []
+    for route in app.routes:
+        methods = sorted((route.methods or set()) & {"GET", "POST", "PUT", "DELETE"})
+        if methods:
+            endpoints.append({"endpoint": route.path, "methods": methods})
+    return endpoints

app/models/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Domain models."""

app/models/label_config.py ADDED Viewed

	@@ -0,0 +1,18 @@

+from dataclasses import dataclass, field
+from threading import Lock
+@dataclass
+class LabelConfig:
+    labels: list[str] = field(default_factory=list)
+    _lock: Lock = field(default_factory=Lock)
+    def get_labels(self) -> list[str]:
+        with self._lock:
+            return list(self.labels)
+    def set_labels(self, labels: list[str]) -> list[str]:
+        cleaned = [label.strip() for label in labels if label and label.strip()]
+        with self._lock:
+            self.labels = cleaned
+            return list(self.labels)

app/pipelines/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """ML/OCR processing pipelines."""

app/pipelines/classification_pipeline.py ADDED Viewed

	@@ -0,0 +1,50 @@

+from pathlib import Path
+from app.core.exceptions import ClassificationError, ExtractionError, LanguageDetectionError, ValidationError
+from app.pipelines.text_pipeline import preprocess_text
+from app.services.classifier_service import classifier_service
+from app.services.extraction_service import extraction_service
+from app.services.label_service import label_service
+from app.services.language_service import language_service
+class ClassificationPipeline:
+    def classify_text(self, text: str) -> str:
+        preprocessed_text = preprocess_text(text)
+        labels = label_service.get_labels()
+        return classifier_service.classify(preprocessed_text, labels)
+    def detect_language(self, text: str) -> str:
+        preprocessed_text = preprocess_text(text)
+        return language_service.detect_language(preprocessed_text)
+    def transform_file(self, original_filename: str, file_path: Path) -> str:
+        text = extraction_service.extract_text(original_filename, file_path)
+        if not text or not text.strip():
+            raise ExtractionError("No text extracted from file")
+        return text
+    def classify_file(self, original_filename: str, file_path: Path) -> dict:
+        text = self.transform_file(original_filename, file_path)
+        preprocessed_text = preprocess_text(text)
+        language = language_service.detect_language(preprocessed_text)
+        labels = label_service.get_labels()
+        topic = classifier_service.classify(preprocessed_text, labels)
+        result = {"label": topic, "language": language}
+        if language != "en":
+            result["type"] = "not english"
+        return result
+classification_pipeline = ClassificationPipeline()
+__all__ = [
+    "classification_pipeline",
+    "ClassificationError",
+    "LanguageDetectionError",
+    "ExtractionError",
+    "ValidationError",
+]

app/pipelines/text_pipeline.py ADDED Viewed

	@@ -0,0 +1,21 @@

+import re
+from app.core.exceptions import ValidationError
+MIN_WORDS = 4
+def preprocess_text(text: str) -> str:
+    if text is None:
+        raise ValidationError("Text is required")
+    cleaned = text.replace("\n", " ")
+    cleaned = re.sub(r"<[^>]+>", "", cleaned)
+    cleaned = re.sub(r"\s+", " ", cleaned)
+    cleaned = re.sub(r"[^\w\s$€%.,-/]|(?<=\d)[.,/](?=\d)", " ", cleaned).lower().strip()
+    if len(cleaned.split(" ")) < MIN_WORDS:
+        raise ValidationError(f"Text must contain at least {MIN_WORDS} words after preprocessing")
+    return cleaned

app/routers/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """HTTP route modules."""

app/routers/classification.py ADDED Viewed

	@@ -0,0 +1,67 @@

+from fastapi import APIRouter, File, HTTPException, UploadFile, status
+from app.core.exceptions import ClassificationError, ExtractionError, LanguageDetectionError, ValidationError
+from app.pipelines.classification_pipeline import classification_pipeline
+from app.schemas.classification import FileClassifyResponse, FileTransformResponse, LabelUpdateInput, TextInput
+from app.services.file_storage_service import file_storage_service
+from app.services.label_service import label_service
+router = APIRouter(tags=["classification"])
+def _handle_exception(exc: Exception) -> None:
+    if isinstance(exc, ValidationError):
+        raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail=str(exc)) from exc
+    if isinstance(exc, ExtractionError):
+        raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail=str(exc)) from exc
+    if isinstance(exc, (ClassificationError, LanguageDetectionError)):
+        raise HTTPException(status_code=status.HTTP_502_BAD_GATEWAY, detail=str(exc)) from exc
+    raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail="Unexpected error") from exc
+@router.post("/api/classifier", response_model=str)
+async def classify_text(payload: TextInput) -> str:
+    try:
+        return classification_pipeline.classify_text(payload.text)
+    except Exception as exc:
+        _handle_exception(exc)
+@router.post("/api/language", response_model=str)
+async def detect_language(payload: TextInput) -> str:
+    try:
+        return classification_pipeline.detect_language(payload.text)
+    except Exception as exc:
+        _handle_exception(exc)
+@router.post("/api/transformer", response_model=FileTransformResponse)
+async def transform_file(file: UploadFile = File(...)) -> dict:
+    try:
+        saved_path = file_storage_service.save_upload(file)
+        content = classification_pipeline.transform_file(file.filename or saved_path.name, saved_path)
+        return {"filename": file.filename or saved_path.name, "content": content}
+    except Exception as exc:
+        _handle_exception(exc)
+@router.post("/classify", response_model=FileClassifyResponse)
+async def classify_uploaded_file(file: UploadFile = File(...)) -> dict:
+    try:
+        saved_path = file_storage_service.save_upload(file)
+        return classification_pipeline.classify_file(file.filename or saved_path.name, saved_path)
+    except Exception as exc:
+        _handle_exception(exc)
+@router.post("/configlabel", response_model=list[str])
+async def configure_labels(payload: LabelUpdateInput) -> list[str]:
+    labels = label_service.set_labels_from_csv(payload.text)
+    if not labels:
+        raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="At least one label is required")
+    return labels
+@router.get("/labels", response_model=list[str])
+async def get_labels() -> list[str]:
+    return label_service.get_labels()

app/routers/health.py ADDED Viewed

	@@ -0,0 +1,18 @@

+from fastapi import APIRouter
+from app.services.label_service import label_service
+router = APIRouter(tags=["health"])
+@router.get("/health/liveness")
+def liveness() -> dict:
+    return {"status": "ok"}
+@router.get("/health/readiness")
+def readiness() -> dict:
+    # This service depends on external APIs, but readiness for local runtime
+    # is based on successful startup and non-empty label config.
+    labels = label_service.get_labels()
+    return {"status": "ready", "labels_count": len(labels)}

app/schemas/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Pydantic API schemas."""

app/schemas/classification.py ADDED Viewed

	@@ -0,0 +1,36 @@

+from pydantic import BaseModel, ConfigDict, Field
+class BaseSchema(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+class TextInput(BaseSchema):
+    text: str = Field(min_length=1)
+class LabelUpdateInput(BaseSchema):
+    text: str = Field(min_length=1, description="Comma-separated labels, e.g. 'news, sport, finance'")
+class ClassifierResponse(BaseSchema):
+    label: str
+class LanguageResponse(BaseSchema):
+    language: str
+class FileTransformResponse(BaseSchema):
+    filename: str
+    content: str
+class FileClassifyResponse(BaseSchema):
+    label: str
+    language: str
+    type: str | None = None
+class LabelsResponse(BaseSchema):
+    labels: list[str]

app/services/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Service layer modules."""

app/services/classifier_service.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import json
+from pathlib import Path
+from typing import Any
+from gradio_client import Client
+from app.core.config import settings
+from app.core.exceptions import ClassificationError
+class ClassifierService:
+    def __init__(self) -> None:
+        self._client: Client | None = None
+    def _get_client(self) -> Client:
+        if self._client is not None:
+            return self._client
+        client_kwargs: dict[str, Any] = {}
+        if settings.huggingface_token:
+            client_kwargs["hf_token"] = settings.huggingface_token
+        try:
+            self._client = Client(settings.classifier_space, **client_kwargs)
+        except Exception as exc:
+            raise ClassificationError("Unable to initialize classifier client") from exc
+        return self._client
+    @staticmethod
+    def _extract_label(payload: Any) -> str | None:
+        if isinstance(payload, dict):
+            value = payload.get("label")
+            if isinstance(value, str) and value.strip():
+                return value.strip()
+            return None
+        if isinstance(payload, list):
+            for item in payload:
+                label = ClassifierService._extract_label(item)
+                if label:
+                    return label
+        return None
+    def classify(self, text: str, labels: list[str]) -> str:
+        if not labels:
+            raise ClassificationError("No labels configured")
+        labels_text = ", ".join(labels)
+        try:
+            result = self._get_client().predict(
+                text,
+                labels_text,
+                api_name=settings.classifier_api_name,
+            )
+        except Exception as exc:
+            raise ClassificationError("Classifier request failed") from exc
+        if isinstance(result, str):
+            candidate_path = Path(result)
+            if candidate_path.exists():
+                try:
+                    parsed = json.loads(candidate_path.read_text(encoding="utf-8"))
+                except Exception as exc:
+                    raise ClassificationError("Classifier output file is not valid JSON") from exc
+                label = self._extract_label(parsed)
+                if label:
+                    return label
+            stripped = result.strip()
+            if stripped:
+                return stripped
+        label = self._extract_label(result)
+        if label:
+            return label
+        raise ClassificationError("Classifier did not return a valid label")
+classifier_service = ClassifierService()

app/services/extraction_service.py ADDED Viewed

	@@ -0,0 +1,70 @@

+from pathlib import Path
+import docx2txt
+from openpyxl import load_workbook
+from PIL import Image
+from pypdf import PdfReader
+import pytesseract
+from app.core.exceptions import ExtractionError
+DOC_EXTENSIONS = {".pdf", ".docx", ".xlsx"}
+IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff"}
+TEXT_EXTENSIONS = {".txt", ".md", ".csv", ".json"}
+class ExtractionService:
+    @staticmethod
+    def _extract_pdf(file_path: Path) -> str:
+        reader = PdfReader(str(file_path))
+        chunks: list[str] = []
+        for page in reader.pages:
+            text = page.extract_text() or ""
+            if text.strip():
+                chunks.append(text)
+        return "\n".join(chunks)
+    @staticmethod
+    def _extract_docx(file_path: Path) -> str:
+        return docx2txt.process(str(file_path))
+    @staticmethod
+    def _extract_xlsx(file_path: Path) -> str:
+        workbook = load_workbook(filename=str(file_path), read_only=True, data_only=True)
+        chunks: list[str] = []
+        for sheet in workbook.worksheets:
+            for row in sheet.iter_rows(values_only=True):
+                row_values = [str(value).strip() for value in row if value is not None and str(value).strip()]
+                if row_values:
+                    chunks.append(" ".join(row_values))
+        workbook.close()
+        return "\n".join(chunks)
+    def extract_text(self, file_name: str, file_path: Path) -> str:
+        extension = Path(file_name).suffix.lower()
+        try:
+            if extension in DOC_EXTENSIONS:
+                if extension == ".pdf":
+                    return self._extract_pdf(file_path)
+                if extension == ".docx":
+                    return self._extract_docx(file_path)
+                if extension == ".xlsx":
+                    return self._extract_xlsx(file_path)
+            if extension in IMAGE_EXTENSIONS:
+                image = Image.open(file_path)
+                return pytesseract.image_to_string(image)
+            if extension in TEXT_EXTENSIONS:
+                return file_path.read_text(encoding="utf-8", errors="ignore")
+            raise ExtractionError(f"Unsupported file extension: {extension or 'unknown'}")
+        except ExtractionError:
+            raise
+        except Exception as exc:
+            raise ExtractionError("Failed to extract text from file") from exc
+extraction_service = ExtractionService()

app/services/file_storage_service.py ADDED Viewed

	@@ -0,0 +1,29 @@

+from pathlib import Path
+from uuid import uuid4
+from fastapi import UploadFile
+from app.core.config import settings
+class FileStorageService:
+    def __init__(self) -> None:
+        settings.static_dir.mkdir(parents=True, exist_ok=True)
+        settings.upload_dir.mkdir(parents=True, exist_ok=True)
+    def save_upload(self, upload_file: UploadFile) -> Path:
+        original_name = Path(upload_file.filename or "uploaded_file").name
+        safe_name = original_name.replace("/", "_")
+        target_path = settings.upload_dir / f"{uuid4().hex}_{safe_name}"
+        with target_path.open("wb") as destination:
+            while True:
+                chunk = upload_file.file.read(1024 * 1024)
+                if not chunk:
+                    break
+                destination.write(chunk)
+        return target_path
+file_storage_service = FileStorageService()

app/services/label_service.py ADDED Viewed

	@@ -0,0 +1,18 @@

+from app.core.config import settings
+from app.models.label_config import LabelConfig
+class LabelService:
+    def __init__(self) -> None:
+        initial_labels = [label.strip() for label in settings.default_labels_csv.split(",") if label.strip()]
+        self._config = LabelConfig(labels=initial_labels)
+    def get_labels(self) -> list[str]:
+        return self._config.get_labels()
+    def set_labels_from_csv(self, labels_csv: str) -> list[str]:
+        labels = [label.strip() for label in labels_csv.split(",") if label.strip()]
+        return self._config.set_labels(labels)
+label_service = LabelService()

app/services/language_service.py ADDED Viewed

	@@ -0,0 +1,41 @@

+import requests
+from app.core.config import settings
+from app.core.exceptions import LanguageDetectionError
+class LanguageService:
+    def __init__(self) -> None:
+        self._session = requests.Session()
+    def detect_language(self, text: str) -> str:
+        try:
+            response = self._session.post(
+                settings.language_detector_url,
+                json={"data": [text]},
+                timeout=settings.request_timeout_seconds,
+            )
+            response.raise_for_status()
+            payload = response.json()
+        except requests.RequestException as exc:
+            raise LanguageDetectionError("Language detection request failed") from exc
+        except ValueError as exc:
+            raise LanguageDetectionError("Language detector returned invalid JSON") from exc
+        data = payload.get("data") if isinstance(payload, dict) else None
+        if not data or not isinstance(data, list):
+            raise LanguageDetectionError("Language detector response missing 'data' field")
+        first = data[0]
+        if isinstance(first, dict):
+            label = first.get("label")
+        else:
+            label = first
+        if not isinstance(label, str) or not label.strip():
+            raise LanguageDetectionError("Language detector did not return a valid label")
+        return label.strip()
+language_service = LanguageService()

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,27 @@

+services:
+  classifier-api:
+    build:
+      context: .
+      dockerfile: Dockerfile
+    container_name: classifier-general-api
+    environment:
+      APP_NAME: ${APP_NAME:-Classifier General API}
+      ENVIRONMENT: ${ENVIRONMENT:-development}
+      DEBUG: ${DEBUG:-false}
+      STATIC_DIR: ${STATIC_DIR:-static}
+      UPLOAD_SUBDIR: ${UPLOAD_SUBDIR:-uploads}
+      CLASSIFIER_SPACE: ${CLASSIFIER_SPACE:-https://ayoubchlin-ayoubchlin-stable-bart-mnli-cnn.hf.space/}
+      CLASSIFIER_API_NAME: ${CLASSIFIER_API_NAME:-/predict}
+      HUGGINGFACE_TOKEN: ${HUGGINGFACE_TOKEN:-}
+      LANGUAGE_DETECTOR_URL: ${LANGUAGE_DETECTOR_URL:-https://team-language-detector-languagedetector.hf.space/run/predict}
+      REQUEST_TIMEOUT_SECONDS: ${REQUEST_TIMEOUT_SECONDS:-30}
+      DEFAULT_LABELS_CSV: ${DEFAULT_LABELS_CSV:-news,sport,finance,politics}
+    ports:
+      - "4002:4002"
+    volumes:
+      - ./static:/app/static
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:4002/health/liveness"]
+      interval: 10s
+      timeout: 4s
+      retries: 6

docs/README.md ADDED Viewed

	@@ -0,0 +1,33 @@

+# Classifier-General Documentation
+This folder contains reverse-engineered documentation generated from repository evidence.
+## Doc Map (Diataxis)
+- Tutorial:
+  - `docs/tutorials/getting-started.md`
+- How-to guides:
+  - `docs/how-to/run-locally.md`
+  - `docs/how-to/deploy-with-docker-compose.md`
+- Reference:
+  - `docs/reference/configuration.md`
+  - `docs/reference/api.md`
+  - `docs/reference/runtime-state.md`
+- Explanation:
+  - `docs/explanation/architecture.md`
+  - `docs/explanation/decisions.md`
+## Scope
+- Covered modules: classification routes, text preprocessing, extraction pipeline, remote classifier/language services, label config.
+- This service has no relational database layer; runtime state is file system + in-memory labels.
+## Evidence anchors
+- `app/main.py`
+- `app/api/router.py`
+- `app/routers/*.py`
+- `app/pipelines/*.py`
+- `app/services/*.py`
+- `app/core/*.py`
+- `app/schemas/*.py`
+- `docker-compose.yml`
+- `Dockerfile`
+- `tests/test_routes.py`

docs/explanation/architecture.md ADDED Viewed

	@@ -0,0 +1,129 @@

+# Architecture Explanation
+## 1. Executive summary
+`classifier-general` is a single FastAPI service that classifies text and files by combining local extraction/preprocessing with two remote AI endpoints (topic classifier and language detector).
+Evidence:
+- `app/main.py`
+- `app/routers/classification.py`
+- `app/services/classifier_service.py`
+- `app/services/language_service.py`
+## 2. Purpose and scope
+### What exists
+- Contract-compatible endpoints for classify/language/transform flows.
+- Pipeline split into preprocess, extraction, classification orchestration.
+- Configurable runtime through environment variables.
+Evidence:
+- `app/routers/classification.py`
+- `app/pipelines/classification_pipeline.py`
+- `app/core/config.py`
+### How it works
+- Router accepts JSON or multipart requests.
+- Files are written to disk (`static/uploads`).
+- Extraction service parses document/image/text into plain text.
+- Text preprocessing enforces minimum quality.
+- Pipeline calls language and classifier services.
+Evidence:
+- `app/services/file_storage_service.py`
+- `app/services/extraction_service.py`
+- `app/pipelines/text_pipeline.py`
+- `app/pipelines/classification_pipeline.py`
+### Why designed this way (inferred)
+- Maintain old API contract while introducing modular services and safer config handling.
+## 3. C4-style views
+### Context view
+Actors/systems:
+- API client sending text/files.
+- External classifier endpoint (`CLASSIFIER_SPACE`).
+- External language detector endpoint (`LANGUAGE_DETECTOR_URL`).
+- Local filesystem for uploaded files.
+Evidence:
+- `app/core/config.py`
+- `app/services/classifier_service.py`
+- `app/services/language_service.py`
+### Container view
+- One container/service (`classifier-api`) with FastAPI + OCR binary.
+Evidence:
+- `docker-compose.yml`
+- `Dockerfile`
+### Component view
+- API routing: `app/routers/*`
+- Orchestration pipelines: `app/pipelines/*`
+- Integration services: `app/services/classifier_service.py`, `app/services/language_service.py`
+- Extraction + storage services: `app/services/extraction_service.py`, `app/services/file_storage_service.py`
+- Config/exceptions/schemas: `app/core/*`, `app/schemas/*`
+### Code-level workflow: file classification
+1. `POST /classify` receives file.
+2. File saved to upload directory.
+3. Text extracted by extension-specific handlers.
+4. Text preprocessed (regex cleanup + min words).
+5. Language detector called.
+6. Classifier called with CSV labels converted to joined text.
+7. Response returns `{label, language}` plus `type=not english` when applicable.
+Evidence:
+- `app/routers/classification.py`
+- `app/services/file_storage_service.py`
+- `app/services/extraction_service.py`
+- `app/pipelines/text_pipeline.py`
+- `app/pipelines/classification_pipeline.py`
+## 4. Cross-cutting concerns
+### Validation and error mapping
+- Input schemas use strict `extra=forbid`.
+- Error mapping explicitly separates validation/extraction (400) from upstream AI failures (502).
+Evidence:
+- `app/schemas/classification.py`
+- `app/routers/classification.py`
+### Configuration and secrets
+- Runtime config sourced from env.
+- HF token optional and no hardcoded secret in current service code.
+Evidence:
+- `app/core/config.py`
+- `app/services/classifier_service.py`
+### Concurrency and mutable state
+- Labels guarded by thread lock (`LabelConfig._lock`).
+- State is still process-local; multi-instance deployments can diverge.
+Evidence:
+- `app/models/label_config.py`
+- `app/services/label_service.py`
+### Testing strategy
+- Route contract tests monkeypatch pipeline methods for deterministic tests.
+- Tests validate response shape and key endpoint behavior, not remote network calls.
+Evidence:
+- `tests/test_routes.py`
+## 5. Risks, gaps, and technical debt
+- External endpoint dependency introduces latency and runtime failure risk.
+- No upload retention/cleanup process.
+- Readiness check does not probe external AI services, only local label readiness.
+- No authentication/authorization layer on API endpoints.
+Evidence:
+- `app/services/language_service.py`
+- `app/services/classifier_service.py`
+- `app/routers/health.py`
+- `app/routers/classification.py`
+## 6. Unknown or inferred
+- Unknown: expected SLA and acceptable latency.
+- Unknown: intended persistence/retention policy for uploaded files.
+- Inferred: service is optimized for local/dev contract compatibility and integration testing.

docs/explanation/decisions.md ADDED Viewed

	@@ -0,0 +1,80 @@

+# Architecture Decisions (ADR-style)
+## ADR-001: Use modular FastAPI layout for classifier backend
+- Status: Accepted
+- Type: Explicit
+- Evidence:
+  - `app/main.py`
+  - `app/api/router.py`
+  - `app/routers/`
+  - `app/services/`
+  - `app/pipelines/`
+- Rationale:
+  - Clear separation between transport, orchestration, and integrations.
+## ADR-002: Preserve endpoint contracts from prior service behavior
+- Status: Accepted
+- Type: Explicit
+- Evidence:
+  - `app/routers/classification.py`
+  - `tests/test_routes.py`
+- Rationale:
+  - Keep clients functional while refactoring internals.
+## ADR-003: Use remote HF/Gradio endpoint for classification
+- Status: Accepted
+- Type: Explicit
+- Evidence:
+  - `app/core/config.py`
+  - `app/services/classifier_service.py`
+- Rationale:
+  - Avoid shipping local model runtime in this service.
+## ADR-004: Use remote language detector HTTP endpoint
+- Status: Accepted
+- Type: Explicit
+- Evidence:
+  - `app/services/language_service.py`
+  - `app/core/config.py`
+- Rationale:
+  - Decouple language detection model from this codebase.
+## ADR-005: Keep labels in in-memory mutable config
+- Status: Accepted (current), Needs review
+- Type: Explicit
+- Evidence:
+  - `app/models/label_config.py`
+  - `app/services/label_service.py`
+  - `app/routers/classification.py` (`/configlabel`, `/labels`)
+- Rationale:
+  - Simple runtime tuning without DB migration.
+- Tradeoff:
+  - No persistence across restarts, no cross-instance consistency.
+## ADR-006: Store uploaded files on local filesystem under static mount
+- Status: Accepted
+- Type: Explicit
+- Evidence:
+  - `app/services/file_storage_service.py`
+  - `app/main.py`
+  - `docker-compose.yml`
+- Rationale:
+  - Enables document/image extraction workflow with minimal infrastructure.
+## ADR-007: Map errors into contract-friendly HTTP statuses
+- Status: Accepted
+- Type: Explicit
+- Evidence:
+  - `app/routers/classification.py`
+  - `app/core/exceptions.py`
+- Rationale:
+  - Differentiate local validation issues (`400`) from upstream AI failures (`502`).
+## ADR-008: No built-in auth layer for this API
+- Status: Accepted (current), Needs review
+- Type: Inferred
+- Evidence:
+  - `app/routers/classification.py`
+  - absence of auth dependencies/middleware in `app/main.py`
+- Rationale:
+  - likely positioned as internal or early-stage service.

docs/how-to/deploy-with-docker-compose.md ADDED Viewed

	@@ -0,0 +1,35 @@

+# Deploy With Docker Compose
+## Topology
+- Single container: `classifier-api`
+- Volume mount: `./static:/app/static` for persisted uploaded files
+- Healthcheck: `GET /health/liveness`
+Evidence:
+- `docker-compose.yml`
+## Command
+```bash
+cd classifier-general
+docker compose up -d --build
+```
+## Verify
+```bash
+docker compose ps
+curl -s http://localhost:4002/health/liveness
+```
+## Production hardening gaps
+- No reverse proxy/TLS config in this repo.
+- External AI dependencies are hard network dependencies at runtime.
+- No horizontal scaling coordination for in-memory labels (`/configlabel` mutates process-local state).
+Evidence:
+- `app/services/language_service.py`
+- `app/services/classifier_service.py`
+- `app/services/label_service.py`
+## Unknown or inferred
+- Unknown: expected deployment platform (only Docker artifacts are present).
+- Inferred: this compose file targets local/dev usage.

docs/how-to/run-locally.md ADDED Viewed

	@@ -0,0 +1,64 @@

+# Run Locally (Dev Loop)
+## 1. Install dependencies
+```bash
+cd classifier-general
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+Evidence:
+- `requirements.txt`
+## 2. Configure environment
+```bash
+cp .env.example .env
+```
+Critical settings:
+- `CLASSIFIER_SPACE`
+- `CLASSIFIER_API_NAME`
+- `LANGUAGE_DETECTOR_URL`
+- `DEFAULT_LABELS_CSV`
+Evidence:
+- `app/core/config.py`
+- `.env.example`
+## 3. Start server
+```bash
+uvicorn main:app --host 0.0.0.0 --port 4002 --reload
+```
+Evidence:
+- `main.py`
+- `app/main.py`
+## 4. Test file-based endpoints
+```bash
+curl -s -X POST http://localhost:4002/api/transformer \
+  -F 'file=@/absolute/path/to/sample.pdf'
+curl -s -X POST http://localhost:4002/classify \
+  -F 'file=@/absolute/path/to/sample.pdf'
+```
+Uploads are stored under `static/uploads` with random UUID prefixes.
+Evidence:
+- `app/services/file_storage_service.py`
+- `app/services/extraction_service.py`
+## Troubleshooting
+- `400 Text must contain at least 4 words`:
+  - input failed preprocessing minimum-word rule.
+- `502 Classifier request failed`:
+  - HF Space unreachable or incompatible response.
+- OCR extraction quality is low:
+  - verify tesseract install and image quality.
+Evidence:
+- `app/pipelines/text_pipeline.py`
+- `app/routers/classification.py`
+- `Dockerfile`

docs/reference/api.md ADDED Viewed

	@@ -0,0 +1,39 @@

+# API Reference
+Base URL: `http://localhost:4002`
+Evidence:
+- `app/api/router.py`
+- `app/routers/classification.py`
+- `app/routers/health.py`
+## Endpoints
+| Method | Path | Request | Response |
+|---|---|---|---|
+| GET | `/health/liveness` | none | `{status}` |
+| GET | `/health/readiness` | none | `{status, labels_count}` |
+| GET | `/endpoint/` | none | list of routes/methods |
+| POST | `/api/classifier` | `{text}` | `"<label>"` |
+| POST | `/api/language` | `{text}` | `"<language>"` |
+| POST | `/api/transformer` | multipart `file` | `{filename, content}` |
+| POST | `/classify` | multipart `file` | `{label, language, type?}` |
+| POST | `/configlabel` | `{text: "csv,labels"}` | `string[]` |
+| GET | `/labels` | none | `string[]` |
+## Validation and errors
+- `400` for input validation and extraction problems.
+- `502` for upstream classifier/language failures.
+- `500` for unexpected failures.
+Evidence:
+- `app/routers/classification.py` (`_handle_exception`)
+- `app/schemas/classification.py`
+## Contract notes
+- Contract returns plain string for `/api/classifier` and `/api/language` (not wrapped object).
+- `/classify` returns optional `type="not english"` when language output is not `en`.
+Evidence:
+- `app/routers/classification.py`
+- `app/pipelines/classification_pipeline.py`

docs/reference/configuration.md ADDED Viewed

	@@ -0,0 +1,53 @@

+# Configuration Reference
+Configuration is managed with Pydantic Settings.
+Evidence:
+- `app/core/config.py`
+- `.env.example`
+- `docker-compose.yml`
+## Application settings
+| Variable | Default | Purpose |
+|---|---|---|
+| `APP_NAME` | `Classifier General API` | FastAPI title |
+| `ENVIRONMENT` | `development` | environment label |
+| `DEBUG` | `false` | debug mode |
+## Filesystem settings
+| Variable | Default | Purpose |
+|---|---|---|
+| `STATIC_DIR` | `static` | static root served at `/static` |
+| `UPLOAD_SUBDIR` | `uploads` | upload directory under static root |
+## Classifier integration settings
+| Variable | Default | Purpose |
+|---|---|---|
+| `CLASSIFIER_SPACE` | `https://ayoubchlin-ayoubchlin-stable-bart-mnli-cnn.hf.space/` | Gradio/HF Space endpoint |
+| `CLASSIFIER_API_NAME` | `/predict` | Gradio predict API name |
+| `HUGGINGFACE_TOKEN` | empty | optional auth token for client init |
+## Language detector settings
+| Variable | Default | Purpose |
+|---|---|---|
+| `LANGUAGE_DETECTOR_URL` | `https://team-language-detector-languagedetector.hf.space/run/predict` | remote language endpoint |
+| `REQUEST_TIMEOUT_SECONDS` | `30` | HTTP timeout for language requests |
+## Label settings
+| Variable | Default | Purpose |
+|---|---|---|
+| `DEFAULT_LABELS_CSV` | `news,sport,finance,politics` | initial in-memory labels |
+## Behavior notes
+- Labels are process-local in memory and reset on restart.
+- Upload directory is auto-created at app startup.
+Evidence:
+- `app/services/label_service.py`
+- `app/models/label_config.py`
+- `app/main.py`

docs/reference/runtime-state.md ADDED Viewed

	@@ -0,0 +1,48 @@

+# Runtime State Reference
+This service does not define a relational database. State exists in memory and filesystem.
+Evidence:
+- no `app/database/` package
+- `app/models/label_config.py`
+- `app/services/file_storage_service.py`
+## In-memory state
+| State | Location | Lifecycle |
+|---|---|---|
+| Active labels list | `LabelConfig.labels` | initialized at process start from `DEFAULT_LABELS_CSV`; mutable via `/configlabel`; reset on restart |
+Evidence:
+- `app/services/label_service.py`
+- `app/models/label_config.py`
+## Filesystem state
+| State | Location | Lifecycle |
+|---|---|---|
+| Uploaded files | `STATIC_DIR/UPLOAD_SUBDIR` (default `static/uploads`) | created per upload; not automatically deleted by app |
+| Static mount | `/static` route | served directly by FastAPI static files |
+Evidence:
+- `app/core/config.py`
+- `app/main.py`
+- `app/services/file_storage_service.py`
+## External runtime dependencies
+| Dependency | Usage |
+|---|---|
+| HF/Gradio classifier Space | text topic classification |
+| Language detector endpoint | language inference |
+| Tesseract binary | OCR extraction for images |
+Evidence:
+- `app/services/classifier_service.py`
+- `app/services/language_service.py`
+- `app/services/extraction_service.py`
+- `Dockerfile`
+## Unknown or inferred
+- Unknown: long-term retention policy for uploaded files.
+- Inferred: `static/uploads` can grow unbounded without cleanup process.

docs/tutorials/getting-started.md ADDED Viewed

	@@ -0,0 +1,79 @@

+# Getting Started
+This tutorial runs the classifier API and validates endpoint contracts.
+## Prerequisites
+- Docker and Docker Compose
+- Internet access for external classifier/language services (unless tests are monkeypatched)
+Evidence:
+- `docker-compose.yml`
+- `app/services/classifier_service.py`
+- `app/services/language_service.py`
+## 1. Prepare environment
+```bash
+cd classifier-general
+cp .env.example .env
+```
+Evidence:
+- `.env.example`
+- `app/core/config.py`
+## 2. Start API in Docker
+```bash
+docker compose up --build
+```
+Service:
+- `classifier-api` exposed on `4002`
+Evidence:
+- `docker-compose.yml`
+## 3. Check health
+```bash
+curl -s http://localhost:4002/health/liveness
+curl -s http://localhost:4002/health/readiness
+```
+Expected:
+- `{"status":"ok"}`
+- `{"status":"ready","labels_count":<n>}`
+Evidence:
+- `app/routers/health.py`
+## 4. Call text classification
+```bash
+curl -s -X POST http://localhost:4002/api/classifier \
+  -H 'content-type: application/json' \
+  -d '{"text":"This is a long enough sentence for classification."}'
+```
+Evidence:
+- `app/routers/classification.py`
+- `app/pipelines/classification_pipeline.py`
+## 5. Update labels and verify
+```bash
+curl -s -X POST http://localhost:4002/configlabel \
+  -H 'content-type: application/json' \
+  -d '{"text":"tech,health,legal"}'
+curl -s http://localhost:4002/labels
+```
+Evidence:
+- `app/services/label_service.py`
+- `app/models/label_config.py`
+## 6. Run route contract tests
+```bash
+docker build -t classifier-general-refactor .
+docker run --rm -w /app -e PYTHONPATH=/app classifier-general-refactor pytest -q
+```
+Evidence:
+- `tests/test_routes.py`

main.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from app.main import app
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("main:app", host="0.0.0.0", port=4002, reload=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+fastapi==0.115.8
+uvicorn[standard]==0.34.0
+pydantic==2.10.6
+pydantic-settings==2.7.1
+requests==2.32.3
+gradio_client==1.7.0
+python-multipart==0.0.20
+pytesseract==0.3.13
+Pillow==11.1.0
+pypdf==5.4.0
+docx2txt==0.8
+openpyxl==3.1.5
+pytest==8.3.4
+httpx==0.28.1

tests/test_routes.py ADDED Viewed

	@@ -0,0 +1,61 @@

+from io import BytesIO
+from fastapi.testclient import TestClient
+from app.main import app
+from app.pipelines.classification_pipeline import classification_pipeline
+client = TestClient(app)
+def test_classifier_endpoint_contract(monkeypatch):
+    monkeypatch.setattr(classification_pipeline, "classify_text", lambda text: "news")
+    response = client.post("/api/classifier", json={"text": "This is a long enough sentence for classification."})
+    assert response.status_code == 200
+    assert response.json() == "news"
+def test_language_endpoint_contract(monkeypatch):
+    monkeypatch.setattr(classification_pipeline, "detect_language", lambda text: "en")
+    response = client.post("/api/language", json={"text": "This is a language detection sample text."})
+    assert response.status_code == 200
+    assert response.json() == "en"
+def test_labels_config_roundtrip():
+    response = client.post("/configlabel", json={"text": "tech, health, legal"})
+    assert response.status_code == 200
+    assert response.json() == ["tech", "health", "legal"]
+    get_response = client.get("/labels")
+    assert get_response.status_code == 200
+    assert get_response.json() == ["tech", "health", "legal"]
+def test_transform_file_contract(monkeypatch):
+    monkeypatch.setattr(classification_pipeline, "transform_file", lambda filename, path: "extracted content")
+    files = {"file": ("sample.txt", BytesIO(b"hello"), "text/plain")}
+    response = client.post("/api/transformer", files=files)
+    assert response.status_code == 200
+    assert response.json()["filename"] == "sample.txt"
+    assert response.json()["content"] == "extracted content"
+def test_classify_file_contract(monkeypatch):
+    monkeypatch.setattr(
+        classification_pipeline,
+        "classify_file",
+        lambda filename, path: {"label": "finance", "language": "en"},
+    )
+    files = {"file": ("sample.txt", BytesIO(b"hello"), "text/plain")}
+    response = client.post("/classify", files=files)
+    assert response.status_code == 200
+    assert response.json() == {"label": "finance", "language": "en", "type": None}