Clean rebuild: all features fixed
Browse files- CHANGES.md +51 -0
- cv-requirements.txt +0 -35
- cv/src/__init__.py +0 -0
- cv/src/api/__init__.py +0 -0
- cv/src/api/main.py +0 -48
- cv/src/api/routes.py +0 -246
- cv/src/api/schemas.py +0 -110
- cv/src/config.py +0 -55
- cv/src/cv_pipeline.py +0 -246
- cv/src/models/__init__.py +0 -0
- cv/src/models/captioner.py +0 -105
- cv/src/models/clip_model.py +0 -150
- cv/src/models/yolo_detector.py +0 -208
- cv/src/processors/__init__.py +0 -0
- cv/src/processors/image_preprocessor.py +0 -154
- cv/src/processors/ocr_processor.py +0 -235
- cv_module/src/api/routes.py +51 -23
- cv_module/src/cv_pipeline.py +33 -7
- frontend/index.html +67 -18
- rag-requirements.txt +0 -34
- rag/src/__init__.py +0 -1
- rag/src/api/__init__.py +0 -1
- rag/src/api/main.py +0 -57
- rag/src/api/routes.py +0 -137
- rag/src/api/schemas.py +0 -67
- rag/src/config.py +0 -56
- rag/src/embeddings/__init__.py +0 -1
- rag/src/embeddings/embedder.py +0 -60
- rag/src/llm/__init__.py +0 -1
- rag/src/llm/groq_client.py +0 -62
- rag/src/llm/prompt_templates.py +0 -36
- rag/src/loaders/__init__.py +0 -69
- rag/src/loaders/base_loader.py +0 -39
- rag/src/loaders/docx_loader.py +0 -52
- rag/src/loaders/json_loader.py +0 -103
- rag/src/loaders/pdf_loader.py +0 -46
- rag/src/loaders/text_loader.py +0 -31
- rag/src/loaders/web_loader.py +0 -57
- rag/src/retrieval/__init__.py +0 -1
- rag/src/retrieval/retriever.py +0 -211
- rag/src/retrieval/vector_store.py +0 -93
- rag_pipeline/src/api/routes.py +43 -6
- rag_pipeline/src/config.py +5 -1
- rag_pipeline/src/llm/groq_client.py +62 -15
- rag_pipeline/src/retrieval/vector_store.py +55 -12
- start.sh +30 -5
CHANGES.md
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Perbaikan dari v2_0_5 β v2_0_5-fixed
|
| 2 |
+
|
| 3 |
+
Ringkasan: 7 file diubah, 1 file diperbaiki tambahan. Semua perbaikan **defensive** (membuat
|
| 4 |
+
kode lebih tahan banting), bukan refactor besar β supaya behaviour lama tidak berubah selama
|
| 5 |
+
kondisi normal, tapi ada fallback yang jelas saat kondisi tidak normal.
|
| 6 |
+
|
| 7 |
+
## Daftar perubahan
|
| 8 |
+
|
| 9 |
+
| File | Jenis | Tujuan |
|
| 10 |
+
|------|-------|--------|
|
| 11 |
+
| `frontend/index.html` | Theme bulletproof | `try/catch` di sekitar `localStorage` & `matchMedia`; fallback `:root` CSS variables |
|
| 12 |
+
| `rag_pipeline/src/config.py` | Config | `groq_api_key` jadi optional (default `""`) β service ngga crash kalau secret belum di-set |
|
| 13 |
+
| `rag_pipeline/src/llm/groq_client.py` | LLM client | ChatGroq di-init **lazy** β pas pertama dipakai, bukan saat constructor jalan. Validasi API key di sini dengan pesan jelas. |
|
| 14 |
+
| `rag_pipeline/src/api/routes.py` | Error handling | `/query` dan `/summarize` return **503** (bukan 500) saat API key missing, dengan pesan spesifik. `/ready` sekarang lapor status `groq_api_key`. `/collection` DELETE error reporting lebih baik. |
|
| 15 |
+
| `rag_pipeline/src/retrieval/vector_store.py` | Robustness | `reset_collection()` punya **3-tier fallback**: `db.reset_collection()` β delete-by-ids β nuke + re-init |
|
| 16 |
+
| `cv_module/src/cv_pipeline.py` | Thread safety | Per-model `Lock` di lazy property (mencegah double-init kalau 2+ request concurrent). `ThreadPoolExecutor(max_workers=2)` mencegah OOM di HF free tier. |
|
| 17 |
+
| `cv_module/src/api/routes.py` | Thread safety | `_trigger_lock` mencegah TOCTOU race di `_trigger_and_wait`. Error handler tiap endpoint kasih pesan yg lebih informatif. |
|
| 18 |
+
| `start.sh` | Diagnostics | Print sanity check (file paths exist, GROQ_API_KEY status) saat boot β gampang debug dari log. |
|
| 19 |
+
|
| 20 |
+
## Yang TIDAK gw ubah
|
| 21 |
+
|
| 22 |
+
- Dockerfile: aman, struktur `cv_module/` dan `rag_pipeline/` di repo lo udah match dengan `COPY` di Dockerfile.
|
| 23 |
+
- `nginx.conf`, `supervisord.conf`: aman.
|
| 24 |
+
- `requirements.txt`: aman, version pin udah konsisten.
|
| 25 |
+
- Loaders (PDF/DOCX/TXT/JSON/Web): aman, ngga ada perubahan logic.
|
| 26 |
+
- Theme system di v2_0_5 *secara fungsional* udah bener β gw cuma tambah defensive guards. Kalau di lo "ilang", penyebabnya bukan code-level missing tapi runtime issue (browser, deploy, cache).
|
| 27 |
+
|
| 28 |
+
## Penjelasan kenapa fix ini bisa nyelesaiin "semua endpoint 500"
|
| 29 |
+
|
| 30 |
+
Hipotesis paling mungkin penyebabnya:
|
| 31 |
+
|
| 32 |
+
1. **GROQ_API_KEY belum di-set di HF Space.** Sebelum fix:
|
| 33 |
+
- `Settings()` raise `ValidationError` karena `Field(...)` mandatory.
|
| 34 |
+
- `RAGRetriever.__init__()` crash di `get_settings()`.
|
| 35 |
+
- Setiap endpoint yg manggil `get_retriever()` (β semua) β 500.
|
| 36 |
+
- Setelah fix: service start clean, `/stats` `/sources` `/ingest` tetep jalan, cuma `/query` `/summarize` yg 503 dengan pesan "GROQ_API_KEY belum di-set".
|
| 37 |
+
|
| 38 |
+
2. **CV model concurrent load OOM.** Sebelum fix:
|
| 39 |
+
- 2+ request paralel ke endpoint yang butuh model yg sama β race di `if self._captioner is None` β 2x init paralel β RAM spike β OS kill process β 500/connection error.
|
| 40 |
+
- Setelah fix: per-model lock, cuma 1 thread yg load.
|
| 41 |
+
|
| 42 |
+
3. **`db.reset_collection()` AttributeError di langchain_chroma versi tertentu.** Sebelum fix: fallback path bisa gagal di chromadb 0.5.3 + langchain_chroma 0.1.4 combo karena `_client.delete_collection` lalu re-init bisa race. Setelah fix: delete-by-ids jadi default fallback (lebih atomic).
|
| 43 |
+
|
| 44 |
+
## Yang harus lo cek sebelum deploy
|
| 45 |
+
|
| 46 |
+
1. **GROQ_API_KEY**: Pastiin udah di-set di HF Spaces β Settings β Variables and secrets.
|
| 47 |
+
2. **Struktur repo HF**: Dockerfile `COPY rag_pipeline/src/...`. Pastiin folder ini ada di repo HF (bukan cuma di local zip). Kalau di HF strukturnya `rag/` (bukan `rag_pipeline/`), build bakal gagal β dan **semua endpoint return error karena container ngga jalan**, bukan karena bug kode.
|
| 48 |
+
3. **HF Space rebuild**: Kadang HF nge-cache build layer. Setelah deploy versi baru, force rebuild via dashboard ("Restart this Space" β "Factory rebuild").
|
| 49 |
+
|
| 50 |
+
Kalau semua 3 hal di atas udah OK dan endpoint masih error, share log dari HF Spaces dashboard
|
| 51 |
+
(tab "Logs"), gw bisa pinpoint penyebab spesifik.
|
cv-requirements.txt
DELETED
|
@@ -1,35 +0,0 @@
|
|
| 1 |
-
# ββ CV Core βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 2 |
-
transformers==4.35.2
|
| 3 |
-
numpy==1.26.4
|
| 4 |
-
Pillow>=10.4.0
|
| 5 |
-
opencv-python-headless>=4.10.0
|
| 6 |
-
|
| 7 |
-
# ββ CLIP ββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 8 |
-
open-clip-torch>=2.26.1
|
| 9 |
-
timm==0.9.16
|
| 10 |
-
|
| 11 |
-
# ββ Object Detection ββββββββββββββββββββββββββββββββββββββ
|
| 12 |
-
ultralytics>=8.2.0
|
| 13 |
-
|
| 14 |
-
# ββ OCR βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 15 |
-
pytesseract>=0.3.13
|
| 16 |
-
easyocr>=1.7.1
|
| 17 |
-
|
| 18 |
-
# ββ Image utils βββββββββββββββββββββββββββββββββββββββββββ
|
| 19 |
-
imageio>=2.34.0
|
| 20 |
-
scikit-image>=0.24.0
|
| 21 |
-
|
| 22 |
-
# ββ API βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 23 |
-
fastapi==0.112.0
|
| 24 |
-
uvicorn[standard]==0.30.6
|
| 25 |
-
python-multipart==0.0.9
|
| 26 |
-
pydantic==2.8.2
|
| 27 |
-
pydantic-settings==2.4.0
|
| 28 |
-
|
| 29 |
-
# ββ MLOps βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 30 |
-
mlflow==2.15.1
|
| 31 |
-
|
| 32 |
-
# ββ Utils βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 33 |
-
loguru==0.7.2
|
| 34 |
-
python-dotenv==1.0.1
|
| 35 |
-
httpx==0.27.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cv/src/__init__.py
DELETED
|
File without changes
|
cv/src/api/__init__.py
DELETED
|
File without changes
|
cv/src/api/main.py
DELETED
|
@@ -1,48 +0,0 @@
|
|
| 1 |
-
from fastapi import FastAPI
|
| 2 |
-
from fastapi.middleware.cors import CORSMiddleware
|
| 3 |
-
from loguru import logger
|
| 4 |
-
import sys
|
| 5 |
-
|
| 6 |
-
from .routes import router
|
| 7 |
-
from ..config import get_cv_settings
|
| 8 |
-
|
| 9 |
-
settings = get_cv_settings()
|
| 10 |
-
|
| 11 |
-
logger.remove()
|
| 12 |
-
logger.add(sys.stderr, level="INFO", colorize=True,
|
| 13 |
-
format="<green>{time:HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan> - {message}")
|
| 14 |
-
logger.add("./logs/cv_api.log", rotation="10 MB", retention="7 days")
|
| 15 |
-
|
| 16 |
-
app = FastAPI(
|
| 17 |
-
title="CV Pipeline API",
|
| 18 |
-
description="""
|
| 19 |
-
## Multimodal AI Assistant β Computer Vision Module
|
| 20 |
-
|
| 21 |
-
Endpoint untuk analisis gambar menggunakan:
|
| 22 |
-
- **BLIP** β image captioning & visual QA
|
| 23 |
-
- **YOLOv8** β real-time object detection (80 kelas COCO)
|
| 24 |
-
- **CLIP** β zero-shot classification & image-text similarity
|
| 25 |
-
- **EasyOCR** β text extraction dari gambar (80+ bahasa)
|
| 26 |
-
- **MLflow** β latency & performance tracking
|
| 27 |
-
|
| 28 |
-
### Integrasi dengan RAG Module
|
| 29 |
-
Output `summary_text` dari `/analyze` bisa langsung dipakai sebagai
|
| 30 |
-
konteks untuk RAG pipeline β gambar bisa menjadi bagian dari knowledge base.
|
| 31 |
-
""",
|
| 32 |
-
version="1.0.0",
|
| 33 |
-
)
|
| 34 |
-
|
| 35 |
-
app.add_middleware(
|
| 36 |
-
CORSMiddleware,
|
| 37 |
-
allow_origins=["*"],
|
| 38 |
-
allow_methods=["*"],
|
| 39 |
-
allow_headers=["*"],
|
| 40 |
-
)
|
| 41 |
-
|
| 42 |
-
app.include_router(router, prefix="/api/v1")
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
@app.on_event("startup")
|
| 46 |
-
async def startup():
|
| 47 |
-
logger.info("CV Pipeline API starting up...")
|
| 48 |
-
logger.info(f"Docs: http://{settings.api_host}:{settings.api_port}/docs")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cv/src/api/routes.py
DELETED
|
@@ -1,246 +0,0 @@
|
|
| 1 |
-
from fastapi import APIRouter, HTTPException, UploadFile, File
|
| 2 |
-
from fastapi.responses import Response
|
| 3 |
-
from pydantic import BaseModel
|
| 4 |
-
from loguru import logger
|
| 5 |
-
|
| 6 |
-
from .schemas import (
|
| 7 |
-
AnalyzeURLRequest, FullAnalysisResponse,
|
| 8 |
-
ClassifyRequest, ClassificationResponse,
|
| 9 |
-
SimilarityRequest, SimilarityResponse,
|
| 10 |
-
VisualQARequest, VisualQAResponse,
|
| 11 |
-
CaptionResponse, DetectionResponse, OCRResponse,
|
| 12 |
-
)
|
| 13 |
-
from ..cv_pipeline import CVPipeline
|
| 14 |
-
|
| 15 |
-
router = APIRouter()
|
| 16 |
-
|
| 17 |
-
_pipeline: CVPipeline = None
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
def get_pipeline() -> CVPipeline:
|
| 21 |
-
global _pipeline
|
| 22 |
-
if _pipeline is None:
|
| 23 |
-
_pipeline = CVPipeline()
|
| 24 |
-
return _pipeline
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
# === HEALTH ===
|
| 28 |
-
|
| 29 |
-
@router.get("/health", tags=["system"])
|
| 30 |
-
async def health():
|
| 31 |
-
return {"status": "ok", "service": "CV Pipeline API"}
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
# === FULL ANALYSIS ===
|
| 35 |
-
|
| 36 |
-
@router.post("/analyze/url", response_model=FullAnalysisResponse, tags=["analysis"])
|
| 37 |
-
async def analyze_from_url(req: AnalyzeURLRequest):
|
| 38 |
-
"""
|
| 39 |
-
Analisis gambar dari URL.
|
| 40 |
-
Jalankan caption, object detection, optional OCR + CLIP classification sekaligus.
|
| 41 |
-
"""
|
| 42 |
-
try:
|
| 43 |
-
result = get_pipeline().analyze(
|
| 44 |
-
source=req.url,
|
| 45 |
-
run_caption=req.run_caption,
|
| 46 |
-
run_detection=req.run_detection,
|
| 47 |
-
run_ocr=req.run_ocr,
|
| 48 |
-
classification_labels=req.classification_labels,
|
| 49 |
-
)
|
| 50 |
-
return _to_response(result)
|
| 51 |
-
except Exception as e:
|
| 52 |
-
logger.error(f"Analyze error: {e}")
|
| 53 |
-
raise HTTPException(status_code=500, detail=str(e))
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
@router.post("/analyze/upload", response_model=FullAnalysisResponse, tags=["analysis"])
|
| 57 |
-
async def analyze_upload(
|
| 58 |
-
file: UploadFile = File(...),
|
| 59 |
-
run_caption: bool = True,
|
| 60 |
-
run_detection: bool = True,
|
| 61 |
-
run_ocr: bool = False,
|
| 62 |
-
):
|
| 63 |
-
"""Upload dan analisis gambar langsung (multipart)."""
|
| 64 |
-
allowed = {"image/jpeg", "image/png", "image/webp", "image/gif"}
|
| 65 |
-
if file.content_type not in allowed:
|
| 66 |
-
raise HTTPException(400, detail=f"Tipe file tidak didukung: {file.content_type}")
|
| 67 |
-
|
| 68 |
-
data = await file.read()
|
| 69 |
-
if len(data) > 10 * 1024 * 1024:
|
| 70 |
-
raise HTTPException(400, detail="Ukuran file maksimum 10MB")
|
| 71 |
-
|
| 72 |
-
try:
|
| 73 |
-
result = get_pipeline().analyze(
|
| 74 |
-
source=data,
|
| 75 |
-
run_caption=run_caption,
|
| 76 |
-
run_detection=run_detection,
|
| 77 |
-
run_ocr=run_ocr,
|
| 78 |
-
)
|
| 79 |
-
return _to_response(result)
|
| 80 |
-
except Exception as e:
|
| 81 |
-
raise HTTPException(status_code=500, detail=str(e))
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
# === INDIVIDUAL TASKS ===
|
| 85 |
-
|
| 86 |
-
@router.post("/caption", response_model=CaptionResponse, tags=["tasks"])
|
| 87 |
-
async def caption(url: str, prompt: str = None):
|
| 88 |
-
"""Generate deskripsi teks dari gambar."""
|
| 89 |
-
try:
|
| 90 |
-
from ..processors.image_preprocessor import ImagePreprocessor
|
| 91 |
-
image = ImagePreprocessor.load(url)
|
| 92 |
-
result = get_pipeline().captioner.caption(image, prompt=prompt)
|
| 93 |
-
return CaptionResponse(caption=result.caption, model=result.model)
|
| 94 |
-
except Exception as e:
|
| 95 |
-
raise HTTPException(500, detail=str(e))
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
@router.post("/detect", response_model=DetectionResponse, tags=["tasks"])
|
| 99 |
-
async def detect(url: str, conf: float = None):
|
| 100 |
-
"""Deteksi objek dalam gambar dengan YOLOv8."""
|
| 101 |
-
try:
|
| 102 |
-
result = get_pipeline().detect_objects(url, conf=conf)
|
| 103 |
-
return DetectionResponse(
|
| 104 |
-
detections=[d.to_dict() for d in result.detections],
|
| 105 |
-
count=result.count,
|
| 106 |
-
labels_summary=result.labels_summary,
|
| 107 |
-
image_width=result.image_width,
|
| 108 |
-
image_height=result.image_height,
|
| 109 |
-
inference_time_ms=result.inference_time_ms,
|
| 110 |
-
)
|
| 111 |
-
except Exception as e:
|
| 112 |
-
raise HTTPException(500, detail=str(e))
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
@router.post("/classify", response_model=ClassificationResponse, tags=["tasks"])
|
| 116 |
-
async def classify(req: ClassifyRequest):
|
| 117 |
-
"""
|
| 118 |
-
Zero-shot image classification dengan CLIP.
|
| 119 |
-
Tidak perlu training β cukup berikan daftar label kandidat.
|
| 120 |
-
"""
|
| 121 |
-
try:
|
| 122 |
-
result = get_pipeline().classify_image(req.url, req.labels)
|
| 123 |
-
return ClassificationResponse(
|
| 124 |
-
top_label=result.top_label,
|
| 125 |
-
top_score=result.top_score,
|
| 126 |
-
labels=result.labels,
|
| 127 |
-
probabilities=result.probabilities,
|
| 128 |
-
)
|
| 129 |
-
except Exception as e:
|
| 130 |
-
raise HTTPException(500, detail=str(e))
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
class OCRRequest(BaseModel):
|
| 134 |
-
url: str
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
@router.post("/ocr", response_model=OCRResponse, tags=["tasks"])
|
| 138 |
-
async def ocr(req: OCRRequest):
|
| 139 |
-
"""Ekstrak teks dari gambar menggunakan EasyOCR."""
|
| 140 |
-
try:
|
| 141 |
-
from ..processors.image_preprocessor import ImagePreprocessor
|
| 142 |
-
image = ImagePreprocessor.load(req.url)
|
| 143 |
-
result = get_pipeline().ocr.extract_text(image)
|
| 144 |
-
return OCRResponse(
|
| 145 |
-
full_text=result.full_text,
|
| 146 |
-
boxes=[b.to_dict() for b in result.boxes],
|
| 147 |
-
word_count=result.word_count,
|
| 148 |
-
language=result.language,
|
| 149 |
-
engine=result.engine,
|
| 150 |
-
)
|
| 151 |
-
except Exception as e:
|
| 152 |
-
raise HTTPException(500, detail=str(e))
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
@router.post("/similarity", response_model=SimilarityResponse, tags=["tasks"])
|
| 156 |
-
async def image_text_similarity(req: SimilarityRequest):
|
| 157 |
-
"""Hitung relevansi antara gambar dan teks (0.0 - 1.0)."""
|
| 158 |
-
try:
|
| 159 |
-
score = get_pipeline().image_text_similarity(req.url, req.text)
|
| 160 |
-
interpretation = (
|
| 161 |
-
"Sangat relevan" if score > 0.7
|
| 162 |
-
else "Cukup relevan" if score > 0.5
|
| 163 |
-
else "Kurang relevan"
|
| 164 |
-
)
|
| 165 |
-
return SimilarityResponse(
|
| 166 |
-
similarity_score=score,
|
| 167 |
-
text=req.text,
|
| 168 |
-
interpretation=interpretation,
|
| 169 |
-
)
|
| 170 |
-
except Exception as e:
|
| 171 |
-
raise HTTPException(500, detail=str(e))
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
@router.post("/visual-qa", response_model=VisualQAResponse, tags=["tasks"])
|
| 175 |
-
async def visual_qa(req: VisualQARequest):
|
| 176 |
-
"""Visual Question Answering β tanya tentang isi gambar."""
|
| 177 |
-
try:
|
| 178 |
-
answer = get_pipeline().visual_qa(req.url, req.question)
|
| 179 |
-
return VisualQAResponse(question=req.question, answer=answer)
|
| 180 |
-
except Exception as e:
|
| 181 |
-
raise HTTPException(500, detail=str(e))
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
@router.get("/annotate", tags=["tasks"])
|
| 185 |
-
async def annotate(url: str):
|
| 186 |
-
"""Return gambar dengan bounding box YOLO yang sudah digambar (JPEG)."""
|
| 187 |
-
try:
|
| 188 |
-
jpeg_bytes = get_pipeline().annotate_image(url)
|
| 189 |
-
return Response(content=jpeg_bytes, media_type="image/jpeg")
|
| 190 |
-
except Exception as e:
|
| 191 |
-
raise HTTPException(500, detail=str(e))
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
# === HELPER ===
|
| 195 |
-
|
| 196 |
-
def _to_response(result) -> FullAnalysisResponse:
|
| 197 |
-
"""Convert CVAnalysisResult ke FullAnalysisResponse."""
|
| 198 |
-
caption_r = None
|
| 199 |
-
if result.caption:
|
| 200 |
-
caption_r = CaptionResponse(
|
| 201 |
-
caption=result.caption.caption,
|
| 202 |
-
model=result.caption.model,
|
| 203 |
-
)
|
| 204 |
-
|
| 205 |
-
det_r = None
|
| 206 |
-
if result.detections:
|
| 207 |
-
det_r = DetectionResponse(
|
| 208 |
-
detections=[d.to_dict() for d in result.detections.detections],
|
| 209 |
-
count=result.detections.count,
|
| 210 |
-
labels_summary=result.detections.labels_summary,
|
| 211 |
-
image_width=result.detections.image_width,
|
| 212 |
-
image_height=result.detections.image_height,
|
| 213 |
-
inference_time_ms=result.detections.inference_time_ms,
|
| 214 |
-
)
|
| 215 |
-
|
| 216 |
-
cls_r = None
|
| 217 |
-
if result.classification:
|
| 218 |
-
cls_r = ClassificationResponse(
|
| 219 |
-
top_label=result.classification.top_label,
|
| 220 |
-
top_score=result.classification.top_score,
|
| 221 |
-
labels=result.classification.labels,
|
| 222 |
-
probabilities=result.classification.probabilities,
|
| 223 |
-
)
|
| 224 |
-
|
| 225 |
-
ocr_r = None
|
| 226 |
-
if result.ocr:
|
| 227 |
-
ocr_r = OCRResponse(
|
| 228 |
-
full_text=result.ocr.full_text,
|
| 229 |
-
boxes=[b.to_dict() for b in result.ocr.boxes],
|
| 230 |
-
word_count=result.ocr.word_count,
|
| 231 |
-
language=result.ocr.language,
|
| 232 |
-
engine=result.ocr.engine,
|
| 233 |
-
)
|
| 234 |
-
|
| 235 |
-
return FullAnalysisResponse(
|
| 236 |
-
image_width=result.image_width,
|
| 237 |
-
image_height=result.image_height,
|
| 238 |
-
source=result.source,
|
| 239 |
-
caption=caption_r,
|
| 240 |
-
detections=det_r,
|
| 241 |
-
classification=cls_r,
|
| 242 |
-
ocr=ocr_r,
|
| 243 |
-
summary_text=result.to_summary(),
|
| 244 |
-
models_used=result.models_used,
|
| 245 |
-
total_latency_ms=result.total_latency_ms,
|
| 246 |
-
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cv/src/api/schemas.py
DELETED
|
@@ -1,110 +0,0 @@
|
|
| 1 |
-
from pydantic import BaseModel, Field, HttpUrl
|
| 2 |
-
from typing import List, Optional
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
# === Shared ===
|
| 6 |
-
|
| 7 |
-
class BBoxSchema(BaseModel):
|
| 8 |
-
x1: float
|
| 9 |
-
y1: float
|
| 10 |
-
x2: float
|
| 11 |
-
y2: float
|
| 12 |
-
width: float
|
| 13 |
-
height: float
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
class DetectionSchema(BaseModel):
|
| 17 |
-
label: str
|
| 18 |
-
confidence: float
|
| 19 |
-
bbox: BBoxSchema
|
| 20 |
-
class_id: int
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
class OCRBoxSchema(BaseModel):
|
| 24 |
-
text: str
|
| 25 |
-
confidence: float
|
| 26 |
-
bbox: list
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
# === Requests ===
|
| 30 |
-
|
| 31 |
-
class AnalyzeURLRequest(BaseModel):
|
| 32 |
-
url: str = Field(..., description="URL gambar yang akan dianalisis")
|
| 33 |
-
run_caption: bool = Field(True, description="Generate image caption")
|
| 34 |
-
run_detection: bool = Field(True, description="Deteksi objek dengan YOLO")
|
| 35 |
-
run_ocr: bool = Field(False, description="Ekstrak teks dari gambar")
|
| 36 |
-
classification_labels: Optional[List[str]] = Field(
|
| 37 |
-
None,
|
| 38 |
-
description="Label untuk zero-shot CLIP classification, e.g. ['kucing','anjing']",
|
| 39 |
-
example=["indoor", "outdoor", "nature", "city"],
|
| 40 |
-
)
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
class ClassifyRequest(BaseModel):
|
| 44 |
-
url: str
|
| 45 |
-
labels: List[str] = Field(..., min_length=2, description="Minimal 2 label kandidat")
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
class SimilarityRequest(BaseModel):
|
| 49 |
-
url: str
|
| 50 |
-
text: str = Field(..., min_length=1)
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
class VisualQARequest(BaseModel):
|
| 54 |
-
url: str
|
| 55 |
-
question: str = Field(..., description="Pertanyaan tentang isi gambar")
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
# === Responses ===
|
| 59 |
-
|
| 60 |
-
class CaptionResponse(BaseModel):
|
| 61 |
-
caption: str
|
| 62 |
-
model: str
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
class DetectionResponse(BaseModel):
|
| 66 |
-
detections: List[DetectionSchema]
|
| 67 |
-
count: int
|
| 68 |
-
labels_summary: dict
|
| 69 |
-
image_width: int
|
| 70 |
-
image_height: int
|
| 71 |
-
inference_time_ms: float
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
class ClassificationResponse(BaseModel):
|
| 75 |
-
top_label: str
|
| 76 |
-
top_score: float
|
| 77 |
-
labels: List[str]
|
| 78 |
-
probabilities: List[float]
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
class OCRResponse(BaseModel):
|
| 82 |
-
full_text: str
|
| 83 |
-
boxes: List[OCRBoxSchema]
|
| 84 |
-
word_count: int
|
| 85 |
-
language: str
|
| 86 |
-
engine: str
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
class FullAnalysisResponse(BaseModel):
|
| 90 |
-
image_width: int
|
| 91 |
-
image_height: int
|
| 92 |
-
source: str
|
| 93 |
-
caption: Optional[CaptionResponse] = None
|
| 94 |
-
detections: Optional[DetectionResponse] = None
|
| 95 |
-
classification: Optional[ClassificationResponse] = None
|
| 96 |
-
ocr: Optional[OCRResponse] = None
|
| 97 |
-
summary_text: str = Field(..., description="Ringkasan teks dari semua model β siap dipakai sebagai konteks LLM")
|
| 98 |
-
models_used: List[str]
|
| 99 |
-
total_latency_ms: float
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
class SimilarityResponse(BaseModel):
|
| 103 |
-
similarity_score: float
|
| 104 |
-
text: str
|
| 105 |
-
interpretation: str
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
class VisualQAResponse(BaseModel):
|
| 109 |
-
question: str
|
| 110 |
-
answer: str
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cv/src/config.py
DELETED
|
@@ -1,55 +0,0 @@
|
|
| 1 |
-
from pydantic_settings import BaseSettings
|
| 2 |
-
from pydantic import Field
|
| 3 |
-
from functools import lru_cache
|
| 4 |
-
from pathlib import Path
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
class CVSettings(BaseSettings):
|
| 8 |
-
# Device
|
| 9 |
-
device: str = Field("cpu", env="CV_DEVICE") # "cpu" atau "cuda"
|
| 10 |
-
|
| 11 |
-
# CLIP
|
| 12 |
-
clip_model: str = Field("ViT-B-32", env="CLIP_MODEL")
|
| 13 |
-
clip_pretrained: str = Field("openai", env="CLIP_PRETRAINED")
|
| 14 |
-
|
| 15 |
-
# YOLO
|
| 16 |
-
yolo_model: str = Field("yolov8n.pt", env="YOLO_MODEL") # n=nano, s=small, m=medium
|
| 17 |
-
yolo_conf_threshold: float = Field(0.25, env="YOLO_CONF")
|
| 18 |
-
yolo_iou_threshold: float = Field(0.45, env="YOLO_IOU")
|
| 19 |
-
|
| 20 |
-
# Image Captioning
|
| 21 |
-
caption_model: str = Field(
|
| 22 |
-
"Salesforce/blip-image-captioning-base", env="CAPTION_MODEL"
|
| 23 |
-
)
|
| 24 |
-
|
| 25 |
-
# OCR
|
| 26 |
-
ocr_engine: str = Field("easyocr", env="OCR_ENGINE") # "easyocr" atau "tesseract"
|
| 27 |
-
ocr_languages: str = Field("en,id", env="OCR_LANGUAGES") # comma-separated
|
| 28 |
-
|
| 29 |
-
# API
|
| 30 |
-
api_host: str = Field("0.0.0.0", env="CV_API_HOST")
|
| 31 |
-
api_port: int = Field(8001, env="CV_API_PORT")
|
| 32 |
-
max_image_size_mb: float = Field(10.0, env="MAX_IMAGE_SIZE_MB")
|
| 33 |
-
|
| 34 |
-
# Storage
|
| 35 |
-
upload_dir: str = Field("./uploads", env="CV_UPLOAD_DIR")
|
| 36 |
-
models_cache_dir: str = Field("./model_cache", env="CV_MODELS_CACHE")
|
| 37 |
-
|
| 38 |
-
# MLflow
|
| 39 |
-
mlflow_tracking_uri: str = Field("./mlruns", env="MLFLOW_TRACKING_URI")
|
| 40 |
-
mlflow_experiment_name: str = Field("cv_pipeline", env="MLFLOW_CV_EXPERIMENT")
|
| 41 |
-
|
| 42 |
-
class Config:
|
| 43 |
-
env_file = ".env"
|
| 44 |
-
env_file_encoding = "utf-8"
|
| 45 |
-
|
| 46 |
-
def ensure_dirs(self):
|
| 47 |
-
for d in [self.upload_dir, self.models_cache_dir, "./logs", "./mlruns"]:
|
| 48 |
-
Path(d).mkdir(parents=True, exist_ok=True)
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
@lru_cache()
|
| 52 |
-
def get_cv_settings() -> CVSettings:
|
| 53 |
-
s = CVSettings()
|
| 54 |
-
s.ensure_dirs()
|
| 55 |
-
return s
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cv/src/cv_pipeline.py
DELETED
|
@@ -1,246 +0,0 @@
|
|
| 1 |
-
from __future__ import annotations
|
| 2 |
-
|
| 3 |
-
import time
|
| 4 |
-
from typing import List, Optional, Union
|
| 5 |
-
from dataclasses import dataclass, field
|
| 6 |
-
from pathlib import Path
|
| 7 |
-
|
| 8 |
-
import mlflow
|
| 9 |
-
from loguru import logger
|
| 10 |
-
|
| 11 |
-
from .config import get_cv_settings
|
| 12 |
-
from .processors.image_preprocessor import ImagePreprocessor, ImageInput
|
| 13 |
-
from .models.clip_model import CLIPModel, CLIPResult
|
| 14 |
-
from .models.yolo_detector import YOLODetector, DetectionResult
|
| 15 |
-
from .models.captioner import ImageCaptioner, CaptionResult
|
| 16 |
-
from .processors.ocr_processor import OCRProcessor, OCRResult
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
@dataclass
|
| 20 |
-
class CVAnalysisResult:
|
| 21 |
-
"""Hasil lengkap analisis gambar dari semua model."""
|
| 22 |
-
|
| 23 |
-
# Info gambar
|
| 24 |
-
image_width: int = 0
|
| 25 |
-
image_height: int = 0
|
| 26 |
-
source: str = ""
|
| 27 |
-
|
| 28 |
-
# Per-model results (None jika tidak dijalankan)
|
| 29 |
-
caption: Optional[CaptionResult] = None
|
| 30 |
-
detections: Optional[DetectionResult] = None
|
| 31 |
-
classification: Optional[CLIPResult] = None
|
| 32 |
-
ocr: Optional[OCRResult] = None
|
| 33 |
-
|
| 34 |
-
# Metadata
|
| 35 |
-
models_used: List[str] = field(default_factory=list)
|
| 36 |
-
total_latency_ms: float = 0.0
|
| 37 |
-
|
| 38 |
-
def to_summary(self) -> str:
|
| 39 |
-
"""
|
| 40 |
-
Buat ringkasan teks dari hasil analisis.
|
| 41 |
-
Berguna sebagai input ke LLM (integrasi dengan RAG module).
|
| 42 |
-
"""
|
| 43 |
-
parts = []
|
| 44 |
-
|
| 45 |
-
if self.caption:
|
| 46 |
-
parts.append(f"Deskripsi gambar: {self.caption.caption}")
|
| 47 |
-
|
| 48 |
-
if self.detections and self.detections.count > 0:
|
| 49 |
-
summary = self.detections.labels_summary
|
| 50 |
-
items = ", ".join(f"{count}x {label}" for label, count in summary.items())
|
| 51 |
-
parts.append(f"Objek terdeteksi: {items}")
|
| 52 |
-
|
| 53 |
-
if self.classification:
|
| 54 |
-
parts.append(
|
| 55 |
-
f"Klasifikasi: {self.classification.top_label} "
|
| 56 |
-
f"(confidence: {self.classification.top_score:.1%})"
|
| 57 |
-
)
|
| 58 |
-
|
| 59 |
-
if self.ocr and self.ocr.full_text:
|
| 60 |
-
preview = self.ocr.full_text[:300]
|
| 61 |
-
if len(self.ocr.full_text) > 300:
|
| 62 |
-
preview += "..."
|
| 63 |
-
parts.append(f"Teks dalam gambar: {preview}")
|
| 64 |
-
|
| 65 |
-
return "\n".join(parts) if parts else "Tidak ada informasi yang bisa diekstrak."
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
class CVPipeline:
|
| 69 |
-
"""
|
| 70 |
-
Orchestrator untuk semua CV models.
|
| 71 |
-
Lazy loading β model hanya di-load saat pertama kali dipakai.
|
| 72 |
-
Support modular: bisa run satu atau semua model sekaligus.
|
| 73 |
-
"""
|
| 74 |
-
|
| 75 |
-
def __init__(self):
|
| 76 |
-
self.settings = get_cv_settings()
|
| 77 |
-
self._clip: Optional[CLIPModel] = None
|
| 78 |
-
self._yolo: Optional[YOLODetector] = None
|
| 79 |
-
self._captioner: Optional[ImageCaptioner] = None
|
| 80 |
-
self._ocr: Optional[OCRProcessor] = None
|
| 81 |
-
self._setup_mlflow()
|
| 82 |
-
logger.info("CVPipeline initialized (lazy loading).")
|
| 83 |
-
|
| 84 |
-
def _setup_mlflow(self):
|
| 85 |
-
mlflow.set_tracking_uri(self.settings.mlflow_tracking_uri)
|
| 86 |
-
mlflow.set_experiment(self.settings.mlflow_experiment_name)
|
| 87 |
-
|
| 88 |
-
# === Lazy loaders ===
|
| 89 |
-
|
| 90 |
-
@property
|
| 91 |
-
def clip(self) -> CLIPModel:
|
| 92 |
-
if self._clip is None:
|
| 93 |
-
self._clip = CLIPModel()
|
| 94 |
-
return self._clip
|
| 95 |
-
|
| 96 |
-
@property
|
| 97 |
-
def yolo(self) -> YOLODetector:
|
| 98 |
-
if self._yolo is None:
|
| 99 |
-
self._yolo = YOLODetector()
|
| 100 |
-
return self._yolo
|
| 101 |
-
|
| 102 |
-
@property
|
| 103 |
-
def captioner(self) -> ImageCaptioner:
|
| 104 |
-
if self._captioner is None:
|
| 105 |
-
self._captioner = ImageCaptioner()
|
| 106 |
-
return self._captioner
|
| 107 |
-
|
| 108 |
-
@property
|
| 109 |
-
def ocr(self) -> OCRProcessor:
|
| 110 |
-
if self._ocr is None:
|
| 111 |
-
self._ocr = OCRProcessor()
|
| 112 |
-
return self._ocr
|
| 113 |
-
|
| 114 |
-
# === Main analysis methods ===
|
| 115 |
-
|
| 116 |
-
def analyze(
|
| 117 |
-
self,
|
| 118 |
-
source: Union[str, bytes, Path],
|
| 119 |
-
run_caption: bool = True,
|
| 120 |
-
run_detection: bool = True,
|
| 121 |
-
run_ocr: bool = False,
|
| 122 |
-
classification_labels: Optional[List[str]] = None,
|
| 123 |
-
) -> CVAnalysisResult:
|
| 124 |
-
"""
|
| 125 |
-
Full pipeline analisis gambar.
|
| 126 |
-
|
| 127 |
-
Args:
|
| 128 |
-
source: Path, bytes, URL, atau base64 string
|
| 129 |
-
run_caption: Generate image caption dengan BLIP
|
| 130 |
-
run_detection: Deteksi objek dengan YOLO
|
| 131 |
-
run_ocr: Ekstrak teks dengan EasyOCR
|
| 132 |
-
classification_labels: Jika diisi, jalankan zero-shot CLIP classification
|
| 133 |
-
|
| 134 |
-
Returns:
|
| 135 |
-
CVAnalysisResult berisi semua hasil
|
| 136 |
-
"""
|
| 137 |
-
start = time.perf_counter()
|
| 138 |
-
image = ImagePreprocessor.load(source)
|
| 139 |
-
models_used = []
|
| 140 |
-
|
| 141 |
-
with mlflow.start_run(run_name="cv_analyze"):
|
| 142 |
-
mlflow.log_params({
|
| 143 |
-
"source": str(source)[:100],
|
| 144 |
-
"image_size": f"{image.width}x{image.height}",
|
| 145 |
-
"run_caption": run_caption,
|
| 146 |
-
"run_detection": run_detection,
|
| 147 |
-
"run_ocr": run_ocr,
|
| 148 |
-
})
|
| 149 |
-
|
| 150 |
-
result = CVAnalysisResult(
|
| 151 |
-
image_width=image.width,
|
| 152 |
-
image_height=image.height,
|
| 153 |
-
source=image.source,
|
| 154 |
-
)
|
| 155 |
-
|
| 156 |
-
# 1. Image Captioning
|
| 157 |
-
if run_caption:
|
| 158 |
-
t0 = time.perf_counter()
|
| 159 |
-
result.caption = self.captioner.caption(image)
|
| 160 |
-
models_used.append("BLIP-caption")
|
| 161 |
-
logger.debug(f"Caption: {(time.perf_counter()-t0)*1000:.0f}ms")
|
| 162 |
-
|
| 163 |
-
# 2. Object Detection
|
| 164 |
-
if run_detection:
|
| 165 |
-
t0 = time.perf_counter()
|
| 166 |
-
result.detections = self.yolo.detect(image)
|
| 167 |
-
models_used.append("YOLOv8")
|
| 168 |
-
logger.debug(f"Detection: {(time.perf_counter()-t0)*1000:.0f}ms")
|
| 169 |
-
|
| 170 |
-
# 3. Zero-shot Classification (opsional)
|
| 171 |
-
if classification_labels:
|
| 172 |
-
t0 = time.perf_counter()
|
| 173 |
-
result.classification = self.clip.classify(image, classification_labels)
|
| 174 |
-
models_used.append("CLIP")
|
| 175 |
-
logger.debug(f"CLIP: {(time.perf_counter()-t0)*1000:.0f}ms")
|
| 176 |
-
|
| 177 |
-
# 4. OCR (opsional, lebih berat)
|
| 178 |
-
if run_ocr:
|
| 179 |
-
t0 = time.perf_counter()
|
| 180 |
-
result.ocr = self.ocr.extract_text(image)
|
| 181 |
-
models_used.append("EasyOCR")
|
| 182 |
-
logger.debug(f"OCR: {(time.perf_counter()-t0)*1000:.0f}ms")
|
| 183 |
-
|
| 184 |
-
total_ms = (time.perf_counter() - start) * 1000
|
| 185 |
-
result.models_used = models_used
|
| 186 |
-
result.total_latency_ms = round(total_ms, 2)
|
| 187 |
-
|
| 188 |
-
mlflow.log_metrics({
|
| 189 |
-
"total_latency_ms": total_ms,
|
| 190 |
-
"objects_detected": result.detections.count if result.detections else 0,
|
| 191 |
-
"ocr_chars": len(result.ocr.full_text) if result.ocr else 0,
|
| 192 |
-
})
|
| 193 |
-
|
| 194 |
-
logger.info(
|
| 195 |
-
f"CV analysis done in {total_ms:.0f}ms | "
|
| 196 |
-
f"Models: {models_used} | "
|
| 197 |
-
f"Objects: {result.detections.count if result.detections else 0}"
|
| 198 |
-
)
|
| 199 |
-
return result
|
| 200 |
-
|
| 201 |
-
# === Individual task methods ===
|
| 202 |
-
|
| 203 |
-
def caption_image(self, source, prompt: str = None) -> str:
|
| 204 |
-
"""Shorthand: return caption string langsung."""
|
| 205 |
-
image = ImagePreprocessor.load(source)
|
| 206 |
-
return self.captioner.caption(image, prompt=prompt).caption
|
| 207 |
-
|
| 208 |
-
def detect_objects(self, source, conf: float = None) -> DetectionResult:
|
| 209 |
-
"""Shorthand: return DetectionResult."""
|
| 210 |
-
image = ImagePreprocessor.load(source)
|
| 211 |
-
return self.yolo.detect(image, conf_threshold=conf)
|
| 212 |
-
|
| 213 |
-
def classify_image(self, source, labels: List[str]) -> CLIPResult:
|
| 214 |
-
"""Shorthand: zero-shot CLIP classification."""
|
| 215 |
-
image = ImagePreprocessor.load(source)
|
| 216 |
-
return self.clip.classify(image, labels)
|
| 217 |
-
|
| 218 |
-
def extract_text(self, source) -> str:
|
| 219 |
-
"""Shorthand: return OCR text string."""
|
| 220 |
-
image = ImagePreprocessor.load(source)
|
| 221 |
-
return self.ocr.extract_text_simple(image)
|
| 222 |
-
|
| 223 |
-
def visual_qa(self, source, question: str) -> str:
|
| 224 |
-
"""Visual Question Answering: tanya tentang isi gambar."""
|
| 225 |
-
image = ImagePreprocessor.load(source)
|
| 226 |
-
return self.captioner.visual_qa(image, question).caption
|
| 227 |
-
|
| 228 |
-
def image_text_similarity(self, source, text: str) -> float:
|
| 229 |
-
"""Hitung seberapa relevan teks dengan gambar (0-1)."""
|
| 230 |
-
image = ImagePreprocessor.load(source)
|
| 231 |
-
return self.clip.compute_similarity(image, text)
|
| 232 |
-
|
| 233 |
-
def annotate_image(self, source) -> bytes:
|
| 234 |
-
"""
|
| 235 |
-
Return gambar dengan bounding box yang sudah digambar β untuk visualisasi.
|
| 236 |
-
Returns JPEG bytes.
|
| 237 |
-
"""
|
| 238 |
-
import io
|
| 239 |
-
from PIL import Image
|
| 240 |
-
|
| 241 |
-
image = ImagePreprocessor.load(source)
|
| 242 |
-
annotated = self.yolo.detect_and_annotate(image)
|
| 243 |
-
pil_annotated = Image.fromarray(annotated)
|
| 244 |
-
buf = io.BytesIO()
|
| 245 |
-
pil_annotated.save(buf, format="JPEG", quality=90)
|
| 246 |
-
return buf.getvalue()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cv/src/models/__init__.py
DELETED
|
File without changes
|
cv/src/models/captioner.py
DELETED
|
@@ -1,105 +0,0 @@
|
|
| 1 |
-
from __future__ import annotations
|
| 2 |
-
|
| 3 |
-
from dataclasses import dataclass
|
| 4 |
-
from loguru import logger
|
| 5 |
-
|
| 6 |
-
from ..config import get_cv_settings
|
| 7 |
-
from ..processors.image_preprocessor import ImageInput
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
@dataclass
|
| 11 |
-
class CaptionResult:
|
| 12 |
-
caption: str
|
| 13 |
-
model: str
|
| 14 |
-
confidence: float = 1.0
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
class ImageCaptioner:
|
| 18 |
-
"""
|
| 19 |
-
Image captioning menggunakan BLIP (Bootstrapped Language-Image Pre-training).
|
| 20 |
-
Model Salesforce/blip-image-captioning-base β ringan, akurat, bisa jalan di CPU.
|
| 21 |
-
|
| 22 |
-
Output: deskripsi teks natural dari gambar.
|
| 23 |
-
Berguna untuk: accessibility, content indexing, multimodal RAG.
|
| 24 |
-
"""
|
| 25 |
-
|
| 26 |
-
def __init__(self):
|
| 27 |
-
settings = get_cv_settings()
|
| 28 |
-
logger.info(f"Loading captioning model: {settings.caption_model}")
|
| 29 |
-
|
| 30 |
-
try:
|
| 31 |
-
from transformers import BlipProcessor, BlipForConditionalGeneration
|
| 32 |
-
import torch
|
| 33 |
-
|
| 34 |
-
self.device = settings.device
|
| 35 |
-
self.processor = BlipProcessor.from_pretrained(
|
| 36 |
-
settings.caption_model,
|
| 37 |
-
cache_dir=settings.models_cache_dir,
|
| 38 |
-
)
|
| 39 |
-
self.model = BlipForConditionalGeneration.from_pretrained(
|
| 40 |
-
settings.caption_model,
|
| 41 |
-
cache_dir=settings.models_cache_dir,
|
| 42 |
-
).to(self.device)
|
| 43 |
-
self.model.eval()
|
| 44 |
-
self.model_name = settings.caption_model
|
| 45 |
-
logger.info("Image captioner ready.")
|
| 46 |
-
|
| 47 |
-
except Exception as e:
|
| 48 |
-
logger.error(f"Gagal load captioning model: {e}")
|
| 49 |
-
raise
|
| 50 |
-
|
| 51 |
-
def caption(
|
| 52 |
-
self,
|
| 53 |
-
image: ImageInput,
|
| 54 |
-
prompt: str = None,
|
| 55 |
-
max_new_tokens: int = 100,
|
| 56 |
-
) -> CaptionResult:
|
| 57 |
-
"""
|
| 58 |
-
Generate caption untuk gambar.
|
| 59 |
-
|
| 60 |
-
Args:
|
| 61 |
-
image: ImageInput object
|
| 62 |
-
prompt: Optional β beri konteks/instruksi, e.g. "a photo of"
|
| 63 |
-
max_new_tokens: Panjang maksimum caption
|
| 64 |
-
|
| 65 |
-
Returns:
|
| 66 |
-
CaptionResult berisi caption string
|
| 67 |
-
"""
|
| 68 |
-
import torch
|
| 69 |
-
|
| 70 |
-
if prompt:
|
| 71 |
-
inputs = self.processor(
|
| 72 |
-
image.pil_image, prompt, return_tensors="pt"
|
| 73 |
-
).to(self.device)
|
| 74 |
-
else:
|
| 75 |
-
inputs = self.processor(
|
| 76 |
-
image.pil_image, return_tensors="pt"
|
| 77 |
-
).to(self.device)
|
| 78 |
-
|
| 79 |
-
with torch.no_grad():
|
| 80 |
-
output = self.model.generate(
|
| 81 |
-
**inputs,
|
| 82 |
-
max_new_tokens=max_new_tokens,
|
| 83 |
-
num_beams=4,
|
| 84 |
-
early_stopping=True,
|
| 85 |
-
)
|
| 86 |
-
|
| 87 |
-
caption = self.processor.decode(output[0], skip_special_tokens=True)
|
| 88 |
-
|
| 89 |
-
# Bersihkan prefix prompt dari output
|
| 90 |
-
if prompt and caption.lower().startswith(prompt.lower()):
|
| 91 |
-
caption = caption[len(prompt):].strip()
|
| 92 |
-
|
| 93 |
-
logger.debug(f"Caption: {caption}")
|
| 94 |
-
|
| 95 |
-
return CaptionResult(
|
| 96 |
-
caption=caption,
|
| 97 |
-
model=self.model_name,
|
| 98 |
-
)
|
| 99 |
-
|
| 100 |
-
def visual_qa(self, image: ImageInput, question: str) -> CaptionResult:
|
| 101 |
-
"""
|
| 102 |
-
Visual Question Answering β tanya tentang isi gambar.
|
| 103 |
-
Contoh: "What color is the car?" β "red"
|
| 104 |
-
"""
|
| 105 |
-
return self.caption(image, prompt=question, max_new_tokens=50)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cv/src/models/clip_model.py
DELETED
|
@@ -1,150 +0,0 @@
|
|
| 1 |
-
from __future__ import annotations
|
| 2 |
-
|
| 3 |
-
from typing import List
|
| 4 |
-
from dataclasses import dataclass
|
| 5 |
-
from functools import lru_cache
|
| 6 |
-
|
| 7 |
-
import torch
|
| 8 |
-
import open_clip
|
| 9 |
-
from loguru import logger
|
| 10 |
-
|
| 11 |
-
from ..config import get_cv_settings
|
| 12 |
-
from ..processors.image_preprocessor import ImageInput
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
@dataclass
|
| 16 |
-
class CLIPResult:
|
| 17 |
-
"""Hasil dari CLIP model."""
|
| 18 |
-
# Zero-shot classification
|
| 19 |
-
labels: List[str] = None
|
| 20 |
-
probabilities: List[float] = None
|
| 21 |
-
top_label: str = ""
|
| 22 |
-
top_score: float = 0.0
|
| 23 |
-
|
| 24 |
-
# Image-text similarity
|
| 25 |
-
similarity_score: float = None
|
| 26 |
-
|
| 27 |
-
# Image features (untuk downstream tasks)
|
| 28 |
-
image_features: "torch.Tensor" = None
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
class CLIPModel:
|
| 32 |
-
"""
|
| 33 |
-
Wrapper CLIP menggunakan open_clip.
|
| 34 |
-
Capabilities:
|
| 35 |
-
- Zero-shot image classification (tanpa training!)
|
| 36 |
-
- Image-text similarity scoring
|
| 37 |
-
- Image feature extraction untuk retrieval
|
| 38 |
-
"""
|
| 39 |
-
|
| 40 |
-
def __init__(self):
|
| 41 |
-
settings = get_cv_settings()
|
| 42 |
-
self.device = settings.device
|
| 43 |
-
logger.info(f"Loading CLIP model: {settings.clip_model} ({settings.clip_pretrained})")
|
| 44 |
-
|
| 45 |
-
self.model, _, self.preprocess = open_clip.create_model_and_transforms(
|
| 46 |
-
settings.clip_model,
|
| 47 |
-
pretrained=settings.clip_pretrained,
|
| 48 |
-
device=self.device,
|
| 49 |
-
)
|
| 50 |
-
self.tokenizer = open_clip.get_tokenizer(settings.clip_model)
|
| 51 |
-
self.model.eval()
|
| 52 |
-
logger.info("CLIP model ready.")
|
| 53 |
-
|
| 54 |
-
@torch.no_grad()
|
| 55 |
-
def classify(self, image: ImageInput, labels: List[str]) -> CLIPResult:
|
| 56 |
-
"""
|
| 57 |
-
Zero-shot classification β tentukan kategori gambar dari daftar label.
|
| 58 |
-
Tidak perlu training sama sekali!
|
| 59 |
-
|
| 60 |
-
Args:
|
| 61 |
-
image: ImageInput object
|
| 62 |
-
labels: List label kandidat, e.g. ["kucing", "anjing", "burung"]
|
| 63 |
-
|
| 64 |
-
Returns:
|
| 65 |
-
CLIPResult dengan probabilitas tiap label
|
| 66 |
-
"""
|
| 67 |
-
# Preprocess image
|
| 68 |
-
img_tensor = self.preprocess(image.pil_image).unsqueeze(0).to(self.device)
|
| 69 |
-
|
| 70 |
-
# Tokenize labels
|
| 71 |
-
text_tokens = self.tokenizer(labels).to(self.device)
|
| 72 |
-
|
| 73 |
-
# Compute features
|
| 74 |
-
image_features = self.model.encode_image(img_tensor)
|
| 75 |
-
text_features = self.model.encode_text(text_tokens)
|
| 76 |
-
|
| 77 |
-
# Normalize
|
| 78 |
-
image_features /= image_features.norm(dim=-1, keepdim=True)
|
| 79 |
-
text_features /= text_features.norm(dim=-1, keepdim=True)
|
| 80 |
-
|
| 81 |
-
# Compute similarity (cosine similarity β softmax β probabilities)
|
| 82 |
-
logits = (100.0 * image_features @ text_features.T).softmax(dim=-1)
|
| 83 |
-
probs = logits[0].cpu().numpy().tolist()
|
| 84 |
-
|
| 85 |
-
top_idx = int(torch.argmax(logits[0]).item())
|
| 86 |
-
|
| 87 |
-
return CLIPResult(
|
| 88 |
-
labels=labels,
|
| 89 |
-
probabilities=[round(p, 4) for p in probs],
|
| 90 |
-
top_label=labels[top_idx],
|
| 91 |
-
top_score=round(probs[top_idx], 4),
|
| 92 |
-
)
|
| 93 |
-
|
| 94 |
-
@torch.no_grad()
|
| 95 |
-
def compute_similarity(self, image: ImageInput, text: str) -> float:
|
| 96 |
-
"""
|
| 97 |
-
Hitung seberapa relevan teks dengan gambar (0.0 - 1.0).
|
| 98 |
-
Berguna untuk: image search, content moderation, caption scoring.
|
| 99 |
-
"""
|
| 100 |
-
img_tensor = self.preprocess(image.pil_image).unsqueeze(0).to(self.device)
|
| 101 |
-
text_tokens = self.tokenizer([text]).to(self.device)
|
| 102 |
-
|
| 103 |
-
img_feat = self.model.encode_image(img_tensor)
|
| 104 |
-
txt_feat = self.model.encode_text(text_tokens)
|
| 105 |
-
|
| 106 |
-
img_feat /= img_feat.norm(dim=-1, keepdim=True)
|
| 107 |
-
txt_feat /= txt_feat.norm(dim=-1, keepdim=True)
|
| 108 |
-
|
| 109 |
-
similarity = (img_feat @ txt_feat.T).item()
|
| 110 |
-
|
| 111 |
-
# Normalize ke 0-1 (CLIP output biasanya -1 to 1)
|
| 112 |
-
return round((similarity + 1) / 2, 4)
|
| 113 |
-
|
| 114 |
-
@torch.no_grad()
|
| 115 |
-
def extract_features(self, image: ImageInput) -> "torch.Tensor":
|
| 116 |
-
"""
|
| 117 |
-
Ekstrak image embedding untuk semantic image search / clustering.
|
| 118 |
-
Output: tensor shape (512,) untuk ViT-B-32
|
| 119 |
-
"""
|
| 120 |
-
img_tensor = self.preprocess(image.pil_image).unsqueeze(0).to(self.device)
|
| 121 |
-
features = self.model.encode_image(img_tensor)
|
| 122 |
-
features /= features.norm(dim=-1, keepdim=True)
|
| 123 |
-
return features[0].cpu()
|
| 124 |
-
|
| 125 |
-
@torch.no_grad()
|
| 126 |
-
def rank_images_by_text(
|
| 127 |
-
self,
|
| 128 |
-
images: List[ImageInput],
|
| 129 |
-
query_text: str,
|
| 130 |
-
) -> List[tuple[int, float]]:
|
| 131 |
-
"""
|
| 132 |
-
Rank multiple images berdasarkan relevansi dengan teks query.
|
| 133 |
-
Returns: list of (original_index, score) sorted by score desc.
|
| 134 |
-
Berguna untuk: text-to-image search.
|
| 135 |
-
"""
|
| 136 |
-
tensors = torch.stack([
|
| 137 |
-
self.preprocess(img.pil_image) for img in images
|
| 138 |
-
]).to(self.device)
|
| 139 |
-
|
| 140 |
-
text_tokens = self.tokenizer([query_text]).to(self.device)
|
| 141 |
-
|
| 142 |
-
img_features = self.model.encode_image(tensors)
|
| 143 |
-
txt_features = self.model.encode_text(text_tokens)
|
| 144 |
-
|
| 145 |
-
img_features /= img_features.norm(dim=-1, keepdim=True)
|
| 146 |
-
txt_features /= txt_features.norm(dim=-1, keepdim=True)
|
| 147 |
-
|
| 148 |
-
scores = (img_features @ txt_features.T).squeeze(1).cpu().numpy()
|
| 149 |
-
ranked = sorted(enumerate(scores.tolist()), key=lambda x: x[1], reverse=True)
|
| 150 |
-
return [(idx, round(score, 4)) for idx, score in ranked]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cv/src/models/yolo_detector.py
DELETED
|
@@ -1,208 +0,0 @@
|
|
| 1 |
-
from __future__ import annotations
|
| 2 |
-
|
| 3 |
-
from typing import List
|
| 4 |
-
from dataclasses import dataclass, field
|
| 5 |
-
|
| 6 |
-
import numpy as np
|
| 7 |
-
from loguru import logger
|
| 8 |
-
|
| 9 |
-
from ..config import get_cv_settings
|
| 10 |
-
from ..processors.image_preprocessor import ImageInput
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
@dataclass
|
| 14 |
-
class BoundingBox:
|
| 15 |
-
x1: float
|
| 16 |
-
y1: float
|
| 17 |
-
x2: float
|
| 18 |
-
y2: float
|
| 19 |
-
|
| 20 |
-
@property
|
| 21 |
-
def width(self) -> float:
|
| 22 |
-
return self.x2 - self.x1
|
| 23 |
-
|
| 24 |
-
@property
|
| 25 |
-
def height(self) -> float:
|
| 26 |
-
return self.y2 - self.y1
|
| 27 |
-
|
| 28 |
-
@property
|
| 29 |
-
def area(self) -> float:
|
| 30 |
-
return self.width * self.height
|
| 31 |
-
|
| 32 |
-
def to_dict(self) -> dict:
|
| 33 |
-
return {
|
| 34 |
-
"x1": round(self.x1, 1), "y1": round(self.y1, 1),
|
| 35 |
-
"x2": round(self.x2, 1), "y2": round(self.y2, 1),
|
| 36 |
-
"width": round(self.width, 1), "height": round(self.height, 1),
|
| 37 |
-
}
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
@dataclass
|
| 41 |
-
class Detection:
|
| 42 |
-
label: str
|
| 43 |
-
confidence: float
|
| 44 |
-
bbox: BoundingBox
|
| 45 |
-
class_id: int
|
| 46 |
-
|
| 47 |
-
def to_dict(self) -> dict:
|
| 48 |
-
return {
|
| 49 |
-
"label": self.label,
|
| 50 |
-
"confidence": round(self.confidence, 4),
|
| 51 |
-
"bbox": self.bbox.to_dict(),
|
| 52 |
-
"class_id": self.class_id,
|
| 53 |
-
}
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
@dataclass
|
| 57 |
-
class DetectionResult:
|
| 58 |
-
detections: List[Detection] = field(default_factory=list)
|
| 59 |
-
image_width: int = 0
|
| 60 |
-
image_height: int = 0
|
| 61 |
-
model_name: str = ""
|
| 62 |
-
inference_time_ms: float = 0.0
|
| 63 |
-
|
| 64 |
-
@property
|
| 65 |
-
def count(self) -> int:
|
| 66 |
-
return len(self.detections)
|
| 67 |
-
|
| 68 |
-
@property
|
| 69 |
-
def labels_summary(self) -> dict[str, int]:
|
| 70 |
-
"""Ringkasan: {label: count}"""
|
| 71 |
-
summary = {}
|
| 72 |
-
for d in self.detections:
|
| 73 |
-
summary[d.label] = summary.get(d.label, 0) + 1
|
| 74 |
-
return summary
|
| 75 |
-
|
| 76 |
-
def filter_by_label(self, label: str) -> List[Detection]:
|
| 77 |
-
return [d for d in self.detections if d.label.lower() == label.lower()]
|
| 78 |
-
|
| 79 |
-
def filter_by_confidence(self, min_conf: float) -> List[Detection]:
|
| 80 |
-
return [d for d in self.detections if d.confidence >= min_conf]
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
class YOLODetector:
|
| 84 |
-
"""
|
| 85 |
-
Object detection menggunakan YOLOv8 (Ultralytics).
|
| 86 |
-
Model: yolov8n (nano, cepat) β yolov8m (medium, akurat)
|
| 87 |
-
80 kelas COCO default, bisa di-finetune untuk domain spesifik.
|
| 88 |
-
"""
|
| 89 |
-
|
| 90 |
-
def __init__(self):
|
| 91 |
-
settings = get_cv_settings()
|
| 92 |
-
logger.info(f"Loading YOLO model: {settings.yolo_model}")
|
| 93 |
-
|
| 94 |
-
try:
|
| 95 |
-
from ultralytics import YOLO
|
| 96 |
-
self.model = YOLO(settings.yolo_model)
|
| 97 |
-
except Exception as e:
|
| 98 |
-
logger.error(f"Gagal load YOLO: {e}")
|
| 99 |
-
raise
|
| 100 |
-
|
| 101 |
-
self.conf_threshold = settings.yolo_conf_threshold
|
| 102 |
-
self.iou_threshold = settings.yolo_iou_threshold
|
| 103 |
-
self.model_name = settings.yolo_model
|
| 104 |
-
logger.info("YOLO detector ready.")
|
| 105 |
-
|
| 106 |
-
def detect(
|
| 107 |
-
self,
|
| 108 |
-
image: ImageInput,
|
| 109 |
-
conf_threshold: float = None,
|
| 110 |
-
classes: List[int] = None,
|
| 111 |
-
) -> DetectionResult:
|
| 112 |
-
"""
|
| 113 |
-
Deteksi objek dalam gambar.
|
| 114 |
-
|
| 115 |
-
Args:
|
| 116 |
-
image: ImageInput object
|
| 117 |
-
conf_threshold: Override confidence threshold (default dari config)
|
| 118 |
-
classes: Filter kelas spesifik (COCO class IDs), None = semua kelas
|
| 119 |
-
|
| 120 |
-
Returns:
|
| 121 |
-
DetectionResult berisi semua deteksi
|
| 122 |
-
"""
|
| 123 |
-
import time
|
| 124 |
-
conf = conf_threshold or self.conf_threshold
|
| 125 |
-
|
| 126 |
-
start = time.perf_counter()
|
| 127 |
-
results = self.model.predict(
|
| 128 |
-
source=image.numpy,
|
| 129 |
-
conf=conf,
|
| 130 |
-
iou=self.iou_threshold,
|
| 131 |
-
classes=classes,
|
| 132 |
-
verbose=False,
|
| 133 |
-
)
|
| 134 |
-
elapsed_ms = (time.perf_counter() - start) * 1000
|
| 135 |
-
|
| 136 |
-
detections = []
|
| 137 |
-
if results and results[0].boxes is not None:
|
| 138 |
-
boxes = results[0].boxes
|
| 139 |
-
for i in range(len(boxes)):
|
| 140 |
-
x1, y1, x2, y2 = boxes.xyxy[i].cpu().numpy()
|
| 141 |
-
conf_val = float(boxes.conf[i].cpu().numpy())
|
| 142 |
-
cls_id = int(boxes.cls[i].cpu().numpy())
|
| 143 |
-
label = self.model.names[cls_id]
|
| 144 |
-
|
| 145 |
-
detections.append(Detection(
|
| 146 |
-
label=label,
|
| 147 |
-
confidence=conf_val,
|
| 148 |
-
bbox=BoundingBox(x1=x1, y1=y1, x2=x2, y2=y2),
|
| 149 |
-
class_id=cls_id,
|
| 150 |
-
))
|
| 151 |
-
|
| 152 |
-
logger.debug(
|
| 153 |
-
f"Detected {len(detections)} objects in {elapsed_ms:.1f}ms | "
|
| 154 |
-
f"Summary: {DetectionResult(detections=detections).labels_summary}"
|
| 155 |
-
)
|
| 156 |
-
|
| 157 |
-
return DetectionResult(
|
| 158 |
-
detections=detections,
|
| 159 |
-
image_width=image.width,
|
| 160 |
-
image_height=image.height,
|
| 161 |
-
model_name=self.model_name,
|
| 162 |
-
inference_time_ms=round(elapsed_ms, 2),
|
| 163 |
-
)
|
| 164 |
-
|
| 165 |
-
def detect_and_annotate(self, image: ImageInput, **kwargs) -> "np.ndarray":
|
| 166 |
-
"""
|
| 167 |
-
Detect dan return gambar dengan bounding box yang sudah digambar.
|
| 168 |
-
Berguna untuk visualisasi / demo.
|
| 169 |
-
"""
|
| 170 |
-
import cv2
|
| 171 |
-
|
| 172 |
-
result_img = image.numpy.copy()
|
| 173 |
-
det_result = self.detect(image, **kwargs)
|
| 174 |
-
|
| 175 |
-
for det in det_result.detections:
|
| 176 |
-
bb = det.bbox
|
| 177 |
-
x1, y1, x2, y2 = int(bb.x1), int(bb.y1), int(bb.x2), int(bb.y2)
|
| 178 |
-
|
| 179 |
-
# Warna berdasarkan class_id
|
| 180 |
-
color = self._get_color(det.class_id)
|
| 181 |
-
|
| 182 |
-
# Gambar bounding box
|
| 183 |
-
cv2.rectangle(result_img, (x1, y1), (x2, y2), color, 2)
|
| 184 |
-
|
| 185 |
-
# Label background + text
|
| 186 |
-
label_text = f"{det.label} {det.confidence:.2f}"
|
| 187 |
-
(tw, th), _ = cv2.getTextSize(label_text, cv2.FONT_HERSHEY_SIMPLEX, 0.6, 1)
|
| 188 |
-
cv2.rectangle(result_img, (x1, y1 - th - 8), (x1 + tw + 4, y1), color, -1)
|
| 189 |
-
cv2.putText(result_img, label_text, (x1 + 2, y1 - 4),
|
| 190 |
-
cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 255), 1)
|
| 191 |
-
|
| 192 |
-
return result_img
|
| 193 |
-
|
| 194 |
-
@staticmethod
|
| 195 |
-
def _get_color(class_id: int) -> tuple[int, int, int]:
|
| 196 |
-
"""Generate warna konsisten per class_id."""
|
| 197 |
-
palette = [
|
| 198 |
-
(255, 56, 56), (255, 157, 151), (255, 112, 31), (255, 178, 29),
|
| 199 |
-
(207, 210, 49), (72, 249, 10), (146, 204, 23), (61, 219, 134),
|
| 200 |
-
(26, 147, 52), (0, 212, 187), (44, 153, 168), (0, 194, 255),
|
| 201 |
-
(52, 69, 147), (100, 115, 255), (0, 24, 236), (132, 56, 255),
|
| 202 |
-
]
|
| 203 |
-
return palette[class_id % len(palette)]
|
| 204 |
-
|
| 205 |
-
@property
|
| 206 |
-
def available_classes(self) -> dict[int, str]:
|
| 207 |
-
"""Return dict semua kelas yang bisa dideteksi."""
|
| 208 |
-
return self.model.names
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cv/src/processors/__init__.py
DELETED
|
File without changes
|
cv/src/processors/image_preprocessor.py
DELETED
|
@@ -1,154 +0,0 @@
|
|
| 1 |
-
from __future__ import annotations
|
| 2 |
-
|
| 3 |
-
import io
|
| 4 |
-
import base64
|
| 5 |
-
from pathlib import Path
|
| 6 |
-
from typing import Union
|
| 7 |
-
from dataclasses import dataclass, field
|
| 8 |
-
|
| 9 |
-
import numpy as np
|
| 10 |
-
from PIL import Image, ExifTags
|
| 11 |
-
from loguru import logger
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
@dataclass
|
| 15 |
-
class ImageInput:
|
| 16 |
-
"""Normalized image container β semua sumber dikonversi ke sini."""
|
| 17 |
-
pil_image: Image.Image
|
| 18 |
-
original_size: tuple[int, int] # (width, height)
|
| 19 |
-
source: str = "unknown"
|
| 20 |
-
filename: str = ""
|
| 21 |
-
format: str = "RGB"
|
| 22 |
-
metadata: dict = field(default_factory=dict)
|
| 23 |
-
|
| 24 |
-
@property
|
| 25 |
-
def width(self) -> int:
|
| 26 |
-
return self.pil_image.width
|
| 27 |
-
|
| 28 |
-
@property
|
| 29 |
-
def height(self) -> int:
|
| 30 |
-
return self.pil_image.height
|
| 31 |
-
|
| 32 |
-
@property
|
| 33 |
-
def numpy(self) -> np.ndarray:
|
| 34 |
-
"""Return as HWC uint8 numpy array (untuk OpenCV/YOLO)."""
|
| 35 |
-
return np.array(self.pil_image)
|
| 36 |
-
|
| 37 |
-
def to_base64(self) -> str:
|
| 38 |
-
"""Konversi ke base64 string untuk response API."""
|
| 39 |
-
buf = io.BytesIO()
|
| 40 |
-
self.pil_image.save(buf, format="JPEG", quality=85)
|
| 41 |
-
return base64.b64encode(buf.getvalue()).decode()
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
class ImagePreprocessor:
|
| 45 |
-
"""
|
| 46 |
-
Handle semua bentuk input gambar:
|
| 47 |
-
- File path (str / Path)
|
| 48 |
-
- Raw bytes (dari upload)
|
| 49 |
-
- Base64 string
|
| 50 |
-
- URL (via HTTP)
|
| 51 |
-
- PIL Image langsung
|
| 52 |
-
"""
|
| 53 |
-
|
| 54 |
-
MAX_SIZE = (1920, 1920)
|
| 55 |
-
|
| 56 |
-
@classmethod
|
| 57 |
-
def load(cls, source: Union[str, bytes, Path, Image.Image]) -> ImageInput:
|
| 58 |
-
"""Auto-detect tipe input dan load sebagai ImageInput."""
|
| 59 |
-
if isinstance(source, Image.Image):
|
| 60 |
-
return cls._from_pil(source, source_name="pil_direct")
|
| 61 |
-
|
| 62 |
-
if isinstance(source, bytes):
|
| 63 |
-
return cls._from_bytes(source)
|
| 64 |
-
|
| 65 |
-
if isinstance(source, Path) or (isinstance(source, str) and not source.startswith(("http", "data:"))):
|
| 66 |
-
return cls._from_file(str(source))
|
| 67 |
-
|
| 68 |
-
if isinstance(source, str) and source.startswith("data:image"):
|
| 69 |
-
return cls._from_base64(source)
|
| 70 |
-
|
| 71 |
-
if isinstance(source, str) and source.startswith(("http://", "https://")):
|
| 72 |
-
return cls._from_url(source)
|
| 73 |
-
|
| 74 |
-
raise ValueError(f"Tipe input tidak dikenali: {type(source)}")
|
| 75 |
-
|
| 76 |
-
@classmethod
|
| 77 |
-
def _from_file(cls, path: str) -> ImageInput:
|
| 78 |
-
p = Path(path)
|
| 79 |
-
if not p.exists():
|
| 80 |
-
raise FileNotFoundError(f"Gambar tidak ditemukan: {path}")
|
| 81 |
-
img = Image.open(p)
|
| 82 |
-
img = cls._normalize(img)
|
| 83 |
-
logger.debug(f"Loaded image from file: {p.name} ({img.width}x{img.height})")
|
| 84 |
-
return ImageInput(
|
| 85 |
-
pil_image=img,
|
| 86 |
-
original_size=(img.width, img.height),
|
| 87 |
-
source="file",
|
| 88 |
-
filename=p.name,
|
| 89 |
-
metadata={"path": str(p), "format": p.suffix},
|
| 90 |
-
)
|
| 91 |
-
|
| 92 |
-
@classmethod
|
| 93 |
-
def _from_bytes(cls, data: bytes, filename: str = "upload") -> ImageInput:
|
| 94 |
-
img = Image.open(io.BytesIO(data))
|
| 95 |
-
original_size = (img.width, img.height)
|
| 96 |
-
img = cls._normalize(img)
|
| 97 |
-
return ImageInput(
|
| 98 |
-
pil_image=img,
|
| 99 |
-
original_size=original_size,
|
| 100 |
-
source="bytes",
|
| 101 |
-
filename=filename,
|
| 102 |
-
metadata={"size_bytes": len(data)},
|
| 103 |
-
)
|
| 104 |
-
|
| 105 |
-
@classmethod
|
| 106 |
-
def _from_base64(cls, b64_str: str) -> ImageInput:
|
| 107 |
-
# Strip data URI prefix jika ada
|
| 108 |
-
if "," in b64_str:
|
| 109 |
-
b64_str = b64_str.split(",", 1)[1]
|
| 110 |
-
data = base64.b64decode(b64_str)
|
| 111 |
-
return cls._from_bytes(data, filename="base64_input")
|
| 112 |
-
|
| 113 |
-
@classmethod
|
| 114 |
-
def _from_url(cls, url: str) -> ImageInput:
|
| 115 |
-
import httpx
|
| 116 |
-
logger.debug(f"Fetching image from URL: {url}")
|
| 117 |
-
r = httpx.get(url, timeout=15, follow_redirects=True)
|
| 118 |
-
r.raise_for_status()
|
| 119 |
-
img_input = cls._from_bytes(r.content, filename=url.split("/")[-1] or "url_image")
|
| 120 |
-
img_input.source = "url"
|
| 121 |
-
img_input.metadata["url"] = url
|
| 122 |
-
return img_input
|
| 123 |
-
|
| 124 |
-
@classmethod
|
| 125 |
-
def _from_pil(cls, img: Image.Image, source_name: str = "pil") -> ImageInput:
|
| 126 |
-
original_size = (img.width, img.height)
|
| 127 |
-
img = cls._normalize(img)
|
| 128 |
-
return ImageInput(pil_image=img, original_size=original_size, source=source_name)
|
| 129 |
-
|
| 130 |
-
@classmethod
|
| 131 |
-
def _normalize(cls, img: Image.Image) -> Image.Image:
|
| 132 |
-
"""Convert ke RGB, fix EXIF rotation, resize jika terlalu besar."""
|
| 133 |
-
# Fix EXIF orientation
|
| 134 |
-
try:
|
| 135 |
-
exif = img._getexif()
|
| 136 |
-
if exif:
|
| 137 |
-
for tag, val in exif.items():
|
| 138 |
-
if ExifTags.TAGS.get(tag) == "Orientation":
|
| 139 |
-
rotations = {3: 180, 6: 270, 8: 90}
|
| 140 |
-
if val in rotations:
|
| 141 |
-
img = img.rotate(rotations[val], expand=True)
|
| 142 |
-
except Exception:
|
| 143 |
-
pass
|
| 144 |
-
|
| 145 |
-
# Convert ke RGB
|
| 146 |
-
if img.mode != "RGB":
|
| 147 |
-
img = img.convert("RGB")
|
| 148 |
-
|
| 149 |
-
# Resize jika melebihi batas
|
| 150 |
-
if img.width > cls.MAX_SIZE[0] or img.height > cls.MAX_SIZE[1]:
|
| 151 |
-
img.thumbnail(cls.MAX_SIZE, Image.LANCZOS)
|
| 152 |
-
logger.debug(f"Resized image to {img.width}x{img.height}")
|
| 153 |
-
|
| 154 |
-
return img
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cv/src/processors/ocr_processor.py
DELETED
|
@@ -1,235 +0,0 @@
|
|
| 1 |
-
from __future__ import annotations
|
| 2 |
-
|
| 3 |
-
from typing import List
|
| 4 |
-
from dataclasses import dataclass, field
|
| 5 |
-
from loguru import logger
|
| 6 |
-
|
| 7 |
-
import numpy as np
|
| 8 |
-
|
| 9 |
-
from ..config import get_cv_settings
|
| 10 |
-
from ..processors.image_preprocessor import ImageInput
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
@dataclass
|
| 14 |
-
class OCRBox:
|
| 15 |
-
text: str
|
| 16 |
-
confidence: float
|
| 17 |
-
bbox: list # [[x1,y1],[x2,y1],[x2,y2],[x1,y2]] format EasyOCR
|
| 18 |
-
|
| 19 |
-
def to_dict(self) -> dict:
|
| 20 |
-
return {
|
| 21 |
-
"text": self.text,
|
| 22 |
-
"confidence": round(self.confidence, 4),
|
| 23 |
-
"bbox": self.bbox,
|
| 24 |
-
}
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
@dataclass
|
| 28 |
-
class OCRResult:
|
| 29 |
-
full_text: str
|
| 30 |
-
boxes: List[OCRBox] = field(default_factory=list)
|
| 31 |
-
language: str = ""
|
| 32 |
-
engine: str = ""
|
| 33 |
-
|
| 34 |
-
@property
|
| 35 |
-
def word_count(self) -> int:
|
| 36 |
-
return len(self.full_text.split())
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
class OCRProcessor:
|
| 40 |
-
"""
|
| 41 |
-
OCR menggunakan EasyOCR dengan mode stabil (single-pass ringan).
|
| 42 |
-
Fokus: tidak crash di Docker + tetap improve akurasi.
|
| 43 |
-
"""
|
| 44 |
-
|
| 45 |
-
MIN_CONFIDENCE = 0.10
|
| 46 |
-
MIN_OCR_DIM = 800
|
| 47 |
-
|
| 48 |
-
def __init__(self):
|
| 49 |
-
settings = get_cv_settings()
|
| 50 |
-
self.engine = settings.ocr_engine
|
| 51 |
-
self.languages = [l.strip() for l in settings.ocr_languages.split(",")]
|
| 52 |
-
logger.info(f"Loading OCR ({self.engine}) for languages: {self.languages}")
|
| 53 |
-
|
| 54 |
-
try:
|
| 55 |
-
import easyocr
|
| 56 |
-
self.reader = easyocr.Reader(
|
| 57 |
-
self.languages,
|
| 58 |
-
gpu=(settings.device == "cuda"),
|
| 59 |
-
model_storage_directory=settings.models_cache_dir,
|
| 60 |
-
)
|
| 61 |
-
logger.info("OCR processor ready.")
|
| 62 |
-
except Exception as e:
|
| 63 |
-
logger.error(f"Gagal init OCR: {e}")
|
| 64 |
-
raise
|
| 65 |
-
|
| 66 |
-
def _preprocess_for_ocr(self, img: np.ndarray) -> np.ndarray:
|
| 67 |
-
"""
|
| 68 |
-
Preprocessing ringan:
|
| 69 |
-
- upscale (jika kecil)
|
| 70 |
-
- grayscale
|
| 71 |
-
- CLAHE contrast enhancement
|
| 72 |
-
- light sharpen
|
| 73 |
-
"""
|
| 74 |
-
try:
|
| 75 |
-
import cv2
|
| 76 |
-
|
| 77 |
-
h, w = img.shape[:2]
|
| 78 |
-
if max(h, w) < self.MIN_OCR_DIM:
|
| 79 |
-
scale = self.MIN_OCR_DIM / max(h, w)
|
| 80 |
-
new_w, new_h = int(w * scale), int(h * scale)
|
| 81 |
-
img = cv2.resize(img, (new_w, new_h), interpolation=cv2.INTER_CUBIC)
|
| 82 |
-
|
| 83 |
-
if len(img.shape) == 3:
|
| 84 |
-
gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
|
| 85 |
-
else:
|
| 86 |
-
gray = img.copy()
|
| 87 |
-
|
| 88 |
-
clahe = cv2.createCLAHE(clipLimit=2.5, tileGridSize=(8, 8))
|
| 89 |
-
enhanced = clahe.apply(gray)
|
| 90 |
-
|
| 91 |
-
kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]], dtype=np.float32)
|
| 92 |
-
sharpened = cv2.filter2D(enhanced, -1, kernel)
|
| 93 |
-
|
| 94 |
-
return cv2.cvtColor(sharpened, cv2.COLOR_GRAY2RGB)
|
| 95 |
-
except Exception as e:
|
| 96 |
-
logger.warning(f"OCR preprocessing fallback to original image: {e}")
|
| 97 |
-
return img
|
| 98 |
-
|
| 99 |
-
def _parse_results(self, raw_results: List) -> List[OCRBox]:
|
| 100 |
-
boxes = []
|
| 101 |
-
for item in raw_results:
|
| 102 |
-
if len(item) == 3:
|
| 103 |
-
bbox, text, confidence = item
|
| 104 |
-
elif len(item) == 2:
|
| 105 |
-
bbox, text = item
|
| 106 |
-
confidence = 0.8
|
| 107 |
-
else:
|
| 108 |
-
continue
|
| 109 |
-
|
| 110 |
-
text = str(text).strip()
|
| 111 |
-
if not text or confidence < self.MIN_CONFIDENCE:
|
| 112 |
-
continue
|
| 113 |
-
|
| 114 |
-
# Convert numpy scalars/arrays to native Python types for FastAPI/Pydantic serialization
|
| 115 |
-
safe_bbox = []
|
| 116 |
-
try:
|
| 117 |
-
for pt in bbox:
|
| 118 |
-
if isinstance(pt, (list, tuple)) and len(pt) >= 2:
|
| 119 |
-
safe_bbox.append([float(pt[0]), float(pt[1])])
|
| 120 |
-
else:
|
| 121 |
-
safe_bbox.append(pt)
|
| 122 |
-
except Exception:
|
| 123 |
-
safe_bbox = bbox
|
| 124 |
-
|
| 125 |
-
boxes.append(OCRBox(
|
| 126 |
-
text=text,
|
| 127 |
-
confidence=float(confidence),
|
| 128 |
-
bbox=safe_bbox,
|
| 129 |
-
))
|
| 130 |
-
return boxes
|
| 131 |
-
|
| 132 |
-
def _boxes_to_text(self, boxes: List[OCRBox]) -> str:
|
| 133 |
-
if not boxes:
|
| 134 |
-
return ""
|
| 135 |
-
|
| 136 |
-
def cy(box: OCRBox) -> float:
|
| 137 |
-
try:
|
| 138 |
-
ys = [pt[1] for pt in box.bbox]
|
| 139 |
-
return sum(ys) / len(ys)
|
| 140 |
-
except Exception:
|
| 141 |
-
return 0
|
| 142 |
-
|
| 143 |
-
def cx(box: OCRBox) -> float:
|
| 144 |
-
try:
|
| 145 |
-
xs = [pt[0] for pt in box.bbox]
|
| 146 |
-
return sum(xs) / len(xs)
|
| 147 |
-
except Exception:
|
| 148 |
-
return 0
|
| 149 |
-
|
| 150 |
-
def h(box: OCRBox) -> float:
|
| 151 |
-
try:
|
| 152 |
-
ys = [pt[1] for pt in box.bbox]
|
| 153 |
-
return max(ys) - min(ys)
|
| 154 |
-
except Exception:
|
| 155 |
-
return 20
|
| 156 |
-
|
| 157 |
-
sorted_boxes = sorted(boxes, key=lambda b: (cy(b), cx(b)))
|
| 158 |
-
lines = []
|
| 159 |
-
current = [sorted_boxes[0]]
|
| 160 |
-
current_y = cy(sorted_boxes[0])
|
| 161 |
-
|
| 162 |
-
for box in sorted_boxes[1:]:
|
| 163 |
-
if abs(cy(box) - current_y) < max(h(box) * 0.5, 15):
|
| 164 |
-
current.append(box)
|
| 165 |
-
else:
|
| 166 |
-
current.sort(key=lambda b: cx(b))
|
| 167 |
-
lines.append(" ".join(b.text for b in current))
|
| 168 |
-
current = [box]
|
| 169 |
-
current_y = cy(box)
|
| 170 |
-
|
| 171 |
-
if current:
|
| 172 |
-
current.sort(key=lambda b: cx(b))
|
| 173 |
-
lines.append(" ".join(b.text for b in current))
|
| 174 |
-
|
| 175 |
-
return "\n".join(lines)
|
| 176 |
-
|
| 177 |
-
def extract_text(
|
| 178 |
-
self,
|
| 179 |
-
image: ImageInput,
|
| 180 |
-
detail: bool = True,
|
| 181 |
-
paragraph: bool = False,
|
| 182 |
-
) -> OCRResult:
|
| 183 |
-
logger.debug(f"Running stable OCR on {image.width}x{image.height} image")
|
| 184 |
-
|
| 185 |
-
try:
|
| 186 |
-
processed = self._preprocess_for_ocr(image.numpy.copy())
|
| 187 |
-
raw_results = self.reader.readtext(
|
| 188 |
-
processed,
|
| 189 |
-
detail=1,
|
| 190 |
-
paragraph=False,
|
| 191 |
-
contrast_ths=0.1,
|
| 192 |
-
adjust_contrast=0.7,
|
| 193 |
-
text_threshold=0.5,
|
| 194 |
-
low_text=0.3,
|
| 195 |
-
link_threshold=0.3,
|
| 196 |
-
width_ths=0.7,
|
| 197 |
-
decoder="beamsearch",
|
| 198 |
-
beamWidth=10,
|
| 199 |
-
)
|
| 200 |
-
boxes = self._parse_results(raw_results)
|
| 201 |
-
|
| 202 |
-
if len(boxes) < 2:
|
| 203 |
-
raw2 = self.reader.readtext(
|
| 204 |
-
image.numpy,
|
| 205 |
-
detail=1,
|
| 206 |
-
paragraph=False,
|
| 207 |
-
)
|
| 208 |
-
boxes2 = self._parse_results(raw2)
|
| 209 |
-
if len(boxes2) > len(boxes):
|
| 210 |
-
boxes = boxes2
|
| 211 |
-
|
| 212 |
-
full_text = self._boxes_to_text(boxes)
|
| 213 |
-
|
| 214 |
-
return OCRResult(
|
| 215 |
-
full_text=full_text,
|
| 216 |
-
boxes=boxes,
|
| 217 |
-
language=",".join(self.languages),
|
| 218 |
-
engine=self.engine,
|
| 219 |
-
)
|
| 220 |
-
|
| 221 |
-
except Exception as e:
|
| 222 |
-
logger.error(f"OCR processing error: {e}")
|
| 223 |
-
raw_results = self.reader.readtext(image.numpy, detail=1, paragraph=False)
|
| 224 |
-
boxes = self._parse_results(raw_results)
|
| 225 |
-
full_text = self._boxes_to_text(boxes)
|
| 226 |
-
return OCRResult(
|
| 227 |
-
full_text=full_text,
|
| 228 |
-
boxes=boxes,
|
| 229 |
-
language=",".join(self.languages),
|
| 230 |
-
engine=self.engine,
|
| 231 |
-
)
|
| 232 |
-
|
| 233 |
-
def extract_text_simple(self, image: ImageInput) -> str:
|
| 234 |
-
result = self.extract_text(image, detail=True, paragraph=False)
|
| 235 |
-
return result.full_text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cv_module/src/api/routes.py
CHANGED
|
@@ -34,6 +34,10 @@ router = APIRouter()
|
|
| 34 |
_pipeline: CVPipeline = None
|
| 35 |
_pipeline_lock = threading.Lock()
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
def get_pipeline() -> CVPipeline:
|
| 39 |
global _pipeline
|
|
@@ -57,29 +61,37 @@ def _trigger_and_wait(model_name: str):
|
|
| 57 |
Thread-safe: hanya satu thread yang load, sisanya tunggu.
|
| 58 |
"""
|
| 59 |
readiness = get_readiness()
|
| 60 |
-
status_info = readiness.get_status(model_name)
|
| 61 |
|
| 62 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
if status_info.state.value == "ready":
|
| 64 |
return
|
| 65 |
|
| 66 |
-
#
|
| 67 |
-
if
|
| 68 |
-
raise HTTPException(
|
| 69 |
-
status_code=503,
|
| 70 |
-
detail={
|
| 71 |
-
"error": "model_failed_to_load",
|
| 72 |
-
"model": model_name,
|
| 73 |
-
"message": status_info.error_message or "Model gagal dimuat.",
|
| 74 |
-
"hint": "Cek logs container untuk detail error.",
|
| 75 |
-
},
|
| 76 |
-
)
|
| 77 |
-
|
| 78 |
-
# Kalau belum loading (not_loaded) β trigger load via pipeline property access.
|
| 79 |
-
# ReadinessTracker akan di-update oleh pipeline lazy loader.
|
| 80 |
-
if status_info.state.value in ("not_loaded",):
|
| 81 |
-
readiness.mark_loading(model_name)
|
| 82 |
-
# Trigger load di thread baru supaya tidak block event loop.
|
| 83 |
def _do_load():
|
| 84 |
try:
|
| 85 |
p = get_pipeline()
|
|
@@ -104,6 +116,18 @@ def _trigger_and_wait(model_name: str):
|
|
| 104 |
ok = readiness.wait_for(model_name, timeout=_MODEL_WAIT_TIMEOUT)
|
| 105 |
if not ok:
|
| 106 |
current = readiness.get_status(model_name).state.value
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
raise HTTPException(
|
| 108 |
status_code=503,
|
| 109 |
detail={
|
|
@@ -230,7 +254,8 @@ async def caption(url: str, prompt: str = None):
|
|
| 230 |
except HTTPException:
|
| 231 |
raise
|
| 232 |
except Exception as e:
|
| 233 |
-
|
|
|
|
| 234 |
|
| 235 |
|
| 236 |
@router.post("/detect", response_model=DetectionResponse, tags=["tasks"])
|
|
@@ -250,7 +275,8 @@ async def detect(url: str, conf: float = None):
|
|
| 250 |
except HTTPException:
|
| 251 |
raise
|
| 252 |
except Exception as e:
|
| 253 |
-
|
|
|
|
| 254 |
|
| 255 |
|
| 256 |
@router.post("/classify", response_model=ClassificationResponse, tags=["tasks"])
|
|
@@ -268,7 +294,8 @@ async def classify(req: ClassifyRequest):
|
|
| 268 |
except HTTPException:
|
| 269 |
raise
|
| 270 |
except Exception as e:
|
| 271 |
-
|
|
|
|
| 272 |
|
| 273 |
|
| 274 |
class OCRRequest(BaseModel):
|
|
@@ -293,7 +320,8 @@ async def ocr(req: OCRRequest):
|
|
| 293 |
except HTTPException:
|
| 294 |
raise
|
| 295 |
except Exception as e:
|
| 296 |
-
|
|
|
|
| 297 |
|
| 298 |
|
| 299 |
@router.post("/ocr/bytes", tags=["tasks"])
|
|
|
|
| 34 |
_pipeline: CVPipeline = None
|
| 35 |
_pipeline_lock = threading.Lock()
|
| 36 |
|
| 37 |
+
# Lock terpisah untuk trigger lazy-load β mencegah TOCTOU race kalau
|
| 38 |
+
# beberapa request datang barengan untuk model yang sama.
|
| 39 |
+
_trigger_lock = threading.Lock()
|
| 40 |
+
|
| 41 |
|
| 42 |
def get_pipeline() -> CVPipeline:
|
| 43 |
global _pipeline
|
|
|
|
| 61 |
Thread-safe: hanya satu thread yang load, sisanya tunggu.
|
| 62 |
"""
|
| 63 |
readiness = get_readiness()
|
|
|
|
| 64 |
|
| 65 |
+
# Atomic check-and-mark: hold _trigger_lock biar dua request ga sama-sama
|
| 66 |
+
# nge-spawn loader thread untuk model yang sama (CVPipeline punya per-model
|
| 67 |
+
# lock, tapi spawning extra thread tetep waste resource & log noise).
|
| 68 |
+
with _trigger_lock:
|
| 69 |
+
status_info = readiness.get_status(model_name)
|
| 70 |
+
|
| 71 |
+
# Kalau error, langsung raise.
|
| 72 |
+
if status_info.state.value == "error":
|
| 73 |
+
raise HTTPException(
|
| 74 |
+
status_code=503,
|
| 75 |
+
detail={
|
| 76 |
+
"error": "model_failed_to_load",
|
| 77 |
+
"model": model_name,
|
| 78 |
+
"message": status_info.error_message or "Model gagal dimuat.",
|
| 79 |
+
"hint": "Cek logs container untuk detail error.",
|
| 80 |
+
},
|
| 81 |
+
)
|
| 82 |
+
|
| 83 |
+
need_spawn = status_info.state.value in ("not_loaded",)
|
| 84 |
+
if need_spawn:
|
| 85 |
+
# Mark loading dulu β request berikutnya yang masuk akan lihat
|
| 86 |
+
# state="loading" dan langsung wait_for() tanpa spawn thread baru.
|
| 87 |
+
readiness.mark_loading(model_name)
|
| 88 |
+
|
| 89 |
+
# Kalau sudah ready, langsung return (tidak perlu wait).
|
| 90 |
if status_info.state.value == "ready":
|
| 91 |
return
|
| 92 |
|
| 93 |
+
# Spawn loader thread di luar lock supaya request lain bisa masuk.
|
| 94 |
+
if need_spawn:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
def _do_load():
|
| 96 |
try:
|
| 97 |
p = get_pipeline()
|
|
|
|
| 116 |
ok = readiness.wait_for(model_name, timeout=_MODEL_WAIT_TIMEOUT)
|
| 117 |
if not ok:
|
| 118 |
current = readiness.get_status(model_name).state.value
|
| 119 |
+
# Kalau state-nya error, kasih pesan error spesifik.
|
| 120 |
+
if current == "error":
|
| 121 |
+
err_msg = readiness.get_status(model_name).error_message
|
| 122 |
+
raise HTTPException(
|
| 123 |
+
status_code=503,
|
| 124 |
+
detail={
|
| 125 |
+
"error": "model_failed_to_load",
|
| 126 |
+
"model": model_name,
|
| 127 |
+
"message": err_msg or f"Model '{model_name}' gagal dimuat.",
|
| 128 |
+
"hint": "Cek logs container untuk traceback lengkap.",
|
| 129 |
+
},
|
| 130 |
+
)
|
| 131 |
raise HTTPException(
|
| 132 |
status_code=503,
|
| 133 |
detail={
|
|
|
|
| 254 |
except HTTPException:
|
| 255 |
raise
|
| 256 |
except Exception as e:
|
| 257 |
+
logger.error(f"Caption error: {e}")
|
| 258 |
+
raise HTTPException(500, detail=f"Caption gagal: {e}")
|
| 259 |
|
| 260 |
|
| 261 |
@router.post("/detect", response_model=DetectionResponse, tags=["tasks"])
|
|
|
|
| 275 |
except HTTPException:
|
| 276 |
raise
|
| 277 |
except Exception as e:
|
| 278 |
+
logger.error(f"Detect error: {e}")
|
| 279 |
+
raise HTTPException(500, detail=f"Detect gagal: {e}")
|
| 280 |
|
| 281 |
|
| 282 |
@router.post("/classify", response_model=ClassificationResponse, tags=["tasks"])
|
|
|
|
| 294 |
except HTTPException:
|
| 295 |
raise
|
| 296 |
except Exception as e:
|
| 297 |
+
logger.error(f"Classify error: {e}")
|
| 298 |
+
raise HTTPException(500, detail=f"Classify gagal: {e}")
|
| 299 |
|
| 300 |
|
| 301 |
class OCRRequest(BaseModel):
|
|
|
|
| 320 |
except HTTPException:
|
| 321 |
raise
|
| 322 |
except Exception as e:
|
| 323 |
+
logger.error(f"OCR error: {e}")
|
| 324 |
+
raise HTTPException(500, detail=f"OCR gagal: {e}")
|
| 325 |
|
| 326 |
|
| 327 |
@router.post("/ocr/bytes", tags=["tasks"])
|
cv_module/src/cv_pipeline.py
CHANGED
|
@@ -69,38 +69,61 @@ class CVPipeline:
|
|
| 69 |
Orchestrator untuk semua CV models.
|
| 70 |
Lazy loading β model hanya di-load saat pertama kali dipakai.
|
| 71 |
Support modular: bisa run satu atau semua model sekaligus.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
"""
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
def __init__(self):
|
|
|
|
| 75 |
self.settings = get_cv_settings()
|
| 76 |
self._clip: Optional[CLIPModel] = None
|
| 77 |
self._yolo: Optional[YOLODetector] = None
|
| 78 |
self._captioner: Optional[ImageCaptioner] = None
|
| 79 |
self._ocr: Optional[OCRProcessor] = None
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
@property
|
| 83 |
def clip(self) -> CLIPModel:
|
| 84 |
if self._clip is None:
|
| 85 |
-
self.
|
|
|
|
|
|
|
| 86 |
return self._clip
|
| 87 |
|
| 88 |
@property
|
| 89 |
def yolo(self) -> YOLODetector:
|
| 90 |
if self._yolo is None:
|
| 91 |
-
self.
|
|
|
|
|
|
|
| 92 |
return self._yolo
|
| 93 |
|
| 94 |
@property
|
| 95 |
def captioner(self) -> ImageCaptioner:
|
| 96 |
if self._captioner is None:
|
| 97 |
-
self.
|
|
|
|
|
|
|
| 98 |
return self._captioner
|
| 99 |
|
| 100 |
@property
|
| 101 |
def ocr(self) -> OCRProcessor:
|
| 102 |
if self._ocr is None:
|
| 103 |
-
self.
|
|
|
|
|
|
|
| 104 |
return self._ocr
|
| 105 |
|
| 106 |
# === Main analysis methods ===
|
|
@@ -139,9 +162,12 @@ class CVPipeline:
|
|
| 139 |
source=image.source,
|
| 140 |
)
|
| 141 |
|
| 142 |
-
# Jalankan semua model secara paralel untuk menghindari timeout
|
|
|
|
|
|
|
|
|
|
| 143 |
tasks = {}
|
| 144 |
-
with concurrent.futures.ThreadPoolExecutor() as executor:
|
| 145 |
if run_caption:
|
| 146 |
tasks["caption"] = executor.submit(self.captioner.caption, image)
|
| 147 |
if run_detection:
|
|
|
|
| 69 |
Orchestrator untuk semua CV models.
|
| 70 |
Lazy loading β model hanya di-load saat pertama kali dipakai.
|
| 71 |
Support modular: bisa run satu atau semua model sekaligus.
|
| 72 |
+
|
| 73 |
+
Thread-safe: setiap model property pakai per-model lock supaya 2+ request
|
| 74 |
+
yang concurrent tidak load model yang sama secara duplicate (bisa OOM
|
| 75 |
+
di HF free tier yang RAM-nya cuma 16GB shared).
|
| 76 |
"""
|
| 77 |
|
| 78 |
+
# Cap ThreadPoolExecutor workers untuk analyze() β tanpa cap, default
|
| 79 |
+
# min(32, os.cpu_count()+4) bisa bikin OOM kalau semua model jalan paralel
|
| 80 |
+
# plus model loading butuh RAM. 2 cukup buat overlap I/O + compute.
|
| 81 |
+
_MAX_PARALLEL_TASKS = 2
|
| 82 |
+
|
| 83 |
def __init__(self):
|
| 84 |
+
import threading
|
| 85 |
self.settings = get_cv_settings()
|
| 86 |
self._clip: Optional[CLIPModel] = None
|
| 87 |
self._yolo: Optional[YOLODetector] = None
|
| 88 |
self._captioner: Optional[ImageCaptioner] = None
|
| 89 |
self._ocr: Optional[OCRProcessor] = None
|
| 90 |
+
# Per-model locks β mencegah double-init kalau 2+ thread access barengan.
|
| 91 |
+
self._lock_clip = threading.Lock()
|
| 92 |
+
self._lock_yolo = threading.Lock()
|
| 93 |
+
self._lock_captioner = threading.Lock()
|
| 94 |
+
self._lock_ocr = threading.Lock()
|
| 95 |
+
logger.info("CVPipeline initialized (lazy loading, thread-safe).")
|
| 96 |
|
| 97 |
@property
|
| 98 |
def clip(self) -> CLIPModel:
|
| 99 |
if self._clip is None:
|
| 100 |
+
with self._lock_clip:
|
| 101 |
+
if self._clip is None: # double-check after lock
|
| 102 |
+
self._clip = CLIPModel()
|
| 103 |
return self._clip
|
| 104 |
|
| 105 |
@property
|
| 106 |
def yolo(self) -> YOLODetector:
|
| 107 |
if self._yolo is None:
|
| 108 |
+
with self._lock_yolo:
|
| 109 |
+
if self._yolo is None:
|
| 110 |
+
self._yolo = YOLODetector()
|
| 111 |
return self._yolo
|
| 112 |
|
| 113 |
@property
|
| 114 |
def captioner(self) -> ImageCaptioner:
|
| 115 |
if self._captioner is None:
|
| 116 |
+
with self._lock_captioner:
|
| 117 |
+
if self._captioner is None:
|
| 118 |
+
self._captioner = ImageCaptioner()
|
| 119 |
return self._captioner
|
| 120 |
|
| 121 |
@property
|
| 122 |
def ocr(self) -> OCRProcessor:
|
| 123 |
if self._ocr is None:
|
| 124 |
+
with self._lock_ocr:
|
| 125 |
+
if self._ocr is None:
|
| 126 |
+
self._ocr = OCRProcessor()
|
| 127 |
return self._ocr
|
| 128 |
|
| 129 |
# === Main analysis methods ===
|
|
|
|
| 162 |
source=image.source,
|
| 163 |
)
|
| 164 |
|
| 165 |
+
# Jalankan semua model secara paralel untuk menghindari timeout.
|
| 166 |
+
# max_workers=2 β cukup untuk overlap I/O & compute, ngga ngebebanin RAM.
|
| 167 |
+
# Tanpa cap, default workers bisa bikin 4 model jalan barengan + ngeload
|
| 168 |
+
# weight-nya barengan β OOM di HF free tier.
|
| 169 |
tasks = {}
|
| 170 |
+
with concurrent.futures.ThreadPoolExecutor(max_workers=self._MAX_PARALLEL_TASKS) as executor:
|
| 171 |
if run_caption:
|
| 172 |
tasks["caption"] = executor.submit(self.captioner.caption, image)
|
| 173 |
if run_detection:
|
frontend/index.html
CHANGED
|
@@ -9,15 +9,54 @@
|
|
| 9 |
// Anti-flash theme init. Key: 'ai-rag-theme'.
|
| 10 |
// null/missing = follow device preference (re-read every load).
|
| 11 |
// 'dark' or 'light' = user manually overrode.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
(function(){
|
| 13 |
-
var
|
| 14 |
-
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
document.documentElement.setAttribute('data-theme', t);
|
| 17 |
})();
|
| 18 |
</script>
|
| 19 |
<style>
|
| 20 |
/* ββ Theme tokens βββββββββββββββββββββββββββββββββββββ */
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
:root[data-theme="dark"] {
|
| 22 |
--bg0: #111214;
|
| 23 |
--bg1: #16181c;
|
|
@@ -515,21 +554,25 @@ let queryCount=0;
|
|
| 515 |
// === Theme ===
|
| 516 |
function applyTheme(theme, manual){
|
| 517 |
document.documentElement.setAttribute('data-theme', theme);
|
| 518 |
-
// Only persist if user manually toggled (not from system change or init)
|
|
|
|
|
|
|
| 519 |
if (manual) {
|
| 520 |
-
localStorage.setItem('ai-rag-theme', theme);
|
| 521 |
}
|
| 522 |
// icon-sun shown when dark (click -> go light)
|
| 523 |
// icon-moon shown when light (click -> go dark)
|
| 524 |
const isDark = theme === 'dark';
|
| 525 |
-
document.getElementById('icon-sun')
|
| 526 |
-
document.getElementById('icon-moon')
|
| 527 |
-
|
| 528 |
-
|
|
|
|
|
|
|
| 529 |
}
|
| 530 |
|
| 531 |
function toggleTheme(){
|
| 532 |
-
const cur = document.documentElement.getAttribute('data-theme');
|
| 533 |
applyTheme(cur === 'dark' ? 'light' : 'dark', true);
|
| 534 |
}
|
| 535 |
|
|
@@ -540,14 +583,20 @@ function toggleTheme(){
|
|
| 540 |
applyTheme(theme, false);
|
| 541 |
})();
|
| 542 |
|
| 543 |
-
// Listen for system theme changes (only if user hasn't manually overridden)
|
| 544 |
-
//
|
| 545 |
-
|
| 546 |
-
|
| 547 |
-
|
| 548 |
-
|
| 549 |
-
|
| 550 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 551 |
|
| 552 |
// === Utility ===
|
| 553 |
function tick(){document.getElementById('clock').textContent=new Date().toLocaleTimeString('id-ID',{hour:'2-digit',minute:'2-digit',second:'2-digit'})}
|
|
|
|
| 9 |
// Anti-flash theme init. Key: 'ai-rag-theme'.
|
| 10 |
// null/missing = follow device preference (re-read every load).
|
| 11 |
// 'dark' or 'light' = user manually overrode.
|
| 12 |
+
//
|
| 13 |
+
// IMPORTANT: wrap in try/catch β beberapa browser (private mode, strict cookie
|
| 14 |
+
// policy, embedded WebView) nge-block localStorage atau matchMedia β kalau script
|
| 15 |
+
// throw, data-theme tidak pernah ke-set, CSS variables semua undefined, dan
|
| 16 |
+
// tampilan keliatan "ilang temanya". Default ke 'light' kalau apa pun gagal.
|
| 17 |
(function(){
|
| 18 |
+
var t = 'light';
|
| 19 |
+
try {
|
| 20 |
+
var s = null;
|
| 21 |
+
try { s = window.localStorage.getItem('ai-rag-theme'); } catch (_) {}
|
| 22 |
+
if (s === 'dark' || s === 'light') {
|
| 23 |
+
t = s;
|
| 24 |
+
} else if (window.matchMedia && window.matchMedia('(prefers-color-scheme: dark)').matches) {
|
| 25 |
+
t = 'dark';
|
| 26 |
+
}
|
| 27 |
+
} catch (_) { /* fall through to light */ }
|
| 28 |
document.documentElement.setAttribute('data-theme', t);
|
| 29 |
})();
|
| 30 |
</script>
|
| 31 |
<style>
|
| 32 |
/* ββ Theme tokens βββββββββββββββββββββββββββββββββββββ */
|
| 33 |
+
/* Fallback default (light) β applies when data-theme attribute is missing
|
| 34 |
+
or invalid. Without this, CSS vars all undefined β blank/white-on-white UI.
|
| 35 |
+
The inline script in <head> ALWAYS sets data-theme, but defense-in-depth
|
| 36 |
+
protects us against weird browsers / extensions / errors. */
|
| 37 |
+
:root {
|
| 38 |
+
--bg0: #f4f5f7;
|
| 39 |
+
--bg1: #ffffff;
|
| 40 |
+
--bg2: #eef0f3;
|
| 41 |
+
--bg3: #e4e6eb;
|
| 42 |
+
--line: #d0d4db;
|
| 43 |
+
--line2: #bcc0c9;
|
| 44 |
+
--ink0: #1a1d24;
|
| 45 |
+
--ink1: #4a5263;
|
| 46 |
+
--ink2: #7a8296;
|
| 47 |
+
--ink3: #a8afc0;
|
| 48 |
+
--sage: #4a7a5e;
|
| 49 |
+
--sage-l: #357a52;
|
| 50 |
+
--sage-bg: #eaf4ee;
|
| 51 |
+
--amber: #8c6d3f;
|
| 52 |
+
--amber-l: #7a5820;
|
| 53 |
+
--amber-bg: #fdf3e3;
|
| 54 |
+
--sky: #2a5f80;
|
| 55 |
+
--sky-l: #1e5070;
|
| 56 |
+
--red: #8c3a3a;
|
| 57 |
+
--red-l: #9e2a2a;
|
| 58 |
+
--shadow: rgba(0,0,0,0.08);
|
| 59 |
+
}
|
| 60 |
:root[data-theme="dark"] {
|
| 61 |
--bg0: #111214;
|
| 62 |
--bg1: #16181c;
|
|
|
|
| 554 |
// === Theme ===
|
| 555 |
function applyTheme(theme, manual){
|
| 556 |
document.documentElement.setAttribute('data-theme', theme);
|
| 557 |
+
// Only persist if user manually toggled (not from system change or init).
|
| 558 |
+
// Wrap in try/catch β localStorage can throw in private mode, embedded
|
| 559 |
+
// WebViews, or with strict cookie policies.
|
| 560 |
if (manual) {
|
| 561 |
+
try { localStorage.setItem('ai-rag-theme', theme); } catch (_) {}
|
| 562 |
}
|
| 563 |
// icon-sun shown when dark (click -> go light)
|
| 564 |
// icon-moon shown when light (click -> go dark)
|
| 565 |
const isDark = theme === 'dark';
|
| 566 |
+
const sun = document.getElementById('icon-sun');
|
| 567 |
+
const moon = document.getElementById('icon-moon');
|
| 568 |
+
const label = document.getElementById('theme-label');
|
| 569 |
+
if (sun) sun.style.display = isDark ? '' : 'none';
|
| 570 |
+
if (moon) moon.style.display = isDark ? 'none' : '';
|
| 571 |
+
if (label) label.textContent = theme;
|
| 572 |
}
|
| 573 |
|
| 574 |
function toggleTheme(){
|
| 575 |
+
const cur = document.documentElement.getAttribute('data-theme') || 'light';
|
| 576 |
applyTheme(cur === 'dark' ? 'light' : 'dark', true);
|
| 577 |
}
|
| 578 |
|
|
|
|
| 583 |
applyTheme(theme, false);
|
| 584 |
})();
|
| 585 |
|
| 586 |
+
// Listen for system theme changes (only if user hasn't manually overridden).
|
| 587 |
+
// Wrap in try/catch β matchMedia.addEventListener missing in old browsers.
|
| 588 |
+
try {
|
| 589 |
+
const mq = window.matchMedia('(prefers-color-scheme: dark)');
|
| 590 |
+
const handler = function(e){
|
| 591 |
+
var stored = null;
|
| 592 |
+
try { stored = localStorage.getItem('ai-rag-theme'); } catch (_) {}
|
| 593 |
+
if (stored !== 'dark' && stored !== 'light') {
|
| 594 |
+
applyTheme(e.matches ? 'dark' : 'light', false);
|
| 595 |
+
}
|
| 596 |
+
};
|
| 597 |
+
if (mq.addEventListener) mq.addEventListener('change', handler);
|
| 598 |
+
else if (mq.addListener) mq.addListener(handler); // Safari < 14
|
| 599 |
+
} catch (_) {}
|
| 600 |
|
| 601 |
// === Utility ===
|
| 602 |
function tick(){document.getElementById('clock').textContent=new Date().toLocaleTimeString('id-ID',{hour:'2-digit',minute:'2-digit',second:'2-digit'})}
|
rag-requirements.txt
DELETED
|
@@ -1,34 +0,0 @@
|
|
| 1 |
-
# ββ LLM & Orchestration ββββββββββββββββββββββββββββββββββ
|
| 2 |
-
langchain==0.2.16
|
| 3 |
-
langchain-groq==0.1.9
|
| 4 |
-
langchain-community==0.2.16
|
| 5 |
-
langchain-chroma==0.1.4
|
| 6 |
-
|
| 7 |
-
# ββ Vector Store ββββββββββββββββββββββββββββββββββββββββββ
|
| 8 |
-
chromadb==0.5.3
|
| 9 |
-
|
| 10 |
-
# ββ Embeddings ββββββββββββββββββββββββββββββββββββββββββββ
|
| 11 |
-
sentence-transformers==2.7.0
|
| 12 |
-
transformers==4.35.2
|
| 13 |
-
numpy==1.26.4
|
| 14 |
-
|
| 15 |
-
# ββ Document Loaders ββββββββββββββββββββββββββββββββββββββ
|
| 16 |
-
pypdf==4.3.1
|
| 17 |
-
python-docx==1.1.2
|
| 18 |
-
beautifulsoup4==4.12.3
|
| 19 |
-
requests==2.32.3
|
| 20 |
-
|
| 21 |
-
# ββ API βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 22 |
-
fastapi==0.112.0
|
| 23 |
-
uvicorn[standard]==0.30.6
|
| 24 |
-
python-multipart==0.0.9
|
| 25 |
-
pydantic==2.8.2
|
| 26 |
-
pydantic-settings==2.4.0
|
| 27 |
-
|
| 28 |
-
# ββ MLOps βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 29 |
-
mlflow==2.15.1
|
| 30 |
-
|
| 31 |
-
# ββ Utils βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 32 |
-
python-dotenv==1.0.1
|
| 33 |
-
loguru==0.7.2
|
| 34 |
-
httpx==0.27.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/__init__.py
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
# RAG Pipeline
|
|
|
|
|
|
rag/src/api/__init__.py
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
|
|
|
|
|
|
rag/src/api/main.py
DELETED
|
@@ -1,57 +0,0 @@
|
|
| 1 |
-
from fastapi import FastAPI
|
| 2 |
-
from fastapi.middleware.cors import CORSMiddleware
|
| 3 |
-
from loguru import logger
|
| 4 |
-
import sys
|
| 5 |
-
|
| 6 |
-
from .routes import router
|
| 7 |
-
from ..config import get_settings
|
| 8 |
-
|
| 9 |
-
# === Logging setup ===
|
| 10 |
-
settings = get_settings()
|
| 11 |
-
logger.remove()
|
| 12 |
-
logger.add(sys.stderr, level=settings.log_level, colorize=True,
|
| 13 |
-
format="<green>{time:HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan> - {message}")
|
| 14 |
-
logger.add(settings.log_file, rotation="10 MB", retention="7 days", level="DEBUG")
|
| 15 |
-
|
| 16 |
-
# === App ===
|
| 17 |
-
app = FastAPI(
|
| 18 |
-
title="RAG Pipeline API",
|
| 19 |
-
description="""
|
| 20 |
-
## Multimodal AI Assistant β RAG Module
|
| 21 |
-
|
| 22 |
-
Endpoint untuk indexing dan querying dokumen menggunakan:
|
| 23 |
-
- **Groq** (LLM inference β cepat & gratis)
|
| 24 |
-
- **Sentence Transformers** (local embeddings)
|
| 25 |
-
- **ChromaDB** (vector store persistent)
|
| 26 |
-
- **MLflow** (experiment tracking)
|
| 27 |
-
|
| 28 |
-
### Supported document types
|
| 29 |
-
PDF, TXT, Markdown, DOCX, JSON, JSONL, URL
|
| 30 |
-
""",
|
| 31 |
-
version="1.0.0",
|
| 32 |
-
docs_url="/docs",
|
| 33 |
-
redoc_url="/redoc",
|
| 34 |
-
)
|
| 35 |
-
|
| 36 |
-
# === CORS ===
|
| 37 |
-
app.add_middleware(
|
| 38 |
-
CORSMiddleware,
|
| 39 |
-
allow_origins=["*"], # Ganti dengan domain spesifik di production
|
| 40 |
-
allow_credentials=True,
|
| 41 |
-
allow_methods=["*"],
|
| 42 |
-
allow_headers=["*"],
|
| 43 |
-
)
|
| 44 |
-
|
| 45 |
-
# === Routes ===
|
| 46 |
-
app.include_router(router, prefix="/api/v1")
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
@app.on_event("startup")
|
| 50 |
-
async def startup():
|
| 51 |
-
logger.info("RAG Pipeline API starting up...")
|
| 52 |
-
logger.info(f"Docs: http://{settings.api_host}:{settings.api_port}/docs")
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
@app.on_event("shutdown")
|
| 56 |
-
async def shutdown():
|
| 57 |
-
logger.info("RAG Pipeline API shutting down.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/api/routes.py
DELETED
|
@@ -1,137 +0,0 @@
|
|
| 1 |
-
from fastapi import APIRouter, HTTPException, UploadFile, File
|
| 2 |
-
from fastapi.responses import StreamingResponse
|
| 3 |
-
from loguru import logger
|
| 4 |
-
import tempfile
|
| 5 |
-
import shutil
|
| 6 |
-
from pathlib import Path
|
| 7 |
-
|
| 8 |
-
from .schemas import (
|
| 9 |
-
IngestRequest, IngestResponse,
|
| 10 |
-
QueryRequest, QueryResponse,
|
| 11 |
-
SummarizeRequest, SummarizeResponse,
|
| 12 |
-
StatsResponse, DeleteResponse,
|
| 13 |
-
)
|
| 14 |
-
from ..retrieval.retriever import RAGRetriever
|
| 15 |
-
|
| 16 |
-
router = APIRouter()
|
| 17 |
-
|
| 18 |
-
# Singleton retriever β di-init sekali saat startup
|
| 19 |
-
_retriever: RAGRetriever = None
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
def get_retriever() -> RAGRetriever:
|
| 23 |
-
global _retriever
|
| 24 |
-
if _retriever is None:
|
| 25 |
-
_retriever = RAGRetriever()
|
| 26 |
-
return _retriever
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
# === HEALTH ===
|
| 30 |
-
|
| 31 |
-
@router.get("/health", tags=["system"])
|
| 32 |
-
async def health_check():
|
| 33 |
-
return {"status": "ok", "service": "RAG Pipeline API"}
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
# === STATS ===
|
| 37 |
-
|
| 38 |
-
@router.get("/stats", response_model=StatsResponse, tags=["system"])
|
| 39 |
-
async def get_stats():
|
| 40 |
-
"""Info tentang vector store saat ini."""
|
| 41 |
-
return get_retriever().get_stats()
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
# === INGEST ===
|
| 45 |
-
|
| 46 |
-
@router.post("/ingest", response_model=IngestResponse, tags=["indexing"])
|
| 47 |
-
async def ingest_documents(request: IngestRequest):
|
| 48 |
-
"""
|
| 49 |
-
Index dokumen dari file path atau URL ke vector store.
|
| 50 |
-
Mendukung: PDF, TXT, MD, DOCX, JSON, JSONL, URL
|
| 51 |
-
"""
|
| 52 |
-
try:
|
| 53 |
-
stats = get_retriever().ingest(request.sources)
|
| 54 |
-
return IngestResponse(status="success", **stats)
|
| 55 |
-
except Exception as e:
|
| 56 |
-
logger.error(f"Ingest error: {e}")
|
| 57 |
-
raise HTTPException(status_code=500, detail=str(e))
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
@router.post("/ingest/upload", tags=["indexing"])
|
| 61 |
-
async def ingest_upload(file: UploadFile = File(...)):
|
| 62 |
-
"""Upload dan index file langsung via multipart."""
|
| 63 |
-
allowed_exts = {".pdf", ".txt", ".md", ".docx", ".json", ".jsonl"}
|
| 64 |
-
ext = Path(file.filename).suffix.lower()
|
| 65 |
-
|
| 66 |
-
if ext not in allowed_exts:
|
| 67 |
-
raise HTTPException(
|
| 68 |
-
status_code=400,
|
| 69 |
-
detail=f"Ekstensi '{ext}' tidak didukung. Gunakan: {allowed_exts}"
|
| 70 |
-
)
|
| 71 |
-
|
| 72 |
-
# Simpan file sementara
|
| 73 |
-
with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp:
|
| 74 |
-
shutil.copyfileobj(file.file, tmp)
|
| 75 |
-
tmp_path = tmp.name
|
| 76 |
-
|
| 77 |
-
try:
|
| 78 |
-
stats = get_retriever().ingest([tmp_path])
|
| 79 |
-
return IngestResponse(status="success", **stats)
|
| 80 |
-
except Exception as e:
|
| 81 |
-
raise HTTPException(status_code=500, detail=str(e))
|
| 82 |
-
finally:
|
| 83 |
-
Path(tmp_path).unlink(missing_ok=True)
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
# === QUERY ===
|
| 87 |
-
|
| 88 |
-
@router.post("/query", response_model=QueryResponse, tags=["querying"])
|
| 89 |
-
async def query(request: QueryRequest):
|
| 90 |
-
"""
|
| 91 |
-
Tanya jawab berdasarkan dokumen yang sudah di-index.
|
| 92 |
-
Mendukung multi-turn conversation via chat_history.
|
| 93 |
-
"""
|
| 94 |
-
if request.stream:
|
| 95 |
-
# Streaming response
|
| 96 |
-
def generate():
|
| 97 |
-
yield from get_retriever().stream_query(
|
| 98 |
-
question=request.question,
|
| 99 |
-
chat_history=[m.model_dump() for m in (request.chat_history or [])],
|
| 100 |
-
)
|
| 101 |
-
return StreamingResponse(generate(), media_type="text/event-stream")
|
| 102 |
-
|
| 103 |
-
try:
|
| 104 |
-
result = get_retriever().query(
|
| 105 |
-
question=request.question,
|
| 106 |
-
chat_history=[m.model_dump() for m in (request.chat_history or [])],
|
| 107 |
-
top_k=request.top_k,
|
| 108 |
-
return_sources=request.return_sources,
|
| 109 |
-
)
|
| 110 |
-
return QueryResponse(**result)
|
| 111 |
-
except Exception as e:
|
| 112 |
-
logger.error(f"Query error: {e}")
|
| 113 |
-
raise HTTPException(status_code=500, detail=str(e))
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
# === SUMMARIZE ===
|
| 117 |
-
|
| 118 |
-
@router.post("/summarize", response_model=SummarizeResponse, tags=["querying"])
|
| 119 |
-
async def summarize(request: SummarizeRequest):
|
| 120 |
-
"""Buat ringkasan otomatis dari dokumen."""
|
| 121 |
-
try:
|
| 122 |
-
summary = get_retriever().summarize(request.source)
|
| 123 |
-
return SummarizeResponse(summary=summary, source=request.source)
|
| 124 |
-
except Exception as e:
|
| 125 |
-
raise HTTPException(status_code=500, detail=str(e))
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
# === DELETE ===
|
| 129 |
-
|
| 130 |
-
@router.delete("/collection", response_model=DeleteResponse, tags=["system"])
|
| 131 |
-
async def delete_collection():
|
| 132 |
-
"""Hapus semua dokumen dari vector store. HATI-HATI: tidak bisa di-undo."""
|
| 133 |
-
get_retriever().vector_store.delete_collection()
|
| 134 |
-
return DeleteResponse(
|
| 135 |
-
status="success",
|
| 136 |
-
message="Semua dokumen berhasil dihapus dari vector store."
|
| 137 |
-
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/api/schemas.py
DELETED
|
@@ -1,67 +0,0 @@
|
|
| 1 |
-
from pydantic import BaseModel, Field
|
| 2 |
-
from typing import List, Optional
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
class IngestRequest(BaseModel):
|
| 6 |
-
sources: List[str] = Field(
|
| 7 |
-
...,
|
| 8 |
-
description="List of file paths atau URLs untuk di-index",
|
| 9 |
-
example=["./docs/laporan.pdf", "https://example.com/artikel"],
|
| 10 |
-
)
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
class IngestResponse(BaseModel):
|
| 14 |
-
status: str
|
| 15 |
-
documents_loaded: int
|
| 16 |
-
chunks_indexed: int
|
| 17 |
-
total_docs_in_store: int
|
| 18 |
-
elapsed_seconds: float
|
| 19 |
-
sources: List[str]
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
class ChatMessage(BaseModel):
|
| 23 |
-
role: str = Field(..., pattern="^(user|assistant)$")
|
| 24 |
-
content: str
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
class QueryRequest(BaseModel):
|
| 28 |
-
question: str = Field(..., min_length=1, max_length=2000)
|
| 29 |
-
chat_history: Optional[List[ChatMessage]] = Field(default=[], description="Riwayat chat untuk multi-turn")
|
| 30 |
-
top_k: Optional[int] = Field(default=None, ge=1, le=20)
|
| 31 |
-
return_sources: bool = Field(default=True)
|
| 32 |
-
stream: bool = Field(default=False, description="Gunakan streaming response")
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
class SourceChunk(BaseModel):
|
| 36 |
-
content: str
|
| 37 |
-
metadata: dict
|
| 38 |
-
relevance_score: float
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
class QueryResponse(BaseModel):
|
| 42 |
-
answer: str
|
| 43 |
-
question: str
|
| 44 |
-
latency_seconds: float
|
| 45 |
-
chunks_retrieved: int
|
| 46 |
-
sources: Optional[List[SourceChunk]] = None
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
class SummarizeRequest(BaseModel):
|
| 50 |
-
source: str = Field(..., description="File path atau URL untuk diringkas")
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
class SummarizeResponse(BaseModel):
|
| 54 |
-
summary: str
|
| 55 |
-
source: str
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
class StatsResponse(BaseModel):
|
| 59 |
-
total_chunks: int
|
| 60 |
-
collection_name: str
|
| 61 |
-
embedding_model: str
|
| 62 |
-
llm_model: str
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
class DeleteResponse(BaseModel):
|
| 66 |
-
status: str
|
| 67 |
-
message: str
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/config.py
DELETED
|
@@ -1,56 +0,0 @@
|
|
| 1 |
-
from pydantic_settings import BaseSettings
|
| 2 |
-
from pydantic import Field
|
| 3 |
-
from functools import lru_cache
|
| 4 |
-
from pathlib import Path
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
class Settings(BaseSettings):
|
| 8 |
-
# LLM
|
| 9 |
-
groq_api_key: str = Field(..., env="GROQ_API_KEY")
|
| 10 |
-
groq_model: str = Field("llama-3.3-70b-versatile", env="GROQ_MODEL") # updated from 3.1
|
| 11 |
-
|
| 12 |
-
# Embeddings
|
| 13 |
-
embedding_model: str = Field("all-MiniLM-L6-v2", env="EMBEDDING_MODEL")
|
| 14 |
-
embedding_device: str = Field("cpu", env="EMBEDDING_DEVICE")
|
| 15 |
-
|
| 16 |
-
# Vector Store
|
| 17 |
-
chroma_persist_dir: str = Field("./chroma_db", env="CHROMA_PERSIST_DIR")
|
| 18 |
-
chroma_collection_name: str = Field("rag_documents", env="CHROMA_COLLECTION_NAME")
|
| 19 |
-
|
| 20 |
-
# RAG Settings
|
| 21 |
-
chunk_size: int = Field(1000, env="CHUNK_SIZE")
|
| 22 |
-
chunk_overlap: int = Field(200, env="CHUNK_OVERLAP")
|
| 23 |
-
top_k_retrieval: int = Field(5, env="TOP_K_RETRIEVAL")
|
| 24 |
-
max_tokens: int = Field(2048, env="MAX_TOKENS")
|
| 25 |
-
temperature: float = Field(0.1, env="TEMPERATURE")
|
| 26 |
-
|
| 27 |
-
# API
|
| 28 |
-
api_host: str = Field("0.0.0.0", env="API_HOST")
|
| 29 |
-
api_port: int = Field(8000, env="API_PORT")
|
| 30 |
-
api_reload: bool = Field(True, env="API_RELOAD")
|
| 31 |
-
|
| 32 |
-
# MLflow
|
| 33 |
-
mlflow_tracking_uri: str = Field("./mlruns", env="MLFLOW_TRACKING_URI")
|
| 34 |
-
mlflow_experiment_name: str = Field("rag_pipeline", env="MLFLOW_EXPERIMENT_NAME")
|
| 35 |
-
|
| 36 |
-
# Logging
|
| 37 |
-
log_level: str = Field("INFO", env="LOG_LEVEL")
|
| 38 |
-
log_file: str = Field("./logs/app.log", env="LOG_FILE")
|
| 39 |
-
|
| 40 |
-
class Config:
|
| 41 |
-
env_file = ".env"
|
| 42 |
-
env_file_encoding = "utf-8"
|
| 43 |
-
|
| 44 |
-
def ensure_dirs(self):
|
| 45 |
-
"""Buat direktori yang dibutuhkan jika belum ada."""
|
| 46 |
-
Path(self.chroma_persist_dir).mkdir(parents=True, exist_ok=True)
|
| 47 |
-
Path(self.log_file).parent.mkdir(parents=True, exist_ok=True)
|
| 48 |
-
Path(self.mlflow_tracking_uri).mkdir(parents=True, exist_ok=True)
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
@lru_cache()
|
| 52 |
-
def get_settings() -> Settings:
|
| 53 |
-
"""Singleton settings β di-cache supaya tidak re-parse tiap request."""
|
| 54 |
-
settings = Settings()
|
| 55 |
-
settings.ensure_dirs()
|
| 56 |
-
return settings
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/embeddings/__init__.py
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
|
|
|
|
|
|
rag/src/embeddings/embedder.py
DELETED
|
@@ -1,60 +0,0 @@
|
|
| 1 |
-
from typing import List
|
| 2 |
-
from loguru import logger
|
| 3 |
-
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 4 |
-
from langchain_community.embeddings import HuggingFaceEmbeddings
|
| 5 |
-
|
| 6 |
-
from ..config import get_settings
|
| 7 |
-
from ..loaders.base_loader import Document
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
class DocumentEmbedder:
|
| 11 |
-
"""
|
| 12 |
-
Bertanggung jawab untuk:
|
| 13 |
-
1. Chunking dokumen panjang jadi potongan yang bisa di-embed
|
| 14 |
-
2. Membuat embedding vektor pakai model lokal (no API cost!)
|
| 15 |
-
"""
|
| 16 |
-
|
| 17 |
-
def __init__(self):
|
| 18 |
-
settings = get_settings()
|
| 19 |
-
logger.info(f"Loading embedding model: {settings.embedding_model}")
|
| 20 |
-
|
| 21 |
-
self.embeddings = HuggingFaceEmbeddings(
|
| 22 |
-
model_name=settings.embedding_model,
|
| 23 |
-
model_kwargs={"device": settings.embedding_device},
|
| 24 |
-
encode_kwargs={"normalize_embeddings": True},
|
| 25 |
-
)
|
| 26 |
-
|
| 27 |
-
self.splitter = RecursiveCharacterTextSplitter(
|
| 28 |
-
chunk_size=settings.chunk_size,
|
| 29 |
-
chunk_overlap=settings.chunk_overlap,
|
| 30 |
-
separators=["\n\n", "\n", ". ", " ", ""],
|
| 31 |
-
)
|
| 32 |
-
|
| 33 |
-
logger.info("Embedder ready.")
|
| 34 |
-
|
| 35 |
-
def chunk_documents(self, documents: List[Document]) -> List[Document]:
|
| 36 |
-
"""
|
| 37 |
-
Split dokumen panjang jadi chunks.
|
| 38 |
-
Metadata dari dokumen asli diwarisi ke setiap chunk.
|
| 39 |
-
"""
|
| 40 |
-
chunks = []
|
| 41 |
-
for doc in documents:
|
| 42 |
-
texts = self.splitter.split_text(doc.content)
|
| 43 |
-
for i, text in enumerate(texts):
|
| 44 |
-
chunk_metadata = {
|
| 45 |
-
**doc.metadata,
|
| 46 |
-
"chunk_index": i,
|
| 47 |
-
"total_chunks": len(texts),
|
| 48 |
-
"parent_doc_id": doc.doc_id,
|
| 49 |
-
}
|
| 50 |
-
chunks.append(Document(
|
| 51 |
-
content=text,
|
| 52 |
-
metadata=chunk_metadata,
|
| 53 |
-
))
|
| 54 |
-
|
| 55 |
-
logger.info(f"Chunked {len(documents)} docs β {len(chunks)} chunks")
|
| 56 |
-
return chunks
|
| 57 |
-
|
| 58 |
-
def get_embeddings_model(self):
|
| 59 |
-
"""Return LangChain-compatible embeddings object untuk ChromaDB."""
|
| 60 |
-
return self.embeddings
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/llm/__init__.py
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
|
|
|
|
|
|
rag/src/llm/groq_client.py
DELETED
|
@@ -1,62 +0,0 @@
|
|
| 1 |
-
from typing import List, Optional, Iterator
|
| 2 |
-
from loguru import logger
|
| 3 |
-
from langchain_groq import ChatGroq
|
| 4 |
-
from langchain.schema import BaseMessage, HumanMessage, AIMessage
|
| 5 |
-
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
|
| 6 |
-
|
| 7 |
-
from ..config import get_settings
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
class GroqClient:
|
| 11 |
-
"""
|
| 12 |
-
Wrapper di atas ChatGroq dari LangChain.
|
| 13 |
-
Mendukung regular call dan streaming.
|
| 14 |
-
"""
|
| 15 |
-
|
| 16 |
-
def __init__(self, streaming: bool = False):
|
| 17 |
-
settings = get_settings()
|
| 18 |
-
callbacks = [StreamingStdOutCallbackHandler()] if streaming else []
|
| 19 |
-
|
| 20 |
-
self.llm = ChatGroq(
|
| 21 |
-
api_key=settings.groq_api_key,
|
| 22 |
-
model_name=settings.groq_model,
|
| 23 |
-
temperature=settings.temperature,
|
| 24 |
-
max_tokens=settings.max_tokens,
|
| 25 |
-
streaming=streaming,
|
| 26 |
-
callbacks=callbacks,
|
| 27 |
-
)
|
| 28 |
-
self.model_name = settings.groq_model
|
| 29 |
-
logger.info(f"Groq client initialized. Model: {settings.groq_model}")
|
| 30 |
-
|
| 31 |
-
def invoke(self, messages: List[BaseMessage]) -> str:
|
| 32 |
-
"""Kirim messages dan return response string."""
|
| 33 |
-
response = self.llm.invoke(messages)
|
| 34 |
-
return response.content
|
| 35 |
-
|
| 36 |
-
def stream(self, messages: List[BaseMessage]) -> Iterator[str]:
|
| 37 |
-
"""Streaming response β yield token per token."""
|
| 38 |
-
for chunk in self.llm.stream(messages):
|
| 39 |
-
if chunk.content:
|
| 40 |
-
yield chunk.content
|
| 41 |
-
|
| 42 |
-
def get_langchain_llm(self):
|
| 43 |
-
"""Return raw LangChain LLM object untuk dipakai di chain."""
|
| 44 |
-
return self.llm
|
| 45 |
-
|
| 46 |
-
@staticmethod
|
| 47 |
-
def build_messages(
|
| 48 |
-
question: str,
|
| 49 |
-
chat_history: Optional[List[dict]] = None,
|
| 50 |
-
) -> List[BaseMessage]:
|
| 51 |
-
"""
|
| 52 |
-
Convert chat history format ke LangChain messages.
|
| 53 |
-
chat_history format: [{"role": "user"/"assistant", "content": "..."}]
|
| 54 |
-
"""
|
| 55 |
-
messages = []
|
| 56 |
-
for msg in (chat_history or []):
|
| 57 |
-
if msg["role"] == "user":
|
| 58 |
-
messages.append(HumanMessage(content=msg["content"]))
|
| 59 |
-
elif msg["role"] == "assistant":
|
| 60 |
-
messages.append(AIMessage(content=msg["content"]))
|
| 61 |
-
messages.append(HumanMessage(content=question))
|
| 62 |
-
return messages
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/llm/prompt_templates.py
DELETED
|
@@ -1,36 +0,0 @@
|
|
| 1 |
-
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
|
| 2 |
-
|
| 3 |
-
# === RAG QA Prompt ===
|
| 4 |
-
RAG_PROMPT = ChatPromptTemplate.from_messages([
|
| 5 |
-
("system", """Kamu adalah AI assistant yang menjawab pertanyaan berdasarkan konteks dokumen yang diberikan.
|
| 6 |
-
|
| 7 |
-
ATURAN:
|
| 8 |
-
- Jawab HANYA berdasarkan konteks yang disediakan
|
| 9 |
-
- Jika jawaban tidak ada di konteks, katakan "Informasi ini tidak tersedia dalam dokumen yang diberikan"
|
| 10 |
-
- Selalu sebutkan sumber dokumen jika relevan
|
| 11 |
-
- Jawab dalam bahasa yang sama dengan pertanyaan pengguna
|
| 12 |
-
- Berikan jawaban yang ringkas, akurat, dan terstruktur
|
| 13 |
-
|
| 14 |
-
KONTEKS DOKUMEN:
|
| 15 |
-
{context}
|
| 16 |
-
"""),
|
| 17 |
-
MessagesPlaceholder(variable_name="chat_history", optional=True),
|
| 18 |
-
("human", "{question}"),
|
| 19 |
-
])
|
| 20 |
-
|
| 21 |
-
# === Standalone Question Prompt (untuk rephrase pertanyaan follow-up) ===
|
| 22 |
-
CONDENSE_QUESTION_PROMPT = ChatPromptTemplate.from_messages([
|
| 23 |
-
("system", """Diberikan riwayat percakapan dan pertanyaan terbaru dari pengguna,
|
| 24 |
-
reformulasikan pertanyaan menjadi pertanyaan mandiri yang bisa dipahami tanpa konteks percakapan sebelumnya.
|
| 25 |
-
Jangan jawab pertanyaannya, cukup reformulasikan jika perlu. Jika tidak perlu, kembalikan apa adanya."""),
|
| 26 |
-
MessagesPlaceholder(variable_name="chat_history"),
|
| 27 |
-
("human", "{question}"),
|
| 28 |
-
])
|
| 29 |
-
|
| 30 |
-
# === Summary Prompt ===
|
| 31 |
-
SUMMARY_PROMPT = ChatPromptTemplate.from_messages([
|
| 32 |
-
("system", """Buat ringkasan dari dokumen berikut.
|
| 33 |
-
Sertakan: poin-poin utama, informasi kunci, dan kesimpulan.
|
| 34 |
-
Format: gunakan bullet points untuk keterbacaan yang baik."""),
|
| 35 |
-
("human", "Dokumen:\n{document}\n\nBuat ringkasan:"),
|
| 36 |
-
])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/loaders/__init__.py
DELETED
|
@@ -1,69 +0,0 @@
|
|
| 1 |
-
from typing import List
|
| 2 |
-
from pathlib import Path
|
| 3 |
-
from loguru import logger
|
| 4 |
-
|
| 5 |
-
from .base_loader import BaseLoader, Document
|
| 6 |
-
from .pdf_loader import PDFLoader
|
| 7 |
-
from .text_loader import TextLoader
|
| 8 |
-
from .docx_loader import DocxLoader
|
| 9 |
-
from .web_loader import WebLoader
|
| 10 |
-
from .json_loader import JSONLoader
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
class LoaderFactory:
|
| 14 |
-
"""
|
| 15 |
-
Auto-detect loader yang tepat berdasarkan ekstensi file atau URL.
|
| 16 |
-
Pattern: Factory Method β client tidak perlu tahu loader mana yang dipakai.
|
| 17 |
-
"""
|
| 18 |
-
|
| 19 |
-
_loaders: dict[str, BaseLoader] = {
|
| 20 |
-
".pdf": PDFLoader(),
|
| 21 |
-
".txt": TextLoader(),
|
| 22 |
-
".md": TextLoader(),
|
| 23 |
-
".markdown": TextLoader(),
|
| 24 |
-
".docx": DocxLoader(),
|
| 25 |
-
".doc": DocxLoader(),
|
| 26 |
-
".json": JSONLoader(),
|
| 27 |
-
".jsonl": JSONLoader(),
|
| 28 |
-
}
|
| 29 |
-
|
| 30 |
-
@classmethod
|
| 31 |
-
def get_loader(cls, source: str) -> BaseLoader:
|
| 32 |
-
"""Pilih loader yang sesuai untuk source."""
|
| 33 |
-
# URL
|
| 34 |
-
if source.startswith(("http://", "https://")):
|
| 35 |
-
return WebLoader()
|
| 36 |
-
|
| 37 |
-
# File
|
| 38 |
-
ext = Path(source).suffix.lower()
|
| 39 |
-
loader = cls._loaders.get(ext)
|
| 40 |
-
if loader is None:
|
| 41 |
-
raise ValueError(
|
| 42 |
-
f"Tidak ada loader untuk ekstensi '{ext}'. "
|
| 43 |
-
f"Didukung: {list(cls._loaders.keys())} + URL"
|
| 44 |
-
)
|
| 45 |
-
return loader
|
| 46 |
-
|
| 47 |
-
@classmethod
|
| 48 |
-
def load(cls, source: str) -> List[Document]:
|
| 49 |
-
"""One-liner: auto-detect loader dan langsung load."""
|
| 50 |
-
loader = cls.get_loader(source)
|
| 51 |
-
logger.info(f"Using {loader.__class__.__name__} for: {source}")
|
| 52 |
-
return loader.load(source)
|
| 53 |
-
|
| 54 |
-
@classmethod
|
| 55 |
-
def load_many(cls, sources: List[str]) -> List[Document]:
|
| 56 |
-
"""Load multiple sources sekaligus."""
|
| 57 |
-
all_docs = []
|
| 58 |
-
for source in sources:
|
| 59 |
-
try:
|
| 60 |
-
docs = cls.load(source)
|
| 61 |
-
all_docs.extend(docs)
|
| 62 |
-
logger.info(f"Loaded {len(docs)} docs from {source}")
|
| 63 |
-
except Exception as e:
|
| 64 |
-
logger.error(f"Gagal load {source}: {e}")
|
| 65 |
-
logger.info(f"Total loaded: {len(all_docs)} documents")
|
| 66 |
-
return all_docs
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
__all__ = ["LoaderFactory", "Document", "BaseLoader"]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/loaders/base_loader.py
DELETED
|
@@ -1,39 +0,0 @@
|
|
| 1 |
-
from abc import ABC, abstractmethod
|
| 2 |
-
from dataclasses import dataclass, field
|
| 3 |
-
from typing import List, Optional
|
| 4 |
-
from pathlib import Path
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
@dataclass
|
| 8 |
-
class Document:
|
| 9 |
-
"""Representasi satu dokumen atau chunk yang sudah diload."""
|
| 10 |
-
content: str
|
| 11 |
-
metadata: dict = field(default_factory=dict)
|
| 12 |
-
doc_id: Optional[str] = None
|
| 13 |
-
|
| 14 |
-
def __post_init__(self):
|
| 15 |
-
if self.doc_id is None:
|
| 16 |
-
import hashlib
|
| 17 |
-
self.doc_id = hashlib.md5(self.content.encode()).hexdigest()[:12]
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
class BaseLoader(ABC):
|
| 21 |
-
"""Abstract base class untuk semua document loaders."""
|
| 22 |
-
|
| 23 |
-
@abstractmethod
|
| 24 |
-
def load(self, source: str) -> List[Document]:
|
| 25 |
-
"""
|
| 26 |
-
Load dokumen dari source (path file atau URL).
|
| 27 |
-
Returns list of Document objects.
|
| 28 |
-
"""
|
| 29 |
-
pass
|
| 30 |
-
|
| 31 |
-
def validate_source(self, source: str) -> bool:
|
| 32 |
-
"""Validasi apakah source bisa di-handle loader ini."""
|
| 33 |
-
return True
|
| 34 |
-
|
| 35 |
-
@property
|
| 36 |
-
@abstractmethod
|
| 37 |
-
def supported_extensions(self) -> List[str]:
|
| 38 |
-
"""Daftar ekstensi file yang didukung loader ini."""
|
| 39 |
-
pass
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/loaders/docx_loader.py
DELETED
|
@@ -1,52 +0,0 @@
|
|
| 1 |
-
from typing import List
|
| 2 |
-
from pathlib import Path
|
| 3 |
-
from loguru import logger
|
| 4 |
-
|
| 5 |
-
from .base_loader import BaseLoader, Document
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
class DocxLoader(BaseLoader):
|
| 9 |
-
"""Loader untuk file .docx menggunakan python-docx."""
|
| 10 |
-
|
| 11 |
-
@property
|
| 12 |
-
def supported_extensions(self) -> List[str]:
|
| 13 |
-
return [".docx", ".doc"]
|
| 14 |
-
|
| 15 |
-
def load(self, source: str) -> List[Document]:
|
| 16 |
-
try:
|
| 17 |
-
from docx import Document as DocxDocument
|
| 18 |
-
except ImportError:
|
| 19 |
-
raise ImportError("Install python-docx: pip install python-docx")
|
| 20 |
-
|
| 21 |
-
path = Path(source)
|
| 22 |
-
if not path.exists():
|
| 23 |
-
raise FileNotFoundError(f"File tidak ditemukan: {source}")
|
| 24 |
-
|
| 25 |
-
logger.info(f"Loading DOCX: {path.name}")
|
| 26 |
-
doc = DocxDocument(str(path))
|
| 27 |
-
|
| 28 |
-
# Ambil semua paragraf yang tidak kosong
|
| 29 |
-
paragraphs = [p.text.strip() for p in doc.paragraphs if p.text.strip()]
|
| 30 |
-
content = "\n\n".join(paragraphs)
|
| 31 |
-
|
| 32 |
-
# Ambil teks dari tabel juga
|
| 33 |
-
table_texts = []
|
| 34 |
-
for table in doc.tables:
|
| 35 |
-
for row in table.rows:
|
| 36 |
-
row_text = " | ".join(cell.text.strip() for cell in row.cells if cell.text.strip())
|
| 37 |
-
if row_text:
|
| 38 |
-
table_texts.append(row_text)
|
| 39 |
-
|
| 40 |
-
if table_texts:
|
| 41 |
-
content += "\n\n[Tables]\n" + "\n".join(table_texts)
|
| 42 |
-
|
| 43 |
-
return [Document(
|
| 44 |
-
content=content,
|
| 45 |
-
metadata={
|
| 46 |
-
"source": str(path),
|
| 47 |
-
"filename": path.name,
|
| 48 |
-
"type": "docx",
|
| 49 |
-
"paragraphs": len(paragraphs),
|
| 50 |
-
"tables": len(doc.tables),
|
| 51 |
-
}
|
| 52 |
-
)]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/loaders/json_loader.py
DELETED
|
@@ -1,103 +0,0 @@
|
|
| 1 |
-
import json
|
| 2 |
-
from typing import List
|
| 3 |
-
from pathlib import Path
|
| 4 |
-
from loguru import logger
|
| 5 |
-
|
| 6 |
-
from .base_loader import BaseLoader, Document
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
class JSONLoader(BaseLoader):
|
| 10 |
-
"""
|
| 11 |
-
Loader untuk file JSON.
|
| 12 |
-
Bisa flatten nested JSON menjadi teks untuk di-embed.
|
| 13 |
-
"""
|
| 14 |
-
|
| 15 |
-
def __init__(self, text_key: str = None, jq_schema: str = None):
|
| 16 |
-
"""
|
| 17 |
-
text_key: key spesifik yang jadi konten utama (e.g. 'content', 'text')
|
| 18 |
-
jq_schema: opsional β filter JSON pakai jq-style path
|
| 19 |
-
"""
|
| 20 |
-
self.text_key = text_key
|
| 21 |
-
self.jq_schema = jq_schema
|
| 22 |
-
|
| 23 |
-
@property
|
| 24 |
-
def supported_extensions(self) -> List[str]:
|
| 25 |
-
return [".json", ".jsonl"]
|
| 26 |
-
|
| 27 |
-
def load(self, source: str) -> List[Document]:
|
| 28 |
-
path = Path(source)
|
| 29 |
-
if not path.exists():
|
| 30 |
-
raise FileNotFoundError(f"File tidak ditemukan: {source}")
|
| 31 |
-
|
| 32 |
-
logger.info(f"Loading JSON: {path.name}")
|
| 33 |
-
|
| 34 |
-
# Handle JSONL (JSON Lines)
|
| 35 |
-
if path.suffix == ".jsonl":
|
| 36 |
-
return self._load_jsonl(path)
|
| 37 |
-
|
| 38 |
-
with open(path, "r", encoding="utf-8") as f:
|
| 39 |
-
data = json.load(f)
|
| 40 |
-
|
| 41 |
-
# Jika list of records
|
| 42 |
-
if isinstance(data, list):
|
| 43 |
-
documents = []
|
| 44 |
-
for i, record in enumerate(data):
|
| 45 |
-
content = self._extract_content(record)
|
| 46 |
-
documents.append(Document(
|
| 47 |
-
content=content,
|
| 48 |
-
metadata={
|
| 49 |
-
"source": str(path),
|
| 50 |
-
"filename": path.name,
|
| 51 |
-
"type": "json",
|
| 52 |
-
"record_index": i,
|
| 53 |
-
}
|
| 54 |
-
))
|
| 55 |
-
return documents
|
| 56 |
-
|
| 57 |
-
# Single object
|
| 58 |
-
content = self._extract_content(data)
|
| 59 |
-
return [Document(
|
| 60 |
-
content=content,
|
| 61 |
-
metadata={
|
| 62 |
-
"source": str(path),
|
| 63 |
-
"filename": path.name,
|
| 64 |
-
"type": "json",
|
| 65 |
-
}
|
| 66 |
-
)]
|
| 67 |
-
|
| 68 |
-
def _load_jsonl(self, path: Path) -> List[Document]:
|
| 69 |
-
documents = []
|
| 70 |
-
with open(path, "r", encoding="utf-8") as f:
|
| 71 |
-
for i, line in enumerate(f):
|
| 72 |
-
line = line.strip()
|
| 73 |
-
if not line:
|
| 74 |
-
continue
|
| 75 |
-
record = json.loads(line)
|
| 76 |
-
content = self._extract_content(record)
|
| 77 |
-
documents.append(Document(
|
| 78 |
-
content=content,
|
| 79 |
-
metadata={
|
| 80 |
-
"source": str(path),
|
| 81 |
-
"filename": path.name,
|
| 82 |
-
"type": "jsonl",
|
| 83 |
-
"line": i + 1,
|
| 84 |
-
}
|
| 85 |
-
))
|
| 86 |
-
return documents
|
| 87 |
-
|
| 88 |
-
def _extract_content(self, data: dict) -> str:
|
| 89 |
-
"""Konversi dict/list ke string yang bisa di-embed."""
|
| 90 |
-
if self.text_key and isinstance(data, dict) and self.text_key in data:
|
| 91 |
-
return str(data[self.text_key])
|
| 92 |
-
|
| 93 |
-
# Fallback: flatten semua key-value pair
|
| 94 |
-
if isinstance(data, dict):
|
| 95 |
-
parts = []
|
| 96 |
-
for k, v in data.items():
|
| 97 |
-
if isinstance(v, (str, int, float, bool)):
|
| 98 |
-
parts.append(f"{k}: {v}")
|
| 99 |
-
elif isinstance(v, (list, dict)):
|
| 100 |
-
parts.append(f"{k}: {json.dumps(v, ensure_ascii=False)}")
|
| 101 |
-
return "\n".join(parts)
|
| 102 |
-
|
| 103 |
-
return json.dumps(data, ensure_ascii=False, indent=2)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/loaders/pdf_loader.py
DELETED
|
@@ -1,46 +0,0 @@
|
|
| 1 |
-
from typing import List
|
| 2 |
-
from pathlib import Path
|
| 3 |
-
from loguru import logger
|
| 4 |
-
|
| 5 |
-
from .base_loader import BaseLoader, Document
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
class PDFLoader(BaseLoader):
|
| 9 |
-
"""Loader untuk file PDF menggunakan pypdf."""
|
| 10 |
-
|
| 11 |
-
@property
|
| 12 |
-
def supported_extensions(self) -> List[str]:
|
| 13 |
-
return [".pdf"]
|
| 14 |
-
|
| 15 |
-
def load(self, source: str) -> List[Document]:
|
| 16 |
-
try:
|
| 17 |
-
from pypdf import PdfReader
|
| 18 |
-
except ImportError:
|
| 19 |
-
raise ImportError("Install pypdf: pip install pypdf")
|
| 20 |
-
|
| 21 |
-
path = Path(source)
|
| 22 |
-
if not path.exists():
|
| 23 |
-
raise FileNotFoundError(f"File tidak ditemukan: {source}")
|
| 24 |
-
|
| 25 |
-
logger.info(f"Loading PDF: {path.name}")
|
| 26 |
-
reader = PdfReader(str(path))
|
| 27 |
-
documents = []
|
| 28 |
-
|
| 29 |
-
for i, page in enumerate(reader.pages):
|
| 30 |
-
text = page.extract_text()
|
| 31 |
-
if not text or not text.strip():
|
| 32 |
-
continue
|
| 33 |
-
|
| 34 |
-
documents.append(Document(
|
| 35 |
-
content=text.strip(),
|
| 36 |
-
metadata={
|
| 37 |
-
"source": str(path),
|
| 38 |
-
"filename": path.name,
|
| 39 |
-
"page": i + 1,
|
| 40 |
-
"total_pages": len(reader.pages),
|
| 41 |
-
"type": "pdf",
|
| 42 |
-
}
|
| 43 |
-
))
|
| 44 |
-
|
| 45 |
-
logger.info(f"Loaded {len(documents)} pages from {path.name}")
|
| 46 |
-
return documents
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/loaders/text_loader.py
DELETED
|
@@ -1,31 +0,0 @@
|
|
| 1 |
-
from typing import List
|
| 2 |
-
from pathlib import Path
|
| 3 |
-
from loguru import logger
|
| 4 |
-
|
| 5 |
-
from .base_loader import BaseLoader, Document
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
class TextLoader(BaseLoader):
|
| 9 |
-
"""Loader untuk file .txt dan .md."""
|
| 10 |
-
|
| 11 |
-
@property
|
| 12 |
-
def supported_extensions(self) -> List[str]:
|
| 13 |
-
return [".txt", ".md", ".markdown"]
|
| 14 |
-
|
| 15 |
-
def load(self, source: str) -> List[Document]:
|
| 16 |
-
path = Path(source)
|
| 17 |
-
if not path.exists():
|
| 18 |
-
raise FileNotFoundError(f"File tidak ditemukan: {source}")
|
| 19 |
-
|
| 20 |
-
logger.info(f"Loading text file: {path.name}")
|
| 21 |
-
content = path.read_text(encoding="utf-8")
|
| 22 |
-
|
| 23 |
-
return [Document(
|
| 24 |
-
content=content,
|
| 25 |
-
metadata={
|
| 26 |
-
"source": str(path),
|
| 27 |
-
"filename": path.name,
|
| 28 |
-
"type": path.suffix.lstrip("."),
|
| 29 |
-
"size_chars": len(content),
|
| 30 |
-
}
|
| 31 |
-
)]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/loaders/web_loader.py
DELETED
|
@@ -1,57 +0,0 @@
|
|
| 1 |
-
from typing import List
|
| 2 |
-
from loguru import logger
|
| 3 |
-
|
| 4 |
-
from .base_loader import BaseLoader, Document
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
class WebLoader(BaseLoader):
|
| 8 |
-
"""Loader untuk URL β scrape konten teks dari halaman web."""
|
| 9 |
-
|
| 10 |
-
@property
|
| 11 |
-
def supported_extensions(self) -> List[str]:
|
| 12 |
-
return [] # Tidak berbasis ekstensi, berbasis URL
|
| 13 |
-
|
| 14 |
-
def validate_source(self, source: str) -> bool:
|
| 15 |
-
return source.startswith(("http://", "https://"))
|
| 16 |
-
|
| 17 |
-
def load(self, source: str) -> List[Document]:
|
| 18 |
-
try:
|
| 19 |
-
import requests
|
| 20 |
-
from bs4 import BeautifulSoup
|
| 21 |
-
except ImportError:
|
| 22 |
-
raise ImportError("Install: pip install requests beautifulsoup4")
|
| 23 |
-
|
| 24 |
-
logger.info(f"Fetching URL: {source}")
|
| 25 |
-
headers = {"User-Agent": "Mozilla/5.0 (compatible; RAG-Pipeline/1.0)"}
|
| 26 |
-
|
| 27 |
-
response = requests.get(source, headers=headers, timeout=15)
|
| 28 |
-
response.raise_for_status()
|
| 29 |
-
|
| 30 |
-
soup = BeautifulSoup(response.text, "html.parser")
|
| 31 |
-
|
| 32 |
-
# Hapus tag yang tidak relevan
|
| 33 |
-
for tag in soup(["script", "style", "nav", "footer", "header", "aside"]):
|
| 34 |
-
tag.decompose()
|
| 35 |
-
|
| 36 |
-
# Ambil judul
|
| 37 |
-
title = soup.find("title")
|
| 38 |
-
title_text = title.get_text(strip=True) if title else ""
|
| 39 |
-
|
| 40 |
-
# Ambil konten utama
|
| 41 |
-
main = soup.find("main") or soup.find("article") or soup.find("body")
|
| 42 |
-
content = main.get_text(separator="\n", strip=True) if main else soup.get_text(separator="\n", strip=True)
|
| 43 |
-
|
| 44 |
-
# Bersihkan baris kosong berulang
|
| 45 |
-
lines = [line for line in content.splitlines() if line.strip()]
|
| 46 |
-
content = "\n".join(lines)
|
| 47 |
-
|
| 48 |
-
return [Document(
|
| 49 |
-
content=content,
|
| 50 |
-
metadata={
|
| 51 |
-
"source": source,
|
| 52 |
-
"title": title_text,
|
| 53 |
-
"type": "web",
|
| 54 |
-
"status_code": response.status_code,
|
| 55 |
-
"content_length": len(content),
|
| 56 |
-
}
|
| 57 |
-
)]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/retrieval/__init__.py
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
|
|
|
|
|
|
rag/src/retrieval/retriever.py
DELETED
|
@@ -1,211 +0,0 @@
|
|
| 1 |
-
from typing import List, Optional, Iterator
|
| 2 |
-
from loguru import logger
|
| 3 |
-
import mlflow
|
| 4 |
-
import time
|
| 5 |
-
|
| 6 |
-
from langchain.schema import HumanMessage, AIMessage
|
| 7 |
-
|
| 8 |
-
from ..config import get_settings
|
| 9 |
-
from ..retrieval.vector_store import VectorStore
|
| 10 |
-
from ..llm.groq_client import GroqClient
|
| 11 |
-
from ..llm.prompt_templates import RAG_PROMPT, SUMMARY_PROMPT
|
| 12 |
-
from ..loaders import LoaderFactory, Document
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
class RAGRetriever:
|
| 16 |
-
"""
|
| 17 |
-
Core class yang menyatukan semua komponen RAG:
|
| 18 |
-
Document Loading β Chunking β Embedding β Retrieval β Generation
|
| 19 |
-
"""
|
| 20 |
-
|
| 21 |
-
def __init__(self):
|
| 22 |
-
self.settings = get_settings()
|
| 23 |
-
self.vector_store = VectorStore()
|
| 24 |
-
self.groq = GroqClient()
|
| 25 |
-
self._setup_mlflow()
|
| 26 |
-
logger.info("RAGRetriever initialized.")
|
| 27 |
-
|
| 28 |
-
def _setup_mlflow(self):
|
| 29 |
-
mlflow.set_tracking_uri(self.settings.mlflow_tracking_uri)
|
| 30 |
-
mlflow.set_experiment(self.settings.mlflow_experiment_name)
|
| 31 |
-
|
| 32 |
-
# === INDEXING ===
|
| 33 |
-
|
| 34 |
-
def ingest(self, sources: List[str]) -> dict:
|
| 35 |
-
"""
|
| 36 |
-
Load, chunk, embed, dan index dokumen dari berbagai sources.
|
| 37 |
-
|
| 38 |
-
Args:
|
| 39 |
-
sources: List of file paths atau URLs
|
| 40 |
-
|
| 41 |
-
Returns:
|
| 42 |
-
dict berisi stats indexing
|
| 43 |
-
"""
|
| 44 |
-
logger.info(f"Ingesting {len(sources)} sources...")
|
| 45 |
-
start = time.time()
|
| 46 |
-
|
| 47 |
-
with mlflow.start_run(run_name="ingest"):
|
| 48 |
-
mlflow.log_params({
|
| 49 |
-
"sources_count": len(sources),
|
| 50 |
-
"chunk_size": self.settings.chunk_size,
|
| 51 |
-
"chunk_overlap": self.settings.chunk_overlap,
|
| 52 |
-
"embedding_model": self.settings.embedding_model,
|
| 53 |
-
})
|
| 54 |
-
|
| 55 |
-
# Load semua dokumen
|
| 56 |
-
documents = LoaderFactory.load_many(sources)
|
| 57 |
-
|
| 58 |
-
# Index ke vector store
|
| 59 |
-
chunks_indexed = self.vector_store.add_documents(documents)
|
| 60 |
-
|
| 61 |
-
elapsed = time.time() - start
|
| 62 |
-
stats = {
|
| 63 |
-
"documents_loaded": len(documents),
|
| 64 |
-
"chunks_indexed": chunks_indexed,
|
| 65 |
-
"sources": sources,
|
| 66 |
-
"elapsed_seconds": round(elapsed, 2),
|
| 67 |
-
"total_docs_in_store": self.vector_store.count(),
|
| 68 |
-
}
|
| 69 |
-
|
| 70 |
-
mlflow.log_metrics({
|
| 71 |
-
"documents_loaded": len(documents),
|
| 72 |
-
"chunks_indexed": chunks_indexed,
|
| 73 |
-
"elapsed_seconds": elapsed,
|
| 74 |
-
})
|
| 75 |
-
|
| 76 |
-
logger.info(f"Ingestion selesai: {stats}")
|
| 77 |
-
return stats
|
| 78 |
-
|
| 79 |
-
# === QUERYING ===
|
| 80 |
-
|
| 81 |
-
def query(
|
| 82 |
-
self,
|
| 83 |
-
question: str,
|
| 84 |
-
chat_history: Optional[List[dict]] = None,
|
| 85 |
-
top_k: Optional[int] = None,
|
| 86 |
-
return_sources: bool = True,
|
| 87 |
-
) -> dict:
|
| 88 |
-
"""
|
| 89 |
-
Jawab pertanyaan menggunakan RAG.
|
| 90 |
-
|
| 91 |
-
Args:
|
| 92 |
-
question: Pertanyaan user
|
| 93 |
-
chat_history: Riwayat chat [{"role": "user"/"assistant", "content": "..."}]
|
| 94 |
-
top_k: Jumlah chunks yang diretrieve
|
| 95 |
-
return_sources: Sertakan source chunks di response
|
| 96 |
-
|
| 97 |
-
Returns:
|
| 98 |
-
dict dengan 'answer', 'sources', dan 'metadata'
|
| 99 |
-
"""
|
| 100 |
-
start = time.time()
|
| 101 |
-
logger.info(f"Query: '{question[:80]}...'")
|
| 102 |
-
|
| 103 |
-
with mlflow.start_run(run_name="query"):
|
| 104 |
-
mlflow.log_param("question", question[:250])
|
| 105 |
-
mlflow.log_param("model", self.settings.groq_model)
|
| 106 |
-
|
| 107 |
-
# Retrieve relevant chunks
|
| 108 |
-
k = top_k or self.settings.top_k_retrieval
|
| 109 |
-
|
| 110 |
-
# Guard: jika store kosong, langsung jawab tanpa retrieval
|
| 111 |
-
store_count = self.vector_store.count()
|
| 112 |
-
if store_count == 0:
|
| 113 |
-
retrieved = []
|
| 114 |
-
else:
|
| 115 |
-
try:
|
| 116 |
-
retrieved = self.vector_store.similarity_search_with_score(question, k=k)
|
| 117 |
-
except Exception as e:
|
| 118 |
-
logger.warning(f"Retrieval failed (mungkin store kosong): {e}")
|
| 119 |
-
retrieved = []
|
| 120 |
-
|
| 121 |
-
# Format context
|
| 122 |
-
context_parts = []
|
| 123 |
-
sources = []
|
| 124 |
-
for i, (doc, score) in enumerate(retrieved):
|
| 125 |
-
source_info = doc.metadata.get("filename", doc.metadata.get("source", "Unknown"))
|
| 126 |
-
page_info = f" (hal. {doc.metadata['page']})" if "page" in doc.metadata else ""
|
| 127 |
-
context_parts.append(
|
| 128 |
-
f"[Sumber {i+1}: {source_info}{page_info} | Relevansi: {1-score:.2f}]\n{doc.page_content}"
|
| 129 |
-
)
|
| 130 |
-
sources.append({
|
| 131 |
-
"content": doc.page_content[:300] + "..." if len(doc.page_content) > 300 else doc.page_content,
|
| 132 |
-
"metadata": doc.metadata,
|
| 133 |
-
"relevance_score": round(1 - score, 4),
|
| 134 |
-
})
|
| 135 |
-
|
| 136 |
-
context = "\n\n---\n\n".join(context_parts) if context_parts else "(Tidak ada dokumen yang di-index. Silakan upload dokumen terlebih dahulu.)"
|
| 137 |
-
|
| 138 |
-
# Build chat history with correct message types
|
| 139 |
-
history_messages = []
|
| 140 |
-
for m in (chat_history or []):
|
| 141 |
-
if m["role"] == "user":
|
| 142 |
-
history_messages.append(HumanMessage(content=m["content"]))
|
| 143 |
-
elif m["role"] == "assistant":
|
| 144 |
-
history_messages.append(AIMessage(content=m["content"]))
|
| 145 |
-
|
| 146 |
-
# Build prompt dan generate
|
| 147 |
-
formatted_prompt = RAG_PROMPT.format_messages(
|
| 148 |
-
context=context,
|
| 149 |
-
question=question,
|
| 150 |
-
chat_history=history_messages,
|
| 151 |
-
)
|
| 152 |
-
|
| 153 |
-
answer = self.groq.invoke(formatted_prompt)
|
| 154 |
-
elapsed = time.time() - start
|
| 155 |
-
|
| 156 |
-
mlflow.log_metrics({
|
| 157 |
-
"chunks_retrieved": len(retrieved),
|
| 158 |
-
"answer_length": len(answer),
|
| 159 |
-
"latency_seconds": elapsed,
|
| 160 |
-
})
|
| 161 |
-
|
| 162 |
-
result = {
|
| 163 |
-
"answer": answer,
|
| 164 |
-
"question": question,
|
| 165 |
-
"latency_seconds": round(elapsed, 2),
|
| 166 |
-
"chunks_retrieved": len(retrieved),
|
| 167 |
-
}
|
| 168 |
-
if return_sources:
|
| 169 |
-
result["sources"] = sources
|
| 170 |
-
|
| 171 |
-
return result
|
| 172 |
-
|
| 173 |
-
def stream_query(
|
| 174 |
-
self,
|
| 175 |
-
question: str,
|
| 176 |
-
chat_history: Optional[List[dict]] = None,
|
| 177 |
-
) -> Iterator[str]:
|
| 178 |
-
"""Streaming version dari query β yield token per token."""
|
| 179 |
-
retrieved = self.vector_store.similarity_search(question)
|
| 180 |
-
context = "\n\n---\n\n".join(
|
| 181 |
-
f"[Sumber: {doc.metadata.get('filename', 'Unknown')}]\n{doc.page_content}"
|
| 182 |
-
for doc in retrieved
|
| 183 |
-
)
|
| 184 |
-
formatted = RAG_PROMPT.format_messages(
|
| 185 |
-
context=context,
|
| 186 |
-
question=question,
|
| 187 |
-
chat_history=[],
|
| 188 |
-
)
|
| 189 |
-
groq_stream = GroqClient(streaming=True)
|
| 190 |
-
yield from groq_stream.stream(formatted)
|
| 191 |
-
|
| 192 |
-
def summarize(self, source: str) -> str:
|
| 193 |
-
"""Buat ringkasan dari satu dokumen."""
|
| 194 |
-
documents = LoaderFactory.load(source)
|
| 195 |
-
full_text = "\n\n".join(doc.content for doc in documents)
|
| 196 |
-
|
| 197 |
-
# Truncate jika terlalu panjang
|
| 198 |
-
if len(full_text) > 12000:
|
| 199 |
-
full_text = full_text[:12000] + "\n...[dokumen dipotong untuk efisiensi]"
|
| 200 |
-
|
| 201 |
-
messages = SUMMARY_PROMPT.format_messages(document=full_text)
|
| 202 |
-
return self.groq.invoke(messages)
|
| 203 |
-
|
| 204 |
-
def get_stats(self) -> dict:
|
| 205 |
-
"""Statistik vector store saat ini."""
|
| 206 |
-
return {
|
| 207 |
-
"total_chunks": self.vector_store.count(),
|
| 208 |
-
"collection_name": self.settings.chroma_collection_name,
|
| 209 |
-
"embedding_model": self.settings.embedding_model,
|
| 210 |
-
"llm_model": self.settings.groq_model,
|
| 211 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag/src/retrieval/vector_store.py
DELETED
|
@@ -1,93 +0,0 @@
|
|
| 1 |
-
from typing import List, Optional
|
| 2 |
-
from loguru import logger
|
| 3 |
-
from langchain_chroma import Chroma
|
| 4 |
-
from langchain.schema import Document as LCDocument
|
| 5 |
-
|
| 6 |
-
from ..config import get_settings
|
| 7 |
-
from ..loaders.base_loader import Document
|
| 8 |
-
from ..embeddings.embedder import DocumentEmbedder
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
class VectorStore:
|
| 12 |
-
"""
|
| 13 |
-
Wrapper di atas ChromaDB.
|
| 14 |
-
Menangani indexing, persistence, dan similarity search.
|
| 15 |
-
"""
|
| 16 |
-
|
| 17 |
-
def __init__(self, embedder: Optional[DocumentEmbedder] = None):
|
| 18 |
-
settings = get_settings()
|
| 19 |
-
self.embedder = embedder or DocumentEmbedder()
|
| 20 |
-
self.settings = settings
|
| 21 |
-
|
| 22 |
-
self.db = Chroma(
|
| 23 |
-
collection_name=settings.chroma_collection_name,
|
| 24 |
-
embedding_function=self.embedder.get_embeddings_model(),
|
| 25 |
-
persist_directory=settings.chroma_persist_dir,
|
| 26 |
-
)
|
| 27 |
-
logger.info(
|
| 28 |
-
f"VectorStore ready. Collection: '{settings.chroma_collection_name}' "
|
| 29 |
-
f"| Docs: {self.db._collection.count()}"
|
| 30 |
-
)
|
| 31 |
-
|
| 32 |
-
def add_documents(self, documents: List[Document]) -> int:
|
| 33 |
-
"""
|
| 34 |
-
Chunk dan index dokumen ke ChromaDB.
|
| 35 |
-
Returns: jumlah chunks yang berhasil di-index.
|
| 36 |
-
"""
|
| 37 |
-
chunks = self.embedder.chunk_documents(documents)
|
| 38 |
-
|
| 39 |
-
# Konversi ke format LangChain
|
| 40 |
-
lc_docs = [
|
| 41 |
-
LCDocument(page_content=chunk.content, metadata=chunk.metadata)
|
| 42 |
-
for chunk in chunks
|
| 43 |
-
]
|
| 44 |
-
|
| 45 |
-
self.db.add_documents(lc_docs)
|
| 46 |
-
logger.info(f"Indexed {len(chunks)} chunks ke ChromaDB.")
|
| 47 |
-
return len(chunks)
|
| 48 |
-
|
| 49 |
-
def similarity_search(
|
| 50 |
-
self,
|
| 51 |
-
query: str,
|
| 52 |
-
k: Optional[int] = None,
|
| 53 |
-
filter: Optional[dict] = None,
|
| 54 |
-
) -> List[LCDocument]:
|
| 55 |
-
"""Cari dokumen paling relevan berdasarkan query."""
|
| 56 |
-
k = k or self.settings.top_k_retrieval
|
| 57 |
-
results = self.db.similarity_search(query, k=k, filter=filter)
|
| 58 |
-
logger.debug(f"Retrieved {len(results)} chunks for query: '{query[:60]}...'")
|
| 59 |
-
return results
|
| 60 |
-
|
| 61 |
-
def similarity_search_with_score(
|
| 62 |
-
self,
|
| 63 |
-
query: str,
|
| 64 |
-
k: Optional[int] = None,
|
| 65 |
-
) -> List[tuple]:
|
| 66 |
-
"""Sama seperti similarity_search tapi return (doc, score)."""
|
| 67 |
-
k = k or self.settings.top_k_retrieval
|
| 68 |
-
return self.db.similarity_search_with_score(query, k=k)
|
| 69 |
-
|
| 70 |
-
def reset_collection(self):
|
| 71 |
-
"""
|
| 72 |
-
Hapus semua dokumen TANPA mematikan collection.
|
| 73 |
-
Pakai reset_collection() dari langchain-chroma β collection tetap hidup
|
| 74 |
-
dan langsung siap untuk ingest berikutnya.
|
| 75 |
-
"""
|
| 76 |
-
self.db.reset_collection()
|
| 77 |
-
logger.warning(
|
| 78 |
-
f"Collection '{self.settings.chroma_collection_name}' di-reset. "
|
| 79 |
-
"Semua dokumen dihapus, collection siap dipakai kembali."
|
| 80 |
-
)
|
| 81 |
-
|
| 82 |
-
def delete_collection(self):
|
| 83 |
-
"""Alias ke reset_collection() β collection tidak dimatikan, aman."""
|
| 84 |
-
self.reset_collection()
|
| 85 |
-
|
| 86 |
-
def count(self) -> int:
|
| 87 |
-
"""Jumlah chunks yang tersimpan."""
|
| 88 |
-
return self.db._collection.count()
|
| 89 |
-
|
| 90 |
-
def get_retriever(self, search_kwargs: Optional[dict] = None):
|
| 91 |
-
"""Return LangChain retriever untuk dipakai di chain."""
|
| 92 |
-
search_kwargs = search_kwargs or {"k": self.settings.top_k_retrieval}
|
| 93 |
-
return self.db.as_retriever(search_kwargs=search_kwargs)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
rag_pipeline/src/api/routes.py
CHANGED
|
@@ -71,6 +71,12 @@ async def readiness_combined():
|
|
| 71 |
models={"rag": {"state": rag_state, "error": str(e)[:300]}},
|
| 72 |
)
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
# Cek status CV API (best-effort β kalau CV down, RAG tetep bisa pake
|
| 75 |
# text-based loaders, cuma OCR fallback yang ga jalan).
|
| 76 |
cv_url = os.getenv("CV_API_URL", "http://127.0.0.1:8001")
|
|
@@ -85,7 +91,9 @@ async def readiness_combined():
|
|
| 85 |
except Exception as e:
|
| 86 |
logger.debug(f"CV /ready unreachable: {e}")
|
| 87 |
|
| 88 |
-
all_models = {
|
|
|
|
|
|
|
| 89 |
for name, info in cv_models.items():
|
| 90 |
all_models[f"cv.{name}"] = info
|
| 91 |
|
|
@@ -220,6 +228,20 @@ async def query(request: QueryRequest):
|
|
| 220 |
source_ids=request.source_ids,
|
| 221 |
)
|
| 222 |
return QueryResponse(**result)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
except Exception as e:
|
| 224 |
logger.error(f"Query error: {e}")
|
| 225 |
raise HTTPException(status_code=500, detail=str(e))
|
|
@@ -233,6 +255,14 @@ async def summarize(request: SummarizeRequest):
|
|
| 233 |
try:
|
| 234 |
summary = get_retriever().summarize(request.source)
|
| 235 |
return SummarizeResponse(summary=summary, source=request.source)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
except Exception as e:
|
| 237 |
raise HTTPException(status_code=500, detail=str(e))
|
| 238 |
|
|
@@ -242,8 +272,15 @@ async def summarize(request: SummarizeRequest):
|
|
| 242 |
@router.delete("/collection", response_model=DeleteResponse, tags=["system"])
|
| 243 |
async def delete_collection():
|
| 244 |
"""Hapus semua dokumen dari vector store."""
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
models={"rag": {"state": rag_state, "error": str(e)[:300]}},
|
| 72 |
)
|
| 73 |
|
| 74 |
+
# Cek apakah GROQ_API_KEY ada β penting untuk endpoint /query dan /summarize.
|
| 75 |
+
# Tidak fail readiness, tapi UI dapat info untuk munculin warning.
|
| 76 |
+
from ..config import get_settings
|
| 77 |
+
settings = get_settings()
|
| 78 |
+
has_groq_key = bool((settings.groq_api_key or "").strip())
|
| 79 |
+
|
| 80 |
# Cek status CV API (best-effort β kalau CV down, RAG tetep bisa pake
|
| 81 |
# text-based loaders, cuma OCR fallback yang ga jalan).
|
| 82 |
cv_url = os.getenv("CV_API_URL", "http://127.0.0.1:8001")
|
|
|
|
| 91 |
except Exception as e:
|
| 92 |
logger.debug(f"CV /ready unreachable: {e}")
|
| 93 |
|
| 94 |
+
all_models = {
|
| 95 |
+
"rag": {"state": rag_state, "groq_api_key": "set" if has_groq_key else "missing"},
|
| 96 |
+
}
|
| 97 |
for name, info in cv_models.items():
|
| 98 |
all_models[f"cv.{name}"] = info
|
| 99 |
|
|
|
|
| 228 |
source_ids=request.source_ids,
|
| 229 |
)
|
| 230 |
return QueryResponse(**result)
|
| 231 |
+
except RuntimeError as e:
|
| 232 |
+
# GROQ_API_KEY missing β 503 Service Unavailable dengan pesan jelas.
|
| 233 |
+
msg = str(e)
|
| 234 |
+
if "GROQ_API_KEY" in msg:
|
| 235 |
+
logger.warning(f"Query rejected (no API key): {msg}")
|
| 236 |
+
raise HTTPException(
|
| 237 |
+
status_code=503,
|
| 238 |
+
detail={
|
| 239 |
+
"error": "groq_api_key_missing",
|
| 240 |
+
"message": msg,
|
| 241 |
+
},
|
| 242 |
+
)
|
| 243 |
+
logger.error(f"Query runtime error: {e}")
|
| 244 |
+
raise HTTPException(status_code=500, detail=str(e))
|
| 245 |
except Exception as e:
|
| 246 |
logger.error(f"Query error: {e}")
|
| 247 |
raise HTTPException(status_code=500, detail=str(e))
|
|
|
|
| 255 |
try:
|
| 256 |
summary = get_retriever().summarize(request.source)
|
| 257 |
return SummarizeResponse(summary=summary, source=request.source)
|
| 258 |
+
except RuntimeError as e:
|
| 259 |
+
msg = str(e)
|
| 260 |
+
if "GROQ_API_KEY" in msg:
|
| 261 |
+
raise HTTPException(
|
| 262 |
+
status_code=503,
|
| 263 |
+
detail={"error": "groq_api_key_missing", "message": msg},
|
| 264 |
+
)
|
| 265 |
+
raise HTTPException(status_code=500, detail=str(e))
|
| 266 |
except Exception as e:
|
| 267 |
raise HTTPException(status_code=500, detail=str(e))
|
| 268 |
|
|
|
|
| 272 |
@router.delete("/collection", response_model=DeleteResponse, tags=["system"])
|
| 273 |
async def delete_collection():
|
| 274 |
"""Hapus semua dokumen dari vector store."""
|
| 275 |
+
try:
|
| 276 |
+
get_retriever().vector_store.delete_collection()
|
| 277 |
+
return DeleteResponse(
|
| 278 |
+
status="success",
|
| 279 |
+
message="Semua dokumen berhasil dihapus dari vector store.",
|
| 280 |
+
)
|
| 281 |
+
except Exception as e:
|
| 282 |
+
logger.error(f"Clear collection error: {e}")
|
| 283 |
+
raise HTTPException(
|
| 284 |
+
status_code=500,
|
| 285 |
+
detail=f"Gagal hapus collection: {e}",
|
| 286 |
+
)
|
rag_pipeline/src/config.py
CHANGED
|
@@ -6,7 +6,11 @@ from pathlib import Path
|
|
| 6 |
|
| 7 |
class Settings(BaseSettings):
|
| 8 |
# LLM
|
| 9 |
-
groq_api_key
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
groq_model: str = Field("llama-3.3-70b-versatile", env="GROQ_MODEL") # updated from 3.1
|
| 11 |
|
| 12 |
# Embeddings
|
|
|
|
| 6 |
|
| 7 |
class Settings(BaseSettings):
|
| 8 |
# LLM
|
| 9 |
+
# groq_api_key sengaja TIDAK required (default ""). Kalau env var ga di-set,
|
| 10 |
+
# service tetap bisa start dan endpoint /stats /sources /ingest masih jalan.
|
| 11 |
+
# Cuma /query dan /summarize yang gagal dengan pesan jelas β ini lebih bagus
|
| 12 |
+
# daripada seluruh container crash di startup.
|
| 13 |
+
groq_api_key: str = Field(default="", env="GROQ_API_KEY")
|
| 14 |
groq_model: str = Field("llama-3.3-70b-versatile", env="GROQ_MODEL") # updated from 3.1
|
| 15 |
|
| 16 |
# Embeddings
|
rag_pipeline/src/llm/groq_client.py
CHANGED
|
@@ -1,6 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
from typing import List, Optional, Iterator
|
| 2 |
from loguru import logger
|
| 3 |
-
from langchain_groq import ChatGroq
|
| 4 |
from langchain.schema import BaseMessage, HumanMessage, AIMessage
|
| 5 |
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
|
| 6 |
|
|
@@ -11,37 +25,70 @@ class GroqClient:
|
|
| 11 |
"""
|
| 12 |
Wrapper di atas ChatGroq dari LangChain.
|
| 13 |
Mendukung regular call dan streaming.
|
|
|
|
|
|
|
| 14 |
"""
|
| 15 |
|
| 16 |
def __init__(self, streaming: bool = False):
|
| 17 |
-
settings = get_settings()
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
self.
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
callbacks=callbacks,
|
| 27 |
)
|
| 28 |
-
|
| 29 |
-
|
| 30 |
|
| 31 |
def invoke(self, messages: List[BaseMessage]) -> str:
|
| 32 |
"""Kirim messages dan return response string."""
|
| 33 |
-
|
|
|
|
| 34 |
return response.content
|
| 35 |
|
| 36 |
def stream(self, messages: List[BaseMessage]) -> Iterator[str]:
|
| 37 |
"""Streaming response β yield token per token."""
|
| 38 |
-
|
|
|
|
| 39 |
if chunk.content:
|
| 40 |
yield chunk.content
|
| 41 |
|
| 42 |
def get_langchain_llm(self):
|
| 43 |
"""Return raw LangChain LLM object untuk dipakai di chain."""
|
| 44 |
-
return self.
|
| 45 |
|
| 46 |
@staticmethod
|
| 47 |
def build_messages(
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Groq LLM client.
|
| 3 |
+
|
| 4 |
+
Perubahan: ChatGroq di-init LAZY β bukan saat constructor jalan, tapi saat
|
| 5 |
+
pertama kali invoke/stream dipanggil. Alasannya:
|
| 6 |
+
|
| 7 |
+
- RAGRetriever.__init__ sekarang ngga crash kalau GROQ_API_KEY ngga di-set.
|
| 8 |
+
- Endpoint yang ngga butuh LLM (/stats, /sources, /ingest, /collection DELETE)
|
| 9 |
+
tetep jalan walau API key ngga ada.
|
| 10 |
+
- /query dan /summarize gagal dengan pesan yang jelas: "GROQ_API_KEY belum
|
| 11 |
+
di-set" β bukan error pydantic / startup crash yang bikin user bingung.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
from typing import List, Optional, Iterator
|
| 17 |
from loguru import logger
|
|
|
|
| 18 |
from langchain.schema import BaseMessage, HumanMessage, AIMessage
|
| 19 |
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
|
| 20 |
|
|
|
|
| 25 |
"""
|
| 26 |
Wrapper di atas ChatGroq dari LangChain.
|
| 27 |
Mendukung regular call dan streaming.
|
| 28 |
+
|
| 29 |
+
LAZY: ChatGroq instance baru dibuat saat pertama kali invoke/stream.
|
| 30 |
"""
|
| 31 |
|
| 32 |
def __init__(self, streaming: bool = False):
|
| 33 |
+
self.settings = get_settings()
|
| 34 |
+
self.streaming = streaming
|
| 35 |
+
self.model_name = self.settings.groq_model
|
| 36 |
+
self._llm = None # akan diisi di _ensure_llm()
|
| 37 |
+
logger.info(
|
| 38 |
+
f"Groq client constructed (lazy). Model: {self.model_name} "
|
| 39 |
+
f"| API key: {'SET' if self.settings.groq_api_key else 'NOT SET'}"
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
def _ensure_llm(self):
|
| 43 |
+
"""Buat ChatGroq instance kalau belum ada. Validasi API key di sini."""
|
| 44 |
+
if self._llm is not None:
|
| 45 |
+
return self._llm
|
| 46 |
+
|
| 47 |
+
api_key = (self.settings.groq_api_key or "").strip()
|
| 48 |
+
if not api_key:
|
| 49 |
+
raise RuntimeError(
|
| 50 |
+
"GROQ_API_KEY belum di-set. "
|
| 51 |
+
"Tambahkan di Hugging Face Space β Settings β Variables and secrets, "
|
| 52 |
+
"atau set environment variable di host. "
|
| 53 |
+
"Endpoint /query dan /summarize butuh API key ini."
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
try:
|
| 57 |
+
from langchain_groq import ChatGroq
|
| 58 |
+
except ImportError as e:
|
| 59 |
+
raise RuntimeError(
|
| 60 |
+
f"langchain-groq tidak terinstall: {e}. "
|
| 61 |
+
"Cek requirements.txt."
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
callbacks = [StreamingStdOutCallbackHandler()] if self.streaming else []
|
| 65 |
+
self._llm = ChatGroq(
|
| 66 |
+
api_key=api_key,
|
| 67 |
+
model_name=self.model_name,
|
| 68 |
+
temperature=self.settings.temperature,
|
| 69 |
+
max_tokens=self.settings.max_tokens,
|
| 70 |
+
streaming=self.streaming,
|
| 71 |
callbacks=callbacks,
|
| 72 |
)
|
| 73 |
+
logger.info(f"Groq client initialized lazily. Model: {self.model_name}")
|
| 74 |
+
return self._llm
|
| 75 |
|
| 76 |
def invoke(self, messages: List[BaseMessage]) -> str:
|
| 77 |
"""Kirim messages dan return response string."""
|
| 78 |
+
llm = self._ensure_llm()
|
| 79 |
+
response = llm.invoke(messages)
|
| 80 |
return response.content
|
| 81 |
|
| 82 |
def stream(self, messages: List[BaseMessage]) -> Iterator[str]:
|
| 83 |
"""Streaming response β yield token per token."""
|
| 84 |
+
llm = self._ensure_llm()
|
| 85 |
+
for chunk in llm.stream(messages):
|
| 86 |
if chunk.content:
|
| 87 |
yield chunk.content
|
| 88 |
|
| 89 |
def get_langchain_llm(self):
|
| 90 |
"""Return raw LangChain LLM object untuk dipakai di chain."""
|
| 91 |
+
return self._ensure_llm()
|
| 92 |
|
| 93 |
@staticmethod
|
| 94 |
def build_messages(
|
rag_pipeline/src/retrieval/vector_store.py
CHANGED
|
@@ -203,26 +203,69 @@ class VectorStore:
|
|
| 203 |
return list(bucket.values())
|
| 204 |
|
| 205 |
def reset_collection(self):
|
| 206 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
try:
|
| 208 |
self.db.reset_collection()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
except Exception as e:
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 217 |
self.db = Chroma(
|
| 218 |
collection_name=self.settings.chroma_collection_name,
|
| 219 |
embedding_function=self.embedder.get_embeddings_model(),
|
| 220 |
persist_directory=self.settings.chroma_persist_dir,
|
| 221 |
)
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
|
|
|
|
|
|
| 226 |
|
| 227 |
def delete_collection(self):
|
| 228 |
"""Alias ke reset_collection() β collection tidak dimatikan, aman."""
|
|
|
|
| 203 |
return list(bucket.values())
|
| 204 |
|
| 205 |
def reset_collection(self):
|
| 206 |
+
"""
|
| 207 |
+
Hapus semua dokumen dari collection.
|
| 208 |
+
|
| 209 |
+
Strategi (paling kompatibel):
|
| 210 |
+
1. Coba `db.reset_collection()` (ada di langchain_chroma versi baru).
|
| 211 |
+
2. Fallback: ambil semua IDs lalu hapus via `_collection.delete(ids=...)`.
|
| 212 |
+
Ini bekerja di semua versi langchain_chroma + chromadb 0.5.x.
|
| 213 |
+
3. Last resort: nuke via `_client.delete_collection()` + re-init Chroma.
|
| 214 |
+
|
| 215 |
+
Kenapa option 2 jadi default fallback (bukan option 3):
|
| 216 |
+
- Re-init Chroma kadang triggers ChromaDB re-create error kalau
|
| 217 |
+
ada race condition dengan delete_collection di backend.
|
| 218 |
+
- Delete by IDs jauh lebih atomic.
|
| 219 |
+
"""
|
| 220 |
+
# Option 1: high-level method (newer langchain_chroma)
|
| 221 |
try:
|
| 222 |
self.db.reset_collection()
|
| 223 |
+
logger.warning(
|
| 224 |
+
f"Collection '{self.settings.chroma_collection_name}' di-reset (high-level)."
|
| 225 |
+
)
|
| 226 |
+
return
|
| 227 |
+
except AttributeError:
|
| 228 |
+
logger.debug("db.reset_collection() tidak tersedia, fallback ke delete-by-ids.")
|
| 229 |
except Exception as e:
|
| 230 |
+
logger.debug(f"db.reset_collection() gagal: {e} β fallback ke delete-by-ids.")
|
| 231 |
+
|
| 232 |
+
# Option 2: delete semua document by IDs
|
| 233 |
+
try:
|
| 234 |
+
collection = self.db._collection
|
| 235 |
+
data = collection.get(include=[]) # cuma butuh ids
|
| 236 |
+
ids = data.get("ids") or []
|
| 237 |
+
if ids:
|
| 238 |
+
collection.delete(ids=ids)
|
| 239 |
+
logger.warning(
|
| 240 |
+
f"Collection '{self.settings.chroma_collection_name}' di-reset "
|
| 241 |
+
f"({len(ids)} chunks dihapus via delete-by-ids)."
|
| 242 |
+
)
|
| 243 |
+
else:
|
| 244 |
+
logger.info(
|
| 245 |
+
f"Collection '{self.settings.chroma_collection_name}' sudah kosong, ngga ada yg di-delete."
|
| 246 |
+
)
|
| 247 |
+
return
|
| 248 |
+
except Exception as e:
|
| 249 |
+
logger.warning(f"Delete-by-ids gagal: {e} β fallback ke delete + re-init.")
|
| 250 |
+
|
| 251 |
+
# Option 3: nuke collection lalu re-init Chroma
|
| 252 |
+
try:
|
| 253 |
+
self.db._client.delete_collection(self.settings.chroma_collection_name)
|
| 254 |
+
except Exception as e:
|
| 255 |
+
logger.warning(f"_client.delete_collection() gagal: {e}")
|
| 256 |
+
|
| 257 |
+
try:
|
| 258 |
self.db = Chroma(
|
| 259 |
collection_name=self.settings.chroma_collection_name,
|
| 260 |
embedding_function=self.embedder.get_embeddings_model(),
|
| 261 |
persist_directory=self.settings.chroma_persist_dir,
|
| 262 |
)
|
| 263 |
+
logger.warning(
|
| 264 |
+
f"Collection '{self.settings.chroma_collection_name}' nuked & re-init."
|
| 265 |
+
)
|
| 266 |
+
except Exception as e:
|
| 267 |
+
logger.error(f"Re-init Chroma gagal setelah delete: {e}")
|
| 268 |
+
raise
|
| 269 |
|
| 270 |
def delete_collection(self):
|
| 271 |
"""Alias ke reset_collection() β collection tidak dimatikan, aman."""
|
start.sh
CHANGED
|
@@ -1,16 +1,41 @@
|
|
| 1 |
#!/bin/bash
|
| 2 |
set -e
|
| 3 |
|
| 4 |
-
echo "===
|
| 5 |
-
echo "
|
|
|
|
| 6 |
|
| 7 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
export GROQ_API_KEY="${GROQ_API_KEY:-}"
|
| 9 |
|
| 10 |
-
#
|
| 11 |
mkdir -p /app/rag/chroma_db /app/rag/mlruns /app/rag/logs \
|
| 12 |
/app/cv/model_cache /app/cv/mlruns /app/cv/logs /app/cv/uploads \
|
| 13 |
/var/log/supervisor /run
|
| 14 |
|
| 15 |
-
echo "Starting supervisord (nginx + rag-api + cv-api)..."
|
|
|
|
| 16 |
exec /usr/bin/supervisord -c /etc/supervisor/conf.d/supervisord.conf
|
|
|
|
| 1 |
#!/bin/bash
|
| 2 |
set -e
|
| 3 |
|
| 4 |
+
echo "========================================================"
|
| 5 |
+
echo " Multimodal AI Platform β Starting"
|
| 6 |
+
echo "========================================================"
|
| 7 |
|
| 8 |
+
# --- Diagnostics ---
|
| 9 |
+
echo ""
|
| 10 |
+
echo "[diag] Python: $(python3 --version 2>&1)"
|
| 11 |
+
echo "[diag] Working dir: $(pwd)"
|
| 12 |
+
echo ""
|
| 13 |
+
echo "[diag] Source layout:"
|
| 14 |
+
ls -la /app/rag/src/api/main.py 2>/dev/null && echo " β /app/rag/src/api/main.py" || echo " β /app/rag/src/api/main.py MISSING"
|
| 15 |
+
ls -la /app/cv/src/api/main.py 2>/dev/null && echo " β /app/cv/src/api/main.py" || echo " β /app/cv/src/api/main.py MISSING"
|
| 16 |
+
ls -la /app/frontend/index.html 2>/dev/null && echo " β /app/frontend/index.html" || echo " β /app/frontend/index.html MISSING"
|
| 17 |
+
ls -la /app/cv/model_cache/yolov8n.onnx 2>/dev/null && echo " β YOLOv8n ONNX model" || echo " β YOLOv8n ONNX MISSING (CV /detect akan gagal)"
|
| 18 |
+
echo ""
|
| 19 |
+
|
| 20 |
+
# --- Secrets / env ---
|
| 21 |
+
if [ -z "${GROQ_API_KEY}" ]; then
|
| 22 |
+
echo "[warn] GROQ_API_KEY tidak di-set."
|
| 23 |
+
echo " /api/v1/query dan /api/v1/summarize akan return 503."
|
| 24 |
+
echo " Endpoint lain (/ingest /sources /stats) tetap jalan normal."
|
| 25 |
+
echo " Set di Hugging Face Space β Settings β Variables and secrets."
|
| 26 |
+
else
|
| 27 |
+
echo "[ok] GROQ_API_KEY: SET (${#GROQ_API_KEY} chars)"
|
| 28 |
+
fi
|
| 29 |
+
echo ""
|
| 30 |
+
|
| 31 |
+
# Export GROQ_API_KEY supaya child processes (supervisord β uvicorn) bisa akses.
|
| 32 |
export GROQ_API_KEY="${GROQ_API_KEY:-}"
|
| 33 |
|
| 34 |
+
# Pastikan direktori yang dibutuhkan ada (mount point HF Space sometimes resets).
|
| 35 |
mkdir -p /app/rag/chroma_db /app/rag/mlruns /app/rag/logs \
|
| 36 |
/app/cv/model_cache /app/cv/mlruns /app/cv/logs /app/cv/uploads \
|
| 37 |
/var/log/supervisor /run
|
| 38 |
|
| 39 |
+
echo "[boot] Starting supervisord (nginx + rag-api + cv-api)..."
|
| 40 |
+
echo "========================================================"
|
| 41 |
exec /usr/bin/supervisord -c /etc/supervisor/conf.d/supervisord.conf
|