robrtt commited on
Commit
ea25e34
Β·
1 Parent(s): 678d6ad

Clean rebuild: all features fixed

Browse files
Files changed (46) hide show
  1. CHANGES.md +51 -0
  2. cv-requirements.txt +0 -35
  3. cv/src/__init__.py +0 -0
  4. cv/src/api/__init__.py +0 -0
  5. cv/src/api/main.py +0 -48
  6. cv/src/api/routes.py +0 -246
  7. cv/src/api/schemas.py +0 -110
  8. cv/src/config.py +0 -55
  9. cv/src/cv_pipeline.py +0 -246
  10. cv/src/models/__init__.py +0 -0
  11. cv/src/models/captioner.py +0 -105
  12. cv/src/models/clip_model.py +0 -150
  13. cv/src/models/yolo_detector.py +0 -208
  14. cv/src/processors/__init__.py +0 -0
  15. cv/src/processors/image_preprocessor.py +0 -154
  16. cv/src/processors/ocr_processor.py +0 -235
  17. cv_module/src/api/routes.py +51 -23
  18. cv_module/src/cv_pipeline.py +33 -7
  19. frontend/index.html +67 -18
  20. rag-requirements.txt +0 -34
  21. rag/src/__init__.py +0 -1
  22. rag/src/api/__init__.py +0 -1
  23. rag/src/api/main.py +0 -57
  24. rag/src/api/routes.py +0 -137
  25. rag/src/api/schemas.py +0 -67
  26. rag/src/config.py +0 -56
  27. rag/src/embeddings/__init__.py +0 -1
  28. rag/src/embeddings/embedder.py +0 -60
  29. rag/src/llm/__init__.py +0 -1
  30. rag/src/llm/groq_client.py +0 -62
  31. rag/src/llm/prompt_templates.py +0 -36
  32. rag/src/loaders/__init__.py +0 -69
  33. rag/src/loaders/base_loader.py +0 -39
  34. rag/src/loaders/docx_loader.py +0 -52
  35. rag/src/loaders/json_loader.py +0 -103
  36. rag/src/loaders/pdf_loader.py +0 -46
  37. rag/src/loaders/text_loader.py +0 -31
  38. rag/src/loaders/web_loader.py +0 -57
  39. rag/src/retrieval/__init__.py +0 -1
  40. rag/src/retrieval/retriever.py +0 -211
  41. rag/src/retrieval/vector_store.py +0 -93
  42. rag_pipeline/src/api/routes.py +43 -6
  43. rag_pipeline/src/config.py +5 -1
  44. rag_pipeline/src/llm/groq_client.py +62 -15
  45. rag_pipeline/src/retrieval/vector_store.py +55 -12
  46. start.sh +30 -5
CHANGES.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Perbaikan dari v2_0_5 β†’ v2_0_5-fixed
2
+
3
+ Ringkasan: 7 file diubah, 1 file diperbaiki tambahan. Semua perbaikan **defensive** (membuat
4
+ kode lebih tahan banting), bukan refactor besar β€” supaya behaviour lama tidak berubah selama
5
+ kondisi normal, tapi ada fallback yang jelas saat kondisi tidak normal.
6
+
7
+ ## Daftar perubahan
8
+
9
+ | File | Jenis | Tujuan |
10
+ |------|-------|--------|
11
+ | `frontend/index.html` | Theme bulletproof | `try/catch` di sekitar `localStorage` & `matchMedia`; fallback `:root` CSS variables |
12
+ | `rag_pipeline/src/config.py` | Config | `groq_api_key` jadi optional (default `""`) β€” service ngga crash kalau secret belum di-set |
13
+ | `rag_pipeline/src/llm/groq_client.py` | LLM client | ChatGroq di-init **lazy** β€” pas pertama dipakai, bukan saat constructor jalan. Validasi API key di sini dengan pesan jelas. |
14
+ | `rag_pipeline/src/api/routes.py` | Error handling | `/query` dan `/summarize` return **503** (bukan 500) saat API key missing, dengan pesan spesifik. `/ready` sekarang lapor status `groq_api_key`. `/collection` DELETE error reporting lebih baik. |
15
+ | `rag_pipeline/src/retrieval/vector_store.py` | Robustness | `reset_collection()` punya **3-tier fallback**: `db.reset_collection()` β†’ delete-by-ids β†’ nuke + re-init |
16
+ | `cv_module/src/cv_pipeline.py` | Thread safety | Per-model `Lock` di lazy property (mencegah double-init kalau 2+ request concurrent). `ThreadPoolExecutor(max_workers=2)` mencegah OOM di HF free tier. |
17
+ | `cv_module/src/api/routes.py` | Thread safety | `_trigger_lock` mencegah TOCTOU race di `_trigger_and_wait`. Error handler tiap endpoint kasih pesan yg lebih informatif. |
18
+ | `start.sh` | Diagnostics | Print sanity check (file paths exist, GROQ_API_KEY status) saat boot β€” gampang debug dari log. |
19
+
20
+ ## Yang TIDAK gw ubah
21
+
22
+ - Dockerfile: aman, struktur `cv_module/` dan `rag_pipeline/` di repo lo udah match dengan `COPY` di Dockerfile.
23
+ - `nginx.conf`, `supervisord.conf`: aman.
24
+ - `requirements.txt`: aman, version pin udah konsisten.
25
+ - Loaders (PDF/DOCX/TXT/JSON/Web): aman, ngga ada perubahan logic.
26
+ - Theme system di v2_0_5 *secara fungsional* udah bener β€” gw cuma tambah defensive guards. Kalau di lo "ilang", penyebabnya bukan code-level missing tapi runtime issue (browser, deploy, cache).
27
+
28
+ ## Penjelasan kenapa fix ini bisa nyelesaiin "semua endpoint 500"
29
+
30
+ Hipotesis paling mungkin penyebabnya:
31
+
32
+ 1. **GROQ_API_KEY belum di-set di HF Space.** Sebelum fix:
33
+ - `Settings()` raise `ValidationError` karena `Field(...)` mandatory.
34
+ - `RAGRetriever.__init__()` crash di `get_settings()`.
35
+ - Setiap endpoint yg manggil `get_retriever()` (β‰ˆ semua) β†’ 500.
36
+ - Setelah fix: service start clean, `/stats` `/sources` `/ingest` tetep jalan, cuma `/query` `/summarize` yg 503 dengan pesan "GROQ_API_KEY belum di-set".
37
+
38
+ 2. **CV model concurrent load OOM.** Sebelum fix:
39
+ - 2+ request paralel ke endpoint yang butuh model yg sama β†’ race di `if self._captioner is None` β†’ 2x init paralel β†’ RAM spike β†’ OS kill process β†’ 500/connection error.
40
+ - Setelah fix: per-model lock, cuma 1 thread yg load.
41
+
42
+ 3. **`db.reset_collection()` AttributeError di langchain_chroma versi tertentu.** Sebelum fix: fallback path bisa gagal di chromadb 0.5.3 + langchain_chroma 0.1.4 combo karena `_client.delete_collection` lalu re-init bisa race. Setelah fix: delete-by-ids jadi default fallback (lebih atomic).
43
+
44
+ ## Yang harus lo cek sebelum deploy
45
+
46
+ 1. **GROQ_API_KEY**: Pastiin udah di-set di HF Spaces β†’ Settings β†’ Variables and secrets.
47
+ 2. **Struktur repo HF**: Dockerfile `COPY rag_pipeline/src/...`. Pastiin folder ini ada di repo HF (bukan cuma di local zip). Kalau di HF strukturnya `rag/` (bukan `rag_pipeline/`), build bakal gagal β€” dan **semua endpoint return error karena container ngga jalan**, bukan karena bug kode.
48
+ 3. **HF Space rebuild**: Kadang HF nge-cache build layer. Setelah deploy versi baru, force rebuild via dashboard ("Restart this Space" β†’ "Factory rebuild").
49
+
50
+ Kalau semua 3 hal di atas udah OK dan endpoint masih error, share log dari HF Spaces dashboard
51
+ (tab "Logs"), gw bisa pinpoint penyebab spesifik.
cv-requirements.txt DELETED
@@ -1,35 +0,0 @@
1
- # ── CV Core ───────────────────────────────────────────────
2
- transformers==4.35.2
3
- numpy==1.26.4
4
- Pillow>=10.4.0
5
- opencv-python-headless>=4.10.0
6
-
7
- # ── CLIP ──────────────────────────────────────────────────
8
- open-clip-torch>=2.26.1
9
- timm==0.9.16
10
-
11
- # ── Object Detection ──────────────────────────────────────
12
- ultralytics>=8.2.0
13
-
14
- # ── OCR ───────────────────────────────────────────────────
15
- pytesseract>=0.3.13
16
- easyocr>=1.7.1
17
-
18
- # ── Image utils ───────────────────────────────────────────
19
- imageio>=2.34.0
20
- scikit-image>=0.24.0
21
-
22
- # ── API ───────────────────────────────────────────────────
23
- fastapi==0.112.0
24
- uvicorn[standard]==0.30.6
25
- python-multipart==0.0.9
26
- pydantic==2.8.2
27
- pydantic-settings==2.4.0
28
-
29
- # ── MLOps ─────────────────────────────────────────────────
30
- mlflow==2.15.1
31
-
32
- # ── Utils ─────────────────────────────────────────────────
33
- loguru==0.7.2
34
- python-dotenv==1.0.1
35
- httpx==0.27.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cv/src/__init__.py DELETED
File without changes
cv/src/api/__init__.py DELETED
File without changes
cv/src/api/main.py DELETED
@@ -1,48 +0,0 @@
1
- from fastapi import FastAPI
2
- from fastapi.middleware.cors import CORSMiddleware
3
- from loguru import logger
4
- import sys
5
-
6
- from .routes import router
7
- from ..config import get_cv_settings
8
-
9
- settings = get_cv_settings()
10
-
11
- logger.remove()
12
- logger.add(sys.stderr, level="INFO", colorize=True,
13
- format="<green>{time:HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan> - {message}")
14
- logger.add("./logs/cv_api.log", rotation="10 MB", retention="7 days")
15
-
16
- app = FastAPI(
17
- title="CV Pipeline API",
18
- description="""
19
- ## Multimodal AI Assistant β€” Computer Vision Module
20
-
21
- Endpoint untuk analisis gambar menggunakan:
22
- - **BLIP** β€” image captioning & visual QA
23
- - **YOLOv8** β€” real-time object detection (80 kelas COCO)
24
- - **CLIP** β€” zero-shot classification & image-text similarity
25
- - **EasyOCR** β€” text extraction dari gambar (80+ bahasa)
26
- - **MLflow** β€” latency & performance tracking
27
-
28
- ### Integrasi dengan RAG Module
29
- Output `summary_text` dari `/analyze` bisa langsung dipakai sebagai
30
- konteks untuk RAG pipeline β€” gambar bisa menjadi bagian dari knowledge base.
31
- """,
32
- version="1.0.0",
33
- )
34
-
35
- app.add_middleware(
36
- CORSMiddleware,
37
- allow_origins=["*"],
38
- allow_methods=["*"],
39
- allow_headers=["*"],
40
- )
41
-
42
- app.include_router(router, prefix="/api/v1")
43
-
44
-
45
- @app.on_event("startup")
46
- async def startup():
47
- logger.info("CV Pipeline API starting up...")
48
- logger.info(f"Docs: http://{settings.api_host}:{settings.api_port}/docs")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cv/src/api/routes.py DELETED
@@ -1,246 +0,0 @@
1
- from fastapi import APIRouter, HTTPException, UploadFile, File
2
- from fastapi.responses import Response
3
- from pydantic import BaseModel
4
- from loguru import logger
5
-
6
- from .schemas import (
7
- AnalyzeURLRequest, FullAnalysisResponse,
8
- ClassifyRequest, ClassificationResponse,
9
- SimilarityRequest, SimilarityResponse,
10
- VisualQARequest, VisualQAResponse,
11
- CaptionResponse, DetectionResponse, OCRResponse,
12
- )
13
- from ..cv_pipeline import CVPipeline
14
-
15
- router = APIRouter()
16
-
17
- _pipeline: CVPipeline = None
18
-
19
-
20
- def get_pipeline() -> CVPipeline:
21
- global _pipeline
22
- if _pipeline is None:
23
- _pipeline = CVPipeline()
24
- return _pipeline
25
-
26
-
27
- # === HEALTH ===
28
-
29
- @router.get("/health", tags=["system"])
30
- async def health():
31
- return {"status": "ok", "service": "CV Pipeline API"}
32
-
33
-
34
- # === FULL ANALYSIS ===
35
-
36
- @router.post("/analyze/url", response_model=FullAnalysisResponse, tags=["analysis"])
37
- async def analyze_from_url(req: AnalyzeURLRequest):
38
- """
39
- Analisis gambar dari URL.
40
- Jalankan caption, object detection, optional OCR + CLIP classification sekaligus.
41
- """
42
- try:
43
- result = get_pipeline().analyze(
44
- source=req.url,
45
- run_caption=req.run_caption,
46
- run_detection=req.run_detection,
47
- run_ocr=req.run_ocr,
48
- classification_labels=req.classification_labels,
49
- )
50
- return _to_response(result)
51
- except Exception as e:
52
- logger.error(f"Analyze error: {e}")
53
- raise HTTPException(status_code=500, detail=str(e))
54
-
55
-
56
- @router.post("/analyze/upload", response_model=FullAnalysisResponse, tags=["analysis"])
57
- async def analyze_upload(
58
- file: UploadFile = File(...),
59
- run_caption: bool = True,
60
- run_detection: bool = True,
61
- run_ocr: bool = False,
62
- ):
63
- """Upload dan analisis gambar langsung (multipart)."""
64
- allowed = {"image/jpeg", "image/png", "image/webp", "image/gif"}
65
- if file.content_type not in allowed:
66
- raise HTTPException(400, detail=f"Tipe file tidak didukung: {file.content_type}")
67
-
68
- data = await file.read()
69
- if len(data) > 10 * 1024 * 1024:
70
- raise HTTPException(400, detail="Ukuran file maksimum 10MB")
71
-
72
- try:
73
- result = get_pipeline().analyze(
74
- source=data,
75
- run_caption=run_caption,
76
- run_detection=run_detection,
77
- run_ocr=run_ocr,
78
- )
79
- return _to_response(result)
80
- except Exception as e:
81
- raise HTTPException(status_code=500, detail=str(e))
82
-
83
-
84
- # === INDIVIDUAL TASKS ===
85
-
86
- @router.post("/caption", response_model=CaptionResponse, tags=["tasks"])
87
- async def caption(url: str, prompt: str = None):
88
- """Generate deskripsi teks dari gambar."""
89
- try:
90
- from ..processors.image_preprocessor import ImagePreprocessor
91
- image = ImagePreprocessor.load(url)
92
- result = get_pipeline().captioner.caption(image, prompt=prompt)
93
- return CaptionResponse(caption=result.caption, model=result.model)
94
- except Exception as e:
95
- raise HTTPException(500, detail=str(e))
96
-
97
-
98
- @router.post("/detect", response_model=DetectionResponse, tags=["tasks"])
99
- async def detect(url: str, conf: float = None):
100
- """Deteksi objek dalam gambar dengan YOLOv8."""
101
- try:
102
- result = get_pipeline().detect_objects(url, conf=conf)
103
- return DetectionResponse(
104
- detections=[d.to_dict() for d in result.detections],
105
- count=result.count,
106
- labels_summary=result.labels_summary,
107
- image_width=result.image_width,
108
- image_height=result.image_height,
109
- inference_time_ms=result.inference_time_ms,
110
- )
111
- except Exception as e:
112
- raise HTTPException(500, detail=str(e))
113
-
114
-
115
- @router.post("/classify", response_model=ClassificationResponse, tags=["tasks"])
116
- async def classify(req: ClassifyRequest):
117
- """
118
- Zero-shot image classification dengan CLIP.
119
- Tidak perlu training β€” cukup berikan daftar label kandidat.
120
- """
121
- try:
122
- result = get_pipeline().classify_image(req.url, req.labels)
123
- return ClassificationResponse(
124
- top_label=result.top_label,
125
- top_score=result.top_score,
126
- labels=result.labels,
127
- probabilities=result.probabilities,
128
- )
129
- except Exception as e:
130
- raise HTTPException(500, detail=str(e))
131
-
132
-
133
- class OCRRequest(BaseModel):
134
- url: str
135
-
136
-
137
- @router.post("/ocr", response_model=OCRResponse, tags=["tasks"])
138
- async def ocr(req: OCRRequest):
139
- """Ekstrak teks dari gambar menggunakan EasyOCR."""
140
- try:
141
- from ..processors.image_preprocessor import ImagePreprocessor
142
- image = ImagePreprocessor.load(req.url)
143
- result = get_pipeline().ocr.extract_text(image)
144
- return OCRResponse(
145
- full_text=result.full_text,
146
- boxes=[b.to_dict() for b in result.boxes],
147
- word_count=result.word_count,
148
- language=result.language,
149
- engine=result.engine,
150
- )
151
- except Exception as e:
152
- raise HTTPException(500, detail=str(e))
153
-
154
-
155
- @router.post("/similarity", response_model=SimilarityResponse, tags=["tasks"])
156
- async def image_text_similarity(req: SimilarityRequest):
157
- """Hitung relevansi antara gambar dan teks (0.0 - 1.0)."""
158
- try:
159
- score = get_pipeline().image_text_similarity(req.url, req.text)
160
- interpretation = (
161
- "Sangat relevan" if score > 0.7
162
- else "Cukup relevan" if score > 0.5
163
- else "Kurang relevan"
164
- )
165
- return SimilarityResponse(
166
- similarity_score=score,
167
- text=req.text,
168
- interpretation=interpretation,
169
- )
170
- except Exception as e:
171
- raise HTTPException(500, detail=str(e))
172
-
173
-
174
- @router.post("/visual-qa", response_model=VisualQAResponse, tags=["tasks"])
175
- async def visual_qa(req: VisualQARequest):
176
- """Visual Question Answering β€” tanya tentang isi gambar."""
177
- try:
178
- answer = get_pipeline().visual_qa(req.url, req.question)
179
- return VisualQAResponse(question=req.question, answer=answer)
180
- except Exception as e:
181
- raise HTTPException(500, detail=str(e))
182
-
183
-
184
- @router.get("/annotate", tags=["tasks"])
185
- async def annotate(url: str):
186
- """Return gambar dengan bounding box YOLO yang sudah digambar (JPEG)."""
187
- try:
188
- jpeg_bytes = get_pipeline().annotate_image(url)
189
- return Response(content=jpeg_bytes, media_type="image/jpeg")
190
- except Exception as e:
191
- raise HTTPException(500, detail=str(e))
192
-
193
-
194
- # === HELPER ===
195
-
196
- def _to_response(result) -> FullAnalysisResponse:
197
- """Convert CVAnalysisResult ke FullAnalysisResponse."""
198
- caption_r = None
199
- if result.caption:
200
- caption_r = CaptionResponse(
201
- caption=result.caption.caption,
202
- model=result.caption.model,
203
- )
204
-
205
- det_r = None
206
- if result.detections:
207
- det_r = DetectionResponse(
208
- detections=[d.to_dict() for d in result.detections.detections],
209
- count=result.detections.count,
210
- labels_summary=result.detections.labels_summary,
211
- image_width=result.detections.image_width,
212
- image_height=result.detections.image_height,
213
- inference_time_ms=result.detections.inference_time_ms,
214
- )
215
-
216
- cls_r = None
217
- if result.classification:
218
- cls_r = ClassificationResponse(
219
- top_label=result.classification.top_label,
220
- top_score=result.classification.top_score,
221
- labels=result.classification.labels,
222
- probabilities=result.classification.probabilities,
223
- )
224
-
225
- ocr_r = None
226
- if result.ocr:
227
- ocr_r = OCRResponse(
228
- full_text=result.ocr.full_text,
229
- boxes=[b.to_dict() for b in result.ocr.boxes],
230
- word_count=result.ocr.word_count,
231
- language=result.ocr.language,
232
- engine=result.ocr.engine,
233
- )
234
-
235
- return FullAnalysisResponse(
236
- image_width=result.image_width,
237
- image_height=result.image_height,
238
- source=result.source,
239
- caption=caption_r,
240
- detections=det_r,
241
- classification=cls_r,
242
- ocr=ocr_r,
243
- summary_text=result.to_summary(),
244
- models_used=result.models_used,
245
- total_latency_ms=result.total_latency_ms,
246
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cv/src/api/schemas.py DELETED
@@ -1,110 +0,0 @@
1
- from pydantic import BaseModel, Field, HttpUrl
2
- from typing import List, Optional
3
-
4
-
5
- # === Shared ===
6
-
7
- class BBoxSchema(BaseModel):
8
- x1: float
9
- y1: float
10
- x2: float
11
- y2: float
12
- width: float
13
- height: float
14
-
15
-
16
- class DetectionSchema(BaseModel):
17
- label: str
18
- confidence: float
19
- bbox: BBoxSchema
20
- class_id: int
21
-
22
-
23
- class OCRBoxSchema(BaseModel):
24
- text: str
25
- confidence: float
26
- bbox: list
27
-
28
-
29
- # === Requests ===
30
-
31
- class AnalyzeURLRequest(BaseModel):
32
- url: str = Field(..., description="URL gambar yang akan dianalisis")
33
- run_caption: bool = Field(True, description="Generate image caption")
34
- run_detection: bool = Field(True, description="Deteksi objek dengan YOLO")
35
- run_ocr: bool = Field(False, description="Ekstrak teks dari gambar")
36
- classification_labels: Optional[List[str]] = Field(
37
- None,
38
- description="Label untuk zero-shot CLIP classification, e.g. ['kucing','anjing']",
39
- example=["indoor", "outdoor", "nature", "city"],
40
- )
41
-
42
-
43
- class ClassifyRequest(BaseModel):
44
- url: str
45
- labels: List[str] = Field(..., min_length=2, description="Minimal 2 label kandidat")
46
-
47
-
48
- class SimilarityRequest(BaseModel):
49
- url: str
50
- text: str = Field(..., min_length=1)
51
-
52
-
53
- class VisualQARequest(BaseModel):
54
- url: str
55
- question: str = Field(..., description="Pertanyaan tentang isi gambar")
56
-
57
-
58
- # === Responses ===
59
-
60
- class CaptionResponse(BaseModel):
61
- caption: str
62
- model: str
63
-
64
-
65
- class DetectionResponse(BaseModel):
66
- detections: List[DetectionSchema]
67
- count: int
68
- labels_summary: dict
69
- image_width: int
70
- image_height: int
71
- inference_time_ms: float
72
-
73
-
74
- class ClassificationResponse(BaseModel):
75
- top_label: str
76
- top_score: float
77
- labels: List[str]
78
- probabilities: List[float]
79
-
80
-
81
- class OCRResponse(BaseModel):
82
- full_text: str
83
- boxes: List[OCRBoxSchema]
84
- word_count: int
85
- language: str
86
- engine: str
87
-
88
-
89
- class FullAnalysisResponse(BaseModel):
90
- image_width: int
91
- image_height: int
92
- source: str
93
- caption: Optional[CaptionResponse] = None
94
- detections: Optional[DetectionResponse] = None
95
- classification: Optional[ClassificationResponse] = None
96
- ocr: Optional[OCRResponse] = None
97
- summary_text: str = Field(..., description="Ringkasan teks dari semua model β€” siap dipakai sebagai konteks LLM")
98
- models_used: List[str]
99
- total_latency_ms: float
100
-
101
-
102
- class SimilarityResponse(BaseModel):
103
- similarity_score: float
104
- text: str
105
- interpretation: str
106
-
107
-
108
- class VisualQAResponse(BaseModel):
109
- question: str
110
- answer: str
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cv/src/config.py DELETED
@@ -1,55 +0,0 @@
1
- from pydantic_settings import BaseSettings
2
- from pydantic import Field
3
- from functools import lru_cache
4
- from pathlib import Path
5
-
6
-
7
- class CVSettings(BaseSettings):
8
- # Device
9
- device: str = Field("cpu", env="CV_DEVICE") # "cpu" atau "cuda"
10
-
11
- # CLIP
12
- clip_model: str = Field("ViT-B-32", env="CLIP_MODEL")
13
- clip_pretrained: str = Field("openai", env="CLIP_PRETRAINED")
14
-
15
- # YOLO
16
- yolo_model: str = Field("yolov8n.pt", env="YOLO_MODEL") # n=nano, s=small, m=medium
17
- yolo_conf_threshold: float = Field(0.25, env="YOLO_CONF")
18
- yolo_iou_threshold: float = Field(0.45, env="YOLO_IOU")
19
-
20
- # Image Captioning
21
- caption_model: str = Field(
22
- "Salesforce/blip-image-captioning-base", env="CAPTION_MODEL"
23
- )
24
-
25
- # OCR
26
- ocr_engine: str = Field("easyocr", env="OCR_ENGINE") # "easyocr" atau "tesseract"
27
- ocr_languages: str = Field("en,id", env="OCR_LANGUAGES") # comma-separated
28
-
29
- # API
30
- api_host: str = Field("0.0.0.0", env="CV_API_HOST")
31
- api_port: int = Field(8001, env="CV_API_PORT")
32
- max_image_size_mb: float = Field(10.0, env="MAX_IMAGE_SIZE_MB")
33
-
34
- # Storage
35
- upload_dir: str = Field("./uploads", env="CV_UPLOAD_DIR")
36
- models_cache_dir: str = Field("./model_cache", env="CV_MODELS_CACHE")
37
-
38
- # MLflow
39
- mlflow_tracking_uri: str = Field("./mlruns", env="MLFLOW_TRACKING_URI")
40
- mlflow_experiment_name: str = Field("cv_pipeline", env="MLFLOW_CV_EXPERIMENT")
41
-
42
- class Config:
43
- env_file = ".env"
44
- env_file_encoding = "utf-8"
45
-
46
- def ensure_dirs(self):
47
- for d in [self.upload_dir, self.models_cache_dir, "./logs", "./mlruns"]:
48
- Path(d).mkdir(parents=True, exist_ok=True)
49
-
50
-
51
- @lru_cache()
52
- def get_cv_settings() -> CVSettings:
53
- s = CVSettings()
54
- s.ensure_dirs()
55
- return s
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cv/src/cv_pipeline.py DELETED
@@ -1,246 +0,0 @@
1
- from __future__ import annotations
2
-
3
- import time
4
- from typing import List, Optional, Union
5
- from dataclasses import dataclass, field
6
- from pathlib import Path
7
-
8
- import mlflow
9
- from loguru import logger
10
-
11
- from .config import get_cv_settings
12
- from .processors.image_preprocessor import ImagePreprocessor, ImageInput
13
- from .models.clip_model import CLIPModel, CLIPResult
14
- from .models.yolo_detector import YOLODetector, DetectionResult
15
- from .models.captioner import ImageCaptioner, CaptionResult
16
- from .processors.ocr_processor import OCRProcessor, OCRResult
17
-
18
-
19
- @dataclass
20
- class CVAnalysisResult:
21
- """Hasil lengkap analisis gambar dari semua model."""
22
-
23
- # Info gambar
24
- image_width: int = 0
25
- image_height: int = 0
26
- source: str = ""
27
-
28
- # Per-model results (None jika tidak dijalankan)
29
- caption: Optional[CaptionResult] = None
30
- detections: Optional[DetectionResult] = None
31
- classification: Optional[CLIPResult] = None
32
- ocr: Optional[OCRResult] = None
33
-
34
- # Metadata
35
- models_used: List[str] = field(default_factory=list)
36
- total_latency_ms: float = 0.0
37
-
38
- def to_summary(self) -> str:
39
- """
40
- Buat ringkasan teks dari hasil analisis.
41
- Berguna sebagai input ke LLM (integrasi dengan RAG module).
42
- """
43
- parts = []
44
-
45
- if self.caption:
46
- parts.append(f"Deskripsi gambar: {self.caption.caption}")
47
-
48
- if self.detections and self.detections.count > 0:
49
- summary = self.detections.labels_summary
50
- items = ", ".join(f"{count}x {label}" for label, count in summary.items())
51
- parts.append(f"Objek terdeteksi: {items}")
52
-
53
- if self.classification:
54
- parts.append(
55
- f"Klasifikasi: {self.classification.top_label} "
56
- f"(confidence: {self.classification.top_score:.1%})"
57
- )
58
-
59
- if self.ocr and self.ocr.full_text:
60
- preview = self.ocr.full_text[:300]
61
- if len(self.ocr.full_text) > 300:
62
- preview += "..."
63
- parts.append(f"Teks dalam gambar: {preview}")
64
-
65
- return "\n".join(parts) if parts else "Tidak ada informasi yang bisa diekstrak."
66
-
67
-
68
- class CVPipeline:
69
- """
70
- Orchestrator untuk semua CV models.
71
- Lazy loading β€” model hanya di-load saat pertama kali dipakai.
72
- Support modular: bisa run satu atau semua model sekaligus.
73
- """
74
-
75
- def __init__(self):
76
- self.settings = get_cv_settings()
77
- self._clip: Optional[CLIPModel] = None
78
- self._yolo: Optional[YOLODetector] = None
79
- self._captioner: Optional[ImageCaptioner] = None
80
- self._ocr: Optional[OCRProcessor] = None
81
- self._setup_mlflow()
82
- logger.info("CVPipeline initialized (lazy loading).")
83
-
84
- def _setup_mlflow(self):
85
- mlflow.set_tracking_uri(self.settings.mlflow_tracking_uri)
86
- mlflow.set_experiment(self.settings.mlflow_experiment_name)
87
-
88
- # === Lazy loaders ===
89
-
90
- @property
91
- def clip(self) -> CLIPModel:
92
- if self._clip is None:
93
- self._clip = CLIPModel()
94
- return self._clip
95
-
96
- @property
97
- def yolo(self) -> YOLODetector:
98
- if self._yolo is None:
99
- self._yolo = YOLODetector()
100
- return self._yolo
101
-
102
- @property
103
- def captioner(self) -> ImageCaptioner:
104
- if self._captioner is None:
105
- self._captioner = ImageCaptioner()
106
- return self._captioner
107
-
108
- @property
109
- def ocr(self) -> OCRProcessor:
110
- if self._ocr is None:
111
- self._ocr = OCRProcessor()
112
- return self._ocr
113
-
114
- # === Main analysis methods ===
115
-
116
- def analyze(
117
- self,
118
- source: Union[str, bytes, Path],
119
- run_caption: bool = True,
120
- run_detection: bool = True,
121
- run_ocr: bool = False,
122
- classification_labels: Optional[List[str]] = None,
123
- ) -> CVAnalysisResult:
124
- """
125
- Full pipeline analisis gambar.
126
-
127
- Args:
128
- source: Path, bytes, URL, atau base64 string
129
- run_caption: Generate image caption dengan BLIP
130
- run_detection: Deteksi objek dengan YOLO
131
- run_ocr: Ekstrak teks dengan EasyOCR
132
- classification_labels: Jika diisi, jalankan zero-shot CLIP classification
133
-
134
- Returns:
135
- CVAnalysisResult berisi semua hasil
136
- """
137
- start = time.perf_counter()
138
- image = ImagePreprocessor.load(source)
139
- models_used = []
140
-
141
- with mlflow.start_run(run_name="cv_analyze"):
142
- mlflow.log_params({
143
- "source": str(source)[:100],
144
- "image_size": f"{image.width}x{image.height}",
145
- "run_caption": run_caption,
146
- "run_detection": run_detection,
147
- "run_ocr": run_ocr,
148
- })
149
-
150
- result = CVAnalysisResult(
151
- image_width=image.width,
152
- image_height=image.height,
153
- source=image.source,
154
- )
155
-
156
- # 1. Image Captioning
157
- if run_caption:
158
- t0 = time.perf_counter()
159
- result.caption = self.captioner.caption(image)
160
- models_used.append("BLIP-caption")
161
- logger.debug(f"Caption: {(time.perf_counter()-t0)*1000:.0f}ms")
162
-
163
- # 2. Object Detection
164
- if run_detection:
165
- t0 = time.perf_counter()
166
- result.detections = self.yolo.detect(image)
167
- models_used.append("YOLOv8")
168
- logger.debug(f"Detection: {(time.perf_counter()-t0)*1000:.0f}ms")
169
-
170
- # 3. Zero-shot Classification (opsional)
171
- if classification_labels:
172
- t0 = time.perf_counter()
173
- result.classification = self.clip.classify(image, classification_labels)
174
- models_used.append("CLIP")
175
- logger.debug(f"CLIP: {(time.perf_counter()-t0)*1000:.0f}ms")
176
-
177
- # 4. OCR (opsional, lebih berat)
178
- if run_ocr:
179
- t0 = time.perf_counter()
180
- result.ocr = self.ocr.extract_text(image)
181
- models_used.append("EasyOCR")
182
- logger.debug(f"OCR: {(time.perf_counter()-t0)*1000:.0f}ms")
183
-
184
- total_ms = (time.perf_counter() - start) * 1000
185
- result.models_used = models_used
186
- result.total_latency_ms = round(total_ms, 2)
187
-
188
- mlflow.log_metrics({
189
- "total_latency_ms": total_ms,
190
- "objects_detected": result.detections.count if result.detections else 0,
191
- "ocr_chars": len(result.ocr.full_text) if result.ocr else 0,
192
- })
193
-
194
- logger.info(
195
- f"CV analysis done in {total_ms:.0f}ms | "
196
- f"Models: {models_used} | "
197
- f"Objects: {result.detections.count if result.detections else 0}"
198
- )
199
- return result
200
-
201
- # === Individual task methods ===
202
-
203
- def caption_image(self, source, prompt: str = None) -> str:
204
- """Shorthand: return caption string langsung."""
205
- image = ImagePreprocessor.load(source)
206
- return self.captioner.caption(image, prompt=prompt).caption
207
-
208
- def detect_objects(self, source, conf: float = None) -> DetectionResult:
209
- """Shorthand: return DetectionResult."""
210
- image = ImagePreprocessor.load(source)
211
- return self.yolo.detect(image, conf_threshold=conf)
212
-
213
- def classify_image(self, source, labels: List[str]) -> CLIPResult:
214
- """Shorthand: zero-shot CLIP classification."""
215
- image = ImagePreprocessor.load(source)
216
- return self.clip.classify(image, labels)
217
-
218
- def extract_text(self, source) -> str:
219
- """Shorthand: return OCR text string."""
220
- image = ImagePreprocessor.load(source)
221
- return self.ocr.extract_text_simple(image)
222
-
223
- def visual_qa(self, source, question: str) -> str:
224
- """Visual Question Answering: tanya tentang isi gambar."""
225
- image = ImagePreprocessor.load(source)
226
- return self.captioner.visual_qa(image, question).caption
227
-
228
- def image_text_similarity(self, source, text: str) -> float:
229
- """Hitung seberapa relevan teks dengan gambar (0-1)."""
230
- image = ImagePreprocessor.load(source)
231
- return self.clip.compute_similarity(image, text)
232
-
233
- def annotate_image(self, source) -> bytes:
234
- """
235
- Return gambar dengan bounding box yang sudah digambar β€” untuk visualisasi.
236
- Returns JPEG bytes.
237
- """
238
- import io
239
- from PIL import Image
240
-
241
- image = ImagePreprocessor.load(source)
242
- annotated = self.yolo.detect_and_annotate(image)
243
- pil_annotated = Image.fromarray(annotated)
244
- buf = io.BytesIO()
245
- pil_annotated.save(buf, format="JPEG", quality=90)
246
- return buf.getvalue()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cv/src/models/__init__.py DELETED
File without changes
cv/src/models/captioner.py DELETED
@@ -1,105 +0,0 @@
1
- from __future__ import annotations
2
-
3
- from dataclasses import dataclass
4
- from loguru import logger
5
-
6
- from ..config import get_cv_settings
7
- from ..processors.image_preprocessor import ImageInput
8
-
9
-
10
- @dataclass
11
- class CaptionResult:
12
- caption: str
13
- model: str
14
- confidence: float = 1.0
15
-
16
-
17
- class ImageCaptioner:
18
- """
19
- Image captioning menggunakan BLIP (Bootstrapped Language-Image Pre-training).
20
- Model Salesforce/blip-image-captioning-base β€” ringan, akurat, bisa jalan di CPU.
21
-
22
- Output: deskripsi teks natural dari gambar.
23
- Berguna untuk: accessibility, content indexing, multimodal RAG.
24
- """
25
-
26
- def __init__(self):
27
- settings = get_cv_settings()
28
- logger.info(f"Loading captioning model: {settings.caption_model}")
29
-
30
- try:
31
- from transformers import BlipProcessor, BlipForConditionalGeneration
32
- import torch
33
-
34
- self.device = settings.device
35
- self.processor = BlipProcessor.from_pretrained(
36
- settings.caption_model,
37
- cache_dir=settings.models_cache_dir,
38
- )
39
- self.model = BlipForConditionalGeneration.from_pretrained(
40
- settings.caption_model,
41
- cache_dir=settings.models_cache_dir,
42
- ).to(self.device)
43
- self.model.eval()
44
- self.model_name = settings.caption_model
45
- logger.info("Image captioner ready.")
46
-
47
- except Exception as e:
48
- logger.error(f"Gagal load captioning model: {e}")
49
- raise
50
-
51
- def caption(
52
- self,
53
- image: ImageInput,
54
- prompt: str = None,
55
- max_new_tokens: int = 100,
56
- ) -> CaptionResult:
57
- """
58
- Generate caption untuk gambar.
59
-
60
- Args:
61
- image: ImageInput object
62
- prompt: Optional β€” beri konteks/instruksi, e.g. "a photo of"
63
- max_new_tokens: Panjang maksimum caption
64
-
65
- Returns:
66
- CaptionResult berisi caption string
67
- """
68
- import torch
69
-
70
- if prompt:
71
- inputs = self.processor(
72
- image.pil_image, prompt, return_tensors="pt"
73
- ).to(self.device)
74
- else:
75
- inputs = self.processor(
76
- image.pil_image, return_tensors="pt"
77
- ).to(self.device)
78
-
79
- with torch.no_grad():
80
- output = self.model.generate(
81
- **inputs,
82
- max_new_tokens=max_new_tokens,
83
- num_beams=4,
84
- early_stopping=True,
85
- )
86
-
87
- caption = self.processor.decode(output[0], skip_special_tokens=True)
88
-
89
- # Bersihkan prefix prompt dari output
90
- if prompt and caption.lower().startswith(prompt.lower()):
91
- caption = caption[len(prompt):].strip()
92
-
93
- logger.debug(f"Caption: {caption}")
94
-
95
- return CaptionResult(
96
- caption=caption,
97
- model=self.model_name,
98
- )
99
-
100
- def visual_qa(self, image: ImageInput, question: str) -> CaptionResult:
101
- """
102
- Visual Question Answering β€” tanya tentang isi gambar.
103
- Contoh: "What color is the car?" β†’ "red"
104
- """
105
- return self.caption(image, prompt=question, max_new_tokens=50)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cv/src/models/clip_model.py DELETED
@@ -1,150 +0,0 @@
1
- from __future__ import annotations
2
-
3
- from typing import List
4
- from dataclasses import dataclass
5
- from functools import lru_cache
6
-
7
- import torch
8
- import open_clip
9
- from loguru import logger
10
-
11
- from ..config import get_cv_settings
12
- from ..processors.image_preprocessor import ImageInput
13
-
14
-
15
- @dataclass
16
- class CLIPResult:
17
- """Hasil dari CLIP model."""
18
- # Zero-shot classification
19
- labels: List[str] = None
20
- probabilities: List[float] = None
21
- top_label: str = ""
22
- top_score: float = 0.0
23
-
24
- # Image-text similarity
25
- similarity_score: float = None
26
-
27
- # Image features (untuk downstream tasks)
28
- image_features: "torch.Tensor" = None
29
-
30
-
31
- class CLIPModel:
32
- """
33
- Wrapper CLIP menggunakan open_clip.
34
- Capabilities:
35
- - Zero-shot image classification (tanpa training!)
36
- - Image-text similarity scoring
37
- - Image feature extraction untuk retrieval
38
- """
39
-
40
- def __init__(self):
41
- settings = get_cv_settings()
42
- self.device = settings.device
43
- logger.info(f"Loading CLIP model: {settings.clip_model} ({settings.clip_pretrained})")
44
-
45
- self.model, _, self.preprocess = open_clip.create_model_and_transforms(
46
- settings.clip_model,
47
- pretrained=settings.clip_pretrained,
48
- device=self.device,
49
- )
50
- self.tokenizer = open_clip.get_tokenizer(settings.clip_model)
51
- self.model.eval()
52
- logger.info("CLIP model ready.")
53
-
54
- @torch.no_grad()
55
- def classify(self, image: ImageInput, labels: List[str]) -> CLIPResult:
56
- """
57
- Zero-shot classification β€” tentukan kategori gambar dari daftar label.
58
- Tidak perlu training sama sekali!
59
-
60
- Args:
61
- image: ImageInput object
62
- labels: List label kandidat, e.g. ["kucing", "anjing", "burung"]
63
-
64
- Returns:
65
- CLIPResult dengan probabilitas tiap label
66
- """
67
- # Preprocess image
68
- img_tensor = self.preprocess(image.pil_image).unsqueeze(0).to(self.device)
69
-
70
- # Tokenize labels
71
- text_tokens = self.tokenizer(labels).to(self.device)
72
-
73
- # Compute features
74
- image_features = self.model.encode_image(img_tensor)
75
- text_features = self.model.encode_text(text_tokens)
76
-
77
- # Normalize
78
- image_features /= image_features.norm(dim=-1, keepdim=True)
79
- text_features /= text_features.norm(dim=-1, keepdim=True)
80
-
81
- # Compute similarity (cosine similarity β†’ softmax β†’ probabilities)
82
- logits = (100.0 * image_features @ text_features.T).softmax(dim=-1)
83
- probs = logits[0].cpu().numpy().tolist()
84
-
85
- top_idx = int(torch.argmax(logits[0]).item())
86
-
87
- return CLIPResult(
88
- labels=labels,
89
- probabilities=[round(p, 4) for p in probs],
90
- top_label=labels[top_idx],
91
- top_score=round(probs[top_idx], 4),
92
- )
93
-
94
- @torch.no_grad()
95
- def compute_similarity(self, image: ImageInput, text: str) -> float:
96
- """
97
- Hitung seberapa relevan teks dengan gambar (0.0 - 1.0).
98
- Berguna untuk: image search, content moderation, caption scoring.
99
- """
100
- img_tensor = self.preprocess(image.pil_image).unsqueeze(0).to(self.device)
101
- text_tokens = self.tokenizer([text]).to(self.device)
102
-
103
- img_feat = self.model.encode_image(img_tensor)
104
- txt_feat = self.model.encode_text(text_tokens)
105
-
106
- img_feat /= img_feat.norm(dim=-1, keepdim=True)
107
- txt_feat /= txt_feat.norm(dim=-1, keepdim=True)
108
-
109
- similarity = (img_feat @ txt_feat.T).item()
110
-
111
- # Normalize ke 0-1 (CLIP output biasanya -1 to 1)
112
- return round((similarity + 1) / 2, 4)
113
-
114
- @torch.no_grad()
115
- def extract_features(self, image: ImageInput) -> "torch.Tensor":
116
- """
117
- Ekstrak image embedding untuk semantic image search / clustering.
118
- Output: tensor shape (512,) untuk ViT-B-32
119
- """
120
- img_tensor = self.preprocess(image.pil_image).unsqueeze(0).to(self.device)
121
- features = self.model.encode_image(img_tensor)
122
- features /= features.norm(dim=-1, keepdim=True)
123
- return features[0].cpu()
124
-
125
- @torch.no_grad()
126
- def rank_images_by_text(
127
- self,
128
- images: List[ImageInput],
129
- query_text: str,
130
- ) -> List[tuple[int, float]]:
131
- """
132
- Rank multiple images berdasarkan relevansi dengan teks query.
133
- Returns: list of (original_index, score) sorted by score desc.
134
- Berguna untuk: text-to-image search.
135
- """
136
- tensors = torch.stack([
137
- self.preprocess(img.pil_image) for img in images
138
- ]).to(self.device)
139
-
140
- text_tokens = self.tokenizer([query_text]).to(self.device)
141
-
142
- img_features = self.model.encode_image(tensors)
143
- txt_features = self.model.encode_text(text_tokens)
144
-
145
- img_features /= img_features.norm(dim=-1, keepdim=True)
146
- txt_features /= txt_features.norm(dim=-1, keepdim=True)
147
-
148
- scores = (img_features @ txt_features.T).squeeze(1).cpu().numpy()
149
- ranked = sorted(enumerate(scores.tolist()), key=lambda x: x[1], reverse=True)
150
- return [(idx, round(score, 4)) for idx, score in ranked]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cv/src/models/yolo_detector.py DELETED
@@ -1,208 +0,0 @@
1
- from __future__ import annotations
2
-
3
- from typing import List
4
- from dataclasses import dataclass, field
5
-
6
- import numpy as np
7
- from loguru import logger
8
-
9
- from ..config import get_cv_settings
10
- from ..processors.image_preprocessor import ImageInput
11
-
12
-
13
- @dataclass
14
- class BoundingBox:
15
- x1: float
16
- y1: float
17
- x2: float
18
- y2: float
19
-
20
- @property
21
- def width(self) -> float:
22
- return self.x2 - self.x1
23
-
24
- @property
25
- def height(self) -> float:
26
- return self.y2 - self.y1
27
-
28
- @property
29
- def area(self) -> float:
30
- return self.width * self.height
31
-
32
- def to_dict(self) -> dict:
33
- return {
34
- "x1": round(self.x1, 1), "y1": round(self.y1, 1),
35
- "x2": round(self.x2, 1), "y2": round(self.y2, 1),
36
- "width": round(self.width, 1), "height": round(self.height, 1),
37
- }
38
-
39
-
40
- @dataclass
41
- class Detection:
42
- label: str
43
- confidence: float
44
- bbox: BoundingBox
45
- class_id: int
46
-
47
- def to_dict(self) -> dict:
48
- return {
49
- "label": self.label,
50
- "confidence": round(self.confidence, 4),
51
- "bbox": self.bbox.to_dict(),
52
- "class_id": self.class_id,
53
- }
54
-
55
-
56
- @dataclass
57
- class DetectionResult:
58
- detections: List[Detection] = field(default_factory=list)
59
- image_width: int = 0
60
- image_height: int = 0
61
- model_name: str = ""
62
- inference_time_ms: float = 0.0
63
-
64
- @property
65
- def count(self) -> int:
66
- return len(self.detections)
67
-
68
- @property
69
- def labels_summary(self) -> dict[str, int]:
70
- """Ringkasan: {label: count}"""
71
- summary = {}
72
- for d in self.detections:
73
- summary[d.label] = summary.get(d.label, 0) + 1
74
- return summary
75
-
76
- def filter_by_label(self, label: str) -> List[Detection]:
77
- return [d for d in self.detections if d.label.lower() == label.lower()]
78
-
79
- def filter_by_confidence(self, min_conf: float) -> List[Detection]:
80
- return [d for d in self.detections if d.confidence >= min_conf]
81
-
82
-
83
- class YOLODetector:
84
- """
85
- Object detection menggunakan YOLOv8 (Ultralytics).
86
- Model: yolov8n (nano, cepat) β†’ yolov8m (medium, akurat)
87
- 80 kelas COCO default, bisa di-finetune untuk domain spesifik.
88
- """
89
-
90
- def __init__(self):
91
- settings = get_cv_settings()
92
- logger.info(f"Loading YOLO model: {settings.yolo_model}")
93
-
94
- try:
95
- from ultralytics import YOLO
96
- self.model = YOLO(settings.yolo_model)
97
- except Exception as e:
98
- logger.error(f"Gagal load YOLO: {e}")
99
- raise
100
-
101
- self.conf_threshold = settings.yolo_conf_threshold
102
- self.iou_threshold = settings.yolo_iou_threshold
103
- self.model_name = settings.yolo_model
104
- logger.info("YOLO detector ready.")
105
-
106
- def detect(
107
- self,
108
- image: ImageInput,
109
- conf_threshold: float = None,
110
- classes: List[int] = None,
111
- ) -> DetectionResult:
112
- """
113
- Deteksi objek dalam gambar.
114
-
115
- Args:
116
- image: ImageInput object
117
- conf_threshold: Override confidence threshold (default dari config)
118
- classes: Filter kelas spesifik (COCO class IDs), None = semua kelas
119
-
120
- Returns:
121
- DetectionResult berisi semua deteksi
122
- """
123
- import time
124
- conf = conf_threshold or self.conf_threshold
125
-
126
- start = time.perf_counter()
127
- results = self.model.predict(
128
- source=image.numpy,
129
- conf=conf,
130
- iou=self.iou_threshold,
131
- classes=classes,
132
- verbose=False,
133
- )
134
- elapsed_ms = (time.perf_counter() - start) * 1000
135
-
136
- detections = []
137
- if results and results[0].boxes is not None:
138
- boxes = results[0].boxes
139
- for i in range(len(boxes)):
140
- x1, y1, x2, y2 = boxes.xyxy[i].cpu().numpy()
141
- conf_val = float(boxes.conf[i].cpu().numpy())
142
- cls_id = int(boxes.cls[i].cpu().numpy())
143
- label = self.model.names[cls_id]
144
-
145
- detections.append(Detection(
146
- label=label,
147
- confidence=conf_val,
148
- bbox=BoundingBox(x1=x1, y1=y1, x2=x2, y2=y2),
149
- class_id=cls_id,
150
- ))
151
-
152
- logger.debug(
153
- f"Detected {len(detections)} objects in {elapsed_ms:.1f}ms | "
154
- f"Summary: {DetectionResult(detections=detections).labels_summary}"
155
- )
156
-
157
- return DetectionResult(
158
- detections=detections,
159
- image_width=image.width,
160
- image_height=image.height,
161
- model_name=self.model_name,
162
- inference_time_ms=round(elapsed_ms, 2),
163
- )
164
-
165
- def detect_and_annotate(self, image: ImageInput, **kwargs) -> "np.ndarray":
166
- """
167
- Detect dan return gambar dengan bounding box yang sudah digambar.
168
- Berguna untuk visualisasi / demo.
169
- """
170
- import cv2
171
-
172
- result_img = image.numpy.copy()
173
- det_result = self.detect(image, **kwargs)
174
-
175
- for det in det_result.detections:
176
- bb = det.bbox
177
- x1, y1, x2, y2 = int(bb.x1), int(bb.y1), int(bb.x2), int(bb.y2)
178
-
179
- # Warna berdasarkan class_id
180
- color = self._get_color(det.class_id)
181
-
182
- # Gambar bounding box
183
- cv2.rectangle(result_img, (x1, y1), (x2, y2), color, 2)
184
-
185
- # Label background + text
186
- label_text = f"{det.label} {det.confidence:.2f}"
187
- (tw, th), _ = cv2.getTextSize(label_text, cv2.FONT_HERSHEY_SIMPLEX, 0.6, 1)
188
- cv2.rectangle(result_img, (x1, y1 - th - 8), (x1 + tw + 4, y1), color, -1)
189
- cv2.putText(result_img, label_text, (x1 + 2, y1 - 4),
190
- cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 255), 1)
191
-
192
- return result_img
193
-
194
- @staticmethod
195
- def _get_color(class_id: int) -> tuple[int, int, int]:
196
- """Generate warna konsisten per class_id."""
197
- palette = [
198
- (255, 56, 56), (255, 157, 151), (255, 112, 31), (255, 178, 29),
199
- (207, 210, 49), (72, 249, 10), (146, 204, 23), (61, 219, 134),
200
- (26, 147, 52), (0, 212, 187), (44, 153, 168), (0, 194, 255),
201
- (52, 69, 147), (100, 115, 255), (0, 24, 236), (132, 56, 255),
202
- ]
203
- return palette[class_id % len(palette)]
204
-
205
- @property
206
- def available_classes(self) -> dict[int, str]:
207
- """Return dict semua kelas yang bisa dideteksi."""
208
- return self.model.names
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cv/src/processors/__init__.py DELETED
File without changes
cv/src/processors/image_preprocessor.py DELETED
@@ -1,154 +0,0 @@
1
- from __future__ import annotations
2
-
3
- import io
4
- import base64
5
- from pathlib import Path
6
- from typing import Union
7
- from dataclasses import dataclass, field
8
-
9
- import numpy as np
10
- from PIL import Image, ExifTags
11
- from loguru import logger
12
-
13
-
14
- @dataclass
15
- class ImageInput:
16
- """Normalized image container β€” semua sumber dikonversi ke sini."""
17
- pil_image: Image.Image
18
- original_size: tuple[int, int] # (width, height)
19
- source: str = "unknown"
20
- filename: str = ""
21
- format: str = "RGB"
22
- metadata: dict = field(default_factory=dict)
23
-
24
- @property
25
- def width(self) -> int:
26
- return self.pil_image.width
27
-
28
- @property
29
- def height(self) -> int:
30
- return self.pil_image.height
31
-
32
- @property
33
- def numpy(self) -> np.ndarray:
34
- """Return as HWC uint8 numpy array (untuk OpenCV/YOLO)."""
35
- return np.array(self.pil_image)
36
-
37
- def to_base64(self) -> str:
38
- """Konversi ke base64 string untuk response API."""
39
- buf = io.BytesIO()
40
- self.pil_image.save(buf, format="JPEG", quality=85)
41
- return base64.b64encode(buf.getvalue()).decode()
42
-
43
-
44
- class ImagePreprocessor:
45
- """
46
- Handle semua bentuk input gambar:
47
- - File path (str / Path)
48
- - Raw bytes (dari upload)
49
- - Base64 string
50
- - URL (via HTTP)
51
- - PIL Image langsung
52
- """
53
-
54
- MAX_SIZE = (1920, 1920)
55
-
56
- @classmethod
57
- def load(cls, source: Union[str, bytes, Path, Image.Image]) -> ImageInput:
58
- """Auto-detect tipe input dan load sebagai ImageInput."""
59
- if isinstance(source, Image.Image):
60
- return cls._from_pil(source, source_name="pil_direct")
61
-
62
- if isinstance(source, bytes):
63
- return cls._from_bytes(source)
64
-
65
- if isinstance(source, Path) or (isinstance(source, str) and not source.startswith(("http", "data:"))):
66
- return cls._from_file(str(source))
67
-
68
- if isinstance(source, str) and source.startswith("data:image"):
69
- return cls._from_base64(source)
70
-
71
- if isinstance(source, str) and source.startswith(("http://", "https://")):
72
- return cls._from_url(source)
73
-
74
- raise ValueError(f"Tipe input tidak dikenali: {type(source)}")
75
-
76
- @classmethod
77
- def _from_file(cls, path: str) -> ImageInput:
78
- p = Path(path)
79
- if not p.exists():
80
- raise FileNotFoundError(f"Gambar tidak ditemukan: {path}")
81
- img = Image.open(p)
82
- img = cls._normalize(img)
83
- logger.debug(f"Loaded image from file: {p.name} ({img.width}x{img.height})")
84
- return ImageInput(
85
- pil_image=img,
86
- original_size=(img.width, img.height),
87
- source="file",
88
- filename=p.name,
89
- metadata={"path": str(p), "format": p.suffix},
90
- )
91
-
92
- @classmethod
93
- def _from_bytes(cls, data: bytes, filename: str = "upload") -> ImageInput:
94
- img = Image.open(io.BytesIO(data))
95
- original_size = (img.width, img.height)
96
- img = cls._normalize(img)
97
- return ImageInput(
98
- pil_image=img,
99
- original_size=original_size,
100
- source="bytes",
101
- filename=filename,
102
- metadata={"size_bytes": len(data)},
103
- )
104
-
105
- @classmethod
106
- def _from_base64(cls, b64_str: str) -> ImageInput:
107
- # Strip data URI prefix jika ada
108
- if "," in b64_str:
109
- b64_str = b64_str.split(",", 1)[1]
110
- data = base64.b64decode(b64_str)
111
- return cls._from_bytes(data, filename="base64_input")
112
-
113
- @classmethod
114
- def _from_url(cls, url: str) -> ImageInput:
115
- import httpx
116
- logger.debug(f"Fetching image from URL: {url}")
117
- r = httpx.get(url, timeout=15, follow_redirects=True)
118
- r.raise_for_status()
119
- img_input = cls._from_bytes(r.content, filename=url.split("/")[-1] or "url_image")
120
- img_input.source = "url"
121
- img_input.metadata["url"] = url
122
- return img_input
123
-
124
- @classmethod
125
- def _from_pil(cls, img: Image.Image, source_name: str = "pil") -> ImageInput:
126
- original_size = (img.width, img.height)
127
- img = cls._normalize(img)
128
- return ImageInput(pil_image=img, original_size=original_size, source=source_name)
129
-
130
- @classmethod
131
- def _normalize(cls, img: Image.Image) -> Image.Image:
132
- """Convert ke RGB, fix EXIF rotation, resize jika terlalu besar."""
133
- # Fix EXIF orientation
134
- try:
135
- exif = img._getexif()
136
- if exif:
137
- for tag, val in exif.items():
138
- if ExifTags.TAGS.get(tag) == "Orientation":
139
- rotations = {3: 180, 6: 270, 8: 90}
140
- if val in rotations:
141
- img = img.rotate(rotations[val], expand=True)
142
- except Exception:
143
- pass
144
-
145
- # Convert ke RGB
146
- if img.mode != "RGB":
147
- img = img.convert("RGB")
148
-
149
- # Resize jika melebihi batas
150
- if img.width > cls.MAX_SIZE[0] or img.height > cls.MAX_SIZE[1]:
151
- img.thumbnail(cls.MAX_SIZE, Image.LANCZOS)
152
- logger.debug(f"Resized image to {img.width}x{img.height}")
153
-
154
- return img
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cv/src/processors/ocr_processor.py DELETED
@@ -1,235 +0,0 @@
1
- from __future__ import annotations
2
-
3
- from typing import List
4
- from dataclasses import dataclass, field
5
- from loguru import logger
6
-
7
- import numpy as np
8
-
9
- from ..config import get_cv_settings
10
- from ..processors.image_preprocessor import ImageInput
11
-
12
-
13
- @dataclass
14
- class OCRBox:
15
- text: str
16
- confidence: float
17
- bbox: list # [[x1,y1],[x2,y1],[x2,y2],[x1,y2]] format EasyOCR
18
-
19
- def to_dict(self) -> dict:
20
- return {
21
- "text": self.text,
22
- "confidence": round(self.confidence, 4),
23
- "bbox": self.bbox,
24
- }
25
-
26
-
27
- @dataclass
28
- class OCRResult:
29
- full_text: str
30
- boxes: List[OCRBox] = field(default_factory=list)
31
- language: str = ""
32
- engine: str = ""
33
-
34
- @property
35
- def word_count(self) -> int:
36
- return len(self.full_text.split())
37
-
38
-
39
- class OCRProcessor:
40
- """
41
- OCR menggunakan EasyOCR dengan mode stabil (single-pass ringan).
42
- Fokus: tidak crash di Docker + tetap improve akurasi.
43
- """
44
-
45
- MIN_CONFIDENCE = 0.10
46
- MIN_OCR_DIM = 800
47
-
48
- def __init__(self):
49
- settings = get_cv_settings()
50
- self.engine = settings.ocr_engine
51
- self.languages = [l.strip() for l in settings.ocr_languages.split(",")]
52
- logger.info(f"Loading OCR ({self.engine}) for languages: {self.languages}")
53
-
54
- try:
55
- import easyocr
56
- self.reader = easyocr.Reader(
57
- self.languages,
58
- gpu=(settings.device == "cuda"),
59
- model_storage_directory=settings.models_cache_dir,
60
- )
61
- logger.info("OCR processor ready.")
62
- except Exception as e:
63
- logger.error(f"Gagal init OCR: {e}")
64
- raise
65
-
66
- def _preprocess_for_ocr(self, img: np.ndarray) -> np.ndarray:
67
- """
68
- Preprocessing ringan:
69
- - upscale (jika kecil)
70
- - grayscale
71
- - CLAHE contrast enhancement
72
- - light sharpen
73
- """
74
- try:
75
- import cv2
76
-
77
- h, w = img.shape[:2]
78
- if max(h, w) < self.MIN_OCR_DIM:
79
- scale = self.MIN_OCR_DIM / max(h, w)
80
- new_w, new_h = int(w * scale), int(h * scale)
81
- img = cv2.resize(img, (new_w, new_h), interpolation=cv2.INTER_CUBIC)
82
-
83
- if len(img.shape) == 3:
84
- gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
85
- else:
86
- gray = img.copy()
87
-
88
- clahe = cv2.createCLAHE(clipLimit=2.5, tileGridSize=(8, 8))
89
- enhanced = clahe.apply(gray)
90
-
91
- kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]], dtype=np.float32)
92
- sharpened = cv2.filter2D(enhanced, -1, kernel)
93
-
94
- return cv2.cvtColor(sharpened, cv2.COLOR_GRAY2RGB)
95
- except Exception as e:
96
- logger.warning(f"OCR preprocessing fallback to original image: {e}")
97
- return img
98
-
99
- def _parse_results(self, raw_results: List) -> List[OCRBox]:
100
- boxes = []
101
- for item in raw_results:
102
- if len(item) == 3:
103
- bbox, text, confidence = item
104
- elif len(item) == 2:
105
- bbox, text = item
106
- confidence = 0.8
107
- else:
108
- continue
109
-
110
- text = str(text).strip()
111
- if not text or confidence < self.MIN_CONFIDENCE:
112
- continue
113
-
114
- # Convert numpy scalars/arrays to native Python types for FastAPI/Pydantic serialization
115
- safe_bbox = []
116
- try:
117
- for pt in bbox:
118
- if isinstance(pt, (list, tuple)) and len(pt) >= 2:
119
- safe_bbox.append([float(pt[0]), float(pt[1])])
120
- else:
121
- safe_bbox.append(pt)
122
- except Exception:
123
- safe_bbox = bbox
124
-
125
- boxes.append(OCRBox(
126
- text=text,
127
- confidence=float(confidence),
128
- bbox=safe_bbox,
129
- ))
130
- return boxes
131
-
132
- def _boxes_to_text(self, boxes: List[OCRBox]) -> str:
133
- if not boxes:
134
- return ""
135
-
136
- def cy(box: OCRBox) -> float:
137
- try:
138
- ys = [pt[1] for pt in box.bbox]
139
- return sum(ys) / len(ys)
140
- except Exception:
141
- return 0
142
-
143
- def cx(box: OCRBox) -> float:
144
- try:
145
- xs = [pt[0] for pt in box.bbox]
146
- return sum(xs) / len(xs)
147
- except Exception:
148
- return 0
149
-
150
- def h(box: OCRBox) -> float:
151
- try:
152
- ys = [pt[1] for pt in box.bbox]
153
- return max(ys) - min(ys)
154
- except Exception:
155
- return 20
156
-
157
- sorted_boxes = sorted(boxes, key=lambda b: (cy(b), cx(b)))
158
- lines = []
159
- current = [sorted_boxes[0]]
160
- current_y = cy(sorted_boxes[0])
161
-
162
- for box in sorted_boxes[1:]:
163
- if abs(cy(box) - current_y) < max(h(box) * 0.5, 15):
164
- current.append(box)
165
- else:
166
- current.sort(key=lambda b: cx(b))
167
- lines.append(" ".join(b.text for b in current))
168
- current = [box]
169
- current_y = cy(box)
170
-
171
- if current:
172
- current.sort(key=lambda b: cx(b))
173
- lines.append(" ".join(b.text for b in current))
174
-
175
- return "\n".join(lines)
176
-
177
- def extract_text(
178
- self,
179
- image: ImageInput,
180
- detail: bool = True,
181
- paragraph: bool = False,
182
- ) -> OCRResult:
183
- logger.debug(f"Running stable OCR on {image.width}x{image.height} image")
184
-
185
- try:
186
- processed = self._preprocess_for_ocr(image.numpy.copy())
187
- raw_results = self.reader.readtext(
188
- processed,
189
- detail=1,
190
- paragraph=False,
191
- contrast_ths=0.1,
192
- adjust_contrast=0.7,
193
- text_threshold=0.5,
194
- low_text=0.3,
195
- link_threshold=0.3,
196
- width_ths=0.7,
197
- decoder="beamsearch",
198
- beamWidth=10,
199
- )
200
- boxes = self._parse_results(raw_results)
201
-
202
- if len(boxes) < 2:
203
- raw2 = self.reader.readtext(
204
- image.numpy,
205
- detail=1,
206
- paragraph=False,
207
- )
208
- boxes2 = self._parse_results(raw2)
209
- if len(boxes2) > len(boxes):
210
- boxes = boxes2
211
-
212
- full_text = self._boxes_to_text(boxes)
213
-
214
- return OCRResult(
215
- full_text=full_text,
216
- boxes=boxes,
217
- language=",".join(self.languages),
218
- engine=self.engine,
219
- )
220
-
221
- except Exception as e:
222
- logger.error(f"OCR processing error: {e}")
223
- raw_results = self.reader.readtext(image.numpy, detail=1, paragraph=False)
224
- boxes = self._parse_results(raw_results)
225
- full_text = self._boxes_to_text(boxes)
226
- return OCRResult(
227
- full_text=full_text,
228
- boxes=boxes,
229
- language=",".join(self.languages),
230
- engine=self.engine,
231
- )
232
-
233
- def extract_text_simple(self, image: ImageInput) -> str:
234
- result = self.extract_text(image, detail=True, paragraph=False)
235
- return result.full_text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cv_module/src/api/routes.py CHANGED
@@ -34,6 +34,10 @@ router = APIRouter()
34
  _pipeline: CVPipeline = None
35
  _pipeline_lock = threading.Lock()
36
 
 
 
 
 
37
 
38
  def get_pipeline() -> CVPipeline:
39
  global _pipeline
@@ -57,29 +61,37 @@ def _trigger_and_wait(model_name: str):
57
  Thread-safe: hanya satu thread yang load, sisanya tunggu.
58
  """
59
  readiness = get_readiness()
60
- status_info = readiness.get_status(model_name)
61
 
62
- # Kalau sudah ready, langsung return.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  if status_info.state.value == "ready":
64
  return
65
 
66
- # Kalau error, langsung raise.
67
- if status_info.state.value == "error":
68
- raise HTTPException(
69
- status_code=503,
70
- detail={
71
- "error": "model_failed_to_load",
72
- "model": model_name,
73
- "message": status_info.error_message or "Model gagal dimuat.",
74
- "hint": "Cek logs container untuk detail error.",
75
- },
76
- )
77
-
78
- # Kalau belum loading (not_loaded) β€” trigger load via pipeline property access.
79
- # ReadinessTracker akan di-update oleh pipeline lazy loader.
80
- if status_info.state.value in ("not_loaded",):
81
- readiness.mark_loading(model_name)
82
- # Trigger load di thread baru supaya tidak block event loop.
83
  def _do_load():
84
  try:
85
  p = get_pipeline()
@@ -104,6 +116,18 @@ def _trigger_and_wait(model_name: str):
104
  ok = readiness.wait_for(model_name, timeout=_MODEL_WAIT_TIMEOUT)
105
  if not ok:
106
  current = readiness.get_status(model_name).state.value
 
 
 
 
 
 
 
 
 
 
 
 
107
  raise HTTPException(
108
  status_code=503,
109
  detail={
@@ -230,7 +254,8 @@ async def caption(url: str, prompt: str = None):
230
  except HTTPException:
231
  raise
232
  except Exception as e:
233
- raise HTTPException(500, detail=str(e))
 
234
 
235
 
236
  @router.post("/detect", response_model=DetectionResponse, tags=["tasks"])
@@ -250,7 +275,8 @@ async def detect(url: str, conf: float = None):
250
  except HTTPException:
251
  raise
252
  except Exception as e:
253
- raise HTTPException(500, detail=str(e))
 
254
 
255
 
256
  @router.post("/classify", response_model=ClassificationResponse, tags=["tasks"])
@@ -268,7 +294,8 @@ async def classify(req: ClassifyRequest):
268
  except HTTPException:
269
  raise
270
  except Exception as e:
271
- raise HTTPException(500, detail=str(e))
 
272
 
273
 
274
  class OCRRequest(BaseModel):
@@ -293,7 +320,8 @@ async def ocr(req: OCRRequest):
293
  except HTTPException:
294
  raise
295
  except Exception as e:
296
- raise HTTPException(500, detail=str(e))
 
297
 
298
 
299
  @router.post("/ocr/bytes", tags=["tasks"])
 
34
  _pipeline: CVPipeline = None
35
  _pipeline_lock = threading.Lock()
36
 
37
+ # Lock terpisah untuk trigger lazy-load β€” mencegah TOCTOU race kalau
38
+ # beberapa request datang barengan untuk model yang sama.
39
+ _trigger_lock = threading.Lock()
40
+
41
 
42
  def get_pipeline() -> CVPipeline:
43
  global _pipeline
 
61
  Thread-safe: hanya satu thread yang load, sisanya tunggu.
62
  """
63
  readiness = get_readiness()
 
64
 
65
+ # Atomic check-and-mark: hold _trigger_lock biar dua request ga sama-sama
66
+ # nge-spawn loader thread untuk model yang sama (CVPipeline punya per-model
67
+ # lock, tapi spawning extra thread tetep waste resource & log noise).
68
+ with _trigger_lock:
69
+ status_info = readiness.get_status(model_name)
70
+
71
+ # Kalau error, langsung raise.
72
+ if status_info.state.value == "error":
73
+ raise HTTPException(
74
+ status_code=503,
75
+ detail={
76
+ "error": "model_failed_to_load",
77
+ "model": model_name,
78
+ "message": status_info.error_message or "Model gagal dimuat.",
79
+ "hint": "Cek logs container untuk detail error.",
80
+ },
81
+ )
82
+
83
+ need_spawn = status_info.state.value in ("not_loaded",)
84
+ if need_spawn:
85
+ # Mark loading dulu β€” request berikutnya yang masuk akan lihat
86
+ # state="loading" dan langsung wait_for() tanpa spawn thread baru.
87
+ readiness.mark_loading(model_name)
88
+
89
+ # Kalau sudah ready, langsung return (tidak perlu wait).
90
  if status_info.state.value == "ready":
91
  return
92
 
93
+ # Spawn loader thread di luar lock supaya request lain bisa masuk.
94
+ if need_spawn:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  def _do_load():
96
  try:
97
  p = get_pipeline()
 
116
  ok = readiness.wait_for(model_name, timeout=_MODEL_WAIT_TIMEOUT)
117
  if not ok:
118
  current = readiness.get_status(model_name).state.value
119
+ # Kalau state-nya error, kasih pesan error spesifik.
120
+ if current == "error":
121
+ err_msg = readiness.get_status(model_name).error_message
122
+ raise HTTPException(
123
+ status_code=503,
124
+ detail={
125
+ "error": "model_failed_to_load",
126
+ "model": model_name,
127
+ "message": err_msg or f"Model '{model_name}' gagal dimuat.",
128
+ "hint": "Cek logs container untuk traceback lengkap.",
129
+ },
130
+ )
131
  raise HTTPException(
132
  status_code=503,
133
  detail={
 
254
  except HTTPException:
255
  raise
256
  except Exception as e:
257
+ logger.error(f"Caption error: {e}")
258
+ raise HTTPException(500, detail=f"Caption gagal: {e}")
259
 
260
 
261
  @router.post("/detect", response_model=DetectionResponse, tags=["tasks"])
 
275
  except HTTPException:
276
  raise
277
  except Exception as e:
278
+ logger.error(f"Detect error: {e}")
279
+ raise HTTPException(500, detail=f"Detect gagal: {e}")
280
 
281
 
282
  @router.post("/classify", response_model=ClassificationResponse, tags=["tasks"])
 
294
  except HTTPException:
295
  raise
296
  except Exception as e:
297
+ logger.error(f"Classify error: {e}")
298
+ raise HTTPException(500, detail=f"Classify gagal: {e}")
299
 
300
 
301
  class OCRRequest(BaseModel):
 
320
  except HTTPException:
321
  raise
322
  except Exception as e:
323
+ logger.error(f"OCR error: {e}")
324
+ raise HTTPException(500, detail=f"OCR gagal: {e}")
325
 
326
 
327
  @router.post("/ocr/bytes", tags=["tasks"])
cv_module/src/cv_pipeline.py CHANGED
@@ -69,38 +69,61 @@ class CVPipeline:
69
  Orchestrator untuk semua CV models.
70
  Lazy loading β€” model hanya di-load saat pertama kali dipakai.
71
  Support modular: bisa run satu atau semua model sekaligus.
 
 
 
 
72
  """
73
 
 
 
 
 
 
74
  def __init__(self):
 
75
  self.settings = get_cv_settings()
76
  self._clip: Optional[CLIPModel] = None
77
  self._yolo: Optional[YOLODetector] = None
78
  self._captioner: Optional[ImageCaptioner] = None
79
  self._ocr: Optional[OCRProcessor] = None
80
- logger.info("CVPipeline initialized (lazy loading).")
 
 
 
 
 
81
 
82
  @property
83
  def clip(self) -> CLIPModel:
84
  if self._clip is None:
85
- self._clip = CLIPModel()
 
 
86
  return self._clip
87
 
88
  @property
89
  def yolo(self) -> YOLODetector:
90
  if self._yolo is None:
91
- self._yolo = YOLODetector()
 
 
92
  return self._yolo
93
 
94
  @property
95
  def captioner(self) -> ImageCaptioner:
96
  if self._captioner is None:
97
- self._captioner = ImageCaptioner()
 
 
98
  return self._captioner
99
 
100
  @property
101
  def ocr(self) -> OCRProcessor:
102
  if self._ocr is None:
103
- self._ocr = OCRProcessor()
 
 
104
  return self._ocr
105
 
106
  # === Main analysis methods ===
@@ -139,9 +162,12 @@ class CVPipeline:
139
  source=image.source,
140
  )
141
 
142
- # Jalankan semua model secara paralel untuk menghindari timeout
 
 
 
143
  tasks = {}
144
- with concurrent.futures.ThreadPoolExecutor() as executor:
145
  if run_caption:
146
  tasks["caption"] = executor.submit(self.captioner.caption, image)
147
  if run_detection:
 
69
  Orchestrator untuk semua CV models.
70
  Lazy loading β€” model hanya di-load saat pertama kali dipakai.
71
  Support modular: bisa run satu atau semua model sekaligus.
72
+
73
+ Thread-safe: setiap model property pakai per-model lock supaya 2+ request
74
+ yang concurrent tidak load model yang sama secara duplicate (bisa OOM
75
+ di HF free tier yang RAM-nya cuma 16GB shared).
76
  """
77
 
78
+ # Cap ThreadPoolExecutor workers untuk analyze() β€” tanpa cap, default
79
+ # min(32, os.cpu_count()+4) bisa bikin OOM kalau semua model jalan paralel
80
+ # plus model loading butuh RAM. 2 cukup buat overlap I/O + compute.
81
+ _MAX_PARALLEL_TASKS = 2
82
+
83
  def __init__(self):
84
+ import threading
85
  self.settings = get_cv_settings()
86
  self._clip: Optional[CLIPModel] = None
87
  self._yolo: Optional[YOLODetector] = None
88
  self._captioner: Optional[ImageCaptioner] = None
89
  self._ocr: Optional[OCRProcessor] = None
90
+ # Per-model locks β€” mencegah double-init kalau 2+ thread access barengan.
91
+ self._lock_clip = threading.Lock()
92
+ self._lock_yolo = threading.Lock()
93
+ self._lock_captioner = threading.Lock()
94
+ self._lock_ocr = threading.Lock()
95
+ logger.info("CVPipeline initialized (lazy loading, thread-safe).")
96
 
97
  @property
98
  def clip(self) -> CLIPModel:
99
  if self._clip is None:
100
+ with self._lock_clip:
101
+ if self._clip is None: # double-check after lock
102
+ self._clip = CLIPModel()
103
  return self._clip
104
 
105
  @property
106
  def yolo(self) -> YOLODetector:
107
  if self._yolo is None:
108
+ with self._lock_yolo:
109
+ if self._yolo is None:
110
+ self._yolo = YOLODetector()
111
  return self._yolo
112
 
113
  @property
114
  def captioner(self) -> ImageCaptioner:
115
  if self._captioner is None:
116
+ with self._lock_captioner:
117
+ if self._captioner is None:
118
+ self._captioner = ImageCaptioner()
119
  return self._captioner
120
 
121
  @property
122
  def ocr(self) -> OCRProcessor:
123
  if self._ocr is None:
124
+ with self._lock_ocr:
125
+ if self._ocr is None:
126
+ self._ocr = OCRProcessor()
127
  return self._ocr
128
 
129
  # === Main analysis methods ===
 
162
  source=image.source,
163
  )
164
 
165
+ # Jalankan semua model secara paralel untuk menghindari timeout.
166
+ # max_workers=2 β€” cukup untuk overlap I/O & compute, ngga ngebebanin RAM.
167
+ # Tanpa cap, default workers bisa bikin 4 model jalan barengan + ngeload
168
+ # weight-nya barengan β†’ OOM di HF free tier.
169
  tasks = {}
170
+ with concurrent.futures.ThreadPoolExecutor(max_workers=self._MAX_PARALLEL_TASKS) as executor:
171
  if run_caption:
172
  tasks["caption"] = executor.submit(self.captioner.caption, image)
173
  if run_detection:
frontend/index.html CHANGED
@@ -9,15 +9,54 @@
9
  // Anti-flash theme init. Key: 'ai-rag-theme'.
10
  // null/missing = follow device preference (re-read every load).
11
  // 'dark' or 'light' = user manually overrode.
 
 
 
 
 
12
  (function(){
13
- var s = localStorage.getItem('ai-rag-theme');
14
- var t = (s === 'dark' || s === 'light') ? s
15
- : (window.matchMedia('(prefers-color-scheme: dark)').matches ? 'dark' : 'light');
 
 
 
 
 
 
 
16
  document.documentElement.setAttribute('data-theme', t);
17
  })();
18
  </script>
19
  <style>
20
  /* ── Theme tokens ───────────────────────────────────── */
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  :root[data-theme="dark"] {
22
  --bg0: #111214;
23
  --bg1: #16181c;
@@ -515,21 +554,25 @@ let queryCount=0;
515
  // === Theme ===
516
  function applyTheme(theme, manual){
517
  document.documentElement.setAttribute('data-theme', theme);
518
- // Only persist if user manually toggled (not from system change or init)
 
 
519
  if (manual) {
520
- localStorage.setItem('ai-rag-theme', theme);
521
  }
522
  // icon-sun shown when dark (click -> go light)
523
  // icon-moon shown when light (click -> go dark)
524
  const isDark = theme === 'dark';
525
- document.getElementById('icon-sun').style.display = isDark ? '' : 'none';
526
- document.getElementById('icon-moon').style.display = isDark ? 'none' : '';
527
- // Label shows current mode
528
- document.getElementById('theme-label').textContent = theme;
 
 
529
  }
530
 
531
  function toggleTheme(){
532
- const cur = document.documentElement.getAttribute('data-theme');
533
  applyTheme(cur === 'dark' ? 'light' : 'dark', true);
534
  }
535
 
@@ -540,14 +583,20 @@ function toggleTheme(){
540
  applyTheme(theme, false);
541
  })();
542
 
543
- // Listen for system theme changes (only if user hasn't manually overridden)
544
- // Follow system preference changes ONLY if user has no manual override
545
- window.matchMedia('(prefers-color-scheme: dark)').addEventListener('change', function(e){
546
- var stored = localStorage.getItem('ai-rag-theme');
547
- if (stored !== 'dark' && stored !== 'light') {
548
- applyTheme(e.matches ? 'dark' : 'light', false);
549
- }
550
- });
 
 
 
 
 
 
551
 
552
  // === Utility ===
553
  function tick(){document.getElementById('clock').textContent=new Date().toLocaleTimeString('id-ID',{hour:'2-digit',minute:'2-digit',second:'2-digit'})}
 
9
  // Anti-flash theme init. Key: 'ai-rag-theme'.
10
  // null/missing = follow device preference (re-read every load).
11
  // 'dark' or 'light' = user manually overrode.
12
+ //
13
+ // IMPORTANT: wrap in try/catch β€” beberapa browser (private mode, strict cookie
14
+ // policy, embedded WebView) nge-block localStorage atau matchMedia β†’ kalau script
15
+ // throw, data-theme tidak pernah ke-set, CSS variables semua undefined, dan
16
+ // tampilan keliatan "ilang temanya". Default ke 'light' kalau apa pun gagal.
17
  (function(){
18
+ var t = 'light';
19
+ try {
20
+ var s = null;
21
+ try { s = window.localStorage.getItem('ai-rag-theme'); } catch (_) {}
22
+ if (s === 'dark' || s === 'light') {
23
+ t = s;
24
+ } else if (window.matchMedia && window.matchMedia('(prefers-color-scheme: dark)').matches) {
25
+ t = 'dark';
26
+ }
27
+ } catch (_) { /* fall through to light */ }
28
  document.documentElement.setAttribute('data-theme', t);
29
  })();
30
  </script>
31
  <style>
32
  /* ── Theme tokens ───────────────────────────────────── */
33
+ /* Fallback default (light) β€” applies when data-theme attribute is missing
34
+ or invalid. Without this, CSS vars all undefined β†’ blank/white-on-white UI.
35
+ The inline script in <head> ALWAYS sets data-theme, but defense-in-depth
36
+ protects us against weird browsers / extensions / errors. */
37
+ :root {
38
+ --bg0: #f4f5f7;
39
+ --bg1: #ffffff;
40
+ --bg2: #eef0f3;
41
+ --bg3: #e4e6eb;
42
+ --line: #d0d4db;
43
+ --line2: #bcc0c9;
44
+ --ink0: #1a1d24;
45
+ --ink1: #4a5263;
46
+ --ink2: #7a8296;
47
+ --ink3: #a8afc0;
48
+ --sage: #4a7a5e;
49
+ --sage-l: #357a52;
50
+ --sage-bg: #eaf4ee;
51
+ --amber: #8c6d3f;
52
+ --amber-l: #7a5820;
53
+ --amber-bg: #fdf3e3;
54
+ --sky: #2a5f80;
55
+ --sky-l: #1e5070;
56
+ --red: #8c3a3a;
57
+ --red-l: #9e2a2a;
58
+ --shadow: rgba(0,0,0,0.08);
59
+ }
60
  :root[data-theme="dark"] {
61
  --bg0: #111214;
62
  --bg1: #16181c;
 
554
  // === Theme ===
555
  function applyTheme(theme, manual){
556
  document.documentElement.setAttribute('data-theme', theme);
557
+ // Only persist if user manually toggled (not from system change or init).
558
+ // Wrap in try/catch β€” localStorage can throw in private mode, embedded
559
+ // WebViews, or with strict cookie policies.
560
  if (manual) {
561
+ try { localStorage.setItem('ai-rag-theme', theme); } catch (_) {}
562
  }
563
  // icon-sun shown when dark (click -> go light)
564
  // icon-moon shown when light (click -> go dark)
565
  const isDark = theme === 'dark';
566
+ const sun = document.getElementById('icon-sun');
567
+ const moon = document.getElementById('icon-moon');
568
+ const label = document.getElementById('theme-label');
569
+ if (sun) sun.style.display = isDark ? '' : 'none';
570
+ if (moon) moon.style.display = isDark ? 'none' : '';
571
+ if (label) label.textContent = theme;
572
  }
573
 
574
  function toggleTheme(){
575
+ const cur = document.documentElement.getAttribute('data-theme') || 'light';
576
  applyTheme(cur === 'dark' ? 'light' : 'dark', true);
577
  }
578
 
 
583
  applyTheme(theme, false);
584
  })();
585
 
586
+ // Listen for system theme changes (only if user hasn't manually overridden).
587
+ // Wrap in try/catch β€” matchMedia.addEventListener missing in old browsers.
588
+ try {
589
+ const mq = window.matchMedia('(prefers-color-scheme: dark)');
590
+ const handler = function(e){
591
+ var stored = null;
592
+ try { stored = localStorage.getItem('ai-rag-theme'); } catch (_) {}
593
+ if (stored !== 'dark' && stored !== 'light') {
594
+ applyTheme(e.matches ? 'dark' : 'light', false);
595
+ }
596
+ };
597
+ if (mq.addEventListener) mq.addEventListener('change', handler);
598
+ else if (mq.addListener) mq.addListener(handler); // Safari < 14
599
+ } catch (_) {}
600
 
601
  // === Utility ===
602
  function tick(){document.getElementById('clock').textContent=new Date().toLocaleTimeString('id-ID',{hour:'2-digit',minute:'2-digit',second:'2-digit'})}
rag-requirements.txt DELETED
@@ -1,34 +0,0 @@
1
- # ── LLM & Orchestration ──────────────────────────────────
2
- langchain==0.2.16
3
- langchain-groq==0.1.9
4
- langchain-community==0.2.16
5
- langchain-chroma==0.1.4
6
-
7
- # ── Vector Store ──────────────────────────────────────────
8
- chromadb==0.5.3
9
-
10
- # ── Embeddings ────────────────────────────────────────────
11
- sentence-transformers==2.7.0
12
- transformers==4.35.2
13
- numpy==1.26.4
14
-
15
- # ── Document Loaders ──────────────────────────────────────
16
- pypdf==4.3.1
17
- python-docx==1.1.2
18
- beautifulsoup4==4.12.3
19
- requests==2.32.3
20
-
21
- # ── API ───────────────────────────────────────────────────
22
- fastapi==0.112.0
23
- uvicorn[standard]==0.30.6
24
- python-multipart==0.0.9
25
- pydantic==2.8.2
26
- pydantic-settings==2.4.0
27
-
28
- # ── MLOps ─────────────────────────────────────────────────
29
- mlflow==2.15.1
30
-
31
- # ── Utils ─────────────────────────────────────────────────
32
- python-dotenv==1.0.1
33
- loguru==0.7.2
34
- httpx==0.27.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/__init__.py DELETED
@@ -1 +0,0 @@
1
- # RAG Pipeline
 
 
rag/src/api/__init__.py DELETED
@@ -1 +0,0 @@
1
-
 
 
rag/src/api/main.py DELETED
@@ -1,57 +0,0 @@
1
- from fastapi import FastAPI
2
- from fastapi.middleware.cors import CORSMiddleware
3
- from loguru import logger
4
- import sys
5
-
6
- from .routes import router
7
- from ..config import get_settings
8
-
9
- # === Logging setup ===
10
- settings = get_settings()
11
- logger.remove()
12
- logger.add(sys.stderr, level=settings.log_level, colorize=True,
13
- format="<green>{time:HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan> - {message}")
14
- logger.add(settings.log_file, rotation="10 MB", retention="7 days", level="DEBUG")
15
-
16
- # === App ===
17
- app = FastAPI(
18
- title="RAG Pipeline API",
19
- description="""
20
- ## Multimodal AI Assistant β€” RAG Module
21
-
22
- Endpoint untuk indexing dan querying dokumen menggunakan:
23
- - **Groq** (LLM inference β€” cepat & gratis)
24
- - **Sentence Transformers** (local embeddings)
25
- - **ChromaDB** (vector store persistent)
26
- - **MLflow** (experiment tracking)
27
-
28
- ### Supported document types
29
- PDF, TXT, Markdown, DOCX, JSON, JSONL, URL
30
- """,
31
- version="1.0.0",
32
- docs_url="/docs",
33
- redoc_url="/redoc",
34
- )
35
-
36
- # === CORS ===
37
- app.add_middleware(
38
- CORSMiddleware,
39
- allow_origins=["*"], # Ganti dengan domain spesifik di production
40
- allow_credentials=True,
41
- allow_methods=["*"],
42
- allow_headers=["*"],
43
- )
44
-
45
- # === Routes ===
46
- app.include_router(router, prefix="/api/v1")
47
-
48
-
49
- @app.on_event("startup")
50
- async def startup():
51
- logger.info("RAG Pipeline API starting up...")
52
- logger.info(f"Docs: http://{settings.api_host}:{settings.api_port}/docs")
53
-
54
-
55
- @app.on_event("shutdown")
56
- async def shutdown():
57
- logger.info("RAG Pipeline API shutting down.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/api/routes.py DELETED
@@ -1,137 +0,0 @@
1
- from fastapi import APIRouter, HTTPException, UploadFile, File
2
- from fastapi.responses import StreamingResponse
3
- from loguru import logger
4
- import tempfile
5
- import shutil
6
- from pathlib import Path
7
-
8
- from .schemas import (
9
- IngestRequest, IngestResponse,
10
- QueryRequest, QueryResponse,
11
- SummarizeRequest, SummarizeResponse,
12
- StatsResponse, DeleteResponse,
13
- )
14
- from ..retrieval.retriever import RAGRetriever
15
-
16
- router = APIRouter()
17
-
18
- # Singleton retriever β€” di-init sekali saat startup
19
- _retriever: RAGRetriever = None
20
-
21
-
22
- def get_retriever() -> RAGRetriever:
23
- global _retriever
24
- if _retriever is None:
25
- _retriever = RAGRetriever()
26
- return _retriever
27
-
28
-
29
- # === HEALTH ===
30
-
31
- @router.get("/health", tags=["system"])
32
- async def health_check():
33
- return {"status": "ok", "service": "RAG Pipeline API"}
34
-
35
-
36
- # === STATS ===
37
-
38
- @router.get("/stats", response_model=StatsResponse, tags=["system"])
39
- async def get_stats():
40
- """Info tentang vector store saat ini."""
41
- return get_retriever().get_stats()
42
-
43
-
44
- # === INGEST ===
45
-
46
- @router.post("/ingest", response_model=IngestResponse, tags=["indexing"])
47
- async def ingest_documents(request: IngestRequest):
48
- """
49
- Index dokumen dari file path atau URL ke vector store.
50
- Mendukung: PDF, TXT, MD, DOCX, JSON, JSONL, URL
51
- """
52
- try:
53
- stats = get_retriever().ingest(request.sources)
54
- return IngestResponse(status="success", **stats)
55
- except Exception as e:
56
- logger.error(f"Ingest error: {e}")
57
- raise HTTPException(status_code=500, detail=str(e))
58
-
59
-
60
- @router.post("/ingest/upload", tags=["indexing"])
61
- async def ingest_upload(file: UploadFile = File(...)):
62
- """Upload dan index file langsung via multipart."""
63
- allowed_exts = {".pdf", ".txt", ".md", ".docx", ".json", ".jsonl"}
64
- ext = Path(file.filename).suffix.lower()
65
-
66
- if ext not in allowed_exts:
67
- raise HTTPException(
68
- status_code=400,
69
- detail=f"Ekstensi '{ext}' tidak didukung. Gunakan: {allowed_exts}"
70
- )
71
-
72
- # Simpan file sementara
73
- with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as tmp:
74
- shutil.copyfileobj(file.file, tmp)
75
- tmp_path = tmp.name
76
-
77
- try:
78
- stats = get_retriever().ingest([tmp_path])
79
- return IngestResponse(status="success", **stats)
80
- except Exception as e:
81
- raise HTTPException(status_code=500, detail=str(e))
82
- finally:
83
- Path(tmp_path).unlink(missing_ok=True)
84
-
85
-
86
- # === QUERY ===
87
-
88
- @router.post("/query", response_model=QueryResponse, tags=["querying"])
89
- async def query(request: QueryRequest):
90
- """
91
- Tanya jawab berdasarkan dokumen yang sudah di-index.
92
- Mendukung multi-turn conversation via chat_history.
93
- """
94
- if request.stream:
95
- # Streaming response
96
- def generate():
97
- yield from get_retriever().stream_query(
98
- question=request.question,
99
- chat_history=[m.model_dump() for m in (request.chat_history or [])],
100
- )
101
- return StreamingResponse(generate(), media_type="text/event-stream")
102
-
103
- try:
104
- result = get_retriever().query(
105
- question=request.question,
106
- chat_history=[m.model_dump() for m in (request.chat_history or [])],
107
- top_k=request.top_k,
108
- return_sources=request.return_sources,
109
- )
110
- return QueryResponse(**result)
111
- except Exception as e:
112
- logger.error(f"Query error: {e}")
113
- raise HTTPException(status_code=500, detail=str(e))
114
-
115
-
116
- # === SUMMARIZE ===
117
-
118
- @router.post("/summarize", response_model=SummarizeResponse, tags=["querying"])
119
- async def summarize(request: SummarizeRequest):
120
- """Buat ringkasan otomatis dari dokumen."""
121
- try:
122
- summary = get_retriever().summarize(request.source)
123
- return SummarizeResponse(summary=summary, source=request.source)
124
- except Exception as e:
125
- raise HTTPException(status_code=500, detail=str(e))
126
-
127
-
128
- # === DELETE ===
129
-
130
- @router.delete("/collection", response_model=DeleteResponse, tags=["system"])
131
- async def delete_collection():
132
- """Hapus semua dokumen dari vector store. HATI-HATI: tidak bisa di-undo."""
133
- get_retriever().vector_store.delete_collection()
134
- return DeleteResponse(
135
- status="success",
136
- message="Semua dokumen berhasil dihapus dari vector store."
137
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/api/schemas.py DELETED
@@ -1,67 +0,0 @@
1
- from pydantic import BaseModel, Field
2
- from typing import List, Optional
3
-
4
-
5
- class IngestRequest(BaseModel):
6
- sources: List[str] = Field(
7
- ...,
8
- description="List of file paths atau URLs untuk di-index",
9
- example=["./docs/laporan.pdf", "https://example.com/artikel"],
10
- )
11
-
12
-
13
- class IngestResponse(BaseModel):
14
- status: str
15
- documents_loaded: int
16
- chunks_indexed: int
17
- total_docs_in_store: int
18
- elapsed_seconds: float
19
- sources: List[str]
20
-
21
-
22
- class ChatMessage(BaseModel):
23
- role: str = Field(..., pattern="^(user|assistant)$")
24
- content: str
25
-
26
-
27
- class QueryRequest(BaseModel):
28
- question: str = Field(..., min_length=1, max_length=2000)
29
- chat_history: Optional[List[ChatMessage]] = Field(default=[], description="Riwayat chat untuk multi-turn")
30
- top_k: Optional[int] = Field(default=None, ge=1, le=20)
31
- return_sources: bool = Field(default=True)
32
- stream: bool = Field(default=False, description="Gunakan streaming response")
33
-
34
-
35
- class SourceChunk(BaseModel):
36
- content: str
37
- metadata: dict
38
- relevance_score: float
39
-
40
-
41
- class QueryResponse(BaseModel):
42
- answer: str
43
- question: str
44
- latency_seconds: float
45
- chunks_retrieved: int
46
- sources: Optional[List[SourceChunk]] = None
47
-
48
-
49
- class SummarizeRequest(BaseModel):
50
- source: str = Field(..., description="File path atau URL untuk diringkas")
51
-
52
-
53
- class SummarizeResponse(BaseModel):
54
- summary: str
55
- source: str
56
-
57
-
58
- class StatsResponse(BaseModel):
59
- total_chunks: int
60
- collection_name: str
61
- embedding_model: str
62
- llm_model: str
63
-
64
-
65
- class DeleteResponse(BaseModel):
66
- status: str
67
- message: str
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/config.py DELETED
@@ -1,56 +0,0 @@
1
- from pydantic_settings import BaseSettings
2
- from pydantic import Field
3
- from functools import lru_cache
4
- from pathlib import Path
5
-
6
-
7
- class Settings(BaseSettings):
8
- # LLM
9
- groq_api_key: str = Field(..., env="GROQ_API_KEY")
10
- groq_model: str = Field("llama-3.3-70b-versatile", env="GROQ_MODEL") # updated from 3.1
11
-
12
- # Embeddings
13
- embedding_model: str = Field("all-MiniLM-L6-v2", env="EMBEDDING_MODEL")
14
- embedding_device: str = Field("cpu", env="EMBEDDING_DEVICE")
15
-
16
- # Vector Store
17
- chroma_persist_dir: str = Field("./chroma_db", env="CHROMA_PERSIST_DIR")
18
- chroma_collection_name: str = Field("rag_documents", env="CHROMA_COLLECTION_NAME")
19
-
20
- # RAG Settings
21
- chunk_size: int = Field(1000, env="CHUNK_SIZE")
22
- chunk_overlap: int = Field(200, env="CHUNK_OVERLAP")
23
- top_k_retrieval: int = Field(5, env="TOP_K_RETRIEVAL")
24
- max_tokens: int = Field(2048, env="MAX_TOKENS")
25
- temperature: float = Field(0.1, env="TEMPERATURE")
26
-
27
- # API
28
- api_host: str = Field("0.0.0.0", env="API_HOST")
29
- api_port: int = Field(8000, env="API_PORT")
30
- api_reload: bool = Field(True, env="API_RELOAD")
31
-
32
- # MLflow
33
- mlflow_tracking_uri: str = Field("./mlruns", env="MLFLOW_TRACKING_URI")
34
- mlflow_experiment_name: str = Field("rag_pipeline", env="MLFLOW_EXPERIMENT_NAME")
35
-
36
- # Logging
37
- log_level: str = Field("INFO", env="LOG_LEVEL")
38
- log_file: str = Field("./logs/app.log", env="LOG_FILE")
39
-
40
- class Config:
41
- env_file = ".env"
42
- env_file_encoding = "utf-8"
43
-
44
- def ensure_dirs(self):
45
- """Buat direktori yang dibutuhkan jika belum ada."""
46
- Path(self.chroma_persist_dir).mkdir(parents=True, exist_ok=True)
47
- Path(self.log_file).parent.mkdir(parents=True, exist_ok=True)
48
- Path(self.mlflow_tracking_uri).mkdir(parents=True, exist_ok=True)
49
-
50
-
51
- @lru_cache()
52
- def get_settings() -> Settings:
53
- """Singleton settings β€” di-cache supaya tidak re-parse tiap request."""
54
- settings = Settings()
55
- settings.ensure_dirs()
56
- return settings
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/embeddings/__init__.py DELETED
@@ -1 +0,0 @@
1
-
 
 
rag/src/embeddings/embedder.py DELETED
@@ -1,60 +0,0 @@
1
- from typing import List
2
- from loguru import logger
3
- from langchain.text_splitter import RecursiveCharacterTextSplitter
4
- from langchain_community.embeddings import HuggingFaceEmbeddings
5
-
6
- from ..config import get_settings
7
- from ..loaders.base_loader import Document
8
-
9
-
10
- class DocumentEmbedder:
11
- """
12
- Bertanggung jawab untuk:
13
- 1. Chunking dokumen panjang jadi potongan yang bisa di-embed
14
- 2. Membuat embedding vektor pakai model lokal (no API cost!)
15
- """
16
-
17
- def __init__(self):
18
- settings = get_settings()
19
- logger.info(f"Loading embedding model: {settings.embedding_model}")
20
-
21
- self.embeddings = HuggingFaceEmbeddings(
22
- model_name=settings.embedding_model,
23
- model_kwargs={"device": settings.embedding_device},
24
- encode_kwargs={"normalize_embeddings": True},
25
- )
26
-
27
- self.splitter = RecursiveCharacterTextSplitter(
28
- chunk_size=settings.chunk_size,
29
- chunk_overlap=settings.chunk_overlap,
30
- separators=["\n\n", "\n", ". ", " ", ""],
31
- )
32
-
33
- logger.info("Embedder ready.")
34
-
35
- def chunk_documents(self, documents: List[Document]) -> List[Document]:
36
- """
37
- Split dokumen panjang jadi chunks.
38
- Metadata dari dokumen asli diwarisi ke setiap chunk.
39
- """
40
- chunks = []
41
- for doc in documents:
42
- texts = self.splitter.split_text(doc.content)
43
- for i, text in enumerate(texts):
44
- chunk_metadata = {
45
- **doc.metadata,
46
- "chunk_index": i,
47
- "total_chunks": len(texts),
48
- "parent_doc_id": doc.doc_id,
49
- }
50
- chunks.append(Document(
51
- content=text,
52
- metadata=chunk_metadata,
53
- ))
54
-
55
- logger.info(f"Chunked {len(documents)} docs β†’ {len(chunks)} chunks")
56
- return chunks
57
-
58
- def get_embeddings_model(self):
59
- """Return LangChain-compatible embeddings object untuk ChromaDB."""
60
- return self.embeddings
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/llm/__init__.py DELETED
@@ -1 +0,0 @@
1
-
 
 
rag/src/llm/groq_client.py DELETED
@@ -1,62 +0,0 @@
1
- from typing import List, Optional, Iterator
2
- from loguru import logger
3
- from langchain_groq import ChatGroq
4
- from langchain.schema import BaseMessage, HumanMessage, AIMessage
5
- from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
6
-
7
- from ..config import get_settings
8
-
9
-
10
- class GroqClient:
11
- """
12
- Wrapper di atas ChatGroq dari LangChain.
13
- Mendukung regular call dan streaming.
14
- """
15
-
16
- def __init__(self, streaming: bool = False):
17
- settings = get_settings()
18
- callbacks = [StreamingStdOutCallbackHandler()] if streaming else []
19
-
20
- self.llm = ChatGroq(
21
- api_key=settings.groq_api_key,
22
- model_name=settings.groq_model,
23
- temperature=settings.temperature,
24
- max_tokens=settings.max_tokens,
25
- streaming=streaming,
26
- callbacks=callbacks,
27
- )
28
- self.model_name = settings.groq_model
29
- logger.info(f"Groq client initialized. Model: {settings.groq_model}")
30
-
31
- def invoke(self, messages: List[BaseMessage]) -> str:
32
- """Kirim messages dan return response string."""
33
- response = self.llm.invoke(messages)
34
- return response.content
35
-
36
- def stream(self, messages: List[BaseMessage]) -> Iterator[str]:
37
- """Streaming response β€” yield token per token."""
38
- for chunk in self.llm.stream(messages):
39
- if chunk.content:
40
- yield chunk.content
41
-
42
- def get_langchain_llm(self):
43
- """Return raw LangChain LLM object untuk dipakai di chain."""
44
- return self.llm
45
-
46
- @staticmethod
47
- def build_messages(
48
- question: str,
49
- chat_history: Optional[List[dict]] = None,
50
- ) -> List[BaseMessage]:
51
- """
52
- Convert chat history format ke LangChain messages.
53
- chat_history format: [{"role": "user"/"assistant", "content": "..."}]
54
- """
55
- messages = []
56
- for msg in (chat_history or []):
57
- if msg["role"] == "user":
58
- messages.append(HumanMessage(content=msg["content"]))
59
- elif msg["role"] == "assistant":
60
- messages.append(AIMessage(content=msg["content"]))
61
- messages.append(HumanMessage(content=question))
62
- return messages
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/llm/prompt_templates.py DELETED
@@ -1,36 +0,0 @@
1
- from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
2
-
3
- # === RAG QA Prompt ===
4
- RAG_PROMPT = ChatPromptTemplate.from_messages([
5
- ("system", """Kamu adalah AI assistant yang menjawab pertanyaan berdasarkan konteks dokumen yang diberikan.
6
-
7
- ATURAN:
8
- - Jawab HANYA berdasarkan konteks yang disediakan
9
- - Jika jawaban tidak ada di konteks, katakan "Informasi ini tidak tersedia dalam dokumen yang diberikan"
10
- - Selalu sebutkan sumber dokumen jika relevan
11
- - Jawab dalam bahasa yang sama dengan pertanyaan pengguna
12
- - Berikan jawaban yang ringkas, akurat, dan terstruktur
13
-
14
- KONTEKS DOKUMEN:
15
- {context}
16
- """),
17
- MessagesPlaceholder(variable_name="chat_history", optional=True),
18
- ("human", "{question}"),
19
- ])
20
-
21
- # === Standalone Question Prompt (untuk rephrase pertanyaan follow-up) ===
22
- CONDENSE_QUESTION_PROMPT = ChatPromptTemplate.from_messages([
23
- ("system", """Diberikan riwayat percakapan dan pertanyaan terbaru dari pengguna,
24
- reformulasikan pertanyaan menjadi pertanyaan mandiri yang bisa dipahami tanpa konteks percakapan sebelumnya.
25
- Jangan jawab pertanyaannya, cukup reformulasikan jika perlu. Jika tidak perlu, kembalikan apa adanya."""),
26
- MessagesPlaceholder(variable_name="chat_history"),
27
- ("human", "{question}"),
28
- ])
29
-
30
- # === Summary Prompt ===
31
- SUMMARY_PROMPT = ChatPromptTemplate.from_messages([
32
- ("system", """Buat ringkasan dari dokumen berikut.
33
- Sertakan: poin-poin utama, informasi kunci, dan kesimpulan.
34
- Format: gunakan bullet points untuk keterbacaan yang baik."""),
35
- ("human", "Dokumen:\n{document}\n\nBuat ringkasan:"),
36
- ])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/loaders/__init__.py DELETED
@@ -1,69 +0,0 @@
1
- from typing import List
2
- from pathlib import Path
3
- from loguru import logger
4
-
5
- from .base_loader import BaseLoader, Document
6
- from .pdf_loader import PDFLoader
7
- from .text_loader import TextLoader
8
- from .docx_loader import DocxLoader
9
- from .web_loader import WebLoader
10
- from .json_loader import JSONLoader
11
-
12
-
13
- class LoaderFactory:
14
- """
15
- Auto-detect loader yang tepat berdasarkan ekstensi file atau URL.
16
- Pattern: Factory Method β€” client tidak perlu tahu loader mana yang dipakai.
17
- """
18
-
19
- _loaders: dict[str, BaseLoader] = {
20
- ".pdf": PDFLoader(),
21
- ".txt": TextLoader(),
22
- ".md": TextLoader(),
23
- ".markdown": TextLoader(),
24
- ".docx": DocxLoader(),
25
- ".doc": DocxLoader(),
26
- ".json": JSONLoader(),
27
- ".jsonl": JSONLoader(),
28
- }
29
-
30
- @classmethod
31
- def get_loader(cls, source: str) -> BaseLoader:
32
- """Pilih loader yang sesuai untuk source."""
33
- # URL
34
- if source.startswith(("http://", "https://")):
35
- return WebLoader()
36
-
37
- # File
38
- ext = Path(source).suffix.lower()
39
- loader = cls._loaders.get(ext)
40
- if loader is None:
41
- raise ValueError(
42
- f"Tidak ada loader untuk ekstensi '{ext}'. "
43
- f"Didukung: {list(cls._loaders.keys())} + URL"
44
- )
45
- return loader
46
-
47
- @classmethod
48
- def load(cls, source: str) -> List[Document]:
49
- """One-liner: auto-detect loader dan langsung load."""
50
- loader = cls.get_loader(source)
51
- logger.info(f"Using {loader.__class__.__name__} for: {source}")
52
- return loader.load(source)
53
-
54
- @classmethod
55
- def load_many(cls, sources: List[str]) -> List[Document]:
56
- """Load multiple sources sekaligus."""
57
- all_docs = []
58
- for source in sources:
59
- try:
60
- docs = cls.load(source)
61
- all_docs.extend(docs)
62
- logger.info(f"Loaded {len(docs)} docs from {source}")
63
- except Exception as e:
64
- logger.error(f"Gagal load {source}: {e}")
65
- logger.info(f"Total loaded: {len(all_docs)} documents")
66
- return all_docs
67
-
68
-
69
- __all__ = ["LoaderFactory", "Document", "BaseLoader"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/loaders/base_loader.py DELETED
@@ -1,39 +0,0 @@
1
- from abc import ABC, abstractmethod
2
- from dataclasses import dataclass, field
3
- from typing import List, Optional
4
- from pathlib import Path
5
-
6
-
7
- @dataclass
8
- class Document:
9
- """Representasi satu dokumen atau chunk yang sudah diload."""
10
- content: str
11
- metadata: dict = field(default_factory=dict)
12
- doc_id: Optional[str] = None
13
-
14
- def __post_init__(self):
15
- if self.doc_id is None:
16
- import hashlib
17
- self.doc_id = hashlib.md5(self.content.encode()).hexdigest()[:12]
18
-
19
-
20
- class BaseLoader(ABC):
21
- """Abstract base class untuk semua document loaders."""
22
-
23
- @abstractmethod
24
- def load(self, source: str) -> List[Document]:
25
- """
26
- Load dokumen dari source (path file atau URL).
27
- Returns list of Document objects.
28
- """
29
- pass
30
-
31
- def validate_source(self, source: str) -> bool:
32
- """Validasi apakah source bisa di-handle loader ini."""
33
- return True
34
-
35
- @property
36
- @abstractmethod
37
- def supported_extensions(self) -> List[str]:
38
- """Daftar ekstensi file yang didukung loader ini."""
39
- pass
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/loaders/docx_loader.py DELETED
@@ -1,52 +0,0 @@
1
- from typing import List
2
- from pathlib import Path
3
- from loguru import logger
4
-
5
- from .base_loader import BaseLoader, Document
6
-
7
-
8
- class DocxLoader(BaseLoader):
9
- """Loader untuk file .docx menggunakan python-docx."""
10
-
11
- @property
12
- def supported_extensions(self) -> List[str]:
13
- return [".docx", ".doc"]
14
-
15
- def load(self, source: str) -> List[Document]:
16
- try:
17
- from docx import Document as DocxDocument
18
- except ImportError:
19
- raise ImportError("Install python-docx: pip install python-docx")
20
-
21
- path = Path(source)
22
- if not path.exists():
23
- raise FileNotFoundError(f"File tidak ditemukan: {source}")
24
-
25
- logger.info(f"Loading DOCX: {path.name}")
26
- doc = DocxDocument(str(path))
27
-
28
- # Ambil semua paragraf yang tidak kosong
29
- paragraphs = [p.text.strip() for p in doc.paragraphs if p.text.strip()]
30
- content = "\n\n".join(paragraphs)
31
-
32
- # Ambil teks dari tabel juga
33
- table_texts = []
34
- for table in doc.tables:
35
- for row in table.rows:
36
- row_text = " | ".join(cell.text.strip() for cell in row.cells if cell.text.strip())
37
- if row_text:
38
- table_texts.append(row_text)
39
-
40
- if table_texts:
41
- content += "\n\n[Tables]\n" + "\n".join(table_texts)
42
-
43
- return [Document(
44
- content=content,
45
- metadata={
46
- "source": str(path),
47
- "filename": path.name,
48
- "type": "docx",
49
- "paragraphs": len(paragraphs),
50
- "tables": len(doc.tables),
51
- }
52
- )]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/loaders/json_loader.py DELETED
@@ -1,103 +0,0 @@
1
- import json
2
- from typing import List
3
- from pathlib import Path
4
- from loguru import logger
5
-
6
- from .base_loader import BaseLoader, Document
7
-
8
-
9
- class JSONLoader(BaseLoader):
10
- """
11
- Loader untuk file JSON.
12
- Bisa flatten nested JSON menjadi teks untuk di-embed.
13
- """
14
-
15
- def __init__(self, text_key: str = None, jq_schema: str = None):
16
- """
17
- text_key: key spesifik yang jadi konten utama (e.g. 'content', 'text')
18
- jq_schema: opsional β€” filter JSON pakai jq-style path
19
- """
20
- self.text_key = text_key
21
- self.jq_schema = jq_schema
22
-
23
- @property
24
- def supported_extensions(self) -> List[str]:
25
- return [".json", ".jsonl"]
26
-
27
- def load(self, source: str) -> List[Document]:
28
- path = Path(source)
29
- if not path.exists():
30
- raise FileNotFoundError(f"File tidak ditemukan: {source}")
31
-
32
- logger.info(f"Loading JSON: {path.name}")
33
-
34
- # Handle JSONL (JSON Lines)
35
- if path.suffix == ".jsonl":
36
- return self._load_jsonl(path)
37
-
38
- with open(path, "r", encoding="utf-8") as f:
39
- data = json.load(f)
40
-
41
- # Jika list of records
42
- if isinstance(data, list):
43
- documents = []
44
- for i, record in enumerate(data):
45
- content = self._extract_content(record)
46
- documents.append(Document(
47
- content=content,
48
- metadata={
49
- "source": str(path),
50
- "filename": path.name,
51
- "type": "json",
52
- "record_index": i,
53
- }
54
- ))
55
- return documents
56
-
57
- # Single object
58
- content = self._extract_content(data)
59
- return [Document(
60
- content=content,
61
- metadata={
62
- "source": str(path),
63
- "filename": path.name,
64
- "type": "json",
65
- }
66
- )]
67
-
68
- def _load_jsonl(self, path: Path) -> List[Document]:
69
- documents = []
70
- with open(path, "r", encoding="utf-8") as f:
71
- for i, line in enumerate(f):
72
- line = line.strip()
73
- if not line:
74
- continue
75
- record = json.loads(line)
76
- content = self._extract_content(record)
77
- documents.append(Document(
78
- content=content,
79
- metadata={
80
- "source": str(path),
81
- "filename": path.name,
82
- "type": "jsonl",
83
- "line": i + 1,
84
- }
85
- ))
86
- return documents
87
-
88
- def _extract_content(self, data: dict) -> str:
89
- """Konversi dict/list ke string yang bisa di-embed."""
90
- if self.text_key and isinstance(data, dict) and self.text_key in data:
91
- return str(data[self.text_key])
92
-
93
- # Fallback: flatten semua key-value pair
94
- if isinstance(data, dict):
95
- parts = []
96
- for k, v in data.items():
97
- if isinstance(v, (str, int, float, bool)):
98
- parts.append(f"{k}: {v}")
99
- elif isinstance(v, (list, dict)):
100
- parts.append(f"{k}: {json.dumps(v, ensure_ascii=False)}")
101
- return "\n".join(parts)
102
-
103
- return json.dumps(data, ensure_ascii=False, indent=2)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/loaders/pdf_loader.py DELETED
@@ -1,46 +0,0 @@
1
- from typing import List
2
- from pathlib import Path
3
- from loguru import logger
4
-
5
- from .base_loader import BaseLoader, Document
6
-
7
-
8
- class PDFLoader(BaseLoader):
9
- """Loader untuk file PDF menggunakan pypdf."""
10
-
11
- @property
12
- def supported_extensions(self) -> List[str]:
13
- return [".pdf"]
14
-
15
- def load(self, source: str) -> List[Document]:
16
- try:
17
- from pypdf import PdfReader
18
- except ImportError:
19
- raise ImportError("Install pypdf: pip install pypdf")
20
-
21
- path = Path(source)
22
- if not path.exists():
23
- raise FileNotFoundError(f"File tidak ditemukan: {source}")
24
-
25
- logger.info(f"Loading PDF: {path.name}")
26
- reader = PdfReader(str(path))
27
- documents = []
28
-
29
- for i, page in enumerate(reader.pages):
30
- text = page.extract_text()
31
- if not text or not text.strip():
32
- continue
33
-
34
- documents.append(Document(
35
- content=text.strip(),
36
- metadata={
37
- "source": str(path),
38
- "filename": path.name,
39
- "page": i + 1,
40
- "total_pages": len(reader.pages),
41
- "type": "pdf",
42
- }
43
- ))
44
-
45
- logger.info(f"Loaded {len(documents)} pages from {path.name}")
46
- return documents
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/loaders/text_loader.py DELETED
@@ -1,31 +0,0 @@
1
- from typing import List
2
- from pathlib import Path
3
- from loguru import logger
4
-
5
- from .base_loader import BaseLoader, Document
6
-
7
-
8
- class TextLoader(BaseLoader):
9
- """Loader untuk file .txt dan .md."""
10
-
11
- @property
12
- def supported_extensions(self) -> List[str]:
13
- return [".txt", ".md", ".markdown"]
14
-
15
- def load(self, source: str) -> List[Document]:
16
- path = Path(source)
17
- if not path.exists():
18
- raise FileNotFoundError(f"File tidak ditemukan: {source}")
19
-
20
- logger.info(f"Loading text file: {path.name}")
21
- content = path.read_text(encoding="utf-8")
22
-
23
- return [Document(
24
- content=content,
25
- metadata={
26
- "source": str(path),
27
- "filename": path.name,
28
- "type": path.suffix.lstrip("."),
29
- "size_chars": len(content),
30
- }
31
- )]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/loaders/web_loader.py DELETED
@@ -1,57 +0,0 @@
1
- from typing import List
2
- from loguru import logger
3
-
4
- from .base_loader import BaseLoader, Document
5
-
6
-
7
- class WebLoader(BaseLoader):
8
- """Loader untuk URL β€” scrape konten teks dari halaman web."""
9
-
10
- @property
11
- def supported_extensions(self) -> List[str]:
12
- return [] # Tidak berbasis ekstensi, berbasis URL
13
-
14
- def validate_source(self, source: str) -> bool:
15
- return source.startswith(("http://", "https://"))
16
-
17
- def load(self, source: str) -> List[Document]:
18
- try:
19
- import requests
20
- from bs4 import BeautifulSoup
21
- except ImportError:
22
- raise ImportError("Install: pip install requests beautifulsoup4")
23
-
24
- logger.info(f"Fetching URL: {source}")
25
- headers = {"User-Agent": "Mozilla/5.0 (compatible; RAG-Pipeline/1.0)"}
26
-
27
- response = requests.get(source, headers=headers, timeout=15)
28
- response.raise_for_status()
29
-
30
- soup = BeautifulSoup(response.text, "html.parser")
31
-
32
- # Hapus tag yang tidak relevan
33
- for tag in soup(["script", "style", "nav", "footer", "header", "aside"]):
34
- tag.decompose()
35
-
36
- # Ambil judul
37
- title = soup.find("title")
38
- title_text = title.get_text(strip=True) if title else ""
39
-
40
- # Ambil konten utama
41
- main = soup.find("main") or soup.find("article") or soup.find("body")
42
- content = main.get_text(separator="\n", strip=True) if main else soup.get_text(separator="\n", strip=True)
43
-
44
- # Bersihkan baris kosong berulang
45
- lines = [line for line in content.splitlines() if line.strip()]
46
- content = "\n".join(lines)
47
-
48
- return [Document(
49
- content=content,
50
- metadata={
51
- "source": source,
52
- "title": title_text,
53
- "type": "web",
54
- "status_code": response.status_code,
55
- "content_length": len(content),
56
- }
57
- )]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/retrieval/__init__.py DELETED
@@ -1 +0,0 @@
1
-
 
 
rag/src/retrieval/retriever.py DELETED
@@ -1,211 +0,0 @@
1
- from typing import List, Optional, Iterator
2
- from loguru import logger
3
- import mlflow
4
- import time
5
-
6
- from langchain.schema import HumanMessage, AIMessage
7
-
8
- from ..config import get_settings
9
- from ..retrieval.vector_store import VectorStore
10
- from ..llm.groq_client import GroqClient
11
- from ..llm.prompt_templates import RAG_PROMPT, SUMMARY_PROMPT
12
- from ..loaders import LoaderFactory, Document
13
-
14
-
15
- class RAGRetriever:
16
- """
17
- Core class yang menyatukan semua komponen RAG:
18
- Document Loading β†’ Chunking β†’ Embedding β†’ Retrieval β†’ Generation
19
- """
20
-
21
- def __init__(self):
22
- self.settings = get_settings()
23
- self.vector_store = VectorStore()
24
- self.groq = GroqClient()
25
- self._setup_mlflow()
26
- logger.info("RAGRetriever initialized.")
27
-
28
- def _setup_mlflow(self):
29
- mlflow.set_tracking_uri(self.settings.mlflow_tracking_uri)
30
- mlflow.set_experiment(self.settings.mlflow_experiment_name)
31
-
32
- # === INDEXING ===
33
-
34
- def ingest(self, sources: List[str]) -> dict:
35
- """
36
- Load, chunk, embed, dan index dokumen dari berbagai sources.
37
-
38
- Args:
39
- sources: List of file paths atau URLs
40
-
41
- Returns:
42
- dict berisi stats indexing
43
- """
44
- logger.info(f"Ingesting {len(sources)} sources...")
45
- start = time.time()
46
-
47
- with mlflow.start_run(run_name="ingest"):
48
- mlflow.log_params({
49
- "sources_count": len(sources),
50
- "chunk_size": self.settings.chunk_size,
51
- "chunk_overlap": self.settings.chunk_overlap,
52
- "embedding_model": self.settings.embedding_model,
53
- })
54
-
55
- # Load semua dokumen
56
- documents = LoaderFactory.load_many(sources)
57
-
58
- # Index ke vector store
59
- chunks_indexed = self.vector_store.add_documents(documents)
60
-
61
- elapsed = time.time() - start
62
- stats = {
63
- "documents_loaded": len(documents),
64
- "chunks_indexed": chunks_indexed,
65
- "sources": sources,
66
- "elapsed_seconds": round(elapsed, 2),
67
- "total_docs_in_store": self.vector_store.count(),
68
- }
69
-
70
- mlflow.log_metrics({
71
- "documents_loaded": len(documents),
72
- "chunks_indexed": chunks_indexed,
73
- "elapsed_seconds": elapsed,
74
- })
75
-
76
- logger.info(f"Ingestion selesai: {stats}")
77
- return stats
78
-
79
- # === QUERYING ===
80
-
81
- def query(
82
- self,
83
- question: str,
84
- chat_history: Optional[List[dict]] = None,
85
- top_k: Optional[int] = None,
86
- return_sources: bool = True,
87
- ) -> dict:
88
- """
89
- Jawab pertanyaan menggunakan RAG.
90
-
91
- Args:
92
- question: Pertanyaan user
93
- chat_history: Riwayat chat [{"role": "user"/"assistant", "content": "..."}]
94
- top_k: Jumlah chunks yang diretrieve
95
- return_sources: Sertakan source chunks di response
96
-
97
- Returns:
98
- dict dengan 'answer', 'sources', dan 'metadata'
99
- """
100
- start = time.time()
101
- logger.info(f"Query: '{question[:80]}...'")
102
-
103
- with mlflow.start_run(run_name="query"):
104
- mlflow.log_param("question", question[:250])
105
- mlflow.log_param("model", self.settings.groq_model)
106
-
107
- # Retrieve relevant chunks
108
- k = top_k or self.settings.top_k_retrieval
109
-
110
- # Guard: jika store kosong, langsung jawab tanpa retrieval
111
- store_count = self.vector_store.count()
112
- if store_count == 0:
113
- retrieved = []
114
- else:
115
- try:
116
- retrieved = self.vector_store.similarity_search_with_score(question, k=k)
117
- except Exception as e:
118
- logger.warning(f"Retrieval failed (mungkin store kosong): {e}")
119
- retrieved = []
120
-
121
- # Format context
122
- context_parts = []
123
- sources = []
124
- for i, (doc, score) in enumerate(retrieved):
125
- source_info = doc.metadata.get("filename", doc.metadata.get("source", "Unknown"))
126
- page_info = f" (hal. {doc.metadata['page']})" if "page" in doc.metadata else ""
127
- context_parts.append(
128
- f"[Sumber {i+1}: {source_info}{page_info} | Relevansi: {1-score:.2f}]\n{doc.page_content}"
129
- )
130
- sources.append({
131
- "content": doc.page_content[:300] + "..." if len(doc.page_content) > 300 else doc.page_content,
132
- "metadata": doc.metadata,
133
- "relevance_score": round(1 - score, 4),
134
- })
135
-
136
- context = "\n\n---\n\n".join(context_parts) if context_parts else "(Tidak ada dokumen yang di-index. Silakan upload dokumen terlebih dahulu.)"
137
-
138
- # Build chat history with correct message types
139
- history_messages = []
140
- for m in (chat_history or []):
141
- if m["role"] == "user":
142
- history_messages.append(HumanMessage(content=m["content"]))
143
- elif m["role"] == "assistant":
144
- history_messages.append(AIMessage(content=m["content"]))
145
-
146
- # Build prompt dan generate
147
- formatted_prompt = RAG_PROMPT.format_messages(
148
- context=context,
149
- question=question,
150
- chat_history=history_messages,
151
- )
152
-
153
- answer = self.groq.invoke(formatted_prompt)
154
- elapsed = time.time() - start
155
-
156
- mlflow.log_metrics({
157
- "chunks_retrieved": len(retrieved),
158
- "answer_length": len(answer),
159
- "latency_seconds": elapsed,
160
- })
161
-
162
- result = {
163
- "answer": answer,
164
- "question": question,
165
- "latency_seconds": round(elapsed, 2),
166
- "chunks_retrieved": len(retrieved),
167
- }
168
- if return_sources:
169
- result["sources"] = sources
170
-
171
- return result
172
-
173
- def stream_query(
174
- self,
175
- question: str,
176
- chat_history: Optional[List[dict]] = None,
177
- ) -> Iterator[str]:
178
- """Streaming version dari query β€” yield token per token."""
179
- retrieved = self.vector_store.similarity_search(question)
180
- context = "\n\n---\n\n".join(
181
- f"[Sumber: {doc.metadata.get('filename', 'Unknown')}]\n{doc.page_content}"
182
- for doc in retrieved
183
- )
184
- formatted = RAG_PROMPT.format_messages(
185
- context=context,
186
- question=question,
187
- chat_history=[],
188
- )
189
- groq_stream = GroqClient(streaming=True)
190
- yield from groq_stream.stream(formatted)
191
-
192
- def summarize(self, source: str) -> str:
193
- """Buat ringkasan dari satu dokumen."""
194
- documents = LoaderFactory.load(source)
195
- full_text = "\n\n".join(doc.content for doc in documents)
196
-
197
- # Truncate jika terlalu panjang
198
- if len(full_text) > 12000:
199
- full_text = full_text[:12000] + "\n...[dokumen dipotong untuk efisiensi]"
200
-
201
- messages = SUMMARY_PROMPT.format_messages(document=full_text)
202
- return self.groq.invoke(messages)
203
-
204
- def get_stats(self) -> dict:
205
- """Statistik vector store saat ini."""
206
- return {
207
- "total_chunks": self.vector_store.count(),
208
- "collection_name": self.settings.chroma_collection_name,
209
- "embedding_model": self.settings.embedding_model,
210
- "llm_model": self.settings.groq_model,
211
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag/src/retrieval/vector_store.py DELETED
@@ -1,93 +0,0 @@
1
- from typing import List, Optional
2
- from loguru import logger
3
- from langchain_chroma import Chroma
4
- from langchain.schema import Document as LCDocument
5
-
6
- from ..config import get_settings
7
- from ..loaders.base_loader import Document
8
- from ..embeddings.embedder import DocumentEmbedder
9
-
10
-
11
- class VectorStore:
12
- """
13
- Wrapper di atas ChromaDB.
14
- Menangani indexing, persistence, dan similarity search.
15
- """
16
-
17
- def __init__(self, embedder: Optional[DocumentEmbedder] = None):
18
- settings = get_settings()
19
- self.embedder = embedder or DocumentEmbedder()
20
- self.settings = settings
21
-
22
- self.db = Chroma(
23
- collection_name=settings.chroma_collection_name,
24
- embedding_function=self.embedder.get_embeddings_model(),
25
- persist_directory=settings.chroma_persist_dir,
26
- )
27
- logger.info(
28
- f"VectorStore ready. Collection: '{settings.chroma_collection_name}' "
29
- f"| Docs: {self.db._collection.count()}"
30
- )
31
-
32
- def add_documents(self, documents: List[Document]) -> int:
33
- """
34
- Chunk dan index dokumen ke ChromaDB.
35
- Returns: jumlah chunks yang berhasil di-index.
36
- """
37
- chunks = self.embedder.chunk_documents(documents)
38
-
39
- # Konversi ke format LangChain
40
- lc_docs = [
41
- LCDocument(page_content=chunk.content, metadata=chunk.metadata)
42
- for chunk in chunks
43
- ]
44
-
45
- self.db.add_documents(lc_docs)
46
- logger.info(f"Indexed {len(chunks)} chunks ke ChromaDB.")
47
- return len(chunks)
48
-
49
- def similarity_search(
50
- self,
51
- query: str,
52
- k: Optional[int] = None,
53
- filter: Optional[dict] = None,
54
- ) -> List[LCDocument]:
55
- """Cari dokumen paling relevan berdasarkan query."""
56
- k = k or self.settings.top_k_retrieval
57
- results = self.db.similarity_search(query, k=k, filter=filter)
58
- logger.debug(f"Retrieved {len(results)} chunks for query: '{query[:60]}...'")
59
- return results
60
-
61
- def similarity_search_with_score(
62
- self,
63
- query: str,
64
- k: Optional[int] = None,
65
- ) -> List[tuple]:
66
- """Sama seperti similarity_search tapi return (doc, score)."""
67
- k = k or self.settings.top_k_retrieval
68
- return self.db.similarity_search_with_score(query, k=k)
69
-
70
- def reset_collection(self):
71
- """
72
- Hapus semua dokumen TANPA mematikan collection.
73
- Pakai reset_collection() dari langchain-chroma β€” collection tetap hidup
74
- dan langsung siap untuk ingest berikutnya.
75
- """
76
- self.db.reset_collection()
77
- logger.warning(
78
- f"Collection '{self.settings.chroma_collection_name}' di-reset. "
79
- "Semua dokumen dihapus, collection siap dipakai kembali."
80
- )
81
-
82
- def delete_collection(self):
83
- """Alias ke reset_collection() β€” collection tidak dimatikan, aman."""
84
- self.reset_collection()
85
-
86
- def count(self) -> int:
87
- """Jumlah chunks yang tersimpan."""
88
- return self.db._collection.count()
89
-
90
- def get_retriever(self, search_kwargs: Optional[dict] = None):
91
- """Return LangChain retriever untuk dipakai di chain."""
92
- search_kwargs = search_kwargs or {"k": self.settings.top_k_retrieval}
93
- return self.db.as_retriever(search_kwargs=search_kwargs)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rag_pipeline/src/api/routes.py CHANGED
@@ -71,6 +71,12 @@ async def readiness_combined():
71
  models={"rag": {"state": rag_state, "error": str(e)[:300]}},
72
  )
73
 
 
 
 
 
 
 
74
  # Cek status CV API (best-effort β€” kalau CV down, RAG tetep bisa pake
75
  # text-based loaders, cuma OCR fallback yang ga jalan).
76
  cv_url = os.getenv("CV_API_URL", "http://127.0.0.1:8001")
@@ -85,7 +91,9 @@ async def readiness_combined():
85
  except Exception as e:
86
  logger.debug(f"CV /ready unreachable: {e}")
87
 
88
- all_models = {"rag": {"state": rag_state}}
 
 
89
  for name, info in cv_models.items():
90
  all_models[f"cv.{name}"] = info
91
 
@@ -220,6 +228,20 @@ async def query(request: QueryRequest):
220
  source_ids=request.source_ids,
221
  )
222
  return QueryResponse(**result)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
223
  except Exception as e:
224
  logger.error(f"Query error: {e}")
225
  raise HTTPException(status_code=500, detail=str(e))
@@ -233,6 +255,14 @@ async def summarize(request: SummarizeRequest):
233
  try:
234
  summary = get_retriever().summarize(request.source)
235
  return SummarizeResponse(summary=summary, source=request.source)
 
 
 
 
 
 
 
 
236
  except Exception as e:
237
  raise HTTPException(status_code=500, detail=str(e))
238
 
@@ -242,8 +272,15 @@ async def summarize(request: SummarizeRequest):
242
  @router.delete("/collection", response_model=DeleteResponse, tags=["system"])
243
  async def delete_collection():
244
  """Hapus semua dokumen dari vector store."""
245
- get_retriever().vector_store.delete_collection()
246
- return DeleteResponse(
247
- status="success",
248
- message="Semua dokumen berhasil dihapus dari vector store.",
249
- )
 
 
 
 
 
 
 
 
71
  models={"rag": {"state": rag_state, "error": str(e)[:300]}},
72
  )
73
 
74
+ # Cek apakah GROQ_API_KEY ada β€” penting untuk endpoint /query dan /summarize.
75
+ # Tidak fail readiness, tapi UI dapat info untuk munculin warning.
76
+ from ..config import get_settings
77
+ settings = get_settings()
78
+ has_groq_key = bool((settings.groq_api_key or "").strip())
79
+
80
  # Cek status CV API (best-effort β€” kalau CV down, RAG tetep bisa pake
81
  # text-based loaders, cuma OCR fallback yang ga jalan).
82
  cv_url = os.getenv("CV_API_URL", "http://127.0.0.1:8001")
 
91
  except Exception as e:
92
  logger.debug(f"CV /ready unreachable: {e}")
93
 
94
+ all_models = {
95
+ "rag": {"state": rag_state, "groq_api_key": "set" if has_groq_key else "missing"},
96
+ }
97
  for name, info in cv_models.items():
98
  all_models[f"cv.{name}"] = info
99
 
 
228
  source_ids=request.source_ids,
229
  )
230
  return QueryResponse(**result)
231
+ except RuntimeError as e:
232
+ # GROQ_API_KEY missing β†’ 503 Service Unavailable dengan pesan jelas.
233
+ msg = str(e)
234
+ if "GROQ_API_KEY" in msg:
235
+ logger.warning(f"Query rejected (no API key): {msg}")
236
+ raise HTTPException(
237
+ status_code=503,
238
+ detail={
239
+ "error": "groq_api_key_missing",
240
+ "message": msg,
241
+ },
242
+ )
243
+ logger.error(f"Query runtime error: {e}")
244
+ raise HTTPException(status_code=500, detail=str(e))
245
  except Exception as e:
246
  logger.error(f"Query error: {e}")
247
  raise HTTPException(status_code=500, detail=str(e))
 
255
  try:
256
  summary = get_retriever().summarize(request.source)
257
  return SummarizeResponse(summary=summary, source=request.source)
258
+ except RuntimeError as e:
259
+ msg = str(e)
260
+ if "GROQ_API_KEY" in msg:
261
+ raise HTTPException(
262
+ status_code=503,
263
+ detail={"error": "groq_api_key_missing", "message": msg},
264
+ )
265
+ raise HTTPException(status_code=500, detail=str(e))
266
  except Exception as e:
267
  raise HTTPException(status_code=500, detail=str(e))
268
 
 
272
  @router.delete("/collection", response_model=DeleteResponse, tags=["system"])
273
  async def delete_collection():
274
  """Hapus semua dokumen dari vector store."""
275
+ try:
276
+ get_retriever().vector_store.delete_collection()
277
+ return DeleteResponse(
278
+ status="success",
279
+ message="Semua dokumen berhasil dihapus dari vector store.",
280
+ )
281
+ except Exception as e:
282
+ logger.error(f"Clear collection error: {e}")
283
+ raise HTTPException(
284
+ status_code=500,
285
+ detail=f"Gagal hapus collection: {e}",
286
+ )
rag_pipeline/src/config.py CHANGED
@@ -6,7 +6,11 @@ from pathlib import Path
6
 
7
  class Settings(BaseSettings):
8
  # LLM
9
- groq_api_key: str = Field(..., env="GROQ_API_KEY")
 
 
 
 
10
  groq_model: str = Field("llama-3.3-70b-versatile", env="GROQ_MODEL") # updated from 3.1
11
 
12
  # Embeddings
 
6
 
7
  class Settings(BaseSettings):
8
  # LLM
9
+ # groq_api_key sengaja TIDAK required (default ""). Kalau env var ga di-set,
10
+ # service tetap bisa start dan endpoint /stats /sources /ingest masih jalan.
11
+ # Cuma /query dan /summarize yang gagal dengan pesan jelas β€” ini lebih bagus
12
+ # daripada seluruh container crash di startup.
13
+ groq_api_key: str = Field(default="", env="GROQ_API_KEY")
14
  groq_model: str = Field("llama-3.3-70b-versatile", env="GROQ_MODEL") # updated from 3.1
15
 
16
  # Embeddings
rag_pipeline/src/llm/groq_client.py CHANGED
@@ -1,6 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  from typing import List, Optional, Iterator
2
  from loguru import logger
3
- from langchain_groq import ChatGroq
4
  from langchain.schema import BaseMessage, HumanMessage, AIMessage
5
  from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
6
 
@@ -11,37 +25,70 @@ class GroqClient:
11
  """
12
  Wrapper di atas ChatGroq dari LangChain.
13
  Mendukung regular call dan streaming.
 
 
14
  """
15
 
16
  def __init__(self, streaming: bool = False):
17
- settings = get_settings()
18
- callbacks = [StreamingStdOutCallbackHandler()] if streaming else []
19
-
20
- self.llm = ChatGroq(
21
- api_key=settings.groq_api_key,
22
- model_name=settings.groq_model,
23
- temperature=settings.temperature,
24
- max_tokens=settings.max_tokens,
25
- streaming=streaming,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  callbacks=callbacks,
27
  )
28
- self.model_name = settings.groq_model
29
- logger.info(f"Groq client initialized. Model: {settings.groq_model}")
30
 
31
  def invoke(self, messages: List[BaseMessage]) -> str:
32
  """Kirim messages dan return response string."""
33
- response = self.llm.invoke(messages)
 
34
  return response.content
35
 
36
  def stream(self, messages: List[BaseMessage]) -> Iterator[str]:
37
  """Streaming response β€” yield token per token."""
38
- for chunk in self.llm.stream(messages):
 
39
  if chunk.content:
40
  yield chunk.content
41
 
42
  def get_langchain_llm(self):
43
  """Return raw LangChain LLM object untuk dipakai di chain."""
44
- return self.llm
45
 
46
  @staticmethod
47
  def build_messages(
 
1
+ """
2
+ Groq LLM client.
3
+
4
+ Perubahan: ChatGroq di-init LAZY β€” bukan saat constructor jalan, tapi saat
5
+ pertama kali invoke/stream dipanggil. Alasannya:
6
+
7
+ - RAGRetriever.__init__ sekarang ngga crash kalau GROQ_API_KEY ngga di-set.
8
+ - Endpoint yang ngga butuh LLM (/stats, /sources, /ingest, /collection DELETE)
9
+ tetep jalan walau API key ngga ada.
10
+ - /query dan /summarize gagal dengan pesan yang jelas: "GROQ_API_KEY belum
11
+ di-set" β€” bukan error pydantic / startup crash yang bikin user bingung.
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
  from typing import List, Optional, Iterator
17
  from loguru import logger
 
18
  from langchain.schema import BaseMessage, HumanMessage, AIMessage
19
  from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
20
 
 
25
  """
26
  Wrapper di atas ChatGroq dari LangChain.
27
  Mendukung regular call dan streaming.
28
+
29
+ LAZY: ChatGroq instance baru dibuat saat pertama kali invoke/stream.
30
  """
31
 
32
  def __init__(self, streaming: bool = False):
33
+ self.settings = get_settings()
34
+ self.streaming = streaming
35
+ self.model_name = self.settings.groq_model
36
+ self._llm = None # akan diisi di _ensure_llm()
37
+ logger.info(
38
+ f"Groq client constructed (lazy). Model: {self.model_name} "
39
+ f"| API key: {'SET' if self.settings.groq_api_key else 'NOT SET'}"
40
+ )
41
+
42
+ def _ensure_llm(self):
43
+ """Buat ChatGroq instance kalau belum ada. Validasi API key di sini."""
44
+ if self._llm is not None:
45
+ return self._llm
46
+
47
+ api_key = (self.settings.groq_api_key or "").strip()
48
+ if not api_key:
49
+ raise RuntimeError(
50
+ "GROQ_API_KEY belum di-set. "
51
+ "Tambahkan di Hugging Face Space β†’ Settings β†’ Variables and secrets, "
52
+ "atau set environment variable di host. "
53
+ "Endpoint /query dan /summarize butuh API key ini."
54
+ )
55
+
56
+ try:
57
+ from langchain_groq import ChatGroq
58
+ except ImportError as e:
59
+ raise RuntimeError(
60
+ f"langchain-groq tidak terinstall: {e}. "
61
+ "Cek requirements.txt."
62
+ )
63
+
64
+ callbacks = [StreamingStdOutCallbackHandler()] if self.streaming else []
65
+ self._llm = ChatGroq(
66
+ api_key=api_key,
67
+ model_name=self.model_name,
68
+ temperature=self.settings.temperature,
69
+ max_tokens=self.settings.max_tokens,
70
+ streaming=self.streaming,
71
  callbacks=callbacks,
72
  )
73
+ logger.info(f"Groq client initialized lazily. Model: {self.model_name}")
74
+ return self._llm
75
 
76
  def invoke(self, messages: List[BaseMessage]) -> str:
77
  """Kirim messages dan return response string."""
78
+ llm = self._ensure_llm()
79
+ response = llm.invoke(messages)
80
  return response.content
81
 
82
  def stream(self, messages: List[BaseMessage]) -> Iterator[str]:
83
  """Streaming response β€” yield token per token."""
84
+ llm = self._ensure_llm()
85
+ for chunk in llm.stream(messages):
86
  if chunk.content:
87
  yield chunk.content
88
 
89
  def get_langchain_llm(self):
90
  """Return raw LangChain LLM object untuk dipakai di chain."""
91
+ return self._ensure_llm()
92
 
93
  @staticmethod
94
  def build_messages(
rag_pipeline/src/retrieval/vector_store.py CHANGED
@@ -203,26 +203,69 @@ class VectorStore:
203
  return list(bucket.values())
204
 
205
  def reset_collection(self):
206
- """Hapus semua dokumen tanpa mematikan collection."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
207
  try:
208
  self.db.reset_collection()
 
 
 
 
 
 
209
  except Exception as e:
210
- # Fallback: kalau reset_collection tidak tersedia/gagal, pakai delete + recreate.
211
- logger.warning(f"reset_collection() gagal: {e} β€” fallback ke delete + recreate.")
212
- try:
213
- self.db._client.delete_collection(self.settings.chroma_collection_name)
214
- except Exception:
215
- pass
216
- # Re-init dengan re-attach embedding function.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
217
  self.db = Chroma(
218
  collection_name=self.settings.chroma_collection_name,
219
  embedding_function=self.embedder.get_embeddings_model(),
220
  persist_directory=self.settings.chroma_persist_dir,
221
  )
222
- logger.warning(
223
- f"Collection '{self.settings.chroma_collection_name}' di-reset. "
224
- "Semua dokumen dihapus, collection siap dipakai kembali."
225
- )
 
 
226
 
227
  def delete_collection(self):
228
  """Alias ke reset_collection() β€” collection tidak dimatikan, aman."""
 
203
  return list(bucket.values())
204
 
205
  def reset_collection(self):
206
+ """
207
+ Hapus semua dokumen dari collection.
208
+
209
+ Strategi (paling kompatibel):
210
+ 1. Coba `db.reset_collection()` (ada di langchain_chroma versi baru).
211
+ 2. Fallback: ambil semua IDs lalu hapus via `_collection.delete(ids=...)`.
212
+ Ini bekerja di semua versi langchain_chroma + chromadb 0.5.x.
213
+ 3. Last resort: nuke via `_client.delete_collection()` + re-init Chroma.
214
+
215
+ Kenapa option 2 jadi default fallback (bukan option 3):
216
+ - Re-init Chroma kadang triggers ChromaDB re-create error kalau
217
+ ada race condition dengan delete_collection di backend.
218
+ - Delete by IDs jauh lebih atomic.
219
+ """
220
+ # Option 1: high-level method (newer langchain_chroma)
221
  try:
222
  self.db.reset_collection()
223
+ logger.warning(
224
+ f"Collection '{self.settings.chroma_collection_name}' di-reset (high-level)."
225
+ )
226
+ return
227
+ except AttributeError:
228
+ logger.debug("db.reset_collection() tidak tersedia, fallback ke delete-by-ids.")
229
  except Exception as e:
230
+ logger.debug(f"db.reset_collection() gagal: {e} β€” fallback ke delete-by-ids.")
231
+
232
+ # Option 2: delete semua document by IDs
233
+ try:
234
+ collection = self.db._collection
235
+ data = collection.get(include=[]) # cuma butuh ids
236
+ ids = data.get("ids") or []
237
+ if ids:
238
+ collection.delete(ids=ids)
239
+ logger.warning(
240
+ f"Collection '{self.settings.chroma_collection_name}' di-reset "
241
+ f"({len(ids)} chunks dihapus via delete-by-ids)."
242
+ )
243
+ else:
244
+ logger.info(
245
+ f"Collection '{self.settings.chroma_collection_name}' sudah kosong, ngga ada yg di-delete."
246
+ )
247
+ return
248
+ except Exception as e:
249
+ logger.warning(f"Delete-by-ids gagal: {e} β€” fallback ke delete + re-init.")
250
+
251
+ # Option 3: nuke collection lalu re-init Chroma
252
+ try:
253
+ self.db._client.delete_collection(self.settings.chroma_collection_name)
254
+ except Exception as e:
255
+ logger.warning(f"_client.delete_collection() gagal: {e}")
256
+
257
+ try:
258
  self.db = Chroma(
259
  collection_name=self.settings.chroma_collection_name,
260
  embedding_function=self.embedder.get_embeddings_model(),
261
  persist_directory=self.settings.chroma_persist_dir,
262
  )
263
+ logger.warning(
264
+ f"Collection '{self.settings.chroma_collection_name}' nuked & re-init."
265
+ )
266
+ except Exception as e:
267
+ logger.error(f"Re-init Chroma gagal setelah delete: {e}")
268
+ raise
269
 
270
  def delete_collection(self):
271
  """Alias ke reset_collection() β€” collection tidak dimatikan, aman."""
start.sh CHANGED
@@ -1,16 +1,41 @@
1
  #!/bin/bash
2
  set -e
3
 
4
- echo "=== Multimodal AI Platform β€” Starting ==="
5
- echo "GROQ_API_KEY: ${GROQ_API_KEY:+SET (hidden)}${GROQ_API_KEY:-NOT SET β€” RAG queries will fail!}"
 
6
 
7
- # Export GROQ_API_KEY so child processes (supervisord β†’ uvicorn) can access it
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  export GROQ_API_KEY="${GROQ_API_KEY:-}"
9
 
10
- # Ensure directories exist
11
  mkdir -p /app/rag/chroma_db /app/rag/mlruns /app/rag/logs \
12
  /app/cv/model_cache /app/cv/mlruns /app/cv/logs /app/cv/uploads \
13
  /var/log/supervisor /run
14
 
15
- echo "Starting supervisord (nginx + rag-api + cv-api)..."
 
16
  exec /usr/bin/supervisord -c /etc/supervisor/conf.d/supervisord.conf
 
1
  #!/bin/bash
2
  set -e
3
 
4
+ echo "========================================================"
5
+ echo " Multimodal AI Platform β€” Starting"
6
+ echo "========================================================"
7
 
8
+ # --- Diagnostics ---
9
+ echo ""
10
+ echo "[diag] Python: $(python3 --version 2>&1)"
11
+ echo "[diag] Working dir: $(pwd)"
12
+ echo ""
13
+ echo "[diag] Source layout:"
14
+ ls -la /app/rag/src/api/main.py 2>/dev/null && echo " βœ“ /app/rag/src/api/main.py" || echo " βœ— /app/rag/src/api/main.py MISSING"
15
+ ls -la /app/cv/src/api/main.py 2>/dev/null && echo " βœ“ /app/cv/src/api/main.py" || echo " βœ— /app/cv/src/api/main.py MISSING"
16
+ ls -la /app/frontend/index.html 2>/dev/null && echo " βœ“ /app/frontend/index.html" || echo " βœ— /app/frontend/index.html MISSING"
17
+ ls -la /app/cv/model_cache/yolov8n.onnx 2>/dev/null && echo " βœ“ YOLOv8n ONNX model" || echo " βœ— YOLOv8n ONNX MISSING (CV /detect akan gagal)"
18
+ echo ""
19
+
20
+ # --- Secrets / env ---
21
+ if [ -z "${GROQ_API_KEY}" ]; then
22
+ echo "[warn] GROQ_API_KEY tidak di-set."
23
+ echo " /api/v1/query dan /api/v1/summarize akan return 503."
24
+ echo " Endpoint lain (/ingest /sources /stats) tetap jalan normal."
25
+ echo " Set di Hugging Face Space β†’ Settings β†’ Variables and secrets."
26
+ else
27
+ echo "[ok] GROQ_API_KEY: SET (${#GROQ_API_KEY} chars)"
28
+ fi
29
+ echo ""
30
+
31
+ # Export GROQ_API_KEY supaya child processes (supervisord β†’ uvicorn) bisa akses.
32
  export GROQ_API_KEY="${GROQ_API_KEY:-}"
33
 
34
+ # Pastikan direktori yang dibutuhkan ada (mount point HF Space sometimes resets).
35
  mkdir -p /app/rag/chroma_db /app/rag/mlruns /app/rag/logs \
36
  /app/cv/model_cache /app/cv/mlruns /app/cv/logs /app/cv/uploads \
37
  /var/log/supervisor /run
38
 
39
+ echo "[boot] Starting supervisord (nginx + rag-api + cv-api)..."
40
+ echo "========================================================"
41
  exec /usr/bin/supervisord -c /etc/supervisor/conf.d/supervisord.conf