File size: 34,316 Bytes
d01a7e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa90fb0
 
 
 
 
 
 
 
 
 
 
d01a7e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
"""Resource manager for lazy-loaded application resources.

This module provides a ResourceManager class that handles lazy initialization
and caching of heavy application resources like the retriever and settings.
Resources are NOT loaded on application startup - they are loaded on the first
request that needs them, then cached for subsequent requests.

Key Features:
    - **Deferred Loading**: Resources load on first request, not startup
    - **Caching**: Once loaded, resources are cached for fast access
    - **Concurrent Load Protection**: Prevents multiple simultaneous loads
    - **Artifact Download**: Downloads RAG artifacts from HuggingFace if needed
    - **Metrics Tracking**: Tracks startup time and memory usage
    - **Ready State**: Exposes ready state for health check endpoints

Performance Targets:
    - Cold start (first request): < 30 seconds
    - Warm start (subsequent requests): < 5ms (cached)
    - Memory usage: Tracked and logged after loading

Architecture:
    The ResourceManager is a singleton accessed via get_resource_manager().
    It stores:
        - Settings: Application configuration from environment
        - Retriever: HybridRetriever wrapped with optional reranking

    The loading pipeline includes artifact download from HuggingFace:
        1. Load Settings from environment variables
        2. Download/verify artifacts via ArtifactDownloader (Step 7.7)
        3. Create retriever using factory function
        4. Record metrics (duration, memory)

Lazy Loading Strategy:
    All heavy dependencies (torch, faiss, sentence-transformers) are imported
    inside methods rather than at module level. This ensures:
        - Fast module import time
        - Minimal memory usage until resources are needed
        - Clean separation between import and initialization

Usage:
    The ResourceManager is used by route handlers to access shared resources:

    >>> from rag_chatbot.api.resources import get_resource_manager
    >>>
    >>> async def query_handler(query: str):
    ...     manager = get_resource_manager()
    ...     await manager.ensure_loaded()  # Lazy load if needed
    ...     retriever = manager.get_retriever()
    ...     results = retriever.retrieve(query)
    ...     return results

Integration with Health Checks:
    The /health/ready endpoint uses is_ready() to report whether the
    application is ready to serve requests:

    >>> manager = get_resource_manager()
    >>> if manager.is_ready():
    ...     return {"ready": True}
    ... else:
    ...     return {"ready": False}

See Also
--------
    Settings : Configuration module
        Application configuration (src/rag_chatbot/config/settings.py)
    RetrieverWithReranker : Retriever wrapper
        Retriever wrapper (src/rag_chatbot/retrieval/factory.py)
    ArtifactDownloader : Artifact downloader
        Downloads artifacts from HuggingFace (artifact_downloader.py)
    _lifespan : Lifecycle manager
        Application lifecycle (src/rag_chatbot/api/main.py)

"""

from __future__ import annotations

import asyncio
import logging
import time
from typing import TYPE_CHECKING

# =============================================================================
# Type Checking Imports
# =============================================================================
# These imports are only processed by type checkers (mypy, pyright) and IDEs.
# They enable proper type hints without runtime overhead.
# Heavy dependencies are NOT imported at runtime to ensure fast module loading.
# =============================================================================

if TYPE_CHECKING:
    from rag_chatbot.config.settings import Settings
    from rag_chatbot.retrieval.factory import RetrieverWithReranker

# =============================================================================
# Module Exports
# =============================================================================
__all__: list[str] = ["ResourceManager", "get_resource_manager"]

# =============================================================================
# Logger
# =============================================================================
logger = logging.getLogger(__name__)

# =============================================================================
# Constants
# =============================================================================
# Threshold for warning about slow cold start (30 seconds in milliseconds)
_COLD_START_WARNING_THRESHOLD_MS: int = 30000

# =============================================================================
# Module-level Singleton
# =============================================================================
# The ResourceManager is a singleton to ensure shared state across the
# application. This variable holds the singleton instance.
# =============================================================================
_resource_manager: ResourceManager | None = None


# =============================================================================
# ResourceManager Class
# =============================================================================


class ResourceManager:
    """Manager for lazy-loaded application resources.

    This class handles the lifecycle of heavy application resources like the
    retriever and settings. Resources are loaded lazily on first request and
    cached for subsequent access.

    Design Principles:
        1. **Lazy Loading**: Resources load on first access, not initialization
        2. **Thread-Safe**: Uses asyncio.Lock to prevent concurrent loads
        3. **Metrics Tracking**: Records load time and memory usage
        4. **Ready State**: Tracks whether resources are loaded for health checks

    Resource Lifecycle:
        1. ResourceManager is instantiated (empty - no resources loaded)
        2. First request calls ensure_loaded()
        3. Resources are loaded and cached (settings, retriever)
        4. Subsequent requests use cached resources (fast path)
        5. Application shutdown calls shutdown() for cleanup

    Attributes:
    ----------
        _retriever : RetrieverWithReranker | None
            The cached retriever instance. None until ensure_loaded() completes.

        _settings : Settings | None
            The cached settings instance. None until ensure_loaded() completes.

        _loaded : bool
            Whether resources have been successfully loaded.

        _loading : bool
            Whether a load operation is currently in progress.
            Used to prevent concurrent loads.

        _load_lock : asyncio.Lock
            Lock to ensure only one coroutine loads resources at a time.

        _load_start_time : float | None
            Timestamp when loading started (time.perf_counter()).
            Used to calculate load duration.

        _load_duration_ms : int | None
            Time taken to load resources in milliseconds.
            Logged for monitoring cold start performance.

        _memory_mb : float | None
            Process memory usage after loading in megabytes.
            Logged for monitoring resource consumption.

    Example:
    -------
        >>> manager = get_resource_manager()
        >>> await manager.ensure_loaded()
        >>> retriever = manager.get_retriever()
        >>> results = retriever.retrieve("What is PMV?")

    Note:
    ----
        This class should not be instantiated directly. Use get_resource_manager()
        to get the singleton instance.

    """

    # =========================================================================
    # Initialization
    # =========================================================================

    def __init__(self) -> None:
        """Initialize the ResourceManager with empty state.

        Creates a new ResourceManager with no resources loaded. Resources are
        loaded lazily when ensure_loaded() is called.

        The constructor does NOT import any heavy dependencies. All imports
        of torch, faiss, sentence-transformers, etc. happen inside methods
        when resources are actually loaded.

        Note:
        ----
            This constructor should only be called by get_resource_manager().
            Direct instantiation is discouraged to maintain singleton pattern.

        """
        # =====================================================================
        # Resource Cache (initially empty)
        # =====================================================================
        # These are populated by ensure_loaded() on first request.
        # Using None as sentinel for "not yet loaded" state.
        # =====================================================================
        self._retriever: RetrieverWithReranker | None = None
        self._settings: Settings | None = None

        # =====================================================================
        # Loading State
        # =====================================================================
        # Tracks whether resources are loaded and prevents concurrent loads.
        # =====================================================================
        self._loaded: bool = False
        self._loading: bool = False
        self._load_lock: asyncio.Lock = asyncio.Lock()

        # =====================================================================
        # Metrics (populated after loading)
        # =====================================================================
        # These track performance metrics for monitoring and alerting.
        # =====================================================================
        self._load_start_time: float | None = None
        self._load_duration_ms: int | None = None
        self._memory_mb: float | None = None

        logger.debug("ResourceManager initialized (resources not yet loaded)")

    # =========================================================================
    # Private Methods
    # =========================================================================

    def _get_memory_usage_mb(self) -> float:
        """Get current process memory usage in megabytes.

        Uses psutil to get the Resident Set Size (RSS) of the current process.
        RSS represents the actual physical memory used by the process.

        Returns:
        -------
            Current process memory usage in MB (megabytes).

        Note:
        ----
            This requires psutil to be installed. If psutil is not available,
            returns 0.0 and logs a warning.

            Memory is measured after resource loading completes to track the
            impact of loading FAISS indexes, embeddings, and models.

        Example:
        -------
            >>> memory = self._get_memory_usage_mb()
            >>> print(f"Memory usage: {memory:.2f} MB")
            Memory usage: 512.34 MB

        """
        try:
            import psutil  # type: ignore[import-untyped]

            process = psutil.Process()
            # memory_info().rss returns bytes, convert to MB
            memory_bytes: int = process.memory_info().rss
            return float(memory_bytes) / (1024 * 1024)
        except ImportError:
            logger.warning(
                "psutil not installed - cannot measure memory usage. "
                "Install with: pip install psutil"
            )
            return 0.0
        except Exception:
            # Catch any other errors (permissions, etc.) without crashing
            logger.warning("Failed to get memory usage", exc_info=True)
            return 0.0

    async def _load_resources(self) -> None:
        """Load all application resources.

        This is the core loading method that initializes all heavy dependencies.
        It is called by ensure_loaded() when resources need to be loaded.

        Loading Steps:
            1. Load Settings from environment variables
            2. Download/verify artifacts from HuggingFace via ArtifactDownloader
            3. Create retriever using factory function
            4. Record metrics (duration, memory)

        The method imports heavy dependencies inside the function to ensure
        they are only loaded when actually needed, not at module import time.

        Raises:
        ------
            RuntimeError: If loading fails for any reason, including:
                - Failed to download artifacts from HuggingFace
                - Failed to create retriever from artifacts

        Note:
        ----
            This method assumes it is called while holding _load_lock.
            Do not call directly - use ensure_loaded() instead.

            The retriever itself performs lazy loading of its components
            (FAISS index, BM25 index, encoder model). The first retrieve()
            call will trigger additional loading.

        """
        logger.info("Loading application resources...")

        # =====================================================================
        # Step 1: Load Settings
        # =====================================================================
        # Import Settings lazily to avoid loading pydantic_settings at module
        # import time. Settings reads from environment variables.
        # =====================================================================
        logger.debug("Loading settings from environment")

        from rag_chatbot.config.settings import Settings

        self._settings = Settings()

        logger.debug(
            "Settings loaded: use_hybrid=%s, use_reranker=%s, top_k=%d",
            self._settings.use_hybrid,
            self._settings.use_reranker,
            self._settings.top_k,
        )

        # =====================================================================
        # Step 2: Download/verify artifacts from HuggingFace (Step 7.7)
        # =====================================================================
        # The ArtifactDownloader handles:
        #   - Version-based cache invalidation (compares local vs remote version)
        #   - Cache hit: Uses existing artifacts (fast path, ~1 second)
        #   - Cache miss: Downloads all artifacts from HuggingFace (~10-30 seconds)
        #   - Force refresh: Re-downloads if FORCE_ARTIFACT_REFRESH=true
        #   - Retry logic with exponential backoff for transient failures
        #
        # The downloader returns the path to the cache directory containing:
        #   - chunks.parquet: Document chunks with metadata
        #   - embeddings.parquet: Embedding vectors for semantic search
        #   - faiss_index.bin: FAISS index for dense retrieval
        #   - bm25_index.pkl: BM25 index for sparse/lexical retrieval
        #   - index_version.txt: Version identifier for cache invalidation
        # =====================================================================
        logger.debug(
            "Ensuring artifacts are available (repo=%s, force_refresh=%s)",
            self._settings.hf_index_repo,
            self._settings.force_artifact_refresh,
        )

        # Lazy import to avoid loading huggingface_hub at module import time
        from rag_chatbot.api.artifact_downloader import (
            ArtifactDownloader,
            ArtifactDownloadError,
        )

        # Track artifact download time separately for monitoring
        artifact_start_time = time.perf_counter()

        try:
            downloader = ArtifactDownloader(self._settings)
            artifact_path = await downloader.ensure_artifacts_available()
        except ArtifactDownloadError as e:
            # Log the full exception with traceback for operators
            logger.exception(
                "Failed to download artifacts from HuggingFace (repo=%s)",
                self._settings.hf_index_repo,
            )
            # Re-raise as RuntimeError with helpful message for operators
            msg = (
                f"Failed to download RAG artifacts from HuggingFace: {e}. "
                f"Check HF_TOKEN is valid, repo '{self._settings.hf_index_repo}' "
                "exists, and network connectivity to HuggingFace."
            )
            raise RuntimeError(msg) from e

        artifact_elapsed_ms = int((time.perf_counter() - artifact_start_time) * 1000)
        logger.info(
            "Artifact download/verification completed in %d ms, path: %s",
            artifact_elapsed_ms,
            artifact_path,
        )

        # =====================================================================
        # Step 2.5: Validate Dataset Freshness (Step 9.5)
        # =====================================================================
        # The FreshnessValidator checks that downloaded artifacts are:
        #   - Schema version compatible with this server code
        #   - Complete and consistent (manifest matches version file)
        #
        # If validation fails, the server refuses to start with a clear
        # error message indicating what needs to be fixed.
        # =====================================================================
        logger.debug("Validating dataset freshness...")

        # Lazy import to avoid loading at module import time
        from rag_chatbot.api.freshness import (
            FreshnessValidationError,
            FreshnessValidator,
        )

        freshness_validator = FreshnessValidator(artifact_path, self._settings)

        try:
            manifest = freshness_validator.validate()
            # Log the index version on boot (acceptance criteria)
            if manifest is not None:
                logger.info(
                    "Dataset validated: index_version=%s, schema_version=%s",
                    manifest.index_version,
                    manifest.schema_version,
                )
            else:
                # Legacy manifest format - validation was skipped
                logger.info(
                    "Dataset loaded with legacy manifest format (validation skipped)"
                )
        except FreshnessValidationError as e:
            # Fail fast with clear error if validation fails (acceptance criteria)
            logger.exception(
                "Dataset freshness validation failed - server cannot start"
            )
            msg = (
                f"Dataset freshness validation failed: {e}. "
                f"The server cannot start with incompatible or corrupt artifacts. "
                f"Repository: {self._settings.hf_index_repo}"
            )
            raise RuntimeError(msg) from e

        # =====================================================================
        # Step 3: Create Retriever
        # =====================================================================
        # Import the factory function lazily. This triggers loading of
        # HybridRetriever, DenseRetriever, and related modules.
        #
        # The retriever factory:
        #   - Creates HybridRetriever or DenseRetriever based on use_hybrid
        #   - Wraps with RetrieverWithReranker if use_reranker is enabled
        #   - Configures top_k from settings
        #
        # Note: The retriever loads FAISS/BM25 indexes from disk, but the
        # encoder model is lazy-loaded on first retrieve() call.
        # =====================================================================
        logger.debug("Creating retriever from factory")

        from rag_chatbot.retrieval.factory import get_default_retriever

        self._retriever = get_default_retriever(
            index_path=artifact_path,
            settings=self._settings,
        )

        logger.debug(
            "Retriever created: type=%s, use_reranker=%s",
            type(self._retriever.retriever).__name__,
            self._retriever.use_reranker,
        )

    # =========================================================================
    # Public Methods
    # =========================================================================

    async def ensure_loaded(self) -> None:
        """Ensure all resources are loaded, loading them if necessary.

        This is the main entry point for resource loading. It implements
        lazy loading with the following behavior:

        1. If already loaded: Return immediately (fast path, < 1ms)
        2. If another coroutine is loading: Wait for it to complete
        3. If not loaded: Acquire lock and load resources

        The method uses an asyncio.Lock to ensure that only one coroutine
        performs the actual loading. Other concurrent calls will wait for
        the loading to complete rather than loading redundantly.

        Performance:
            - Warm path (already loaded): < 1ms
            - Cold path (first load): 10-30 seconds depending on index size
            - Concurrent path (waiting): Same as cold path + minor wait overhead

        Raises:
        ------
            RuntimeError: If resource loading fails. The error is logged and
                re-raised. The manager remains in unloaded state for retry.

        Example:
        -------
            >>> manager = get_resource_manager()
            >>> await manager.ensure_loaded()  # May take 10-30s on cold start
            >>> await manager.ensure_loaded()  # Returns immediately (cached)

        Note:
        ----
            This method is idempotent - calling it multiple times is safe.
            After the first successful load, subsequent calls return immediately.

            If loading fails, _loaded remains False and the next call will
            attempt to load again. This provides automatic retry behavior.

        """
        # =====================================================================
        # Fast Path: Already Loaded
        # =====================================================================
        # Check _loaded without lock for fast path. This is safe because
        # _loaded only transitions from False to True, never back.
        # =====================================================================
        if self._loaded:
            logger.debug("Resources already loaded (fast path)")
            return

        # =====================================================================
        # Acquire Lock for Loading
        # =====================================================================
        # Use asyncio.Lock to ensure only one coroutine loads at a time.
        # Other coroutines wait here until the lock is released.
        # =====================================================================
        async with self._load_lock:
            # =================================================================
            # Double-Check After Acquiring Lock
            # =================================================================
            # Another coroutine may have completed loading while we waited.
            # Check again inside the lock to avoid redundant loading.
            # =================================================================
            if self._loaded:
                logger.debug("Resources loaded by another coroutine (waited)")
                return

            # =================================================================
            # Perform Loading
            # =================================================================
            # We hold the lock, so we are the only one loading.
            # Set _loading flag for observability (not strictly necessary
            # with the lock, but useful for debugging/monitoring).
            # =================================================================
            self._loading = True
            self._load_start_time = time.perf_counter()

            try:
                # Load all resources
                await self._load_resources()

                # =============================================================
                # Record Metrics
                # =============================================================
                # Calculate load duration and memory usage for monitoring.
                # These are logged and exposed via get_load_stats().
                # =============================================================
                load_end_time = time.perf_counter()
                self._load_duration_ms = int(
                    (load_end_time - self._load_start_time) * 1000
                )
                self._memory_mb = self._get_memory_usage_mb()

                # Log the metrics with appropriate severity
                # Cold start > 30s is concerning, log as warning
                if self._load_duration_ms > _COLD_START_WARNING_THRESHOLD_MS:
                    logger.warning(
                        "Resources loaded in %d ms (exceeds 30s target), "
                        "memory: %.2f MB",
                        self._load_duration_ms,
                        self._memory_mb,
                    )
                else:
                    logger.info(
                        "Resources loaded in %d ms, memory: %.2f MB",
                        self._load_duration_ms,
                        self._memory_mb,
                    )

                # Mark as loaded (success)
                self._loaded = True

            except Exception as e:
                # =============================================================
                # Handle Loading Failure
                # =============================================================
                # Log the error and re-raise. Keep _loaded as False so that
                # subsequent calls will retry loading.
                # =============================================================
                elapsed_ms = int((time.perf_counter() - self._load_start_time) * 1000)
                logger.exception(
                    "Failed to load resources after %d ms",
                    elapsed_ms,
                )
                msg = f"Failed to load resources: {e}"
                raise RuntimeError(msg) from e

            finally:
                # Always clear the loading flag
                self._loading = False

    def is_ready(self) -> bool:
        """Check if resources are loaded and ready for requests.

        This method is used by health check endpoints to report whether the
        application is ready to serve requests. An application is ready when:
            - Resources have been loaded successfully
            - Retriever is available for queries

        Returns:
        -------
            True if resources are loaded and ready, False otherwise.

        Example:
        -------
            >>> manager = get_resource_manager()
            >>> manager.is_ready()
            False  # Not yet loaded
            >>> await manager.ensure_loaded()
            >>> manager.is_ready()
            True  # Now ready

        Note:
        ----
            This method does NOT trigger loading. Use ensure_loaded() to
            trigger lazy loading. This method only checks current state.

            The ready state is used by:
                - /health/ready endpoint for Kubernetes readiness probes
                - Load balancers to determine if instance can serve traffic

        """
        return self._loaded

    def get_retriever(self) -> RetrieverWithReranker:
        """Get the cached retriever instance.

        Returns the RetrieverWithReranker that was loaded by ensure_loaded().
        This is used by query handlers to retrieve relevant documents.

        Returns:
        -------
            The cached RetrieverWithReranker instance.

        Raises:
        ------
            RuntimeError: If called before ensure_loaded() completes.
                Always call ensure_loaded() first.

        Example:
        -------
            >>> manager = get_resource_manager()
            >>> await manager.ensure_loaded()
            >>> retriever = manager.get_retriever()
            >>> results = retriever.retrieve("What is PMV?", top_k=5)

        Note:
        ----
            This method does NOT trigger loading. It returns the cached
            instance or raises an error if not loaded.

            The retriever performs additional lazy loading on first retrieve()
            call (encoder model). This is handled internally by the retriever.

        """
        if self._retriever is None:
            msg = (
                "Retriever not loaded. Call ensure_loaded() first. "
                "This error indicates a programming bug - ensure_loaded() "
                "should be called before accessing resources."
            )
            raise RuntimeError(msg)

        return self._retriever

    def get_settings(self) -> Settings:
        """Get the cached settings instance.

        Returns the Settings that were loaded by ensure_loaded().
        This provides access to application configuration.

        Returns:
        -------
            The cached Settings instance.

        Raises:
        ------
            RuntimeError: If called before ensure_loaded() completes.
                Always call ensure_loaded() first.

        Example:
        -------
            >>> manager = get_resource_manager()
            >>> await manager.ensure_loaded()
            >>> settings = manager.get_settings()
            >>> print(f"Using top_k={settings.top_k}")

        Note:
        ----
            This method does NOT trigger loading. It returns the cached
            instance or raises an error if not loaded.

            For settings access before loading, create a new Settings()
            instance directly (but prefer using the cached one when available).

        """
        if self._settings is None:
            msg = (
                "Settings not loaded. Call ensure_loaded() first. "
                "This error indicates a programming bug - ensure_loaded() "
                "should be called before accessing resources."
            )
            raise RuntimeError(msg)

        return self._settings

    def get_load_stats(self) -> dict[str, int | float | bool | None]:
        """Get loading statistics for monitoring and debugging.

        Returns a dictionary with metrics about resource loading:
            - loaded: Whether resources are loaded
            - loading: Whether loading is in progress
            - load_duration_ms: Time taken to load (ms), None if not loaded
            - memory_mb: Memory usage after loading (MB), None if not loaded

        Returns:
        -------
            Dictionary with loading statistics.

        Example:
        -------
            >>> manager = get_resource_manager()
            >>> stats = manager.get_load_stats()
            >>> # Before loading: loaded=False, loading=False
            >>> await manager.ensure_loaded()
            >>> stats = manager.get_load_stats()
            >>> # After loading: loaded=True, load_duration_ms=15234

        Note:
        ----
            This is primarily used for:
                - Health check endpoints to report startup metrics
                - Debugging slow startups
                - Monitoring memory consumption

        """
        return {
            "loaded": self._loaded,
            "loading": self._loading,
            "load_duration_ms": self._load_duration_ms,
            "memory_mb": self._memory_mb,
        }

    async def shutdown(self) -> None:
        """Clean up resources on application shutdown.

        This method is called during application shutdown to release resources
        and perform cleanup tasks:
            - Clear cached retriever reference
            - Clear cached settings reference
            - Log shutdown metrics

        The cleanup allows garbage collection of heavy objects (FAISS index,
        encoder model, etc.) and ensures clean shutdown.

        Note:
        ----
            After shutdown(), the manager can be reloaded by calling
            ensure_loaded() again. This supports restart scenarios.

            This method should be called from the application lifespan
            context manager's shutdown phase.

        Example:
        -------
            >>> manager = get_resource_manager()
            >>> await manager.ensure_loaded()
            >>> # ... serve requests ...
            >>> await manager.shutdown()  # Clean up on exit

        See Also:
        --------
            _lifespan in src/rag_chatbot/api/main.py for integration.

        """
        logger.info("Shutting down ResourceManager...")

        # Log final stats before cleanup
        if self._loaded:
            logger.info(
                "Final resource stats: load_duration=%s ms, memory=%s MB",
                self._load_duration_ms,
                self._memory_mb,
            )

        # Clear cached resources to allow garbage collection
        self._retriever = None
        self._settings = None
        self._loaded = False
        self._loading = False

        logger.info("ResourceManager shutdown complete")


# =============================================================================
# Singleton Accessor
# =============================================================================


def get_resource_manager() -> ResourceManager:
    """Get or create the singleton ResourceManager instance.

    This function provides access to the global ResourceManager singleton.
    On first call, it creates the ResourceManager. Subsequent calls return
    the same instance.

    Returns:
    -------
        The singleton ResourceManager instance.

    Example:
    -------
        >>> manager1 = get_resource_manager()
        >>> manager2 = get_resource_manager()
        >>> manager1 is manager2
        True

    Note:
    ----
        This function is thread-safe for access (single assignment).
        The ResourceManager itself uses asyncio.Lock for thread-safe loading.

        The singleton pattern ensures:
            - Shared state across route handlers
            - Resources loaded only once
            - Consistent metrics tracking

    """
    global _resource_manager  # noqa: PLW0603

    if _resource_manager is None:
        _resource_manager = ResourceManager()
        logger.debug("Created ResourceManager singleton")

    return _resource_manager