MasuRii commited on
Commit
31c3d36
·
1 Parent(s): 4dfb828

feat: add runtime resilience for file deletion survival

Browse files

Implement graceful degradation patterns that allow the proxy to continue
running even if core files are deleted during runtime. Changes only take
effect on restart, enabling safe development while the proxy is serving.

## Changes by Component

### Usage Manager (usage_manager.py)
- Wrap `_save_usage()` in try/except with directory auto-recreation
- Enhanced `_load_usage()` with explicit error handling
- In-memory state continues working if file operations fail

### Failure Logger (failure_logger.py)
- Add module-level `_file_handler` and `_fallback_mode` state
- Create `_create_file_handler()` with directory auto-recreation
- Create `_ensure_handler_valid()` for handler recovery
- Use NullHandler as fallback when file logging fails

### Detailed Logger (detailed_logger.py)
- Add class-level `_disk_available` and `_console_fallback_warned` flags
- Add instance-level `_in_memory_logs` list for fallback storage
- Skip disk writes gracefully when filesystem unavailable

### Google OAuth Base (google_oauth_base.py)
- Update memory cache FIRST before disk write (memory-first pattern)
- Use cached tokens as fallback when refresh/save fails
- Log warnings but don't crash on persistence failures

### Provider Cache (provider_cache.py)
- Add `_disk_available` health flag and `disk_errors` counter
- Track disk health status in get_stats()
- Gracefully degrade to memory-only caching on disk failures

### Documentation (DOCUMENTATION.md)
- Add Section 5: Runtime Resilience with resilience hierarchy
- Document "Develop While Running" workflow
- Explain graceful degradation and data loss scenarios

DOCUMENTATION.md CHANGED
@@ -697,4 +697,35 @@ To facilitate robust debugging, the proxy includes a comprehensive transaction l
697
 
698
  This level of detail allows developers to trace exactly why a request failed or why a specific key was rotated.
699
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
700
 
 
697
 
698
  This level of detail allows developers to trace exactly why a request failed or why a specific key was rotated.
699
 
700
+ ---
701
+
702
+ ## 5. Runtime Resilience
703
+
704
+ The proxy is engineered to maintain high availability even in the face of runtime filesystem disruptions. This "Runtime Resilience" capability ensures that the service continues to process API requests even if core data directories (like `logs/`, `oauth_creds/`) or files are accidentally deleted or become unwritable while the application is running.
705
+
706
+ ### 5.1. Resilience Hierarchy
707
+
708
+ The system follows a strict hierarchy of survival:
709
+
710
+ 1. **Core API Handling (Level 1)**: The Python runtime keeps all necessary code in memory (`sys.modules`). Deleting source code files while the proxy is running will **not** crash active requests.
711
+ 2. **Credential Management (Level 2)**: OAuth tokens are aggressively cached in memory. If credential files are deleted, the proxy continues using the cached tokens. If a token needs refresh and the file cannot be written, the new token is updated in memory only.
712
+ 3. **Usage Tracking (Level 3)**: Usage statistics (`key_usage.json`) are maintained in memory. If the file is deleted, the system tracks usage internally. It attempts to recreate the file/directory on the next save interval. If save fails, data is effectively "memory-only" until the next successful write.
713
+ 4. **Logging (Level 4)**: Logging is treated as non-critical. If the `logs/` directory is removed, the system attempts to recreate it. If creation fails (e.g., permission error), logging degrades gracefully (stops or falls back to console) without interrupting the request flow.
714
+
715
+ ### 5.2. "Develop While Running"
716
+
717
+ This architecture supports a robust development workflow:
718
+
719
+ * **Log Cleanup**: You can safely run `rm -rf logs/` while the proxy is serving traffic. The system will simply recreate the directory structure on the next request.
720
+ * **Config Reset**: Deleting `key_usage.json` resets the persistence layer, but the running instance preserves its current in-memory counts to ensure load balancing consistency.
721
+ * **File Recovery**: If you delete a critical file, the system attempts **Directory Auto-Recreation** before every write operation.
722
+
723
+ ### 5.3. Graceful Degradation & Data Loss
724
+
725
+ While functionality is preserved, persistence may be compromised during filesystem failures:
726
+
727
+ * **Logs**: If disk writes fail, detailed request logs may be lost (unless console fallback is active).
728
+ * **Usage Stats**: If `key_usage.json` cannot be written, usage data since the last successful save will be lost upon application restart.
729
+ * **Credentials**: Refreshed tokens held only in memory will require re-authentication after a restart if they cannot be persisted to disk.
730
+
731
 
src/proxy_app/detailed_logger.py CHANGED
@@ -13,6 +13,10 @@ class DetailedLogger:
13
  """
14
  Logs comprehensive details of each API transaction to a unique, timestamped directory.
15
  """
 
 
 
 
16
  def __init__(self):
17
  """
18
  Initializes the logger for a single request, creating a unique directory to store all related log files.
@@ -21,16 +25,33 @@ class DetailedLogger:
21
  self.request_id = str(uuid.uuid4())
22
  timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
23
  self.log_dir = DETAILED_LOGS_DIR / f"{timestamp}_{self.request_id}"
24
- self.log_dir.mkdir(parents=True, exist_ok=True)
25
  self.streaming = False
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  def _write_json(self, filename: str, data: Dict[str, Any]):
28
  """Helper to write data to a JSON file in the log directory."""
 
 
 
 
29
  try:
 
 
30
  with open(self.log_dir / filename, "w", encoding="utf-8") as f:
31
  json.dump(data, f, indent=4, ensure_ascii=False)
32
- except Exception as e:
33
  logging.error(f"[{self.request_id}] Failed to write to {filename}: {e}")
 
34
 
35
  def log_request(self, headers: Dict[str, Any], body: Dict[str, Any]):
36
  """Logs the initial request details."""
@@ -45,14 +66,18 @@ class DetailedLogger:
45
 
46
  def log_stream_chunk(self, chunk: Dict[str, Any]):
47
  """Logs an individual chunk from a streaming response to a JSON Lines file."""
 
 
 
48
  try:
 
49
  log_entry = {
50
  "timestamp_utc": datetime.utcnow().isoformat(),
51
  "chunk": chunk
52
  }
53
  with open(self.log_dir / "streaming_chunks.jsonl", "a", encoding="utf-8") as f:
54
  f.write(json.dumps(log_entry, ensure_ascii=False) + "\n")
55
- except Exception as e:
56
  logging.error(f"[{self.request_id}] Failed to write stream chunk: {e}")
57
 
58
  def log_final_response(self, status_code: int, headers: Optional[Dict[str, Any]], body: Dict[str, Any]):
 
13
  """
14
  Logs comprehensive details of each API transaction to a unique, timestamped directory.
15
  """
16
+ # Class-level fallback flags for resilience
17
+ _disk_available = True
18
+ _console_fallback_warned = False
19
+
20
  def __init__(self):
21
  """
22
  Initializes the logger for a single request, creating a unique directory to store all related log files.
 
25
  self.request_id = str(uuid.uuid4())
26
  timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
27
  self.log_dir = DETAILED_LOGS_DIR / f"{timestamp}_{self.request_id}"
 
28
  self.streaming = False
29
+ self._in_memory_logs = [] # Fallback storage
30
+
31
+ # Attempt directory creation with resilience
32
+ try:
33
+ self.log_dir.mkdir(parents=True, exist_ok=True)
34
+ DetailedLogger._disk_available = True
35
+ except (OSError, PermissionError) as e:
36
+ DetailedLogger._disk_available = False
37
+ if not DetailedLogger._console_fallback_warned:
38
+ logging.warning(f"Detailed logging disabled - cannot create log directory: {e}")
39
+ DetailedLogger._console_fallback_warned = True
40
 
41
  def _write_json(self, filename: str, data: Dict[str, Any]):
42
  """Helper to write data to a JSON file in the log directory."""
43
+ if not DetailedLogger._disk_available:
44
+ self._in_memory_logs.append({"file": filename, "data": data})
45
+ return
46
+
47
  try:
48
+ # Attempt directory recreation if needed
49
+ self.log_dir.mkdir(parents=True, exist_ok=True)
50
  with open(self.log_dir / filename, "w", encoding="utf-8") as f:
51
  json.dump(data, f, indent=4, ensure_ascii=False)
52
+ except (OSError, PermissionError, IOError) as e:
53
  logging.error(f"[{self.request_id}] Failed to write to {filename}: {e}")
54
+ self._in_memory_logs.append({"file": filename, "data": data})
55
 
56
  def log_request(self, headers: Dict[str, Any], body: Dict[str, Any]):
57
  """Logs the initial request details."""
 
66
 
67
  def log_stream_chunk(self, chunk: Dict[str, Any]):
68
  """Logs an individual chunk from a streaming response to a JSON Lines file."""
69
+ if not DetailedLogger._disk_available:
70
+ return # Skip chunk logging when disk unavailable
71
+
72
  try:
73
+ self.log_dir.mkdir(parents=True, exist_ok=True)
74
  log_entry = {
75
  "timestamp_utc": datetime.utcnow().isoformat(),
76
  "chunk": chunk
77
  }
78
  with open(self.log_dir / "streaming_chunks.jsonl", "a", encoding="utf-8") as f:
79
  f.write(json.dumps(log_entry, ensure_ascii=False) + "\n")
80
+ except (OSError, PermissionError, IOError) as e:
81
  logging.error(f"[{self.request_id}] Failed to write stream chunk: {e}")
82
 
83
  def log_final_response(self, status_code: int, headers: Optional[Dict[str, Any]], body: Dict[str, Any]):
src/rotator_library/failure_logger.py CHANGED
@@ -5,41 +5,76 @@ import os
5
  from datetime import datetime
6
  from .error_handler import mask_credential
7
 
 
 
 
8
 
9
- def setup_failure_logger():
10
- """Sets up a dedicated JSON logger for writing detailed failure logs to a file."""
11
- log_dir = "logs"
12
- if not os.path.exists(log_dir):
13
- os.makedirs(log_dir)
14
 
15
- # Create a logger specifically for failures.
16
- # This logger will NOT propagate to the root logger.
17
- logger = logging.getLogger("failure_logger")
18
- logger.setLevel(logging.INFO)
19
- logger.propagate = False
20
 
21
- # Use a rotating file handler
22
- handler = RotatingFileHandler(
23
- os.path.join(log_dir, "failures.log"),
24
- maxBytes=5 * 1024 * 1024, # 5 MB
25
- backupCount=2,
26
- )
27
 
28
- # Custom JSON formatter for structured logs
29
- class JsonFormatter(logging.Formatter):
30
- def format(self, record):
31
- # The message is already a dict, so we just format it as a JSON string
32
- return json.dumps(record.msg)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- handler.setFormatter(JsonFormatter())
35
 
36
- # Add handler only if it hasn't been added before
37
- if not logger.handlers:
 
 
 
 
 
 
 
 
 
 
38
  logger.addHandler(handler)
39
-
 
 
 
 
40
  return logger
41
 
42
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  # Initialize the dedicated logger for detailed failure logs
44
  failure_logger = setup_failure_logger()
45
 
@@ -145,11 +180,23 @@ def log_failure(
145
  "request_headers": request_headers,
146
  "error_chain": error_chain if len(error_chain) > 1 else None,
147
  }
148
- failure_logger.error(detailed_log_data)
149
-
150
  # 2. Log a concise summary to the main library logger, which will propagate
151
  summary_message = (
152
  f"API call failed for model {model} with key {mask_credential(api_key)}. "
153
  f"Error: {type(error).__name__}. See failures.log for details."
154
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  main_lib_logger.error(summary_message)
 
5
  from datetime import datetime
6
  from .error_handler import mask_credential
7
 
8
+ # Module-level state for resilience
9
+ _file_handler = None
10
+ _fallback_mode = False
11
 
 
 
 
 
 
12
 
13
+ # Custom JSON formatter for structured logs (defined at module level for reuse)
14
+ class JsonFormatter(logging.Formatter):
15
+ def format(self, record):
16
+ # The message is already a dict, so we just format it as a JSON string
17
+ return json.dumps(record.msg)
18
 
 
 
 
 
 
 
19
 
20
+ def _create_file_handler():
21
+ """Create file handler with directory auto-recreation."""
22
+ global _file_handler, _fallback_mode
23
+ log_dir = "logs"
24
+
25
+ try:
26
+ if not os.path.exists(log_dir):
27
+ os.makedirs(log_dir, exist_ok=True)
28
+
29
+ handler = RotatingFileHandler(
30
+ os.path.join(log_dir, "failures.log"),
31
+ maxBytes=5 * 1024 * 1024, # 5 MB
32
+ backupCount=2,
33
+ )
34
+
35
+ handler.setFormatter(JsonFormatter())
36
+ _file_handler = handler
37
+ _fallback_mode = False
38
+ return handler
39
+ except (OSError, PermissionError, IOError) as e:
40
+ logging.warning(f"Cannot create failure log file handler: {e}")
41
+ _fallback_mode = True
42
+ return None
43
 
 
44
 
45
+ def setup_failure_logger():
46
+ """Sets up a dedicated JSON logger for writing detailed failure logs."""
47
+ logger = logging.getLogger("failure_logger")
48
+ logger.setLevel(logging.INFO)
49
+ logger.propagate = False
50
+
51
+ # Remove existing handlers to prevent duplicates
52
+ logger.handlers.clear()
53
+
54
+ # Try to add file handler
55
+ handler = _create_file_handler()
56
+ if handler:
57
  logger.addHandler(handler)
58
+
59
+ # Always add a NullHandler as fallback to prevent "no handlers" warning
60
+ if not logger.handlers:
61
+ logger.addHandler(logging.NullHandler())
62
+
63
  return logger
64
 
65
 
66
+ def _ensure_handler_valid():
67
+ """Check if file handler is still valid, recreate if needed."""
68
+ global _file_handler, _fallback_mode
69
+
70
+ if _file_handler is None or _fallback_mode:
71
+ handler = _create_file_handler()
72
+ if handler:
73
+ failure_logger = logging.getLogger("failure_logger")
74
+ failure_logger.handlers.clear()
75
+ failure_logger.addHandler(handler)
76
+
77
+
78
  # Initialize the dedicated logger for detailed failure logs
79
  failure_logger = setup_failure_logger()
80
 
 
180
  "request_headers": request_headers,
181
  "error_chain": error_chain if len(error_chain) > 1 else None,
182
  }
183
+
 
184
  # 2. Log a concise summary to the main library logger, which will propagate
185
  summary_message = (
186
  f"API call failed for model {model} with key {mask_credential(api_key)}. "
187
  f"Error: {type(error).__name__}. See failures.log for details."
188
  )
189
+
190
+ # Attempt to ensure handler is valid before logging
191
+ _ensure_handler_valid()
192
+
193
+ # Wrap the actual log call with resilience
194
+ try:
195
+ failure_logger.error(detailed_log_data)
196
+ except (OSError, IOError) as e:
197
+ # File logging failed - log to console instead
198
+ logging.error(f"Failed to write to failures.log: {e}")
199
+ logging.error(f"Failure summary: {summary_message}")
200
+
201
+ # Console log always succeeds
202
  main_lib_logger.error(summary_message)
src/rotator_library/providers/google_oauth_base.py CHANGED
@@ -260,64 +260,76 @@ class GoogleOAuthBase:
260
  )
261
 
262
  async def _save_credentials(self, path: str, creds: Dict[str, Any]):
 
 
 
 
 
 
 
 
 
263
  # Don't save to file if credentials were loaded from environment
264
  if creds.get("_proxy_metadata", {}).get("loaded_from_env"):
265
  lib_logger.debug("Credentials loaded from env, skipping file save")
266
- # Still update cache for in-memory consistency
267
- self._credentials_cache[path] = creds
268
  return
269
 
270
- # [ATOMIC WRITE] Use tempfile + move pattern to ensure atomic writes
271
- # This prevents credential corruption if the process is interrupted during write
272
- parent_dir = os.path.dirname(os.path.abspath(path))
273
- os.makedirs(parent_dir, exist_ok=True)
274
-
275
- tmp_fd = None
276
- tmp_path = None
277
  try:
278
- # Create temp file in same directory as target (ensures same filesystem)
279
- tmp_fd, tmp_path = tempfile.mkstemp(
280
- dir=parent_dir, prefix=".tmp_", suffix=".json", text=True
281
- )
282
-
283
- # Write JSON to temp file
284
- with os.fdopen(tmp_fd, "w") as f:
285
- json.dump(creds, f, indent=2)
286
- tmp_fd = None # fdopen closes the fd
287
 
288
- # Set secure permissions (0600 = owner read/write only)
 
289
  try:
290
- os.chmod(tmp_path, 0o600)
291
- except (OSError, AttributeError):
292
- # Windows may not support chmod, ignore
293
- pass
294
-
295
- # Atomic move (overwrites target if it exists)
296
- shutil.move(tmp_path, path)
297
- tmp_path = None # Successfully moved
298
 
299
- # Update cache AFTER successful file write (prevents cache/file inconsistency)
300
- self._credentials_cache[path] = creds
301
- lib_logger.debug(
302
- f"Saved updated {self.ENV_PREFIX} OAuth credentials to '{path}' (atomic write)."
303
- )
304
 
305
- except Exception as e:
306
- lib_logger.error(
307
- f"Failed to save updated {self.ENV_PREFIX} OAuth credentials to '{path}': {e}"
308
- )
309
- # Clean up temp file if it still exists
310
- if tmp_fd is not None:
311
  try:
312
- os.close(tmp_fd)
313
- except:
 
314
  pass
315
- if tmp_path and os.path.exists(tmp_path):
316
- try:
317
- os.unlink(tmp_path)
318
- except:
319
- pass
320
- raise
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
321
 
322
  def _is_token_expired(self, creds: Dict[str, Any]) -> bool:
323
  expiry = creds.get("token_expiry") # gcloud format
@@ -841,10 +853,39 @@ class GoogleOAuthBase:
841
  )
842
 
843
  async def get_auth_header(self, credential_path: str) -> Dict[str, str]:
844
- creds = await self._load_credentials(credential_path)
845
- if self._is_token_expired(creds):
846
- creds = await self._refresh_token(credential_path, creds)
847
- return {"Authorization": f"Bearer {creds['access_token']}"}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
848
 
849
  async def get_user_info(
850
  self, creds_or_path: Union[Dict[str, Any], str]
 
260
  )
261
 
262
  async def _save_credentials(self, path: str, creds: Dict[str, Any]):
263
+ """Save credentials with in-memory fallback if disk unavailable.
264
+
265
+ [RUNTIME RESILIENCE] Always updates the in-memory cache first (memory is reliable),
266
+ then attempts disk persistence. If disk write fails, logs a warning but does NOT
267
+ raise an exception - the in-memory state continues to work.
268
+ """
269
+ # [IN-MEMORY FIRST] Always update cache first (reliable)
270
+ self._credentials_cache[path] = creds
271
+
272
  # Don't save to file if credentials were loaded from environment
273
  if creds.get("_proxy_metadata", {}).get("loaded_from_env"):
274
  lib_logger.debug("Credentials loaded from env, skipping file save")
 
 
275
  return
276
 
 
 
 
 
 
 
 
277
  try:
278
+ # [ATOMIC WRITE] Use tempfile + move pattern to ensure atomic writes
279
+ # This prevents credential corruption if the process is interrupted during write
280
+ parent_dir = os.path.dirname(os.path.abspath(path))
281
+ os.makedirs(parent_dir, exist_ok=True)
 
 
 
 
 
282
 
283
+ tmp_fd = None
284
+ tmp_path = None
285
  try:
286
+ # Create temp file in same directory as target (ensures same filesystem)
287
+ tmp_fd, tmp_path = tempfile.mkstemp(
288
+ dir=parent_dir, prefix=".tmp_", suffix=".json", text=True
289
+ )
 
 
 
 
290
 
291
+ # Write JSON to temp file
292
+ with os.fdopen(tmp_fd, "w") as f:
293
+ json.dump(creds, f, indent=2)
294
+ tmp_fd = None # fdopen closes the fd
 
295
 
296
+ # Set secure permissions (0600 = owner read/write only)
 
 
 
 
 
297
  try:
298
+ os.chmod(tmp_path, 0o600)
299
+ except (OSError, AttributeError):
300
+ # Windows may not support chmod, ignore
301
  pass
302
+
303
+ # Atomic move (overwrites target if it exists)
304
+ shutil.move(tmp_path, path)
305
+ tmp_path = None # Successfully moved
306
+
307
+ lib_logger.debug(
308
+ f"Saved updated {self.ENV_PREFIX} OAuth credentials to '{path}' (atomic write)."
309
+ )
310
+
311
+ except Exception as e:
312
+ # Clean up temp file if it still exists
313
+ if tmp_fd is not None:
314
+ try:
315
+ os.close(tmp_fd)
316
+ except:
317
+ pass
318
+ if tmp_path and os.path.exists(tmp_path):
319
+ try:
320
+ os.unlink(tmp_path)
321
+ except:
322
+ pass
323
+ raise
324
+
325
+ except (OSError, PermissionError, IOError) as e:
326
+ # [FAIL SILENTLY, LOG LOUDLY] Log the error but don't crash
327
+ # The in-memory cache was already updated, so we can continue operating
328
+ lib_logger.warning(
329
+ f"Failed to save credentials to {path}: {e}. "
330
+ "Credentials cached in memory only (will be lost on restart)."
331
+ )
332
+ # Don't raise - we already updated the memory cache
333
 
334
  def _is_token_expired(self, creds: Dict[str, Any]) -> bool:
335
  expiry = creds.get("token_expiry") # gcloud format
 
853
  )
854
 
855
  async def get_auth_header(self, credential_path: str) -> Dict[str, str]:
856
+ """Get auth header with graceful degradation if refresh fails.
857
+
858
+ [RUNTIME RESILIENCE] If credential file is deleted or refresh fails,
859
+ attempts to use cached credentials. This allows the proxy to continue
860
+ operating with potentially stale tokens rather than crashing.
861
+ """
862
+ try:
863
+ creds = await self._load_credentials(credential_path)
864
+ if self._is_token_expired(creds):
865
+ try:
866
+ creds = await self._refresh_token(credential_path, creds)
867
+ except Exception as e:
868
+ # [CACHED TOKEN FALLBACK] Check if we have a cached token that might still work
869
+ cached = self._credentials_cache.get(credential_path)
870
+ if cached and cached.get("access_token"):
871
+ lib_logger.warning(
872
+ f"Token refresh failed for {Path(credential_path).name}: {e}. "
873
+ "Using cached token (may be expired)."
874
+ )
875
+ creds = cached
876
+ else:
877
+ raise
878
+ return {"Authorization": f"Bearer {creds['access_token']}"}
879
+ except Exception as e:
880
+ # [FINAL FALLBACK] Check if any cached credential exists as last resort
881
+ cached = self._credentials_cache.get(credential_path)
882
+ if cached and cached.get("access_token"):
883
+ lib_logger.error(
884
+ f"Credential load failed for {credential_path}: {e}. "
885
+ "Using stale cached token as last resort."
886
+ )
887
+ return {"Authorization": f"Bearer {cached['access_token']}"}
888
+ raise
889
 
890
  async def get_user_info(
891
  self, creds_or_path: Union[Dict[str, Any], str]
src/rotator_library/providers/provider_cache.py CHANGED
@@ -104,7 +104,10 @@ class ProviderCache:
104
  self._running = False
105
 
106
  # Statistics
107
- self._stats = {"memory_hits": 0, "disk_hits": 0, "misses": 0, "writes": 0}
 
 
 
108
 
109
  # Metadata about this cache instance
110
  self._cache_name = cache_file.stem if cache_file else "unnamed"
@@ -171,13 +174,27 @@ class ProviderCache:
171
  # =========================================================================
172
 
173
  async def _save_to_disk(self) -> None:
174
- """Persist cache to disk using atomic write."""
 
 
 
 
 
175
  if not self._enable_disk:
176
  return
177
 
178
  try:
179
  async with self._disk_lock:
180
- self._cache_file.parent.mkdir(parents=True, exist_ok=True)
 
 
 
 
 
 
 
 
 
181
 
182
  cache_data = {
183
  "version": "1.0",
@@ -210,6 +227,8 @@ class ProviderCache:
210
 
211
  shutil.move(tmp_path, self._cache_file)
212
  self._stats["writes"] += 1
 
 
213
  lib_logger.debug(
214
  f"ProviderCache[{self._cache_name}]: Saved {len(self._cache)} entries"
215
  )
@@ -218,6 +237,9 @@ class ProviderCache:
218
  os.unlink(tmp_path)
219
  raise
220
  except Exception as e:
 
 
 
221
  lib_logger.error(f"ProviderCache[{self._cache_name}]: Disk save failed: {e}")
222
 
223
  # =========================================================================
@@ -416,12 +438,17 @@ class ProviderCache:
416
  return False
417
 
418
  def get_stats(self) -> Dict[str, Any]:
419
- """Get cache statistics."""
 
 
 
 
420
  return {
421
  **self._stats,
422
  "memory_entries": len(self._cache),
423
  "dirty": self._dirty,
424
- "disk_enabled": self._enable_disk
 
425
  }
426
 
427
  async def clear(self) -> None:
 
104
  self._running = False
105
 
106
  # Statistics
107
+ self._stats = {"memory_hits": 0, "disk_hits": 0, "misses": 0, "writes": 0, "disk_errors": 0}
108
+
109
+ # [RUNTIME RESILIENCE] Track disk health for monitoring
110
+ self._disk_available = True
111
 
112
  # Metadata about this cache instance
113
  self._cache_name = cache_file.stem if cache_file else "unnamed"
 
174
  # =========================================================================
175
 
176
  async def _save_to_disk(self) -> None:
177
+ """Persist cache to disk using atomic write with health tracking.
178
+
179
+ [RUNTIME RESILIENCE] Tracks disk health and records errors. If disk
180
+ operations fail, the memory cache continues to work. Health status
181
+ is available via get_stats() for monitoring.
182
+ """
183
  if not self._enable_disk:
184
  return
185
 
186
  try:
187
  async with self._disk_lock:
188
+ # [DIRECTORY AUTO-RECREATION] Attempt to create directory
189
+ try:
190
+ self._cache_file.parent.mkdir(parents=True, exist_ok=True)
191
+ except (OSError, PermissionError) as e:
192
+ self._stats["disk_errors"] += 1
193
+ self._disk_available = False
194
+ lib_logger.warning(
195
+ f"ProviderCache[{self._cache_name}]: Cannot create cache directory: {e}"
196
+ )
197
+ return
198
 
199
  cache_data = {
200
  "version": "1.0",
 
227
 
228
  shutil.move(tmp_path, self._cache_file)
229
  self._stats["writes"] += 1
230
+ # [RUNTIME RESILIENCE] Mark disk as healthy on success
231
+ self._disk_available = True
232
  lib_logger.debug(
233
  f"ProviderCache[{self._cache_name}]: Saved {len(self._cache)} entries"
234
  )
 
237
  os.unlink(tmp_path)
238
  raise
239
  except Exception as e:
240
+ # [RUNTIME RESILIENCE] Track disk errors for monitoring
241
+ self._stats["disk_errors"] += 1
242
+ self._disk_available = False
243
  lib_logger.error(f"ProviderCache[{self._cache_name}]: Disk save failed: {e}")
244
 
245
  # =========================================================================
 
438
  return False
439
 
440
  def get_stats(self) -> Dict[str, Any]:
441
+ """Get cache statistics including disk health.
442
+
443
+ [RUNTIME RESILIENCE] Includes disk_available flag for monitoring
444
+ the health of disk persistence.
445
+ """
446
  return {
447
  **self._stats,
448
  "memory_entries": len(self._cache),
449
  "dirty": self._dirty,
450
+ "disk_enabled": self._enable_disk,
451
+ "disk_available": self._disk_available # [RUNTIME RESILIENCE] Health indicator
452
  }
453
 
454
  async def clear(self) -> None:
src/rotator_library/usage_manager.py CHANGED
@@ -90,25 +90,56 @@ class UsageManager:
90
  self._initialized.set()
91
 
92
  async def _load_usage(self):
93
- """Loads usage data from the JSON file asynchronously."""
 
 
 
 
94
  async with self._data_lock:
95
  if not os.path.exists(self.file_path):
96
  self._usage_data = {}
97
  return
 
98
  try:
99
  async with aiofiles.open(self.file_path, "r") as f:
100
  content = await f.read()
101
- self._usage_data = json.loads(content)
102
- except (json.JSONDecodeError, IOError, FileNotFoundError):
 
 
 
 
 
 
 
103
  self._usage_data = {}
104
 
105
  async def _save_usage(self):
106
- """Saves the current usage data to the JSON file asynchronously."""
 
 
 
 
 
107
  if self._usage_data is None:
108
  return
109
- async with self._data_lock:
110
- async with aiofiles.open(self.file_path, "w") as f:
111
- await f.write(json.dumps(self._usage_data, indent=2))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
  async def _reset_daily_stats_if_needed(self):
114
  """Checks if daily stats need to be reset for any key."""
 
90
  self._initialized.set()
91
 
92
  async def _load_usage(self):
93
+ """Loads usage data from the JSON file asynchronously with enhanced resilience.
94
+
95
+ [RUNTIME RESILIENCE] Handles various file system errors gracefully,
96
+ including race conditions where file is deleted between exists check and open.
97
+ """
98
  async with self._data_lock:
99
  if not os.path.exists(self.file_path):
100
  self._usage_data = {}
101
  return
102
+
103
  try:
104
  async with aiofiles.open(self.file_path, "r") as f:
105
  content = await f.read()
106
+ self._usage_data = json.loads(content) if content.strip() else {}
107
+ except FileNotFoundError:
108
+ # [RACE CONDITION HANDLING] File deleted between exists check and open
109
+ self._usage_data = {}
110
+ except json.JSONDecodeError as e:
111
+ lib_logger.warning(f"Corrupted usage file {self.file_path}: {e}. Starting fresh.")
112
+ self._usage_data = {}
113
+ except (OSError, PermissionError, IOError) as e:
114
+ lib_logger.warning(f"Cannot read usage file {self.file_path}: {e}. Using empty state.")
115
  self._usage_data = {}
116
 
117
  async def _save_usage(self):
118
+ """Saves the current usage data to the JSON file asynchronously with resilience.
119
+
120
+ [RUNTIME RESILIENCE] Wraps file operations in try/except to prevent crashes
121
+ if the file or directory is deleted during runtime. The in-memory state
122
+ continues to work even if disk persistence fails.
123
+ """
124
  if self._usage_data is None:
125
  return
126
+
127
+ try:
128
+ async with self._data_lock:
129
+ # [DIRECTORY AUTO-RECREATION] Ensure directory exists before write
130
+ file_dir = os.path.dirname(os.path.abspath(self.file_path))
131
+ if file_dir and not os.path.exists(file_dir):
132
+ os.makedirs(file_dir, exist_ok=True)
133
+
134
+ async with aiofiles.open(self.file_path, "w") as f:
135
+ await f.write(json.dumps(self._usage_data, indent=2))
136
+ except (OSError, PermissionError, IOError) as e:
137
+ # [FAIL SILENTLY, LOG LOUDLY] Log the error but don't crash
138
+ # In-memory state is preserved and will continue to work
139
+ lib_logger.warning(
140
+ f"Failed to save usage data to {self.file_path}: {e}. "
141
+ "Data will be retained in memory but may be lost on restart."
142
+ )
143
 
144
  async def _reset_daily_stats_if_needed(self):
145
  """Checks if daily stats need to be reset for any key."""