Spaces:
Paused
feat: add runtime resilience for file deletion survival
Browse filesImplement graceful degradation patterns that allow the proxy to continue
running even if core files are deleted during runtime. Changes only take
effect on restart, enabling safe development while the proxy is serving.
## Changes by Component
### Usage Manager (usage_manager.py)
- Wrap `_save_usage()` in try/except with directory auto-recreation
- Enhanced `_load_usage()` with explicit error handling
- In-memory state continues working if file operations fail
### Failure Logger (failure_logger.py)
- Add module-level `_file_handler` and `_fallback_mode` state
- Create `_create_file_handler()` with directory auto-recreation
- Create `_ensure_handler_valid()` for handler recovery
- Use NullHandler as fallback when file logging fails
### Detailed Logger (detailed_logger.py)
- Add class-level `_disk_available` and `_console_fallback_warned` flags
- Add instance-level `_in_memory_logs` list for fallback storage
- Skip disk writes gracefully when filesystem unavailable
### Google OAuth Base (google_oauth_base.py)
- Update memory cache FIRST before disk write (memory-first pattern)
- Use cached tokens as fallback when refresh/save fails
- Log warnings but don't crash on persistence failures
### Provider Cache (provider_cache.py)
- Add `_disk_available` health flag and `disk_errors` counter
- Track disk health status in get_stats()
- Gracefully degrade to memory-only caching on disk failures
### Documentation (DOCUMENTATION.md)
- Add Section 5: Runtime Resilience with resilience hierarchy
- Document "Develop While Running" workflow
- Explain graceful degradation and data loss scenarios
|
@@ -697,4 +697,35 @@ To facilitate robust debugging, the proxy includes a comprehensive transaction l
|
|
| 697 |
|
| 698 |
This level of detail allows developers to trace exactly why a request failed or why a specific key was rotated.
|
| 699 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 700 |
|
|
|
|
| 697 |
|
| 698 |
This level of detail allows developers to trace exactly why a request failed or why a specific key was rotated.
|
| 699 |
|
| 700 |
+
---
|
| 701 |
+
|
| 702 |
+
## 5. Runtime Resilience
|
| 703 |
+
|
| 704 |
+
The proxy is engineered to maintain high availability even in the face of runtime filesystem disruptions. This "Runtime Resilience" capability ensures that the service continues to process API requests even if core data directories (like `logs/`, `oauth_creds/`) or files are accidentally deleted or become unwritable while the application is running.
|
| 705 |
+
|
| 706 |
+
### 5.1. Resilience Hierarchy
|
| 707 |
+
|
| 708 |
+
The system follows a strict hierarchy of survival:
|
| 709 |
+
|
| 710 |
+
1. **Core API Handling (Level 1)**: The Python runtime keeps all necessary code in memory (`sys.modules`). Deleting source code files while the proxy is running will **not** crash active requests.
|
| 711 |
+
2. **Credential Management (Level 2)**: OAuth tokens are aggressively cached in memory. If credential files are deleted, the proxy continues using the cached tokens. If a token needs refresh and the file cannot be written, the new token is updated in memory only.
|
| 712 |
+
3. **Usage Tracking (Level 3)**: Usage statistics (`key_usage.json`) are maintained in memory. If the file is deleted, the system tracks usage internally. It attempts to recreate the file/directory on the next save interval. If save fails, data is effectively "memory-only" until the next successful write.
|
| 713 |
+
4. **Logging (Level 4)**: Logging is treated as non-critical. If the `logs/` directory is removed, the system attempts to recreate it. If creation fails (e.g., permission error), logging degrades gracefully (stops or falls back to console) without interrupting the request flow.
|
| 714 |
+
|
| 715 |
+
### 5.2. "Develop While Running"
|
| 716 |
+
|
| 717 |
+
This architecture supports a robust development workflow:
|
| 718 |
+
|
| 719 |
+
* **Log Cleanup**: You can safely run `rm -rf logs/` while the proxy is serving traffic. The system will simply recreate the directory structure on the next request.
|
| 720 |
+
* **Config Reset**: Deleting `key_usage.json` resets the persistence layer, but the running instance preserves its current in-memory counts to ensure load balancing consistency.
|
| 721 |
+
* **File Recovery**: If you delete a critical file, the system attempts **Directory Auto-Recreation** before every write operation.
|
| 722 |
+
|
| 723 |
+
### 5.3. Graceful Degradation & Data Loss
|
| 724 |
+
|
| 725 |
+
While functionality is preserved, persistence may be compromised during filesystem failures:
|
| 726 |
+
|
| 727 |
+
* **Logs**: If disk writes fail, detailed request logs may be lost (unless console fallback is active).
|
| 728 |
+
* **Usage Stats**: If `key_usage.json` cannot be written, usage data since the last successful save will be lost upon application restart.
|
| 729 |
+
* **Credentials**: Refreshed tokens held only in memory will require re-authentication after a restart if they cannot be persisted to disk.
|
| 730 |
+
|
| 731 |
|
|
@@ -13,6 +13,10 @@ class DetailedLogger:
|
|
| 13 |
"""
|
| 14 |
Logs comprehensive details of each API transaction to a unique, timestamped directory.
|
| 15 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
def __init__(self):
|
| 17 |
"""
|
| 18 |
Initializes the logger for a single request, creating a unique directory to store all related log files.
|
|
@@ -21,16 +25,33 @@ class DetailedLogger:
|
|
| 21 |
self.request_id = str(uuid.uuid4())
|
| 22 |
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 23 |
self.log_dir = DETAILED_LOGS_DIR / f"{timestamp}_{self.request_id}"
|
| 24 |
-
self.log_dir.mkdir(parents=True, exist_ok=True)
|
| 25 |
self.streaming = False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
def _write_json(self, filename: str, data: Dict[str, Any]):
|
| 28 |
"""Helper to write data to a JSON file in the log directory."""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
try:
|
|
|
|
|
|
|
| 30 |
with open(self.log_dir / filename, "w", encoding="utf-8") as f:
|
| 31 |
json.dump(data, f, indent=4, ensure_ascii=False)
|
| 32 |
-
except
|
| 33 |
logging.error(f"[{self.request_id}] Failed to write to {filename}: {e}")
|
|
|
|
| 34 |
|
| 35 |
def log_request(self, headers: Dict[str, Any], body: Dict[str, Any]):
|
| 36 |
"""Logs the initial request details."""
|
|
@@ -45,14 +66,18 @@ class DetailedLogger:
|
|
| 45 |
|
| 46 |
def log_stream_chunk(self, chunk: Dict[str, Any]):
|
| 47 |
"""Logs an individual chunk from a streaming response to a JSON Lines file."""
|
|
|
|
|
|
|
|
|
|
| 48 |
try:
|
|
|
|
| 49 |
log_entry = {
|
| 50 |
"timestamp_utc": datetime.utcnow().isoformat(),
|
| 51 |
"chunk": chunk
|
| 52 |
}
|
| 53 |
with open(self.log_dir / "streaming_chunks.jsonl", "a", encoding="utf-8") as f:
|
| 54 |
f.write(json.dumps(log_entry, ensure_ascii=False) + "\n")
|
| 55 |
-
except
|
| 56 |
logging.error(f"[{self.request_id}] Failed to write stream chunk: {e}")
|
| 57 |
|
| 58 |
def log_final_response(self, status_code: int, headers: Optional[Dict[str, Any]], body: Dict[str, Any]):
|
|
|
|
| 13 |
"""
|
| 14 |
Logs comprehensive details of each API transaction to a unique, timestamped directory.
|
| 15 |
"""
|
| 16 |
+
# Class-level fallback flags for resilience
|
| 17 |
+
_disk_available = True
|
| 18 |
+
_console_fallback_warned = False
|
| 19 |
+
|
| 20 |
def __init__(self):
|
| 21 |
"""
|
| 22 |
Initializes the logger for a single request, creating a unique directory to store all related log files.
|
|
|
|
| 25 |
self.request_id = str(uuid.uuid4())
|
| 26 |
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 27 |
self.log_dir = DETAILED_LOGS_DIR / f"{timestamp}_{self.request_id}"
|
|
|
|
| 28 |
self.streaming = False
|
| 29 |
+
self._in_memory_logs = [] # Fallback storage
|
| 30 |
+
|
| 31 |
+
# Attempt directory creation with resilience
|
| 32 |
+
try:
|
| 33 |
+
self.log_dir.mkdir(parents=True, exist_ok=True)
|
| 34 |
+
DetailedLogger._disk_available = True
|
| 35 |
+
except (OSError, PermissionError) as e:
|
| 36 |
+
DetailedLogger._disk_available = False
|
| 37 |
+
if not DetailedLogger._console_fallback_warned:
|
| 38 |
+
logging.warning(f"Detailed logging disabled - cannot create log directory: {e}")
|
| 39 |
+
DetailedLogger._console_fallback_warned = True
|
| 40 |
|
| 41 |
def _write_json(self, filename: str, data: Dict[str, Any]):
|
| 42 |
"""Helper to write data to a JSON file in the log directory."""
|
| 43 |
+
if not DetailedLogger._disk_available:
|
| 44 |
+
self._in_memory_logs.append({"file": filename, "data": data})
|
| 45 |
+
return
|
| 46 |
+
|
| 47 |
try:
|
| 48 |
+
# Attempt directory recreation if needed
|
| 49 |
+
self.log_dir.mkdir(parents=True, exist_ok=True)
|
| 50 |
with open(self.log_dir / filename, "w", encoding="utf-8") as f:
|
| 51 |
json.dump(data, f, indent=4, ensure_ascii=False)
|
| 52 |
+
except (OSError, PermissionError, IOError) as e:
|
| 53 |
logging.error(f"[{self.request_id}] Failed to write to {filename}: {e}")
|
| 54 |
+
self._in_memory_logs.append({"file": filename, "data": data})
|
| 55 |
|
| 56 |
def log_request(self, headers: Dict[str, Any], body: Dict[str, Any]):
|
| 57 |
"""Logs the initial request details."""
|
|
|
|
| 66 |
|
| 67 |
def log_stream_chunk(self, chunk: Dict[str, Any]):
|
| 68 |
"""Logs an individual chunk from a streaming response to a JSON Lines file."""
|
| 69 |
+
if not DetailedLogger._disk_available:
|
| 70 |
+
return # Skip chunk logging when disk unavailable
|
| 71 |
+
|
| 72 |
try:
|
| 73 |
+
self.log_dir.mkdir(parents=True, exist_ok=True)
|
| 74 |
log_entry = {
|
| 75 |
"timestamp_utc": datetime.utcnow().isoformat(),
|
| 76 |
"chunk": chunk
|
| 77 |
}
|
| 78 |
with open(self.log_dir / "streaming_chunks.jsonl", "a", encoding="utf-8") as f:
|
| 79 |
f.write(json.dumps(log_entry, ensure_ascii=False) + "\n")
|
| 80 |
+
except (OSError, PermissionError, IOError) as e:
|
| 81 |
logging.error(f"[{self.request_id}] Failed to write stream chunk: {e}")
|
| 82 |
|
| 83 |
def log_final_response(self, status_code: int, headers: Optional[Dict[str, Any]], body: Dict[str, Any]):
|
|
@@ -5,41 +5,76 @@ import os
|
|
| 5 |
from datetime import datetime
|
| 6 |
from .error_handler import mask_credential
|
| 7 |
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
-
def setup_failure_logger():
|
| 10 |
-
"""Sets up a dedicated JSON logger for writing detailed failure logs to a file."""
|
| 11 |
-
log_dir = "logs"
|
| 12 |
-
if not os.path.exists(log_dir):
|
| 13 |
-
os.makedirs(log_dir)
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
|
| 21 |
-
# Use a rotating file handler
|
| 22 |
-
handler = RotatingFileHandler(
|
| 23 |
-
os.path.join(log_dir, "failures.log"),
|
| 24 |
-
maxBytes=5 * 1024 * 1024, # 5 MB
|
| 25 |
-
backupCount=2,
|
| 26 |
-
)
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
-
handler.setFormatter(JsonFormatter())
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
logger.addHandler(handler)
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
return logger
|
| 41 |
|
| 42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
# Initialize the dedicated logger for detailed failure logs
|
| 44 |
failure_logger = setup_failure_logger()
|
| 45 |
|
|
@@ -145,11 +180,23 @@ def log_failure(
|
|
| 145 |
"request_headers": request_headers,
|
| 146 |
"error_chain": error_chain if len(error_chain) > 1 else None,
|
| 147 |
}
|
| 148 |
-
|
| 149 |
-
|
| 150 |
# 2. Log a concise summary to the main library logger, which will propagate
|
| 151 |
summary_message = (
|
| 152 |
f"API call failed for model {model} with key {mask_credential(api_key)}. "
|
| 153 |
f"Error: {type(error).__name__}. See failures.log for details."
|
| 154 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
main_lib_logger.error(summary_message)
|
|
|
|
| 5 |
from datetime import datetime
|
| 6 |
from .error_handler import mask_credential
|
| 7 |
|
| 8 |
+
# Module-level state for resilience
|
| 9 |
+
_file_handler = None
|
| 10 |
+
_fallback_mode = False
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
+
# Custom JSON formatter for structured logs (defined at module level for reuse)
|
| 14 |
+
class JsonFormatter(logging.Formatter):
|
| 15 |
+
def format(self, record):
|
| 16 |
+
# The message is already a dict, so we just format it as a JSON string
|
| 17 |
+
return json.dumps(record.msg)
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
def _create_file_handler():
|
| 21 |
+
"""Create file handler with directory auto-recreation."""
|
| 22 |
+
global _file_handler, _fallback_mode
|
| 23 |
+
log_dir = "logs"
|
| 24 |
+
|
| 25 |
+
try:
|
| 26 |
+
if not os.path.exists(log_dir):
|
| 27 |
+
os.makedirs(log_dir, exist_ok=True)
|
| 28 |
+
|
| 29 |
+
handler = RotatingFileHandler(
|
| 30 |
+
os.path.join(log_dir, "failures.log"),
|
| 31 |
+
maxBytes=5 * 1024 * 1024, # 5 MB
|
| 32 |
+
backupCount=2,
|
| 33 |
+
)
|
| 34 |
+
|
| 35 |
+
handler.setFormatter(JsonFormatter())
|
| 36 |
+
_file_handler = handler
|
| 37 |
+
_fallback_mode = False
|
| 38 |
+
return handler
|
| 39 |
+
except (OSError, PermissionError, IOError) as e:
|
| 40 |
+
logging.warning(f"Cannot create failure log file handler: {e}")
|
| 41 |
+
_fallback_mode = True
|
| 42 |
+
return None
|
| 43 |
|
|
|
|
| 44 |
|
| 45 |
+
def setup_failure_logger():
|
| 46 |
+
"""Sets up a dedicated JSON logger for writing detailed failure logs."""
|
| 47 |
+
logger = logging.getLogger("failure_logger")
|
| 48 |
+
logger.setLevel(logging.INFO)
|
| 49 |
+
logger.propagate = False
|
| 50 |
+
|
| 51 |
+
# Remove existing handlers to prevent duplicates
|
| 52 |
+
logger.handlers.clear()
|
| 53 |
+
|
| 54 |
+
# Try to add file handler
|
| 55 |
+
handler = _create_file_handler()
|
| 56 |
+
if handler:
|
| 57 |
logger.addHandler(handler)
|
| 58 |
+
|
| 59 |
+
# Always add a NullHandler as fallback to prevent "no handlers" warning
|
| 60 |
+
if not logger.handlers:
|
| 61 |
+
logger.addHandler(logging.NullHandler())
|
| 62 |
+
|
| 63 |
return logger
|
| 64 |
|
| 65 |
|
| 66 |
+
def _ensure_handler_valid():
|
| 67 |
+
"""Check if file handler is still valid, recreate if needed."""
|
| 68 |
+
global _file_handler, _fallback_mode
|
| 69 |
+
|
| 70 |
+
if _file_handler is None or _fallback_mode:
|
| 71 |
+
handler = _create_file_handler()
|
| 72 |
+
if handler:
|
| 73 |
+
failure_logger = logging.getLogger("failure_logger")
|
| 74 |
+
failure_logger.handlers.clear()
|
| 75 |
+
failure_logger.addHandler(handler)
|
| 76 |
+
|
| 77 |
+
|
| 78 |
# Initialize the dedicated logger for detailed failure logs
|
| 79 |
failure_logger = setup_failure_logger()
|
| 80 |
|
|
|
|
| 180 |
"request_headers": request_headers,
|
| 181 |
"error_chain": error_chain if len(error_chain) > 1 else None,
|
| 182 |
}
|
| 183 |
+
|
|
|
|
| 184 |
# 2. Log a concise summary to the main library logger, which will propagate
|
| 185 |
summary_message = (
|
| 186 |
f"API call failed for model {model} with key {mask_credential(api_key)}. "
|
| 187 |
f"Error: {type(error).__name__}. See failures.log for details."
|
| 188 |
)
|
| 189 |
+
|
| 190 |
+
# Attempt to ensure handler is valid before logging
|
| 191 |
+
_ensure_handler_valid()
|
| 192 |
+
|
| 193 |
+
# Wrap the actual log call with resilience
|
| 194 |
+
try:
|
| 195 |
+
failure_logger.error(detailed_log_data)
|
| 196 |
+
except (OSError, IOError) as e:
|
| 197 |
+
# File logging failed - log to console instead
|
| 198 |
+
logging.error(f"Failed to write to failures.log: {e}")
|
| 199 |
+
logging.error(f"Failure summary: {summary_message}")
|
| 200 |
+
|
| 201 |
+
# Console log always succeeds
|
| 202 |
main_lib_logger.error(summary_message)
|
|
@@ -260,64 +260,76 @@ class GoogleOAuthBase:
|
|
| 260 |
)
|
| 261 |
|
| 262 |
async def _save_credentials(self, path: str, creds: Dict[str, Any]):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 263 |
# Don't save to file if credentials were loaded from environment
|
| 264 |
if creds.get("_proxy_metadata", {}).get("loaded_from_env"):
|
| 265 |
lib_logger.debug("Credentials loaded from env, skipping file save")
|
| 266 |
-
# Still update cache for in-memory consistency
|
| 267 |
-
self._credentials_cache[path] = creds
|
| 268 |
return
|
| 269 |
|
| 270 |
-
# [ATOMIC WRITE] Use tempfile + move pattern to ensure atomic writes
|
| 271 |
-
# This prevents credential corruption if the process is interrupted during write
|
| 272 |
-
parent_dir = os.path.dirname(os.path.abspath(path))
|
| 273 |
-
os.makedirs(parent_dir, exist_ok=True)
|
| 274 |
-
|
| 275 |
-
tmp_fd = None
|
| 276 |
-
tmp_path = None
|
| 277 |
try:
|
| 278 |
-
#
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
)
|
| 282 |
-
|
| 283 |
-
# Write JSON to temp file
|
| 284 |
-
with os.fdopen(tmp_fd, "w") as f:
|
| 285 |
-
json.dump(creds, f, indent=2)
|
| 286 |
-
tmp_fd = None # fdopen closes the fd
|
| 287 |
|
| 288 |
-
|
|
|
|
| 289 |
try:
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
# Atomic move (overwrites target if it exists)
|
| 296 |
-
shutil.move(tmp_path, path)
|
| 297 |
-
tmp_path = None # Successfully moved
|
| 298 |
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
)
|
| 304 |
|
| 305 |
-
|
| 306 |
-
lib_logger.error(
|
| 307 |
-
f"Failed to save updated {self.ENV_PREFIX} OAuth credentials to '{path}': {e}"
|
| 308 |
-
)
|
| 309 |
-
# Clean up temp file if it still exists
|
| 310 |
-
if tmp_fd is not None:
|
| 311 |
try:
|
| 312 |
-
os.
|
| 313 |
-
except:
|
|
|
|
| 314 |
pass
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 321 |
|
| 322 |
def _is_token_expired(self, creds: Dict[str, Any]) -> bool:
|
| 323 |
expiry = creds.get("token_expiry") # gcloud format
|
|
@@ -841,10 +853,39 @@ class GoogleOAuthBase:
|
|
| 841 |
)
|
| 842 |
|
| 843 |
async def get_auth_header(self, credential_path: str) -> Dict[str, str]:
|
| 844 |
-
|
| 845 |
-
|
| 846 |
-
|
| 847 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 848 |
|
| 849 |
async def get_user_info(
|
| 850 |
self, creds_or_path: Union[Dict[str, Any], str]
|
|
|
|
| 260 |
)
|
| 261 |
|
| 262 |
async def _save_credentials(self, path: str, creds: Dict[str, Any]):
|
| 263 |
+
"""Save credentials with in-memory fallback if disk unavailable.
|
| 264 |
+
|
| 265 |
+
[RUNTIME RESILIENCE] Always updates the in-memory cache first (memory is reliable),
|
| 266 |
+
then attempts disk persistence. If disk write fails, logs a warning but does NOT
|
| 267 |
+
raise an exception - the in-memory state continues to work.
|
| 268 |
+
"""
|
| 269 |
+
# [IN-MEMORY FIRST] Always update cache first (reliable)
|
| 270 |
+
self._credentials_cache[path] = creds
|
| 271 |
+
|
| 272 |
# Don't save to file if credentials were loaded from environment
|
| 273 |
if creds.get("_proxy_metadata", {}).get("loaded_from_env"):
|
| 274 |
lib_logger.debug("Credentials loaded from env, skipping file save")
|
|
|
|
|
|
|
| 275 |
return
|
| 276 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 277 |
try:
|
| 278 |
+
# [ATOMIC WRITE] Use tempfile + move pattern to ensure atomic writes
|
| 279 |
+
# This prevents credential corruption if the process is interrupted during write
|
| 280 |
+
parent_dir = os.path.dirname(os.path.abspath(path))
|
| 281 |
+
os.makedirs(parent_dir, exist_ok=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 282 |
|
| 283 |
+
tmp_fd = None
|
| 284 |
+
tmp_path = None
|
| 285 |
try:
|
| 286 |
+
# Create temp file in same directory as target (ensures same filesystem)
|
| 287 |
+
tmp_fd, tmp_path = tempfile.mkstemp(
|
| 288 |
+
dir=parent_dir, prefix=".tmp_", suffix=".json", text=True
|
| 289 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 290 |
|
| 291 |
+
# Write JSON to temp file
|
| 292 |
+
with os.fdopen(tmp_fd, "w") as f:
|
| 293 |
+
json.dump(creds, f, indent=2)
|
| 294 |
+
tmp_fd = None # fdopen closes the fd
|
|
|
|
| 295 |
|
| 296 |
+
# Set secure permissions (0600 = owner read/write only)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 297 |
try:
|
| 298 |
+
os.chmod(tmp_path, 0o600)
|
| 299 |
+
except (OSError, AttributeError):
|
| 300 |
+
# Windows may not support chmod, ignore
|
| 301 |
pass
|
| 302 |
+
|
| 303 |
+
# Atomic move (overwrites target if it exists)
|
| 304 |
+
shutil.move(tmp_path, path)
|
| 305 |
+
tmp_path = None # Successfully moved
|
| 306 |
+
|
| 307 |
+
lib_logger.debug(
|
| 308 |
+
f"Saved updated {self.ENV_PREFIX} OAuth credentials to '{path}' (atomic write)."
|
| 309 |
+
)
|
| 310 |
+
|
| 311 |
+
except Exception as e:
|
| 312 |
+
# Clean up temp file if it still exists
|
| 313 |
+
if tmp_fd is not None:
|
| 314 |
+
try:
|
| 315 |
+
os.close(tmp_fd)
|
| 316 |
+
except:
|
| 317 |
+
pass
|
| 318 |
+
if tmp_path and os.path.exists(tmp_path):
|
| 319 |
+
try:
|
| 320 |
+
os.unlink(tmp_path)
|
| 321 |
+
except:
|
| 322 |
+
pass
|
| 323 |
+
raise
|
| 324 |
+
|
| 325 |
+
except (OSError, PermissionError, IOError) as e:
|
| 326 |
+
# [FAIL SILENTLY, LOG LOUDLY] Log the error but don't crash
|
| 327 |
+
# The in-memory cache was already updated, so we can continue operating
|
| 328 |
+
lib_logger.warning(
|
| 329 |
+
f"Failed to save credentials to {path}: {e}. "
|
| 330 |
+
"Credentials cached in memory only (will be lost on restart)."
|
| 331 |
+
)
|
| 332 |
+
# Don't raise - we already updated the memory cache
|
| 333 |
|
| 334 |
def _is_token_expired(self, creds: Dict[str, Any]) -> bool:
|
| 335 |
expiry = creds.get("token_expiry") # gcloud format
|
|
|
|
| 853 |
)
|
| 854 |
|
| 855 |
async def get_auth_header(self, credential_path: str) -> Dict[str, str]:
|
| 856 |
+
"""Get auth header with graceful degradation if refresh fails.
|
| 857 |
+
|
| 858 |
+
[RUNTIME RESILIENCE] If credential file is deleted or refresh fails,
|
| 859 |
+
attempts to use cached credentials. This allows the proxy to continue
|
| 860 |
+
operating with potentially stale tokens rather than crashing.
|
| 861 |
+
"""
|
| 862 |
+
try:
|
| 863 |
+
creds = await self._load_credentials(credential_path)
|
| 864 |
+
if self._is_token_expired(creds):
|
| 865 |
+
try:
|
| 866 |
+
creds = await self._refresh_token(credential_path, creds)
|
| 867 |
+
except Exception as e:
|
| 868 |
+
# [CACHED TOKEN FALLBACK] Check if we have a cached token that might still work
|
| 869 |
+
cached = self._credentials_cache.get(credential_path)
|
| 870 |
+
if cached and cached.get("access_token"):
|
| 871 |
+
lib_logger.warning(
|
| 872 |
+
f"Token refresh failed for {Path(credential_path).name}: {e}. "
|
| 873 |
+
"Using cached token (may be expired)."
|
| 874 |
+
)
|
| 875 |
+
creds = cached
|
| 876 |
+
else:
|
| 877 |
+
raise
|
| 878 |
+
return {"Authorization": f"Bearer {creds['access_token']}"}
|
| 879 |
+
except Exception as e:
|
| 880 |
+
# [FINAL FALLBACK] Check if any cached credential exists as last resort
|
| 881 |
+
cached = self._credentials_cache.get(credential_path)
|
| 882 |
+
if cached and cached.get("access_token"):
|
| 883 |
+
lib_logger.error(
|
| 884 |
+
f"Credential load failed for {credential_path}: {e}. "
|
| 885 |
+
"Using stale cached token as last resort."
|
| 886 |
+
)
|
| 887 |
+
return {"Authorization": f"Bearer {cached['access_token']}"}
|
| 888 |
+
raise
|
| 889 |
|
| 890 |
async def get_user_info(
|
| 891 |
self, creds_or_path: Union[Dict[str, Any], str]
|
|
@@ -104,7 +104,10 @@ class ProviderCache:
|
|
| 104 |
self._running = False
|
| 105 |
|
| 106 |
# Statistics
|
| 107 |
-
self._stats = {"memory_hits": 0, "disk_hits": 0, "misses": 0, "writes": 0}
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
# Metadata about this cache instance
|
| 110 |
self._cache_name = cache_file.stem if cache_file else "unnamed"
|
|
@@ -171,13 +174,27 @@ class ProviderCache:
|
|
| 171 |
# =========================================================================
|
| 172 |
|
| 173 |
async def _save_to_disk(self) -> None:
|
| 174 |
-
"""Persist cache to disk using atomic write.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
if not self._enable_disk:
|
| 176 |
return
|
| 177 |
|
| 178 |
try:
|
| 179 |
async with self._disk_lock:
|
| 180 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
| 182 |
cache_data = {
|
| 183 |
"version": "1.0",
|
|
@@ -210,6 +227,8 @@ class ProviderCache:
|
|
| 210 |
|
| 211 |
shutil.move(tmp_path, self._cache_file)
|
| 212 |
self._stats["writes"] += 1
|
|
|
|
|
|
|
| 213 |
lib_logger.debug(
|
| 214 |
f"ProviderCache[{self._cache_name}]: Saved {len(self._cache)} entries"
|
| 215 |
)
|
|
@@ -218,6 +237,9 @@ class ProviderCache:
|
|
| 218 |
os.unlink(tmp_path)
|
| 219 |
raise
|
| 220 |
except Exception as e:
|
|
|
|
|
|
|
|
|
|
| 221 |
lib_logger.error(f"ProviderCache[{self._cache_name}]: Disk save failed: {e}")
|
| 222 |
|
| 223 |
# =========================================================================
|
|
@@ -416,12 +438,17 @@ class ProviderCache:
|
|
| 416 |
return False
|
| 417 |
|
| 418 |
def get_stats(self) -> Dict[str, Any]:
|
| 419 |
-
"""Get cache statistics.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 420 |
return {
|
| 421 |
**self._stats,
|
| 422 |
"memory_entries": len(self._cache),
|
| 423 |
"dirty": self._dirty,
|
| 424 |
-
"disk_enabled": self._enable_disk
|
|
|
|
| 425 |
}
|
| 426 |
|
| 427 |
async def clear(self) -> None:
|
|
|
|
| 104 |
self._running = False
|
| 105 |
|
| 106 |
# Statistics
|
| 107 |
+
self._stats = {"memory_hits": 0, "disk_hits": 0, "misses": 0, "writes": 0, "disk_errors": 0}
|
| 108 |
+
|
| 109 |
+
# [RUNTIME RESILIENCE] Track disk health for monitoring
|
| 110 |
+
self._disk_available = True
|
| 111 |
|
| 112 |
# Metadata about this cache instance
|
| 113 |
self._cache_name = cache_file.stem if cache_file else "unnamed"
|
|
|
|
| 174 |
# =========================================================================
|
| 175 |
|
| 176 |
async def _save_to_disk(self) -> None:
|
| 177 |
+
"""Persist cache to disk using atomic write with health tracking.
|
| 178 |
+
|
| 179 |
+
[RUNTIME RESILIENCE] Tracks disk health and records errors. If disk
|
| 180 |
+
operations fail, the memory cache continues to work. Health status
|
| 181 |
+
is available via get_stats() for monitoring.
|
| 182 |
+
"""
|
| 183 |
if not self._enable_disk:
|
| 184 |
return
|
| 185 |
|
| 186 |
try:
|
| 187 |
async with self._disk_lock:
|
| 188 |
+
# [DIRECTORY AUTO-RECREATION] Attempt to create directory
|
| 189 |
+
try:
|
| 190 |
+
self._cache_file.parent.mkdir(parents=True, exist_ok=True)
|
| 191 |
+
except (OSError, PermissionError) as e:
|
| 192 |
+
self._stats["disk_errors"] += 1
|
| 193 |
+
self._disk_available = False
|
| 194 |
+
lib_logger.warning(
|
| 195 |
+
f"ProviderCache[{self._cache_name}]: Cannot create cache directory: {e}"
|
| 196 |
+
)
|
| 197 |
+
return
|
| 198 |
|
| 199 |
cache_data = {
|
| 200 |
"version": "1.0",
|
|
|
|
| 227 |
|
| 228 |
shutil.move(tmp_path, self._cache_file)
|
| 229 |
self._stats["writes"] += 1
|
| 230 |
+
# [RUNTIME RESILIENCE] Mark disk as healthy on success
|
| 231 |
+
self._disk_available = True
|
| 232 |
lib_logger.debug(
|
| 233 |
f"ProviderCache[{self._cache_name}]: Saved {len(self._cache)} entries"
|
| 234 |
)
|
|
|
|
| 237 |
os.unlink(tmp_path)
|
| 238 |
raise
|
| 239 |
except Exception as e:
|
| 240 |
+
# [RUNTIME RESILIENCE] Track disk errors for monitoring
|
| 241 |
+
self._stats["disk_errors"] += 1
|
| 242 |
+
self._disk_available = False
|
| 243 |
lib_logger.error(f"ProviderCache[{self._cache_name}]: Disk save failed: {e}")
|
| 244 |
|
| 245 |
# =========================================================================
|
|
|
|
| 438 |
return False
|
| 439 |
|
| 440 |
def get_stats(self) -> Dict[str, Any]:
|
| 441 |
+
"""Get cache statistics including disk health.
|
| 442 |
+
|
| 443 |
+
[RUNTIME RESILIENCE] Includes disk_available flag for monitoring
|
| 444 |
+
the health of disk persistence.
|
| 445 |
+
"""
|
| 446 |
return {
|
| 447 |
**self._stats,
|
| 448 |
"memory_entries": len(self._cache),
|
| 449 |
"dirty": self._dirty,
|
| 450 |
+
"disk_enabled": self._enable_disk,
|
| 451 |
+
"disk_available": self._disk_available # [RUNTIME RESILIENCE] Health indicator
|
| 452 |
}
|
| 453 |
|
| 454 |
async def clear(self) -> None:
|
|
@@ -90,25 +90,56 @@ class UsageManager:
|
|
| 90 |
self._initialized.set()
|
| 91 |
|
| 92 |
async def _load_usage(self):
|
| 93 |
-
"""Loads usage data from the JSON file asynchronously.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
async with self._data_lock:
|
| 95 |
if not os.path.exists(self.file_path):
|
| 96 |
self._usage_data = {}
|
| 97 |
return
|
|
|
|
| 98 |
try:
|
| 99 |
async with aiofiles.open(self.file_path, "r") as f:
|
| 100 |
content = await f.read()
|
| 101 |
-
self._usage_data = json.loads(content)
|
| 102 |
-
except
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
self._usage_data = {}
|
| 104 |
|
| 105 |
async def _save_usage(self):
|
| 106 |
-
"""Saves the current usage data to the JSON file asynchronously.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
if self._usage_data is None:
|
| 108 |
return
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
async def _reset_daily_stats_if_needed(self):
|
| 114 |
"""Checks if daily stats need to be reset for any key."""
|
|
|
|
| 90 |
self._initialized.set()
|
| 91 |
|
| 92 |
async def _load_usage(self):
|
| 93 |
+
"""Loads usage data from the JSON file asynchronously with enhanced resilience.
|
| 94 |
+
|
| 95 |
+
[RUNTIME RESILIENCE] Handles various file system errors gracefully,
|
| 96 |
+
including race conditions where file is deleted between exists check and open.
|
| 97 |
+
"""
|
| 98 |
async with self._data_lock:
|
| 99 |
if not os.path.exists(self.file_path):
|
| 100 |
self._usage_data = {}
|
| 101 |
return
|
| 102 |
+
|
| 103 |
try:
|
| 104 |
async with aiofiles.open(self.file_path, "r") as f:
|
| 105 |
content = await f.read()
|
| 106 |
+
self._usage_data = json.loads(content) if content.strip() else {}
|
| 107 |
+
except FileNotFoundError:
|
| 108 |
+
# [RACE CONDITION HANDLING] File deleted between exists check and open
|
| 109 |
+
self._usage_data = {}
|
| 110 |
+
except json.JSONDecodeError as e:
|
| 111 |
+
lib_logger.warning(f"Corrupted usage file {self.file_path}: {e}. Starting fresh.")
|
| 112 |
+
self._usage_data = {}
|
| 113 |
+
except (OSError, PermissionError, IOError) as e:
|
| 114 |
+
lib_logger.warning(f"Cannot read usage file {self.file_path}: {e}. Using empty state.")
|
| 115 |
self._usage_data = {}
|
| 116 |
|
| 117 |
async def _save_usage(self):
|
| 118 |
+
"""Saves the current usage data to the JSON file asynchronously with resilience.
|
| 119 |
+
|
| 120 |
+
[RUNTIME RESILIENCE] Wraps file operations in try/except to prevent crashes
|
| 121 |
+
if the file or directory is deleted during runtime. The in-memory state
|
| 122 |
+
continues to work even if disk persistence fails.
|
| 123 |
+
"""
|
| 124 |
if self._usage_data is None:
|
| 125 |
return
|
| 126 |
+
|
| 127 |
+
try:
|
| 128 |
+
async with self._data_lock:
|
| 129 |
+
# [DIRECTORY AUTO-RECREATION] Ensure directory exists before write
|
| 130 |
+
file_dir = os.path.dirname(os.path.abspath(self.file_path))
|
| 131 |
+
if file_dir and not os.path.exists(file_dir):
|
| 132 |
+
os.makedirs(file_dir, exist_ok=True)
|
| 133 |
+
|
| 134 |
+
async with aiofiles.open(self.file_path, "w") as f:
|
| 135 |
+
await f.write(json.dumps(self._usage_data, indent=2))
|
| 136 |
+
except (OSError, PermissionError, IOError) as e:
|
| 137 |
+
# [FAIL SILENTLY, LOG LOUDLY] Log the error but don't crash
|
| 138 |
+
# In-memory state is preserved and will continue to work
|
| 139 |
+
lib_logger.warning(
|
| 140 |
+
f"Failed to save usage data to {self.file_path}: {e}. "
|
| 141 |
+
"Data will be retained in memory but may be lost on restart."
|
| 142 |
+
)
|
| 143 |
|
| 144 |
async def _reset_daily_stats_if_needed(self):
|
| 145 |
"""Checks if daily stats need to be reset for any key."""
|