BDR-Agent-Factory / docs /MONITORING_LOGGING.md
Bader Alabddan
Add comprehensive documentation and implementation framework
3ef5d3c
# Monitoring & Logging - BDR Agent Factory
## Overview
Comprehensive monitoring and logging infrastructure for tracking AI capability performance, detecting issues, and ensuring compliance.
---
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Monitoring Stack β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Metrics β”‚ β”‚ Logs β”‚ β”‚ Traces β”‚ β”‚
β”‚ β”‚ (Prometheus)β”‚ β”‚ (Elasticsearch)β”‚ β”‚ (Jaeger) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Grafana β”‚ β”‚
β”‚ β”‚ Dashboards β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Alert Manager β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## 1. Metrics Collection
### Prometheus Configuration
```yaml
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'bdr-agent-factory'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'
- job_name: 'capability-services'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: capability-.*
action: keep
```
### Key Metrics
#### Request Metrics
```python
from prometheus_client import Counter, Histogram, Gauge
# Request counter
request_count = Counter(
'capability_requests_total',
'Total number of capability requests',
['capability_id', 'status', 'decision_type']
)
# Request duration
request_duration = Histogram(
'capability_request_duration_seconds',
'Request duration in seconds',
['capability_id'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
# Active requests
active_requests = Gauge(
'capability_active_requests',
'Number of active requests',
['capability_id']
)
# Error rate
error_count = Counter(
'capability_errors_total',
'Total number of errors',
['capability_id', 'error_type']
)
```
#### Model Performance Metrics
```python
# Model accuracy
model_accuracy = Gauge(
'model_accuracy',
'Model accuracy score',
['capability_id', 'model_version']
)
# Prediction confidence
prediction_confidence = Histogram(
'prediction_confidence',
'Prediction confidence scores',
['capability_id'],
buckets=[0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 1.0]
)
# Model inference time
inference_time = Histogram(
'model_inference_duration_seconds',
'Model inference duration',
['capability_id', 'model_version']
)
```
#### Business Metrics
```python
# Claims processed
claims_processed = Counter(
'claims_processed_total',
'Total claims processed',
['system_id', 'decision_type']
)
# Fraud detected
fraud_detected = Counter(
'fraud_cases_detected_total',
'Total fraud cases detected',
['risk_level']
)
# Compliance violations
compliance_violations = Counter(
'compliance_violations_total',
'Total compliance violations',
['framework', 'violation_type']
)
```
### Metrics Instrumentation
```python
from prometheus_client import start_http_server
import time
class CapabilityMetrics:
def __init__(self):
self.request_count = request_count
self.request_duration = request_duration
self.active_requests = active_requests
self.error_count = error_count
def track_request(self, capability_id, func):
"""Decorator to track capability requests"""
def wrapper(*args, **kwargs):
self.active_requests.labels(capability_id=capability_id).inc()
start_time = time.time()
try:
result = func(*args, **kwargs)
status = 'success'
decision_type = result.get('decision_type', 'unknown')
return result
except Exception as e:
status = 'error'
decision_type = 'error'
self.error_count.labels(
capability_id=capability_id,
error_type=type(e).__name__
).inc()
raise
finally:
duration = time.time() - start_time
self.request_duration.labels(
capability_id=capability_id
).observe(duration)
self.request_count.labels(
capability_id=capability_id,
status=status,
decision_type=decision_type
).inc()
self.active_requests.labels(
capability_id=capability_id
).dec()
return wrapper
# Start metrics server
start_http_server(8001)
```
---
## 2. Logging
### Structured Logging Configuration
```python
import logging
import json
from datetime import datetime
class StructuredLogger:
def __init__(self, name):
self.logger = logging.getLogger(name)
self.logger.setLevel(logging.INFO)
# JSON formatter
handler = logging.StreamHandler()
handler.setFormatter(self.JSONFormatter())
self.logger.addHandler(handler)
class JSONFormatter(logging.Formatter):
def format(self, record):
log_data = {
'timestamp': datetime.utcnow().isoformat(),
'level': record.levelname,
'logger': record.name,
'message': record.getMessage(),
'module': record.module,
'function': record.funcName,
'line': record.lineno
}
# Add extra fields
if hasattr(record, 'capability_id'):
log_data['capability_id'] = record.capability_id
if hasattr(record, 'request_id'):
log_data['request_id'] = record.request_id
if hasattr(record, 'user_id'):
log_data['user_id'] = record.user_id
if hasattr(record, 'system_id'):
log_data['system_id'] = record.system_id
return json.dumps(log_data)
def info(self, message, **kwargs):
self.logger.info(message, extra=kwargs)
def warning(self, message, **kwargs):
self.logger.warning(message, extra=kwargs)
def error(self, message, **kwargs):
self.logger.error(message, extra=kwargs)
def debug(self, message, **kwargs):
self.logger.debug(message, extra=kwargs)
# Usage
logger = StructuredLogger('bdr_agent_factory')
logger.info(
'Capability invoked',
capability_id='cap_text_classification',
request_id='req_12345',
user_id='user_789'
)
```
### Log Levels
- **DEBUG**: Detailed diagnostic information
- **INFO**: General informational messages
- **WARNING**: Warning messages for potentially harmful situations
- **ERROR**: Error events that might still allow the application to continue
- **CRITICAL**: Critical events that may cause the application to abort
### Log Categories
#### Application Logs
```python
logger.info('Application started', version='1.0.0')
logger.info('Capability registered', capability_id='cap_text_classification')
logger.warning('High memory usage detected', memory_usage_mb=8192)
```
#### Request Logs
```python
logger.info(
'Request received',
request_id='req_12345',
capability_id='cap_text_classification',
user_id='user_789',
ip_address='192.168.1.1'
)
logger.info(
'Request completed',
request_id='req_12345',
duration_ms=142,
status='success'
)
```
#### Error Logs
```python
logger.error(
'Capability invocation failed',
request_id='req_12345',
capability_id='cap_text_classification',
error_type='ValidationError',
error_message='Input text exceeds maximum length',
stack_trace=traceback.format_exc()
)
```
#### Audit Logs
```python
logger.info(
'Audit trail created',
audit_id='audit_67890',
request_id='req_12345',
capability_id='cap_text_classification',
user_id='user_789',
decision_type='approve',
compliance_flags={'gdpr': True, 'ifrs17': True}
)
```
#### Security Logs
```python
logger.warning(
'Authentication failed',
user_id='user_789',
ip_address='192.168.1.1',
reason='Invalid token'
)
logger.warning(
'Rate limit exceeded',
user_id='user_789',
ip_address='192.168.1.1',
requests_per_minute=150
)
```
### Elasticsearch Configuration
```yaml
# logstash.conf
input {
file {
path => "/var/log/bdr-agent-factory/*.log"
codec => json
}
}
filter {
# Parse timestamp
date {
match => ["timestamp", "ISO8601"]
target => "@timestamp"
}
# Add tags for different log types
if [capability_id] {
mutate {
add_tag => ["capability_log"]
}
}
if [audit_id] {
mutate {
add_tag => ["audit_log"]
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "bdr-agent-factory-%{+YYYY.MM.dd}"
}
}
```
---
## 3. Distributed Tracing
### Jaeger Configuration
```python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name='localhost',
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
# Usage
with tracer.start_as_current_span('capability_invocation') as span:
span.set_attribute('capability_id', 'cap_text_classification')
span.set_attribute('request_id', 'req_12345')
# Perform capability invocation
result = invoke_capability()
span.set_attribute('decision_type', result.decision_type)
span.set_attribute('confidence', result.confidence)
```
---
## 4. Dashboards
### Grafana Dashboard Configuration
#### System Overview Dashboard
```json
{
"dashboard": {
"title": "BDR Agent Factory - System Overview",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(capability_requests_total[5m])"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(capability_errors_total[5m])"
}
]
},
{
"title": "P95 Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(capability_request_duration_seconds_bucket[5m]))"
}
]
},
{
"title": "Active Requests",
"targets": [
{
"expr": "sum(capability_active_requests)"
}
]
}
]
}
}
```
#### Capability Performance Dashboard
- **Request Volume by Capability**: Bar chart showing requests per capability
- **Latency Distribution**: Heatmap of latency percentiles
- **Error Rate by Capability**: Line chart of error rates
- **Model Accuracy**: Gauge showing current accuracy
- **Prediction Confidence**: Histogram of confidence scores
#### Compliance Dashboard
- **Compliance Rate**: Gauge showing overall compliance percentage
- **Violations by Framework**: Bar chart of violations per framework
- **Audit Trail Coverage**: Percentage of requests with audit trails
- **Data Retention Status**: Status of data retention policies
---
## 5. Alerting
### Alert Rules
```yaml
# prometheus-alerts.yml
groups:
- name: capability_alerts
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
rate(capability_errors_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec for {{ $labels.capability_id }}"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(capability_request_duration_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P95 latency is {{ $value }}s for {{ $labels.capability_id }}"
# Low model accuracy
- alert: LowModelAccuracy
expr: |
model_accuracy < 0.85
for: 10m
labels:
severity: critical
annotations:
summary: "Model accuracy below threshold"
description: "Model accuracy is {{ $value }} for {{ $labels.capability_id }}"
# Compliance violation
- alert: ComplianceViolation
expr: |
increase(compliance_violations_total[1h]) > 0
labels:
severity: critical
annotations:
summary: "Compliance violation detected"
description: "{{ $value }} violations detected for {{ $labels.framework }}"
# Service down
- alert: ServiceDown
expr: |
up{job="bdr-agent-factory"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "BDR Agent Factory service is not responding"
```
### Alert Channels
```yaml
# alertmanager.yml
route:
group_by: ['alertname', 'capability_id']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
email_configs:
- to: 'ops@bdragentfactory.com'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX'
channel: '#alerts'
title: 'BDR Agent Factory Alert'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'XXX'
```
---
## 6. Log Retention
### Retention Policies
| Log Type | Retention Period | Storage |
|----------|------------------|----------|
| Application Logs | 30 days | Elasticsearch |
| Request Logs | 90 days | Elasticsearch |
| Audit Logs | 7 years | S3 + Elasticsearch |
| Error Logs | 1 year | Elasticsearch |
| Security Logs | 2 years | S3 + Elasticsearch |
| Metrics | 1 year | Prometheus |
### Elasticsearch Index Lifecycle Management
```json
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"freeze": {}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
```
---
## 7. Performance Monitoring
### SLA Monitoring
```python
class SLAMonitor:
def __init__(self):
self.sla_targets = {
'availability': 0.999, # 99.9% uptime
'p95_latency_ms': 300, # 300ms P95 latency
'error_rate': 0.001, # 0.1% error rate
}
def check_sla_compliance(self, time_window='24h'):
"""Check if SLAs are being met"""
metrics = self.get_metrics(time_window)
compliance = {
'availability': metrics['uptime'] >= self.sla_targets['availability'],
'latency': metrics['p95_latency_ms'] <= self.sla_targets['p95_latency_ms'],
'error_rate': metrics['error_rate'] <= self.sla_targets['error_rate']
}
return all(compliance.values()), compliance
```
---
## 8. Troubleshooting
### Common Issues
#### High Latency
```bash
# Check P95 latency by capability
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=histogram_quantile(0.95, rate(capability_request_duration_seconds_bucket[5m]))'
# Check slow queries in logs
curl -X GET "localhost:9200/bdr-agent-factory-*/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"range": {
"duration_ms": { "gte": 1000 }
}
},
"sort": [{ "duration_ms": "desc" }]
}'
```
#### High Error Rate
```bash
# Check error distribution
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=sum by (error_type) (rate(capability_errors_total[5m]))'
# View recent errors
curl -X GET "localhost:9200/bdr-agent-factory-*/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"match": { "level": "ERROR" }
},
"sort": [{ "timestamp": "desc" }],
"size": 100
}'
```
---
## 9. Best Practices
1. **Use structured logging** - Always log in JSON format
2. **Include context** - Add request_id, user_id, capability_id to all logs
3. **Monitor SLAs** - Track availability, latency, and error rates
4. **Set up alerts** - Configure alerts for critical metrics
5. **Retain audit logs** - Keep audit logs for compliance requirements
6. **Use distributed tracing** - Track requests across services
7. **Dashboard everything** - Create dashboards for all key metrics
8. **Regular reviews** - Review logs and metrics regularly
9. **Optimize queries** - Use efficient Elasticsearch queries
10. **Archive old data** - Move old logs to cold storage
---
## Support
For monitoring support:
- Documentation: https://docs.bdragentfactory.com/monitoring
- Email: ops@bdragentfactory.com