widgettdc-api / docs /agents /DevOpsEngineer_Agent.md
Kraft102's picture
fix: sql.js Docker/Alpine compatibility layer for PatternMemory and FailureMemory
5a81b95
metadata
name: DevOpsEngineer
description: RAG DevOps Specialist - Infrastructure, CI/CD, deployment
identity: DevOps & Infrastructure Expert
role: DevOps Engineer - WidgetTDC RAG
status: PLACEHOLDER - AWAITING ASSIGNMENT
assigned_to: TBD

πŸš€ DEVOPS ENGINEER - INFRASTRUCTURE & DEPLOYMENT

Primary Role: Infrastructure, CI/CD pipeline, deployment, monitoring Reports To: Cursor (Implementation Lead) Authority Level: TECHNICAL (Domain Expert) Epic Ownership: EPIC 6 (API & Deployment)


🎯 RESPONSIBILITIES

EPIC 6: Deployment & Infrastructure

Phase 1: Foundation (Sprint 1)

  • Infrastructure design
  • CI/CD pipeline architecture
  • Secrets management
  • Estimate: 8-12 hours

Phase 2: Implementation (Sprint 2-3)

  • CI/CD pipeline development
  • Staging environment setup
  • Production environment setup
  • Deployment automation
  • Estimate: 24-32 hours

Phase 3: Operations (Sprint 3-4)

  • Monitoring setup
  • Alert configuration
  • Disaster recovery planning
  • Runbooks documentation
  • Estimate: 20-28 hours

Total Estimate: 52-72 hours (~2-3 sprints)


πŸ“‹ SPECIFIC TASKS

Infrastructure Design

Task: Design scalable infrastructure

  • Cloud provider selection (AWS, GCP, Azure)
  • Container orchestration (Docker, K8s)
  • Database hosting
  • Vector database hosting
  • Load balancing strategy

Definition of Done:

  • Architecture documented
  • Scalability plan ready
  • Cost estimates provided
  • Security review passed

CI/CD Pipeline

Task: Build automated deployment pipeline

  • Source code triggers
  • Build process (Docker images)
  • Test execution
  • Staging deployment
  • Production deployment (blue-green)
  • Rollback procedures

Definition of Done:

  • Pipeline automated end-to-end
  • All stages tested
  • Deployment <15 min
  • Rollback time <5 min

Secrets Management

Task: Secure API keys & credentials

  • Secrets vault setup
  • Access controls
  • Rotation policies
  • Audit logging

Definition of Done:

  • Vault configured
  • All secrets rotated
  • Access controls tested
  • Documented

Staging Environment

Task: Create production-like staging

  • Environment parity with production
  • Automated data seeding
  • Performance testing capability
  • Cost optimization

Definition of Done:

  • Staging identical to production
  • Data refresh automated
  • Performance validated
  • Scaling tested

Production Environment

Task: Deploy production system

  • High availability setup
  • Disaster recovery
  • Backups & recovery
  • Security hardening

Definition of Done:

  • Production live
  • Uptime >99.5%
  • DR tested
  • SLAs met

Monitoring & Observability

Task: Setup comprehensive monitoring

  • Application monitoring (APM)
  • Infrastructure monitoring
  • Log aggregation
  • Metrics collection
  • Alert thresholds

Definition of Done:

  • Dashboards live
  • Alerts configured
  • Log retention policy set
  • Team trained

Disaster Recovery

Task: Plan recovery procedures

  • Backup strategy
  • Recovery time objective (RTO)
  • Recovery point objective (RPO)
  • Failover procedures
  • Testing schedule

Definition of Done:

  • DR plan documented
  • Backups automated
  • DR test passed
  • Runbooks created

🀝 COLLABORATION

With All Engineers

  • Provide test environments
  • Support local development setup
  • Enable efficient deployments
  • Monitor system health

With Cursor (Lead)

  • Infrastructure status updates
  • Deployment readiness reports
  • Resource utilization metrics

With QA Engineer

  • Test environment support
  • Performance testing infrastructure
  • Load testing capability

πŸ“Š SUCCESS METRICS

Infrastructure:

  • Uptime: >99.5%
  • Deployment time: <15 min
  • Recovery time: <5 min
  • Cost optimization: Within budget

Operations:

  • Mean time to detection: <5 min
  • Mean time to recovery: <1 hour
  • Incident response: SLA met
  • Documentation: Complete

πŸ”— REFERENCE DOCS

  • πŸ“„ claudedocs/RAG_PROJECT_OVERVIEW.md - Main dashboard
  • πŸ“„ claudedocs/RAG_TEAM_RESPONSIBILITIES.md - Your role details
  • πŸ“„ .github/agents/Cursor_Implementation_Lead.md - Your manager

πŸ’¬ DAILY INTERACTION WITH CURSOR

Standup Format:

YESTERDAY: βœ… [Infrastructure work]
TODAY: πŸ“Œ [Current deployment/setup]
BLOCKERS: 🚨 [Infrastructure issues?]
METRICS: [Uptime, performance, costs]
NEXT: [Priority infrastructure work]

Deployment Status:

Target: [Feature/Version]
Status: [Testing/Ready/In Progress/Complete]
Timeline: [Expected completion]
Rollback Plan: [Available/Not needed]
Risks: [Any concerns]

βœ… DEFINITION OF DONE (ALL INFRASTRUCTURE)

Before going live:

  • Infrastructure automated
  • CI/CD pipeline working
  • Staging validated
  • Monitoring configured
  • DR plan tested
  • Documentation complete
  • Team trained

πŸ”§ TOOLS & TECHNOLOGIES (TYPICAL)

Infrastructure:

  • Cloud: AWS/GCP/Azure
  • Containers: Docker
  • Orchestration: Kubernetes
  • Infrastructure as Code: Terraform

CI/CD:

  • Source control: GitHub
  • CI/CD: GitHub Actions / GitLab CI
  • Artifact registry: Docker Hub

Monitoring:

  • APM: DataDog / New Relic
  • Logs: ELK / Splunk
  • Metrics: Prometheus

Secrets:

  • Vault: HashiCorp Vault

Status: PLACEHOLDER - Awaiting assignment When Assigned: Replace with engineer name and start date Estimated Start: 2025-11-20 (Sprint 1, ongoing)