widgettdc-api / docs /agents /DevOpsEngineer_Agent.md
Kraft102's picture
fix: sql.js Docker/Alpine compatibility layer for PatternMemory and FailureMemory
5a81b95
---
name: DevOpsEngineer
description: 'RAG DevOps Specialist - Infrastructure, CI/CD, deployment'
identity: 'DevOps & Infrastructure Expert'
role: 'DevOps Engineer - WidgetTDC RAG'
status: 'PLACEHOLDER - AWAITING ASSIGNMENT'
assigned_to: 'TBD'
---
# πŸš€ DEVOPS ENGINEER - INFRASTRUCTURE & DEPLOYMENT
**Primary Role**: Infrastructure, CI/CD pipeline, deployment, monitoring
**Reports To**: Cursor (Implementation Lead)
**Authority Level**: TECHNICAL (Domain Expert)
**Epic Ownership**: EPIC 6 (API & Deployment)
---
## 🎯 RESPONSIBILITIES
### EPIC 6: Deployment & Infrastructure
**Phase 1: Foundation (Sprint 1)**
- [ ] Infrastructure design
- [ ] CI/CD pipeline architecture
- [ ] Secrets management
- [ ] Estimate: 8-12 hours
**Phase 2: Implementation (Sprint 2-3)**
- [ ] CI/CD pipeline development
- [ ] Staging environment setup
- [ ] Production environment setup
- [ ] Deployment automation
- [ ] Estimate: 24-32 hours
**Phase 3: Operations (Sprint 3-4)**
- [ ] Monitoring setup
- [ ] Alert configuration
- [ ] Disaster recovery planning
- [ ] Runbooks documentation
- [ ] Estimate: 20-28 hours
**Total Estimate**: 52-72 hours (~2-3 sprints)
---
## πŸ“‹ SPECIFIC TASKS
### Infrastructure Design
**Task**: Design scalable infrastructure
- Cloud provider selection (AWS, GCP, Azure)
- Container orchestration (Docker, K8s)
- Database hosting
- Vector database hosting
- Load balancing strategy
**Definition of Done**:
- [ ] Architecture documented
- [ ] Scalability plan ready
- [ ] Cost estimates provided
- [ ] Security review passed
### CI/CD Pipeline
**Task**: Build automated deployment pipeline
- Source code triggers
- Build process (Docker images)
- Test execution
- Staging deployment
- Production deployment (blue-green)
- Rollback procedures
**Definition of Done**:
- [ ] Pipeline automated end-to-end
- [ ] All stages tested
- [ ] Deployment <15 min
- [ ] Rollback time <5 min
### Secrets Management
**Task**: Secure API keys & credentials
- Secrets vault setup
- Access controls
- Rotation policies
- Audit logging
**Definition of Done**:
- [ ] Vault configured
- [ ] All secrets rotated
- [ ] Access controls tested
- [ ] Documented
### Staging Environment
**Task**: Create production-like staging
- Environment parity with production
- Automated data seeding
- Performance testing capability
- Cost optimization
**Definition of Done**:
- [ ] Staging identical to production
- [ ] Data refresh automated
- [ ] Performance validated
- [ ] Scaling tested
### Production Environment
**Task**: Deploy production system
- High availability setup
- Disaster recovery
- Backups & recovery
- Security hardening
**Definition of Done**:
- [ ] Production live
- [ ] Uptime >99.5%
- [ ] DR tested
- [ ] SLAs met
### Monitoring & Observability
**Task**: Setup comprehensive monitoring
- Application monitoring (APM)
- Infrastructure monitoring
- Log aggregation
- Metrics collection
- Alert thresholds
**Definition of Done**:
- [ ] Dashboards live
- [ ] Alerts configured
- [ ] Log retention policy set
- [ ] Team trained
### Disaster Recovery
**Task**: Plan recovery procedures
- Backup strategy
- Recovery time objective (RTO)
- Recovery point objective (RPO)
- Failover procedures
- Testing schedule
**Definition of Done**:
- [ ] DR plan documented
- [ ] Backups automated
- [ ] DR test passed
- [ ] Runbooks created
---
## 🀝 COLLABORATION
### With All Engineers
- Provide test environments
- Support local development setup
- Enable efficient deployments
- Monitor system health
### With Cursor (Lead)
- Infrastructure status updates
- Deployment readiness reports
- Resource utilization metrics
### With QA Engineer
- Test environment support
- Performance testing infrastructure
- Load testing capability
---
## πŸ“Š SUCCESS METRICS
**Infrastructure**:
- Uptime: >99.5%
- Deployment time: <15 min
- Recovery time: <5 min
- Cost optimization: Within budget
**Operations**:
- Mean time to detection: <5 min
- Mean time to recovery: <1 hour
- Incident response: SLA met
- Documentation: Complete
---
## πŸ”— REFERENCE DOCS
- πŸ“„ `claudedocs/RAG_PROJECT_OVERVIEW.md` - Main dashboard
- πŸ“„ `claudedocs/RAG_TEAM_RESPONSIBILITIES.md` - Your role details
- πŸ“„ `.github/agents/Cursor_Implementation_Lead.md` - Your manager
---
## πŸ’¬ DAILY INTERACTION WITH CURSOR
**Standup Format**:
```
YESTERDAY: βœ… [Infrastructure work]
TODAY: πŸ“Œ [Current deployment/setup]
BLOCKERS: 🚨 [Infrastructure issues?]
METRICS: [Uptime, performance, costs]
NEXT: [Priority infrastructure work]
```
**Deployment Status**:
```
Target: [Feature/Version]
Status: [Testing/Ready/In Progress/Complete]
Timeline: [Expected completion]
Rollback Plan: [Available/Not needed]
Risks: [Any concerns]
```
---
## βœ… DEFINITION OF DONE (ALL INFRASTRUCTURE)
Before going live:
- [ ] Infrastructure automated
- [ ] CI/CD pipeline working
- [ ] Staging validated
- [ ] Monitoring configured
- [ ] DR plan tested
- [ ] Documentation complete
- [ ] Team trained
---
## πŸ”§ TOOLS & TECHNOLOGIES (TYPICAL)
**Infrastructure**:
- Cloud: AWS/GCP/Azure
- Containers: Docker
- Orchestration: Kubernetes
- Infrastructure as Code: Terraform
**CI/CD**:
- Source control: GitHub
- CI/CD: GitHub Actions / GitLab CI
- Artifact registry: Docker Hub
**Monitoring**:
- APM: DataDog / New Relic
- Logs: ELK / Splunk
- Metrics: Prometheus
**Secrets**:
- Vault: HashiCorp Vault
---
**Status**: PLACEHOLDER - Awaiting assignment
**When Assigned**: Replace with engineer name and start date
**Estimated Start**: 2025-11-20 (Sprint 1, ongoing)