shaikhsalman commited on
Commit
82ebd41
·
verified ·
1 Parent(s): 59d9053

docs: OMEGA platform README with 10-dimension scorecard

Browse files
Files changed (1) hide show
  1. README.md +121 -131
README.md CHANGED
@@ -1,154 +1,144 @@
1
- # DevSecOps Platform — Production Reference Architecture
2
 
3
- > Enterprise-grade, security-first, automation-first platform covering the full DevOps, Cloud, Kubernetes, Security, AI/ML lifecycle.
4
 
5
- ## Architecture
6
-
7
- ```
8
- ┌─────────────────────────────────────────────────────────────────┐
9
- │ AWS Cloud │
10
- │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
11
- │ │ AZ-1a │ │ AZ-1b │ │ AZ-1c │ Multi-AZ │
12
- │ │ ┌──────┐ │ │ ┌──────┐ │ │ ┌──────┐ │ │
13
- │ │ │ EKS │ │ │ │ EKS │ │ │ │ EKS │ │ Kubernetes 1.29 │
14
- │ │ │Node │ │ │ │Node │ │ │ │Node │ │ │
15
- │ │ └──────┘ │ │ └──────┘ │ │ └──────┘ │ │
16
- │ │ ┌──────┐ │ │ ┌──────┐ │ │ ┌──────┐ │ │
17
- │ │ │ RDS │ │ │ │ RDS │ │ │ │ RDS │ │ PostgreSQL (Multi-AZ)│
18
- │ │ │Replica│ │ │ │Primary│ │ │ │Replica│ │ + KMS Encryption │
19
- │ │ └──────┘ │ │ └──────┘ │ │ └──────┘ │ │
20
- │ └──────────┘ └──────────┘ └──────────┘ │
21
- │ │
22
- │ VPC (10.0.0.0/16) │
23
- │ ├── Public Subnets → ALB/NLB only │
24
- │ ├── Private Subnets → EKS Nodes + NAT Gateway │
25
- │ └── DB Subnets → RDS (no internet access) │
26
- │ │
27
- │ Security: KMS │ WAF │ GuardDuty │ Macie │ IAM MFA │
28
- │ Observability: CloudWatch │ VPC Flow Logs │ CloudTrail │
29
- └─────────────────────────────────────────────────────────────────┘
30
- ```
31
 
32
- ## Kubernetes Platform Stack
33
 
34
  ```
35
- ┌────────────────────────────────────────────┐
36
- │ Istio Service Mesh │
37
- │ (mTLS STRICT + eBPF CNI)
38
- ├────────┬────────┬────────┬─────────────────┤
39
- ArgoCD │ Cert │External│ Prometheus │
40
- GitOps │Manager │Secrets │ Grafana │
41
- │(AWS SM)│ Loki │
42
- ├────────┴────────┴────────┴─────────────────┤
43
- │ Kyverno Policy Engine │
44
- │ (Enforce: no root, no :latest, etc.) │
45
- ├──────────────────────────────────────────────┤
46
- │ Trivy Operator │ Falco │ OPA Gatekeeper │
47
- │ (Image Scan) │(Runtime)│ (Admission) │
48
- └──────────────────────────────────────────────┘
49
  ```
50
 
51
- ## Directory Structure
52
-
53
- ```
54
- devsecops-platform/
55
- ├── terraform/ # Infrastructure as Code
56
- │ ├── modules/ # VPC, EKS, RDS, S3, IAM, KMS
57
- │ └── environments/ # dev, staging, prod configs
58
- ├── k8s/
59
- │ ├── base/ # Namespaces, RBAC, NetPols, Quotas
60
- │ ├─��� manifests/ # Platform services (ArgoCD, Istio, etc.)
61
- │ ├── helm-values/ # Helm chart overrides
62
- │ └── workloads/ # App deployments (frontend, backend, ml)
63
- ├── docker/
64
- │ ├── base-images/ # Multi-stage hardened Dockerfiles
65
- │ ├── scan-scripts/ # Trivy + Grype scanning
66
- │ ├── sign-scripts/ # Cosign image signing
67
- │ └── sbom-scripts/ # SPDX + CycloneDX SBOM generation
68
- ├── ci-cd/
69
- │ ├── github-actions/ # Full DevSecOps pipeline
70
- │ ├── jenkins/ # Jenkinsfile
71
- │ └── gitlab-ci/ # .gitlab-ci.yml
72
- ├── security/
73
- │ ├── checkov/ # IaC scanning config
74
- │ ├── semgrep/ # SAST custom rules
75
- │ ├── trivy/ # Container + secret scanning
76
- │ └── sbom/ # SBOM policies
77
- ├── monitoring/
78
- │ ├── prometheus/ # Alerting rules
79
- │ ├── grafana/ # Dashboards
80
- │ ├── alertmanager/ # Routing & escalation
81
- │ └── otel/ # OpenTelemetry collector
82
- ├── compliance/
83
- │ ├── soc2/ # SOC2 Type II controls mapping
84
- │ ├── nist/ # NIST 800-53 Rev5 mapping
85
- │ ├── cis-benchmarks/ # CIS EKS + K8s checks
86
- │ └── policies/ # OPA Gatekeeper policies
87
- ├── ai-ml/
88
- │ ├── rag-pipeline/ # LangChain + HF + ChromaDB
89
- │ ├── mlflow/ # MLflow tracking deployment
90
- │ └── hf-finetuning/ # SFT + LoRA fine-tuning
91
- └── scripts/
92
- ├── python/ # Security audit automation
93
- └── bash/ # Bootstrap + incident response
94
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
  ## Quick Start
97
 
98
  ```bash
99
- # 1. Bootstrap the platform
100
  ./scripts/bash/bootstrap.sh prod
101
 
102
- # 2. Run security audit
103
  python3 scripts/python/security_audit.py
104
 
105
- # 3. Incident response
106
- ./scripts/bash/incident-response.sh security
107
- ```
108
-
109
- ## Security Controls Summary
110
-
111
- | Control | Implementation | Enforcement |
112
- |---------|---------------|-------------|
113
- | **Zero Trust Network** | Default deny + selective allow NetPol | Kyverno |
114
- | **mTLS** | Istio STRICT mode | PeerAuthentication |
115
- | **No Root** | runAsNonRoot + distroless images | Kyverno Enforce |
116
- | **No :latest** | Version pinning required | Kyverno Enforce |
117
- | **Secret Encryption** | KMS + EKS encryption config | Terraform |
118
- | **Image Scanning** | Trivy Operator continuous | CI/CD gate |
119
- | **Runtime Detection** | Falco eBPF + custom rules | Alertmanager |
120
- | **SBOM** | SPDX + CycloneDX + Cosign attestation | CI/CD |
121
- | **Least Privilege IAM** | MFA + scoped roles + IRSA | Terraform |
122
 
123
- ## Compliance Coverage
 
124
 
125
- | Framework | Controls | Status |
126
- |-----------|----------|--------|
127
- | SOC2 Type II | CC6.1–CC9.1 | ✅ Mapped |
128
- | NIST 800-53 Rev5 | AC-2, AU-2, SC-7, SI-4 | ✅ Mapped |
129
- | CIS EKS Benchmark | 1.1.1–5.3.2 | ✅ Automated |
130
- | PCI-DSS | Req 6, 8, 10, 11 | ✅ Partial |
131
 
132
- ## CI/CD Pipeline Stages
133
 
134
- ```
135
- SAST (Semgrep + Checkov + Trivy Secrets)
136
- → Build (Multi-stage Docker + ECR Push)
137
- → Scan (Trivy Image + SBOM Generation)
138
- → Test (Integration + OWASP ZAP DAST)
139
- → Sign (Cosign Keyless + SBOM Attest)
140
- → Deploy Staging (ArgoCD GitOps Sync)
141
- → Deploy Prod (Manual Approval + Smoke Test)
142
- ```
143
 
144
- ## Observability Stack
 
 
 
 
 
145
 
146
- - **Metrics**: Prometheus Grafana dashboards
147
- - **Logs**: Loki + Promtail → Grafana LogQL
148
- - **Traces**: OpenTelemetry → Tempo → Grafana
149
- - **Alerts**: Prometheus rules → Alertmanager → Slack + PagerDuty
150
- - **Security**: Falco → Alertmanager → Slack #security-alerts
151
 
152
- ## License
153
 
154
- Internal use — Enterprise DevSecOps Reference Architecture
 
1
+ # DevSecOps Platform OMEGA Enterprise AI Operating System
2
 
3
+ > Production-grade, security-first, automation-first platform covering the full DevOps, Cloud, Kubernetes, Security, AI/ML, FinOps, and Governance lifecycle.
4
 
5
+ **156 files | 182KB | 13 domains | All production-ready**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
+ ## Architecture
8
 
9
  ```
10
+ ENGINEERING COMMAND CENTER
11
+ |
12
+ +------------------+------------------+
13
+ | | | | |
14
+ RELIABILITY SECURITY FINOPS PLATFORM AI/ML
15
+ (SLO/PDB) (GuardDuty) (Cost) (Golden (RAG/SFT)
16
+ | | | Path) |
17
+ +---------+--------+--------+---------+--+
18
+ | |
19
+ KUBERNETES TERRAFORM
20
+ (Kustomize) (IaC Modules)
21
+ | |
22
+ AWS CLOUD INFRASTRUCTURE
 
23
  ```
24
 
25
+ ## OMEGA 10-Dimension Scorecard
26
+
27
+ | # | Dimension | Score | Assets |
28
+ |---|-----------|-------|--------|
29
+ | 1 | **Reliability** | 8/10 | PDBs, SLOs, HPA, multi-AZ, Istio |
30
+ | 2 | **Security** | 9/10 | GuardDuty, Macie, Falco, Kyverno, Trivy, mTLS |
31
+ | 3 | **Dev Velocity** | 7/10 | Golden paths, self-service envs, Kustomize |
32
+ | 4 | **Cost Efficiency** | 7/10 | FinOps scanner, spot instances, scheduling policy |
33
+ | 5 | **Governance** | 8/10 | SOC2, NIST 800-53, CIS, OPA, ADR template |
34
+ | 6 | **Automation** | 7/10 | Bootstrap, auto-remediation, GitOps (ArgoCD) |
35
+ | 7 | **Incident Recovery** | 8/10 | Runbook, postmortem template, war-room |
36
+ | 8 | **Standardization** | 8/10 | Kustomize overlays, golden path templates |
37
+ | 9 | **AI Enablement** | 8/10 | RAG, LoRA v2, MLflow, Trackio, GPU scheduling |
38
+ | 10 | **Engineering Excellence** | 7/10 | ADR template, checklists, SRE standards |
39
+
40
+ ## Platform Modules
41
+
42
+ ### Infrastructure (Terraform)
43
+ | Module | Purpose | Key Feature |
44
+ |--------|---------|-------------|
45
+ | VPC | Network isolation | Flow logs, default deny SG/NACL |
46
+ | EKS | Kubernetes cluster | Private API, KMS encryption, IRSA |
47
+ | RDS | Database | Multi-AZ, encrypted, performance insights |
48
+ | S3 | Storage | SSE-KMS, versioning, lifecycle |
49
+ | IAM | Access control | MFA, least privilege, access analyzer |
50
+ | KMS | Key management | Auto-rotation, multi-key |
51
+ | GuardDuty | Threat detection | EBS malware scan, K8s audit, S3 |
52
+ | Macie | PII detection | Automated data classification |
53
+
54
+ ### Kubernetes
55
+ | Layer | Components |
56
+ |-------|-----------|
57
+ | **Base** | Namespaces, RBAC, NetPols, Quotas, Limits, PDBs, SLOs |
58
+ | **Platform** | ArgoCD, Istio (mTLS), ExternalSecrets, CertManager |
59
+ | **Security** | Trivy Operator, Falco (eBPF), Kyverno (7 policies), OPA |
60
+ | **Observability** | Prometheus, Grafana, Loki, Alertmanager, OTEL |
61
+ | **Workloads** | Frontend, Backend (HPA), ML Pipeline (GPU) |
62
+
63
+ ### FinOps Engine
64
+ | Asset | Purpose |
65
+ |-------|---------|
66
+ | finops-policy.yaml | 11 cost optimization rules |
67
+ | finops_scanner.py | Automated waste detection |
68
+ | cost-optimization.yaml | Spot instance strategy + KEDA |
69
+ | finops-cronjob.yaml | Daily cost scan CronJob |
70
+
71
+ ### Platform Engineering
72
+ | Asset | Purpose |
73
+ |-------|---------|
74
+ | golden-paths/microservice/ | Production-ready service template + checklist |
75
+ | self-service/ | Ephemeral environment provisioning config |
76
+ | adr/template.md | Architecture Decision Record template |
77
+ | kustomize/ | Base + dev/staging/prod overlays |
78
+
79
+ ### Incident Response
80
+ | Asset | Purpose |
81
+ |-------|---------|
82
+ | auto-remediate.sh | OOM fix, pod restart, security escalation |
83
+ | postmortem/template.md | Full postmortem with 5 Whys + action items |
84
+ | incident-response.sh | Diagnostic runbook (5 incident types) |
85
+
86
+ ### AI/ML Hub
87
+ | Asset | Purpose |
88
+ |-------|---------|
89
+ | finetune.py | LoRA Without Regret (r=256, all-linear) |
90
+ | run_finetune.py | CLI entry point with dataset selection |
91
+ | TRAINING_RECIPE.md | v1→v2 upgrade documentation |
92
+ | rag_pipeline.py | LangChain + HF + ChromaDB RAG |
93
+ | mlflow/ | MLflow tracking deployment |
94
+
95
+ ### Compliance
96
+ | Framework | Coverage |
97
+ |-----------|---------|
98
+ | SOC2 Type II | CC6-CC9 controls mapped |
99
+ | NIST 800-53 | 12 controls mapped |
100
+ | CIS Benchmarks | EKS + K8s automated |
101
+ | OPA Gatekeeper | Admission policies |
102
+
103
+ ### CI/CD Pipelines
104
+ | System | Features |
105
+ |--------|----------|
106
+ | GitHub Actions | 6-stage DevSecOps (SAST→Build→Scan→Test→Sign→Deploy) |
107
+ | Jenkins | Parallel SAST + production deployment |
108
+ | GitLab CI | Full scan + sign + deploy pipeline |
109
 
110
  ## Quick Start
111
 
112
  ```bash
113
+ # Bootstrap full platform
114
  ./scripts/bash/bootstrap.sh prod
115
 
116
+ # Security audit
117
  python3 scripts/python/security_audit.py
118
 
119
+ # FinOps cost scan
120
+ python3 finops/finops_scanner.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
+ # Incident response
123
+ ./scripts/bash/incident-response.sh security
124
 
125
+ # Auto-remediate
126
+ ./incident-response/auto-remediation/auto-remediate.sh PodCrashLooping backend <pod-name>
127
+ ```
 
 
 
128
 
129
+ ## Self-Improvement Checklist
130
 
131
+ After every deployment, ask:
 
 
 
 
 
 
 
 
132
 
133
+ - [ ] Can this be automated?
134
+ - [ ] Can this be templated?
135
+ - [ ] Can this be secured further?
136
+ - [ ] Can this be cheaper?
137
+ - [ ] Can this scale better?
138
+ - [ ] Can this reduce human toil?
139
 
140
+ If yes, enhance and push.
 
 
 
 
141
 
142
+ ## Hub
143
 
144
+ **[huggingface.co/shaikhsalman/devsecops-platform](https://huggingface.co/shaikhsalman/devsecops-platform)**