shaikhsalman commited on 21 days ago

Commit

9d4d5c7

verified ·

1 Parent(s): 36df1e5

refactor: merged structure - model at center, DevSecOps wrapped around it

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +60 -129
ai-ml/hf-finetuning/TRAINING_RECIPE.md +0 -58
ai-ml/hf-finetuning/run_finetune.py +0 -67
ci-cd/gitlab-ci/.gitlab-ci.yml +0 -113
ci-cd/jenkins/Jenkinsfile +0 -136
compliance/{cis-benchmarks/cis-eks-k8s.yaml → cis-eks-k8s.yaml} +0 -0
compliance/{soc2/controls-mapping.yaml → controls-mapping.yaml} +0 -0
compliance/{nist/nist-800-53-mapping.yaml → nist-800-53-mapping.yaml} +0 -0
{docker/base-images → deployment}/Dockerfile.ml-inference +0 -0
{k8s/workloads/ml-pipeline → deployment}/deployment.yaml +0 -0
{ai-ml/mlflow → deployment}/mlflow-deployment.yaml +0 -0
docker/base-images/Dockerfile.backend +0 -51
docker/base-images/Dockerfile.frontend +0 -33
finops/cost-optimization.yaml +0 -73
finops/finops-cronjob.yaml +0 -23
finops/finops-policy.yaml +0 -73
incident-response/auto-remediation/auto-remediate.sh +0 -50
platform/adr/template.md → infrastructure/adr-template.md +0 -0
{ci-cd/github-actions → infrastructure/ci-cd}/devsecops-pipeline.yml +0 -0
{finops → infrastructure}/finops_scanner.py +0 -0
{scripts/bash → infrastructure}/incident-response.sh +0 -0
{monitoring → infrastructure/monitoring}/alertmanager/alertmanager-config.yaml +0 -0
monitoring/grafana/dashboards/platform-overview.json → infrastructure/monitoring/grafana-platform-overview.json +0 -0
{monitoring → infrastructure/monitoring}/otel/otel-collector.yaml +0 -0
{monitoring → infrastructure/monitoring}/prometheus/alerts.yaml +0 -0
incident-response/postmortem/template.md → infrastructure/postmortem-template.md +0 -0
{terraform → infrastructure/terraform}/environments/prod/main.tf +0 -0
{terraform → infrastructure/terraform}/modules/eks/main.tf +0 -0
{terraform → infrastructure/terraform}/modules/eks/outputs.tf +0 -0
{terraform → infrastructure/terraform}/modules/eks/variables.tf +0 -0
{terraform → infrastructure/terraform}/modules/guardduty/main.tf +0 -0
{terraform → infrastructure/terraform}/modules/guardduty/variables.tf +0 -0
{terraform → infrastructure/terraform}/modules/iam/main.tf +0 -0
{terraform → infrastructure/terraform}/modules/kms/main.tf +0 -0
{terraform → infrastructure/terraform}/modules/macie/main.tf +0 -0
{terraform → infrastructure/terraform}/modules/rds/main.tf +0 -0
{terraform → infrastructure/terraform}/modules/rds/variables.tf +0 -0
{terraform → infrastructure/terraform}/modules/s3/main.tf +0 -0
{terraform → infrastructure/terraform}/modules/s3/variables.tf +0 -0
{terraform → infrastructure/terraform}/modules/vpc/main.tf +0 -0
{terraform → infrastructure/terraform}/modules/vpc/outputs.tf +0 -0
{terraform → infrastructure/terraform}/modules/vpc/variables.tf +0 -0
k8s/base/limit-ranges/limit-ranges.yaml +0 -74
k8s/base/namespaces/namespaces.yaml +0 -69
k8s/base/network-policies/network-policies.yaml +0 -124
k8s/base/pdbs/pdbs.yaml +0 -62
k8s/base/rbac/rbac.yaml +0 -78
k8s/base/resource-quotas/resource-quotas.yaml +0 -50
k8s/base/slos/slos.yaml +0 -68
k8s/kustomize/base/kustomization.yaml +0 -18

README.md CHANGED Viewed

@@ -1,144 +1,75 @@
-# DevSecOps Platform OMEGA — Enterprise AI Operating System
-> Production-grade, security-first, automation-first platform covering the full DevOps, Cloud, Kubernetes, Security, AI/ML, FinOps, and Governance lifecycle.
-**156 files | 182KB | 13 domains | All production-ready**
-## Architecture
-```
-                    ENGINEERING COMMAND CENTER
-                           |
-        +------------------+------------------+
-        |         |        |        |         |
-    RELIABILITY SECURITY  FINOPS  PLATFORM   AI/ML
-    (SLO/PDB)  (GuardDuty) (Cost)  (Golden  (RAG/SFT)
-        |         |        |        Path)      |
-        +---------+--------+--------+---------+--+
-                  |                 |
-            KUBERNETES           TERRAFORM
-          (Kustomize)           (IaC Modules)
-                  |                 |
-            AWS CLOUD INFRASTRUCTURE
-```
-## OMEGA 10-Dimension Scorecard
-| # | Dimension | Score | Assets |
-|---|-----------|-------|--------|
-| 1 | **Reliability** | 8/10 | PDBs, SLOs, HPA, multi-AZ, Istio |
-| 2 | **Security** | 9/10 | GuardDuty, Macie, Falco, Kyverno, Trivy, mTLS |
-| 3 | **Dev Velocity** | 7/10 | Golden paths, self-service envs, Kustomize |
-| 4 | **Cost Efficiency** | 7/10 | FinOps scanner, spot instances, scheduling policy |
-| 5 | **Governance** | 8/10 | SOC2, NIST 800-53, CIS, OPA, ADR template |
-| 6 | **Automation** | 7/10 | Bootstrap, auto-remediation, GitOps (ArgoCD) |
-| 7 | **Incident Recovery** | 8/10 | Runbook, postmortem template, war-room |
-| 8 | **Standardization** | 8/10 | Kustomize overlays, golden path templates |
-| 9 | **AI Enablement** | 8/10 | RAG, LoRA v2, MLflow, Trackio, GPU scheduling |
-| 10 | **Engineering Excellence** | 7/10 | ADR template, checklists, SRE standards |
-## Platform Modules
-### Infrastructure (Terraform)
-| Module | Purpose | Key Feature |
-|--------|---------|-------------|
-| VPC | Network isolation | Flow logs, default deny SG/NACL |
-| EKS | Kubernetes cluster | Private API, KMS encryption, IRSA |
-| RDS | Database | Multi-AZ, encrypted, performance insights |
-| S3 | Storage | SSE-KMS, versioning, lifecycle |
-| IAM | Access control | MFA, least privilege, access analyzer |
-| KMS | Key management | Auto-rotation, multi-key |
-| GuardDuty | Threat detection | EBS malware scan, K8s audit, S3 |
-| Macie | PII detection | Automated data classification |
-### Kubernetes
-| Layer | Components |
-|-------|-----------|
-| **Base** | Namespaces, RBAC, NetPols, Quotas, Limits, PDBs, SLOs |
-| **Platform** | ArgoCD, Istio (mTLS), ExternalSecrets, CertManager |
-| **Security** | Trivy Operator, Falco (eBPF), Kyverno (7 policies), OPA |
-| **Observability** | Prometheus, Grafana, Loki, Alertmanager, OTEL |
-| **Workloads** | Frontend, Backend (HPA), ML Pipeline (GPU) |
-### FinOps Engine
-| Asset | Purpose |
-|-------|---------|
-| finops-policy.yaml | 11 cost optimization rules |
-| finops_scanner.py | Automated waste detection |
-| cost-optimization.yaml | Spot instance strategy + KEDA |
-| finops-cronjob.yaml | Daily cost scan CronJob |
-### Platform Engineering
-| Asset | Purpose |
-|-------|---------|
-| golden-paths/microservice/ | Production-ready service template + checklist |
-| self-service/ | Ephemeral environment provisioning config |
-| adr/template.md | Architecture Decision Record template |
-| kustomize/ | Base + dev/staging/prod overlays |
-### Incident Response
-| Asset | Purpose |
-|-------|---------|
-| auto-remediate.sh | OOM fix, pod restart, security escalation |
-| postmortem/template.md | Full postmortem with 5 Whys + action items |
-| incident-response.sh | Diagnostic runbook (5 incident types) |
-### AI/ML Hub
-| Asset | Purpose |
-|-------|---------|
-| finetune.py | LoRA Without Regret (r=256, all-linear) |
-| run_finetune.py | CLI entry point with dataset selection |
-| TRAINING_RECIPE.md | v1→v2 upgrade documentation |
-| rag_pipeline.py | LangChain + HF + ChromaDB RAG |
-| mlflow/ | MLflow tracking deployment |
-### Compliance
-| Framework | Coverage |
-|-----------|---------|
-| SOC2 Type II | CC6-CC9 controls mapped |
-| NIST 800-53 | 12 controls mapped |
-| CIS Benchmarks | EKS + K8s automated |
-| OPA Gatekeeper | Admission policies |
-### CI/CD Pipelines
-| System | Features |
-|--------|----------|
-| GitHub Actions | 6-stage DevSecOps (SAST→Build→Scan→Test→Sign→Deploy) |
-| Jenkins | Parallel SAST + production deployment |
-| GitLab CI | Full scan + sign + deploy pipeline |
-## Quick Start
-```bash
-# Bootstrap full platform
-./scripts/bash/bootstrap.sh prod
-# Security audit
-python3 scripts/python/security_audit.py
-# FinOps cost scan
-python3 finops/finops_scanner.py
-# Incident response
-./scripts/bash/incident-response.sh security
-# Auto-remediate
-./incident-response/auto-remediation/auto-remediate.sh PodCrashLooping backend <pod-name>
 ```
-## Self-Improvement Checklist
-After every deployment, ask:
-- [ ] Can this be automated?
-- [ ] Can this be templated?
-- [ ] Can this be secured further?
-- [ ] Can this be cheaper?
-- [ ] Can this scale better?
-- [ ] Can this reduce human toil?
-If yes, enhance and push.
-## Hub
-**[huggingface.co/shaikhsalman/devsecops-platform](https://huggingface.co/shaikhsalman/devsecops-platform)**

+---
+license: apache-2.0
+tags:
+- devsecops
+- llm
+- sft
+- lora
+- tulu-3
+- kubernetes
+- terraform
+---
+# DevSecOps Model Platform
+> Train a secure model on the best data, then deploy it securely.
+## Start Here: Train Your Model
+| Dataset | Size | What It Gives You | Command |
+|---------|------|-------------------|---------|
+| **tulu-3-sft-mixture** | 940K | Math, code, safety, chat (BEST) | python model/train_tulu3.py |
+| **OpenThoughts-114k** | 114K | Reasoning, chain-of-thought | python model/train_openthoughts.py |
+**allenai/tulu-3-sft-mixture** is from Allen AI Tulu 3 - current SOTA open instruction-tuned model. Proven on Llama-3.1-8B: MMLU 53.5, GSM8K 79.9, HumanEval 76.8.
+LoRA config from LoRA Without Regret (Schulman 2025): r=256, alpha=16, all-linear = matches full fine-tuning at 67% compute.
+## Repository Structure
+```
+model/                     THE MODEL - train, serve, enhance
+  train_tulu3.py             Primary: 940K best data (zero preprocessing)
+  train_openthoughts.py      Reasoning: 114K CoT traces
+  finetune_configurable.py   Multi-dataset configurable trainer
+  rag_pipeline.py             RAG for DevSecOps knowledge
+  DATASETS.md                 Why these datasets, proven recipes
+deployment/               SERVE IT - Kubernetes + Docker + vLLM
+  deployment.yaml             ML inference K8s manifest
+  mlflow-deployment.yaml      Experiment tracking
+  Dockerfile.ml-inference     Hardened multi-stage image
+security/                 PROTECT IT - scanning + policies
+  scanning/                   Trivy, Semgrep, Checkov, SBOM
+  policies/                   Kyverno, OPA Gatekeeper
+infrastructure/           RUN IT - Terraform + monitoring + CI/CD
+  terraform/                  VPC, EKS, RDS, S3, IAM, KMS, GuardDuty, Macie
+  monitoring/                 Prometheus, Alertmanager, OTEL, Grafana
+  ci-cd/                      GitHub Actions DevSecOps pipeline
+compliance/               CERTIFY IT - SOC2, NIST, CIS
+  controls-mapping.yaml       SOC2 Type II
+  nist-800-53-mapping.yaml    NIST 800-53 Rev5
+  cis-eks-k8s.yaml            CIS Benchmarks
 ```
+## Quick Commands
+```bash
+# Train on best data (A100, ~6h)
+python model/train_tulu3.py
+# Quick test (any GPU)
+python model/train_tulu3.py --max_steps 100 --no_push
+# Security scan
+python security/scanning/security_audit.py
+# Deploy model to K8s
+kubectl apply -f deployment/deployment.yaml
+# Infrastructure (Terraform)
+cd infrastructure/terraform/environments/prod && terraform apply
+```

ai-ml/hf-finetuning/TRAINING_RECIPE.md DELETED Viewed

@@ -1,58 +0,0 @@
-# Model Enhancement — Dataset & Training Recipe vNext
-## What Changed (v1 → v2)
-| Parameter | v1 (Old) | v2 (LoRA Without Regret) | Why |
-|-----------|----------|--------------------------|-----|
-| **Dataset** | ultrachat_200k (5K subset) | **tulu-3-sft-mixture** (940K) | 19 curated sources > single source |
-| **LoRA r** | 16 | **256** | SFT-scale datasets need r=256 to match full FT |
-| **LoRA alpha** | 32 | **16** | Stable scaling with high rank |
-| **Target modules** | q/k/v/o_proj only | **all-linear** | Attention-only underperforms even at higher rank |
-| **Effective batch** | 32 | **16** | LoRA less tolerant of large batches |
-| **Learning rate** | 2e-4 | **2e-4** (same) | 10x full FT rate — correct in v1 |
-| **Packing** | False | **True (bfd_split)** | Preserves all tokens, 2-3x throughput |
-| **assistant_only_loss** | False | **True** | Loss only on assistant tokens |
-| **EOS token** | Not set | **<\|eot_id\|>** | Llama 3.1 chat template |
-| **LR scheduler** | linear | **cosine** | Better convergence for LoRA |
-| **Epochs** | 3 | **1** | 940K examples = 1 epoch sufficient |
-## Dataset Comparison
-| Dataset | Size | Format | Best For | Quality |
-|---------|------|--------|----------|---------|
-| **tulu-3-sft-mixture** | 940K | messages ✅ | General SFT (code, math, IF, safety, science) | ⭐⭐⭐⭐⭐ |
-| **OpenThoughts-114k** | 114K | conversations (needs conversion) | Reasoning, CoT traces | ⭐⭐⭐⭐ |
-| ultrachat_200k | 200K | messages ✅ | Multi-turn chat baseline | ⭐⭐⭐ |
-## Key Research: "LoRA Without Regret" (Schulman et al., 2025)
-Four findings that change how we fine-tune:
-1. **Target ALL linear layers** — not just attention. Increasing rank does NOT compensate for skipping layers.
-2. **Use r=256 for SFT** — sufficient capacity for post-training scale datasets.
-3. **Use 10x higher LR** (2e-4 vs 2e-5 for full FT) — 1/r scaling makes optimal LR rank-independent.
-4. **Keep batch size < 32** — LoRA is less tolerant of large batches. Cannot be mitigated by increasing rank.
-## Recommended Training Matrix
-### SFT (Supervised Fine-Tuning)
-| Model | Dataset | Hardware | Time | Cost |
-|-------|---------|----------|------|------|
-| Llama-3.1-8B-Instruct | tulu-3-sft (940K) | A100 (80GB) | ~6h | ~$24 |
-| Llama-3.1-8B-Instruct | OpenThoughts-114k | A100 (80GB) | ~2h | ~$8 |
-| Llama-3.1-8B-Instruct | tulu-3-sft (940K) | A10G (24GB) + QLoRA | ~12h | ~$24 |
-### GRPO (Reinforcement Learning)
-| Model | Dataset | LoRA r | Hardware |
-|-------|---------|--------|----------|
-| Qwen3-0.6B | OpenR1-Math-220k | 1 | A100 |
-| Llama-3.1-8B-Base | GSM8k | 1-32 | A100 |
-## Source Attribution
-- LoRA Without Regret: Schulman et al., 2025, Thinking Machines Lab
-- tulu-3-sft-mixture: Allen AI, used by Tulu 3 (SOTA open instruction-tuned)
-- OpenThoughts-114k: Open community, reasoning-heavy CoT data
-- LoRA Land: Predibase 2024, 224/310 LoRA models surpassed GPT-4

ai-ml/hf-finetuning/run_finetune.py DELETED Viewed

@@ -1,67 +0,0 @@
-# =============================================================================
-# SFT Fine-Tuning — CLI Entry Point (LoRA Without Regret config)
-# =============================================================================
-# Usage:
-#   # Default: tulu-3-sft + Llama-3.1-8B
-#   python run_finetune.py
-#
-#   # OpenThoughts reasoning dataset
-#   python run_finetune.py --dataset_key openthoughts-114k
-#
-#   # Ultrachat fallback
-#   python run_finetune.py --dataset_key ultrachat-200k
-#
-#   # Custom hub model ID
-#   python run_finetune.py --hub_model_id my-org/my-model-v2
-# =============================================================================
-import argparse
-import sys
-from finetune import FinetuneConfig, finetune, DATASET_REGISTRY
-def main():
-    parser = argparse.ArgumentParser(description="SFT Fine-Tuning (LoRA Without Regret)")
-    parser.add_argument("--dataset_key", default="tulu-3-sft",
-                        choices=list(DATASET_REGISTRY.keys()),
-                        help="Dataset to train on")
-    parser.add_argument("--hub_model_id", default=None,
-                        help="HuggingFace Hub model ID for push")
-    parser.add_argument("--num_train_epochs", type=int, default=None)
-    parser.add_argument("--learning_rate", type=float, default=None)
-    parser.add_argument("--lora_r", type=int, default=None)
-    parser.add_argument("--per_device_train_batch_size", type=int, default=None)
-    parser.add_argument("--max_seq_length", type=int, default=None)
-    args = parser.parse_args()
-    config = FinetuneConfig()
-    if args.dataset_key:
-        config.dataset_key = args.dataset_key
-    if args.hub_model_id:
-        config.hub_model_id = args.hub_model_id
-    if args.num_train_epochs:
-        config.num_train_epochs = args.num_train_epochs
-    if args.learning_rate:
-        config.learning_rate = args.learning_rate
-    if args.lora_r:
-        config.lora_r = args.lora_r
-    if args.per_device_train_batch_size:
-        config.per_device_train_batch_size = args.per_device_train_batch_size
-    if args.max_seq_length:
-        config.max_seq_length = args.max_seq_length
-    print(f"Config: model={config.model_name}")
-    print(f"  dataset={config.dataset_key}")
-    print(f"  lora_r={config.lora_r}, lora_alpha={config.lora_alpha}")
-    print(f"  target_modules={config.target_modules}")
-    print(f"  lr={config.learning_rate}, epochs={config.num_train_epochs}")
-    print(f"  effective_batch={config.per_device_train_batch_size * config.gradient_accumulation_steps}")
-    print(f"  packing={config.packing}, strategy={config.packing_strategy}")
-    print(f"  assistant_only_loss={config.assistant_only_loss}")
-    finetune(config)
-if __name__ == "__main__":
-    main()

ci-cd/gitlab-ci/.gitlab-ci.yml DELETED Viewed

@@ -1,113 +0,0 @@
-# =============================================================================
-# GitLab CI — DevSecOps Pipeline
-# =============================================================================
-stages:
-  - sast
-  - build
-  - scan
-  - test
-  - sign
-  - deploy
-variables:
-  REGISTRY: ecr.aws/devsecops
-  TRIVY_SEVERITY: "CRITICAL,HIGH"
-# --- SAST Stage ---
-semgrep:
-  stage: sast
-  image: semgrep/semgrep:latest
-  script:
-    - semgrep --config auto --json --output semgrep.json .
-  artifacts:
-    paths:
-      - semgrep.json
-secret-scan:
-  stage: sast
-  image: aquasec/trivy:latest
-  script:
-    - trivy fs --scanners secret --exit-code 1 .
-checkov:
-  stage: sast
-  image: bridgecrew/checkov:latest
-  script:
-    - checkov -d terraform/ --output cli
-# --- Build Stage ---
-build:
-  stage: build
-  image: docker:24
-  services:
-    - docker:24-dind
-  before_script:
-    - aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
-  script:
-    - |
-      docker build \
-        --build-arg BUILD_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
-        -t $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA \
-        -t $REGISTRY/$CI_PROJECT_NAME:latest .
-    - docker push $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA
-# --- Scan Stage ---
-trivy-scan:
-  stage: scan
-  image: aquasec/trivy:latest
-  needs: [build]
-  script:
-    - trivy image --severity $TRIVY_SEVERITY --exit-code 1 --ignore-unfixed $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA
-generate-sbom:
-  stage: scan
-  image: anchore/syft:latest
-  needs: [build]
-  script:
-    - syft $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA -o spdx-json > sbom.spdx.json
-  artifacts:
-    paths:
-      - sbom.spdx.json
-# --- Test Stage ---
-integration-test:
-  stage: test
-  image: docker:24
-  services:
-    - docker:24-dind
-  script:
-    - docker compose -f docker-compose.test.yml up --abort-on-container-exit
-# --- Sign Stage ---
-sign:
-  stage: sign
-  image: bitnami/cosign:latest
-  needs: [build, trivy-scan, generate-sbom]
-  variables:
-    COSIGN_EXPERIMENTAL: "1"
-  script:
-    - cosign sign --yes $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA
-    - cosign attest --yes --predicate sbom.spdx.json --type spdxjson $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA
-# --- Deploy Stage ---
-deploy-staging:
-  stage: deploy
-  image: bitnami/kubectl:latest
-  needs: [sign, integration-test]
-  environment:
-    name: staging
-  script:
-    - kubectl set image deployment/$CI_PROJECT_NAME $CI_PROJECT_NAME=$REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA -n staging
-    - kubectl rollout status deployment/$CI_PROJECT_NAME -n staging --timeout=300s
-deploy-prod:
-  stage: deploy
-  image: bitnami/kubectl:latest
-  needs: [deploy-staging]
-  environment:
-    name: production
-  when: manual
-  script:
-    - kubectl set image deployment/$CI_PROJECT_NAME $CI_PROJECT_NAME=$REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA -n production
-    - kubectl rollout status deployment/$CI_PROJECT_NAME -n production --timeout=300s

ci-cd/jenkins/Jenkinsfile DELETED Viewed

@@ -1,136 +0,0 @@
-// =============================================================================
-// Jenkinsfile — Shared DevSecOps Pipeline
-// =============================================================================
-pipeline {
-    agent { label 'docker' }
-    environment {
-        REGISTRY     = 'ecr.aws/devsecops'
-        IMAGE_NAME   = "${env.JOB_NAME.split('/').last()}"
-        IMAGE_TAG    = "${env.GIT_COMMIT.take(12)}"
-        TRIVY_SEVERITY = 'CRITICAL,HIGH'
-    }
-    stages {
-        // ----- Stage 1: SAST -----
-        stage('SAST') {
-            parallel {
-                stage('Semgrep') {
-                    steps {
-                        sh 'semgrep --config auto --json --output semgrep.json .'
-                    }
-                }
-                stage('Secret Scan') {
-                    steps {
-                        sh 'trivy fs --scanners secret --exit-code 1 .'
-                    }
-                }
-                stage('IaC Scan') {
-                    steps {
-                        sh 'checkov -d terraform/ --output cli --soft-fail false'
-                    }
-                }
-            }
-        }
-        // ----- Stage 2: Build -----
-        stage('Build') {
-            steps {
-                script {
-                    docker.withRegistry("https://${REGISTRY}", 'ecr:us-east-1') {
-                        def app = docker.build(
-                            "${IMAGE_NAME}:${IMAGE_TAG}",
-                            '--build-arg BUILD_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ) .'
-                        )
-                        app.push()
-                        app.push('latest')
-                    }
-                }
-            }
-        }
-        // ----- Stage 3: Container Scan -----
-        stage('Security Scan') {
-            steps {
-                sh """
-                    trivy image \
-                        --severity ${TRIVY_SEVERITY} \
-                        --exit-code 1 \
-                        --ignore-unfixed \
-                        ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}
-                """
-                // Generate SBOM
-                sh """
-                    syft ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} \
-                        -o cyclonedx-json > sbom.cyclonedx.json
-                """
-            }
-        }
-        // ----- Stage 4: Test -----
-        stage('Integration Test') {
-            steps {
-                sh 'docker compose -f docker-compose.test.yml up --abort-on-container-exit'
-            }
-        }
-        // ----- Stage 5: Sign -----
-        stage('Sign & Attest') {
-            steps {
-                sh """
-                    cosign sign --yes \
-                        ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}
-                    cosign attest --yes \
-                        --predicate sbom.cyclonedx.json \
-                        --type cyclonedx \
-                        ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG}
-                """
-            }
-        }
-        // ----- Stage 6: Deploy -----
-        stage('Deploy Staging') {
-            steps {
-                sh """
-                    kubectl set image deployment/${IMAGE_NAME} \
-                        ${IMAGE_NAME}=${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} \
-                        -n staging
-                """
-                // Wait for rollout
-                sh 'kubectl rollout status deployment/${IMAGE_NAME} -n staging --timeout=300s'
-            }
-        }
-        stage('Deploy Production') {
-            when {
-                branch 'main'
-            }
-            input {
-                message "Deploy ${IMAGE_NAME}:${IMAGE_TAG} to production?"
-            }
-            steps {
-                sh """
-                    kubectl set image deployment/${IMAGE_NAME} \
-                        ${IMAGE_NAME}=${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} \
-                        -n production
-                """
-                sh 'kubectl rollout status deployment/${IMAGE_NAME} -n production --timeout=300s'
-            }
-        }
-    }
-    post {
-        always {
-            archiveArtifacts artifacts: 'semgrep.json, sbom.cyclonedx.json', allowEmptyArchive: true
-            recordIssues(tools: [semgrep(pattern: 'semgrep.json')])
-        }
-        failure {
-            slackSend(
-                channel: '#platform-alerts',
-                color: 'danger',
-                message: "FAILED: ${env.JOB_NAME} #${env.BUILD_NUMBER}"
-            )
-        }
-    }
-}

compliance/{cis-benchmarks/cis-eks-k8s.yaml → cis-eks-k8s.yaml} RENAMED Viewed

File without changes

compliance/{soc2/controls-mapping.yaml → controls-mapping.yaml} RENAMED Viewed

File without changes

compliance/{nist/nist-800-53-mapping.yaml → nist-800-53-mapping.yaml} RENAMED Viewed

File without changes

{docker/base-images → deployment}/Dockerfile.ml-inference RENAMED Viewed

File without changes

{k8s/workloads/ml-pipeline → deployment}/deployment.yaml RENAMED Viewed

File without changes

{ai-ml/mlflow → deployment}/mlflow-deployment.yaml RENAMED Viewed

File without changes

docker/base-images/Dockerfile.backend DELETED Viewed

@@ -1,51 +0,0 @@
-# =============================================================================
-# Multi-Stage Hardened Dockerfile — Python Backend
-# =============================================================================
-# Security Features:
-#   - Multi-stage build (build → runtime)
-#   - Non-root user
-#   - Minimal base (distroless)
-#   - Pinned versions
-#   - No shell in runtime image
-#   - Health check
-# =============================================================================
-# --- Build Stage ---
-FROM python:3.12-slim AS builder
-WORKDIR /build
-# Pin pip and install dependencies
-COPY requirements.txt .
-RUN pip install --no-cache-dir --require-hashes -r requirements.txt
-# Copy application
-COPY src/ /build/src/
-COPY pyproject.toml /build/
-# Build wheel
-RUN pip wheel --no-cache-dir --no-deps -w /build/wheels .
-# --- Runtime Stage ---
-FROM gcr.io/distroless/python3-debian12:nonroot AS runtime
-# Copy wheels from builder
-COPY --from=builder /build/wheels /app/wheels/
-COPY --from=builder /build/src/ /app/src/
-# Set environment
-ENV PYTHONUNBUFFERED=1 \
-    PYTHONDONTWRITEBYTECODE=1 \
-    PATH="/app/.local/bin:${PATH}"
-WORKDIR /app
-# Run as non-root (distroless nonroot image UID 65532)
-USER 65532:65532
-EXPOSE 8080
-HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
-    CMD ["python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/healthz')"]
-ENTRYPOINT ["python", "-m", "uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080"]

docker/base-images/Dockerfile.frontend DELETED Viewed

@@ -1,33 +0,0 @@
-# =============================================================================
-# Multi-Stage Hardened Dockerfile — React Frontend
-# =============================================================================
-# --- Build Stage ---
-FROM node:20-alpine AS builder
-WORKDIR /app
-# Pin package versions with lockfile
-COPY package.json package-lock.json ./
-RUN npm ci --ignore-scripts
-COPY . .
-RUN npm run build
-# --- Runtime Stage ---
-FROM nginxinc/nginx-unprivileged:1.25-alpine AS runtime
-# Remove default nginx configs
-RUN rm -f /etc/nginx/conf.d/default.conf
-# Copy custom nginx config (security headers)
-COPY docker/nginx.conf /etc/nginx/conf.d/
-COPY --from=builder /app/dist /usr/share/nginx/html
-# Security headers are in nginx.conf
-EXPOSE 8080
-USER 101:101
-HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
-    CMD ["curl", "-f", "http://localhost:8080/healthz"]

finops/cost-optimization.yaml DELETED Viewed

@@ -1,73 +0,0 @@
-# =============================================================================
-# FinOps Engine — Cloud Cost Governance
-# =============================================================================
-# Addresses: cost waste, rightsizing, scheduling, unit economics
-# =============================================================================
-# --- Spot Instance Strategy ---
-# Use SPOT for ML training workloads (70-90% cost savings)
-# Use ON_DEMAND for production services (no interruption risk)
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: ml-training-spot
-  namespace: ml-pipeline
-  labels:
-    app: ml-training-spot
-    finops: spot-instance
-spec:
-  replicas: 0  # Scale up on demand via KEDA
-  selector:
-    matchLabels:
-      app: ml-training-spot
-  template:
-    metadata:
-      labels:
-        app: ml-training-spot
-        finops: spot-instance
-    spec:
-      containers:
-        - name: trainer
-          image: "ecr.aws/devsecops/ml-train:v1.0.0"
-          resources:
-            requests:
-              cpu: "4"
-              memory: 16Gi
-              nvidia.com/gpu: "1"
-            limits:
-              cpu: "8"
-              memory: 32Gi
-              nvidia.com/gpu: "1"
-      tolerations:
-        - key: nvidia.com/gpu
-          operator: Exists
-          effect: NoSchedule
-      nodeSelector:
-        workload: ml-spot
-      # Allow eviction for spot reclamation
-      terminationGracePeriodSeconds: 120
----
-# --- KEDA Scaler — Scale ML training on queue depth ---
-apiVersion: keda.sh/v1alpha1
-kind: ScaledJob
-metadata:
-  name: ml-training-scaler
-  namespace: ml-pipeline
-spec:
-  minReplicaCount: 0
-  maxReplicaCount: 4
-  pollingInterval: 30
-  triggers:
-    - type: aws-sqs
-      metadata:
-        queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/ml-training-queue
-        queueLength: "1"
-  jobTemplate:
-    spec:
-      template:
-        spec:
-          restartPolicy: Never
-          containers:
-            - name: trainer
-              image: "ecr.aws/devsecops/ml-train:v1.0.0"

finops/finops-cronjob.yaml DELETED Viewed

@@ -1,23 +0,0 @@
-# FinOps Daily Cost Scanner
-apiVersion: batch/v1
-kind: CronJob
-metadata:
-  name: finops-scanner
-  namespace: platform-system
-spec:
-  schedule: "0 6 * * 1-5"  # 6am weekdays
-  jobTemplate:
-    spec:
-      template:
-        spec:
-          serviceAccountName: finops-scanner
-          containers:
-            - name: scanner
-              image: "ecr.aws/devsecops/finops-scanner:latest"
-              command: ["python3", "finops_scanner.py"]
-              env:
-                - name: AWS_REGION
-                  value: "us-east-1"
-          restartPolicy: Never
-          concurrencyPolicy: Forbid
-          successfulJobsHistoryLimit: 7

finops/finops-policy.yaml DELETED Viewed

@@ -1,73 +0,0 @@
-# =============================================================================
-# FinOps Policy — Cloud Waste Detection & Rightsizing Rules
-# =============================================================================
-policies:
-  # --- Unused Resources ---
-  - id: FINOPS-001
-    name: "Detect unused EBS volumes"
-    severity: medium
-    check: "aws ec2 describe-volumes --filters Name=status,Values=available"
-    action: "Create snapshot, delete volume after 7 days"
-    estimated_savings: "$50-200/month per volume"
-  - id: FINOPS-002
-    name: "Detect idle RDS instances"
-    severity: medium
-    check: "Connection count < 5 for 7 days"
-    action: "Downsize instance class or stop non-prod"
-    estimated_savings: "30-50% of RDS cost"
-  - id: FINOPS-003
-    name: "Detect unattached EIPs"
-    severity: low
-    check: "aws ec2 describe-addresses --filter Name=association-id,Values=''"
-    action: "Release EIP"
-    estimated_savings: "$3.60/month per EIP"
-  # --- Rightsizing ---
-  - id: FINOPS-010
-    name: "EC2 rightsizing recommendations"
-    severity: medium
-    check: "CPU < 15% for 14 days OR Memory < 25% for 14 days"
-    action: "Downsize to next tier (e.g., m6i.xlarge -> m6i.large)"
-    estimated_savings: "20-40% per instance"
-  - id: FINOPS-011
-    name: "Over-provisioned K8s workloads"
-    severity: medium
-    check: "Container CPU request > 2x actual P95 usage"
-    action: "Reduce requests to P95 + 20% headroom"
-    estimated_savings: "30-50% of cluster cost"
-  # --- Scheduling ---
-  - id: FINOPS-020
-    name: "Non-prod environment scheduling"
-    severity: high
-    check: "Dev/staging workloads running 24/7"
-    action: "Scale to 0 outside business hours (Mon-Fri 8am-8pm)"
-    estimated_savings: "65% of non-prod compute"
-  # --- Reserved Instances Coverage ---
-  - id: FINOPS-030
-    name: "RI coverage below 70%"
-    severity: high
-    check: "RI coverage < 70% for consistent workloads"
-    action: "Purchase RIs for EKS node groups + RDS"
-    estimated_savings: "30-40% vs on-demand"
-  # --- Storage Tiering ---
-  - id: FINOPS-040
-    name: "S3 intelligent tiering"
-    severity: medium
-    check: "S3 objects > 90 days in STANDARD"
-    action: "Enable S3 Intelligent-Tiering on all buckets"
-    estimated_savings: "40-60% on infrequent access data"
-  # --- GPU Utilization ---
-  - id: FINOPS-050
-    name: "Underutilized GPU nodes"
-    severity: critical
-    check: "GPU utilization < 30% for 4 hours"
-    action: "Scale down GPU node group or use KEDA for demand-based scaling"
-    estimated_savings: "$2-6/hour per GPU"

incident-response/auto-remediation/auto-remediate.sh DELETED Viewed

@@ -1,50 +0,0 @@
-#!/usr/bin/env bash
-# =============================================================================
-# Autonomous Incident Remediation Engine
-# =============================================================================
-# Triggered by Alertmanager webhook. Auto-remediates known patterns.
-# =============================================================================
-set -euo pipefail
-ALERT_NAME="${1:-unknown}"
-NAMESPACE="${2:-default}"
-POD_NAME="${3:-}"
-log() { echo "[$(date +%H:%M:%S)] [REMEDIATE] $*"; }
-case "${ALERT_NAME}" in
-  PodCrashLooping)
-    log "Remediating crash-looping pod: ${NAMESPACE}/${POD_NAME}"
-    # Check if OOM killed
-    OOM_COUNT=$(kubectl get pod "${POD_NAME}" -n "${NAMESPACE}" -o json | jq -r '.status.containerStatuses[0].lastState.terminated.reason // empty' | grep -c OOMKilled || true)
-    if [[ "${OOM_COUNT}" -gt 0 ]]; then
-      log "OOM detected - increasing memory limit"
-      kubectl patch deployment "${POD_NAME%-*}" -n "${NAMESPACE}" -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"2Gi"}}}]}}}}'
-      log "Memory limit increased to 2Gi"
-    else
-      log "Non-OOM crash - restarting pod"
-      kubectl delete pod "${POD_NAME}" -n "${NAMESPACE}" --grace-period=30
-    fi
-    ;;
-  HighMemoryUsage)
-    log "Node memory pressure detected"
-    # Evict lowest-priority pods
-    kubectl get pods -A --sort-by=.spec.priority --field-selector=status.phase=Running | tail -5 | while read ns pod rest; do
-      log "Considering eviction: ${ns}/${pod}"
-    done
-    ;;
-  FalcoRuntimeAlert)
-    log "Runtime security alert - do NOT auto-remediate"
-    log "Escalate to security team: #security-alerts"
-    # Only notify - never auto-remediate security alerts
-    ;;
-  *)
-    log "Unknown alert pattern: ${ALERT_NAME}"
-    log "Manual investigation required"
-    exit 1
-    ;;
-esac

platform/adr/template.md → infrastructure/adr-template.md RENAMED Viewed

File without changes

{ci-cd/github-actions → infrastructure/ci-cd}/devsecops-pipeline.yml RENAMED Viewed

File without changes

{finops → infrastructure}/finops_scanner.py RENAMED Viewed

File without changes

{scripts/bash → infrastructure}/incident-response.sh RENAMED Viewed

File without changes

{monitoring → infrastructure/monitoring}/alertmanager/alertmanager-config.yaml RENAMED Viewed

File without changes

monitoring/grafana/dashboards/platform-overview.json → infrastructure/monitoring/grafana-platform-overview.json RENAMED Viewed

File without changes

{monitoring → infrastructure/monitoring}/otel/otel-collector.yaml RENAMED Viewed

File without changes

{monitoring → infrastructure/monitoring}/prometheus/alerts.yaml RENAMED Viewed

File without changes

incident-response/postmortem/template.md → infrastructure/postmortem-template.md RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/environments/prod/main.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/eks/main.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/eks/outputs.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/eks/variables.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/guardduty/main.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/guardduty/variables.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/iam/main.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/kms/main.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/macie/main.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/rds/main.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/rds/variables.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/s3/main.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/s3/variables.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/vpc/main.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/vpc/outputs.tf RENAMED Viewed

File without changes

{terraform → infrastructure/terraform}/modules/vpc/variables.tf RENAMED Viewed

File without changes

k8s/base/limit-ranges/limit-ranges.yaml DELETED Viewed

@@ -1,74 +0,0 @@
-# =============================================================================
-# Limit Ranges — Default Resource Requests/Limits Per Container
-# =============================================================================
-apiVersion: v1
-kind: LimitRange
-metadata:
-  name: default-limits
-  namespace: frontend
-spec:
-  limits:
-    - type: Container
-      default:
-        cpu: 500m
-        memory: 256Mi
-      defaultRequest:
-        cpu: 100m
-        memory: 128Mi
-      max:
-        cpu: "2"
-        memory: 2Gi
-      min:
-        cpu: 50m
-        memory: 64Mi
-      maxLimitRequestRatio:
-        cpu: "4"
-        memory: "4"
----
-apiVersion: v1
-kind: LimitRange
-metadata:
-  name: default-limits
-  namespace: backend
-spec:
-  limits:
-    - type: Container
-      default:
-        cpu: "1"
-        memory: 512Mi
-      defaultRequest:
-        cpu: 200m
-        memory: 256Mi
-      max:
-        cpu: "4"
-        memory: 4Gi
-      min:
-        cpu: 100m
-        memory: 128Mi
-      maxLimitRequestRatio:
-        cpu: "4"
-        memory: "4"
----
-apiVersion: v1
-kind: LimitRange
-metadata:
-  name: default-limits
-  namespace: ml-pipeline
-spec:
-  limits:
-    - type: Container
-      default:
-        cpu: "2"
-        memory: 4Gi
-        nvidia.com/gpu: "1"
-      defaultRequest:
-        cpu: 500m
-        memory: 1Gi
-      max:
-        cpu: "8"
-        memory: 16Gi
-        nvidia.com/gpu: "2"
-      min:
-        cpu: 200m
-        memory: 512Mi

k8s/base/namespaces/namespaces.yaml DELETED Viewed

@@ -1,69 +0,0 @@
-# =============================================================================
-# Namespace Definitions — Security-First Multi-Tenant Layout
-# =============================================================================
-# Each namespace gets:
-#   - Labels for network policy targeting
-#   - Resource quotas
-#   - Limit ranges
-#   - Pod security standards via labels (Kyverno enforces)
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: platform-system
-  labels:
-    pod-security.kubernetes.io/enforce: "privileged"
-    pod-security.kubernetes.io/audit: "privileged"
-    pod-security.kubernetes.io/warn: "privileged"
-    platform: "true"
----
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: monitoring
-  labels:
-    pod-security.kubernetes.io/enforce: "restricted"
-    pod-security.kubernetes.io/audit: "restricted"
-    pod-security.kubernetes.io/warn: "restricted"
-    platform: "true"
----
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: security
-  labels:
-    pod-security.kubernetes.io/enforce: "restricted"
-    pod-security.kubernetes.io/audit: "restricted"
-    pod-security.kubernetes.io/warn: "restricted"
-    platform: "true"
----
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: frontend
-  labels:
-    pod-security.kubernetes.io/enforce: "restricted"
-    pod-security.kubernetes.io/audit: "restricted"
-    pod-security.kubernetes.io/warn: "restricted"
-    app-team: "frontend"
----
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: backend
-  labels:
-    pod-security.kubernetes.io/enforce: "restricted"
-    pod-security.kubernetes.io/audit: "restricted"
-    pod-security.kubernetes.io/warn: "restricted"
-    app-team: "backend"
----
-apiVersion: v1
-kind: Namespace
-metadata:
-  name: ml-pipeline
-  labels:
-    pod-security.kubernetes.io/enforce: "baseline"
-    pod-security.kubernetes.io/audit: "restricted"
-    pod-security.kubernetes.io/warn: "restricted"
-    app-team: "ml"
-    nvidia.com/gpu: "true"

k8s/base/network-policies/network-policies.yaml DELETED Viewed

@@ -1,124 +0,0 @@
-# =============================================================================
-# Network Policies — Zero Trust Default Deny + Selective Allow
-# =============================================================================
-# Strategy: Default deny all ingress/egress, then allow only known traffic
-# --- Default Deny All Ingress in Every Namespace ---
-apiVersion: networking.k8s.io/v1
-kind: NetworkPolicy
-metadata:
-  name: default-deny-ingress
-  namespace: frontend
-spec:
-  podSelector: {}  # Matches all pods
-  policyTypes:
-    - Ingress
----
-apiVersion: networking.k8s.io/v1
-kind: NetworkPolicy
-metadata:
-  name: default-deny-ingress
-  namespace: backend
-spec:
-  podSelector: {}
-  policyTypes:
-    - Ingress
----
-apiVersion: networking.k8s.io/v1
-kind: NetworkPolicy
-metadata:
-  name: default-deny-ingress
-  namespace: ml-pipeline
-spec:
-  podSelector: {}
-  policyTypes:
-    - Ingress
----
-# --- Frontend: Allow ingress from Istio ingress gateway only ---
-apiVersion: networking.k8s.io/v1
-kind: NetworkPolicy
-metadata:
-  name: allow-istio-ingress
-  namespace: frontend
-spec:
-  podSelector:
-    matchLabels:
-      app: frontend
-  policyTypes:
-    - Ingress
-  ingress:
-    - from:
-        - namespaceSelector:
-            matchLabels:
-              name: istio-system
-        - podSelector:
-            matchLabels:
-              istio: ingressgateway
-      ports:
-        - port: 8080
-          protocol: TCP
----
-# --- Backend: Allow ingress from frontend namespace only ---
-apiVersion: networking.k8s.io/v1
-kind: NetworkPolicy
-metadata:
-  name: allow-from-frontend
-  namespace: backend
-spec:
-  podSelector:
-    matchLabels:
-      app: backend
-  policyTypes:
-    - Ingress
-    - Egress
-  ingress:
-    - from:
-        - namespaceSelector:
-            matchLabels:
-              app-team: frontend
-      ports:
-        - port: 8080
-          protocol: TCP
-  egress:
-    # Allow DNS
-    - to: []
-      ports:
-        - port: 53
-          protocol: UDP
-        - port: 53
-          protocol: TCP
-    # Allow RDS
-    - to: []
-      ports:
-        - port: 5432
-          protocol: TCP
----
-# --- ML Pipeline: Allow from backend + Istio ---
-apiVersion: networking.k8s.io/v1
-kind: NetworkPolicy
-metadata:
-  name: allow-ml-traffic
-  namespace: ml-pipeline
-spec:
-  podSelector: {}
-  policyTypes:
-    - Ingress
-    - Egress
-  ingress:
-    - from:
-        - namespaceSelector:
-            matchLabels:
-              app-team: backend
-    - from:
-        - namespaceSelector:
-            matchLabels:
-              name: istio-system
-  egress:
-    - to: []
-      ports:
-        - port: 53
-          protocol: UDP
-    - to: []
-      ports:
-        - port: 443
-          protocol: TCP  # HuggingFace Hub, S3, etc.

k8s/base/pdbs/pdbs.yaml DELETED Viewed

@@ -1,62 +0,0 @@
-# =============================================================================
-# Pod Disruption Budgets — Availability Guarantees Per Workload
-# =============================================================================
-# PDBs prevent voluntary evictions (upgrades, drain) from taking down too many pods.
-# Without PDBs: kubectl drain or cluster-autoscaler can cause unplanned outages.
-# =============================================================================
-apiVersion: policy/v1
-kind: PodDisruptionBudget
-metadata:
-  name: frontend-pdb
-  namespace: frontend
-spec:
-  minAvailable: 2  # At least 2 pods always running (3 replicas total)
-  selector:
-    matchLabels:
-      app: frontend
----
-apiVersion: policy/v1
-kind: PodDisruptionBudget
-metadata:
-  name: backend-pdb
-  namespace: backend
-spec:
-  minAvailable: 2
-  selector:
-    matchLabels:
-      app: backend
----
-apiVersion: policy/v1
-kind: PodDisruptionBudget
-metadata:
-  name: ml-inference-pdb
-  namespace: ml-pipeline
-spec:
-  maxUnavailable: 1  # At most 1 pod disrupted at a time
-  selector:
-    matchLabels:
-      app: ml-inference
----
-# Platform services — always keep 1 available
-apiVersion: policy/v1
-kind: PodDisruptionBudget
-metadata:
-  name: prometheus-pdb
-  namespace: monitoring
-spec:
-  minAvailable: 1
-  selector:
-    matchLabels:
-      app: kube-prometheus-stack-prometheus
----
-apiVersion: policy/v1
-kind: PodDisruptionBudget
-metadata:
-  name: argocd-pdb
-  namespace: platform-system
-spec:
-  minAvailable: 1
-  selector:
-    matchLabels:
-      app.kubernetes.io/name: argocd

k8s/base/rbac/rbac.yaml DELETED Viewed

@@ -1,78 +0,0 @@
-# =============================================================================
-# RBAC — Least-Privilege Access Control
-# =============================================================================
-# Platform Admins — Full cluster access
-apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRole
-metadata:
-  name: platform-admin
-rules:
-  - apiGroups: ["*"]
-    resources: ["*"]
-    verbs: ["*"]
-    # Exclude secrets CRUD for audit trail — use ExternalSecrets instead
-  - apiGroups: [""]
-    resources: ["secrets"]
-    verbs: ["get", "list", "watch"]  # No create/update/delete
----
-apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRoleBinding
-metadata:
-  name: platform-admin
-roleRef:
-  apiGroup: rbac.authorization.k8s.io
-  kind: ClusterRole
-  name: platform-admin
-subjects:
-  - kind: Group
-    name: platform-admins
-    apiGroup: rbac.authorization.k8s.io
----
-# Developer — Read + Pod Exec + Logs within their namespaces
-apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRole
-metadata:
-  name: developer
-rules:
-  - apiGroups: ["", "apps", "batch", "extensions"]
-    resources: ["pods", "pods/log", "pods/exec", "deployments", "statefulsets", "jobs", "cronjobs"]
-    verbs: ["get", "list", "watch"]
-  - apiGroups: [""]
-    resources: ["pods/exec"]
-    verbs: ["create"]
-  - apiGroups: ["", "apps"]
-    resources: ["deployments", "statefulsets"]
-    verbs: ["patch"]  # For restart rollout only
-  - apiGroups: ["metrics.k8s.io"]
-    resources: ["pods", "nodes"]
-    verbs: ["get", "list"]
----
-# Viewer — Read-only cluster-wide
-apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRole
-metadata:
-  name: viewer
-rules:
-  - apiGroups: ["", "apps", "batch", "extensions", "networking.k8s.io"]
-    resources: ["*"]
-    verbs: ["get", "list", "watch"]
-  - nonResourceURLs: ["*"]
-    verbs: ["get"]
----
-# ML Engineer — Access to ml-pipeline namespace only
-apiVersion: rbac.authorization.k8s.io/v1
-kind: Role
-metadata:
-  name: ml-engineer
-  namespace: ml-pipeline
-rules:
-  - apiGroups: ["", "apps", "batch", "kubeflow.org", "serving.kubeflow.org"]
-    resources: ["pods", "pods/log", "pods/exec", "deployments", "jobs", "notebooks", "inferenceservices"]
-    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
-  - apiGroups: [""]
-    resources: ["secrets"]
-    verbs: ["get", "list"]  # No create/update
-  - apiGroups: [""]
-    resources: ["configmaps"]
-    verbs: ["get", "list", "create", "update"]

k8s/base/resource-quotas/resource-quotas.yaml DELETED Viewed

@@ -1,50 +0,0 @@
-# =============================================================================
-# Resource Quotas — Prevent Resource Exhaustion Per Namespace
-# =============================================================================
-apiVersion: v1
-kind: ResourceQuota
-metadata:
-  name: frontend-quota
-  namespace: frontend
-spec:
-  hard:
-    requests.cpu: "4"
-    requests.memory: 8Gi
-    limits.cpu: "8"
-    limits.memory: 16Gi
-    pods: "20"
-    services: "5"
-    persistentvolumeclaims: "10"
-    requests.nvidia.com/gpu: "0"  # No GPUs for frontend
----
-apiVersion: v1
-kind: ResourceQuota
-metadata:
-  name: backend-quota
-  namespace: backend
-spec:
-  hard:
-    requests.cpu: "8"
-    requests.memory: 16Gi
-    limits.cpu: "16"
-    limits.memory: 32Gi
-    pods: "30"
-    services: "10"
-    persistentvolumeclaims: "20"
----
-apiVersion: v1
-kind: ResourceQuota
-metadata:
-  name: ml-quota
-  namespace: ml-pipeline
-spec:
-  hard:
-    requests.cpu: "16"
-    requests.memory: 64Gi
-    limits.cpu: "32"
-    limits.memory: 128Gi
-    pods: "15"
-    services: "5"
-    persistentvolumeclaims: "30"
-    requests.nvidia.com/gpu: "4"

k8s/base/slos/slos.yaml DELETED Viewed

@@ -1,68 +0,0 @@
-# =============================================================================
-# Service Level Objectives — Platform SLOs
-# =============================================================================
-# SLOs define reliability targets. Error budgets = 100% - SLO.
-# Burn rate alerts fire when error budget is consumed too fast.
-# =============================================================================
-# --- API Availability SLO: 99.95% (21.9 min/month error budget) ---
-apiVersion: monitoring.coreos.com/v1
-kind: PrometheusRule
-metadata:
-  name: slo-api-availability
-  namespace: monitoring
-  labels:
-    release: kube-prometheus-stack
-    slo: "true"
-spec:
-  groups:
-    - name: slo.api.availability
-      rules:
-        # SLO metric: 5m success rate
-        - record: slo:api_availability:rate5m
-          expr: |
-            sum(rate(http_requests_total{code!~"5.."}[5m]))
-            /
-            sum(rate(http_requests_total[5m]))
-        # 1h error budget burn rate (14.4x = consume 30d budget in 2d)
-        - alert: SLOAPIAvailabilityBurnRateHigh
-          expr: |
-            (
-              (1 - slo:api_availability:rate5m) > (14.4 * 0.001)
-            )
-          for: 5m
-          labels:
-            severity: critical
-            slo: api-availability
-          annotations:
-            summary: "API availability SLO budget burning too fast"
-            runbook: "https://runbook.platform.internal/slo-api-burn"
-    - name: slo.api.latency
-      rules:
-        # Latency SLO: P99 < 2s, 99.9% of requests
-        - record: slo:api_latency_p99:rate5m
-          expr: |
-            histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
-        - alert: SLOAPILatencyBurnRateHigh
-          expr: |
-            slo:api_latency_p99:rate5m > 2
-          for: 10m
-          labels:
-            severity: warning
-            slo: api-latency
-          annotations:
-            summary: "API P99 latency exceeding 2s SLO"
-    - name: slo.error_budget
-      rules:
-        # Remaining error budget (percentage)
-        - record: slo:error_budget_remaining:ratio
-          expr: |
-            1 - (
-              (1 - slo:api_availability:rate5m)
-              /
-              0.0005
-            )

k8s/kustomize/base/kustomization.yaml DELETED Viewed

@@ -1,18 +0,0 @@
-apiVersion: kustomize.config.k8s.io/v1beta1
-kind: Kustomization
-resources:
-  - ../../base/namespaces/
-  - ../../base/rbac/
-  - ../../base/network-policies/
-  - ../../base/resource-quotas/
-  - ../../base/limit-ranges/
-  - ../../base/pdbs/
-  - ../../base/slos/
-  - ../../manifests/cert-manager/
-  - ../../manifests/external-secrets/
-  - ../../manifests/istio/
-  - ../../manifests/argo-cd/
-  - ../../manifests/trivy-operator/
-  - ../../manifests/falco/
-  - ../../manifests/kyverno/
-  - ../../manifests/prometheus-stack/