diff --git a/README.md b/README.md index 01ffc91c4806ad261d2cc12e1f9603517efcd13b..a486b726febdcacf9530ec479529347e6303131c 100644 --- a/README.md +++ b/README.md @@ -1,144 +1,75 @@ -# DevSecOps Platform OMEGA — Enterprise AI Operating System +--- +license: apache-2.0 +tags: +- devsecops +- llm +- sft +- lora +- tulu-3 +- kubernetes +- terraform +--- -> Production-grade, security-first, automation-first platform covering the full DevOps, Cloud, Kubernetes, Security, AI/ML, FinOps, and Governance lifecycle. +# DevSecOps Model Platform -**156 files | 182KB | 13 domains | All production-ready** +> Train a secure model on the best data, then deploy it securely. -## Architecture +## Start Here: Train Your Model -``` - ENGINEERING COMMAND CENTER - | - +------------------+------------------+ - | | | | | - RELIABILITY SECURITY FINOPS PLATFORM AI/ML - (SLO/PDB) (GuardDuty) (Cost) (Golden (RAG/SFT) - | | | Path) | - +---------+--------+--------+---------+--+ - | | - KUBERNETES TERRAFORM - (Kustomize) (IaC Modules) - | | - AWS CLOUD INFRASTRUCTURE -``` - -## OMEGA 10-Dimension Scorecard - -| # | Dimension | Score | Assets | -|---|-----------|-------|--------| -| 1 | **Reliability** | 8/10 | PDBs, SLOs, HPA, multi-AZ, Istio | -| 2 | **Security** | 9/10 | GuardDuty, Macie, Falco, Kyverno, Trivy, mTLS | -| 3 | **Dev Velocity** | 7/10 | Golden paths, self-service envs, Kustomize | -| 4 | **Cost Efficiency** | 7/10 | FinOps scanner, spot instances, scheduling policy | -| 5 | **Governance** | 8/10 | SOC2, NIST 800-53, CIS, OPA, ADR template | -| 6 | **Automation** | 7/10 | Bootstrap, auto-remediation, GitOps (ArgoCD) | -| 7 | **Incident Recovery** | 8/10 | Runbook, postmortem template, war-room | -| 8 | **Standardization** | 8/10 | Kustomize overlays, golden path templates | -| 9 | **AI Enablement** | 8/10 | RAG, LoRA v2, MLflow, Trackio, GPU scheduling | -| 10 | **Engineering Excellence** | 7/10 | ADR template, checklists, SRE standards | - -## Platform Modules - -### Infrastructure (Terraform) -| Module | Purpose | Key Feature | -|--------|---------|-------------| -| VPC | Network isolation | Flow logs, default deny SG/NACL | -| EKS | Kubernetes cluster | Private API, KMS encryption, IRSA | -| RDS | Database | Multi-AZ, encrypted, performance insights | -| S3 | Storage | SSE-KMS, versioning, lifecycle | -| IAM | Access control | MFA, least privilege, access analyzer | -| KMS | Key management | Auto-rotation, multi-key | -| GuardDuty | Threat detection | EBS malware scan, K8s audit, S3 | -| Macie | PII detection | Automated data classification | - -### Kubernetes -| Layer | Components | -|-------|-----------| -| **Base** | Namespaces, RBAC, NetPols, Quotas, Limits, PDBs, SLOs | -| **Platform** | ArgoCD, Istio (mTLS), ExternalSecrets, CertManager | -| **Security** | Trivy Operator, Falco (eBPF), Kyverno (7 policies), OPA | -| **Observability** | Prometheus, Grafana, Loki, Alertmanager, OTEL | -| **Workloads** | Frontend, Backend (HPA), ML Pipeline (GPU) | - -### FinOps Engine -| Asset | Purpose | -|-------|---------| -| finops-policy.yaml | 11 cost optimization rules | -| finops_scanner.py | Automated waste detection | -| cost-optimization.yaml | Spot instance strategy + KEDA | -| finops-cronjob.yaml | Daily cost scan CronJob | - -### Platform Engineering -| Asset | Purpose | -|-------|---------| -| golden-paths/microservice/ | Production-ready service template + checklist | -| self-service/ | Ephemeral environment provisioning config | -| adr/template.md | Architecture Decision Record template | -| kustomize/ | Base + dev/staging/prod overlays | - -### Incident Response -| Asset | Purpose | -|-------|---------| -| auto-remediate.sh | OOM fix, pod restart, security escalation | -| postmortem/template.md | Full postmortem with 5 Whys + action items | -| incident-response.sh | Diagnostic runbook (5 incident types) | - -### AI/ML Hub -| Asset | Purpose | -|-------|---------| -| finetune.py | LoRA Without Regret (r=256, all-linear) | -| run_finetune.py | CLI entry point with dataset selection | -| TRAINING_RECIPE.md | v1→v2 upgrade documentation | -| rag_pipeline.py | LangChain + HF + ChromaDB RAG | -| mlflow/ | MLflow tracking deployment | - -### Compliance -| Framework | Coverage | -|-----------|---------| -| SOC2 Type II | CC6-CC9 controls mapped | -| NIST 800-53 | 12 controls mapped | -| CIS Benchmarks | EKS + K8s automated | -| OPA Gatekeeper | Admission policies | - -### CI/CD Pipelines -| System | Features | -|--------|----------| -| GitHub Actions | 6-stage DevSecOps (SAST→Build→Scan→Test→Sign→Deploy) | -| Jenkins | Parallel SAST + production deployment | -| GitLab CI | Full scan + sign + deploy pipeline | - -## Quick Start - -```bash -# Bootstrap full platform -./scripts/bash/bootstrap.sh prod +| Dataset | Size | What It Gives You | Command | +|---------|------|-------------------|---------| +| **tulu-3-sft-mixture** | 940K | Math, code, safety, chat (BEST) | python model/train_tulu3.py | +| **OpenThoughts-114k** | 114K | Reasoning, chain-of-thought | python model/train_openthoughts.py | -# Security audit -python3 scripts/python/security_audit.py +**allenai/tulu-3-sft-mixture** is from Allen AI Tulu 3 - current SOTA open instruction-tuned model. Proven on Llama-3.1-8B: MMLU 53.5, GSM8K 79.9, HumanEval 76.8. -# FinOps cost scan -python3 finops/finops_scanner.py +LoRA config from LoRA Without Regret (Schulman 2025): r=256, alpha=16, all-linear = matches full fine-tuning at 67% compute. -# Incident response -./scripts/bash/incident-response.sh security +## Repository Structure -# Auto-remediate -./incident-response/auto-remediation/auto-remediate.sh PodCrashLooping backend +``` +model/ THE MODEL - train, serve, enhance + train_tulu3.py Primary: 940K best data (zero preprocessing) + train_openthoughts.py Reasoning: 114K CoT traces + finetune_configurable.py Multi-dataset configurable trainer + rag_pipeline.py RAG for DevSecOps knowledge + DATASETS.md Why these datasets, proven recipes + +deployment/ SERVE IT - Kubernetes + Docker + vLLM + deployment.yaml ML inference K8s manifest + mlflow-deployment.yaml Experiment tracking + Dockerfile.ml-inference Hardened multi-stage image + +security/ PROTECT IT - scanning + policies + scanning/ Trivy, Semgrep, Checkov, SBOM + policies/ Kyverno, OPA Gatekeeper + +infrastructure/ RUN IT - Terraform + monitoring + CI/CD + terraform/ VPC, EKS, RDS, S3, IAM, KMS, GuardDuty, Macie + monitoring/ Prometheus, Alertmanager, OTEL, Grafana + ci-cd/ GitHub Actions DevSecOps pipeline + +compliance/ CERTIFY IT - SOC2, NIST, CIS + controls-mapping.yaml SOC2 Type II + nist-800-53-mapping.yaml NIST 800-53 Rev5 + cis-eks-k8s.yaml CIS Benchmarks ``` -## Self-Improvement Checklist +## Quick Commands -After every deployment, ask: +```bash +# Train on best data (A100, ~6h) +python model/train_tulu3.py -- [ ] Can this be automated? -- [ ] Can this be templated? -- [ ] Can this be secured further? -- [ ] Can this be cheaper? -- [ ] Can this scale better? -- [ ] Can this reduce human toil? +# Quick test (any GPU) +python model/train_tulu3.py --max_steps 100 --no_push -If yes, enhance and push. +# Security scan +python security/scanning/security_audit.py -## Hub +# Deploy model to K8s +kubectl apply -f deployment/deployment.yaml -**[huggingface.co/shaikhsalman/devsecops-platform](https://huggingface.co/shaikhsalman/devsecops-platform)** +# Infrastructure (Terraform) +cd infrastructure/terraform/environments/prod && terraform apply +``` diff --git a/ai-ml/hf-finetuning/TRAINING_RECIPE.md b/ai-ml/hf-finetuning/TRAINING_RECIPE.md deleted file mode 100644 index ef5c8fa4b6d4e3a20dc5c478b35eb99c94dc0e27..0000000000000000000000000000000000000000 --- a/ai-ml/hf-finetuning/TRAINING_RECIPE.md +++ /dev/null @@ -1,58 +0,0 @@ -# Model Enhancement — Dataset & Training Recipe vNext - -## What Changed (v1 → v2) - -| Parameter | v1 (Old) | v2 (LoRA Without Regret) | Why | -|-----------|----------|--------------------------|-----| -| **Dataset** | ultrachat_200k (5K subset) | **tulu-3-sft-mixture** (940K) | 19 curated sources > single source | -| **LoRA r** | 16 | **256** | SFT-scale datasets need r=256 to match full FT | -| **LoRA alpha** | 32 | **16** | Stable scaling with high rank | -| **Target modules** | q/k/v/o_proj only | **all-linear** | Attention-only underperforms even at higher rank | -| **Effective batch** | 32 | **16** | LoRA less tolerant of large batches | -| **Learning rate** | 2e-4 | **2e-4** (same) | 10x full FT rate — correct in v1 | -| **Packing** | False | **True (bfd_split)** | Preserves all tokens, 2-3x throughput | -| **assistant_only_loss** | False | **True** | Loss only on assistant tokens | -| **EOS token** | Not set | **<\|eot_id\|>** | Llama 3.1 chat template | -| **LR scheduler** | linear | **cosine** | Better convergence for LoRA | -| **Epochs** | 3 | **1** | 940K examples = 1 epoch sufficient | - -## Dataset Comparison - -| Dataset | Size | Format | Best For | Quality | -|---------|------|--------|----------|---------| -| **tulu-3-sft-mixture** | 940K | messages ✅ | General SFT (code, math, IF, safety, science) | ⭐⭐⭐⭐⭐ | -| **OpenThoughts-114k** | 114K | conversations (needs conversion) | Reasoning, CoT traces | ⭐⭐⭐⭐ | -| ultrachat_200k | 200K | messages ✅ | Multi-turn chat baseline | ⭐⭐⭐ | - -## Key Research: "LoRA Without Regret" (Schulman et al., 2025) - -Four findings that change how we fine-tune: - -1. **Target ALL linear layers** — not just attention. Increasing rank does NOT compensate for skipping layers. -2. **Use r=256 for SFT** — sufficient capacity for post-training scale datasets. -3. **Use 10x higher LR** (2e-4 vs 2e-5 for full FT) — 1/r scaling makes optimal LR rank-independent. -4. **Keep batch size < 32** — LoRA is less tolerant of large batches. Cannot be mitigated by increasing rank. - -## Recommended Training Matrix - -### SFT (Supervised Fine-Tuning) - -| Model | Dataset | Hardware | Time | Cost | -|-------|---------|----------|------|------| -| Llama-3.1-8B-Instruct | tulu-3-sft (940K) | A100 (80GB) | ~6h | ~$24 | -| Llama-3.1-8B-Instruct | OpenThoughts-114k | A100 (80GB) | ~2h | ~$8 | -| Llama-3.1-8B-Instruct | tulu-3-sft (940K) | A10G (24GB) + QLoRA | ~12h | ~$24 | - -### GRPO (Reinforcement Learning) - -| Model | Dataset | LoRA r | Hardware | -|-------|---------|--------|----------| -| Qwen3-0.6B | OpenR1-Math-220k | 1 | A100 | -| Llama-3.1-8B-Base | GSM8k | 1-32 | A100 | - -## Source Attribution - -- LoRA Without Regret: Schulman et al., 2025, Thinking Machines Lab -- tulu-3-sft-mixture: Allen AI, used by Tulu 3 (SOTA open instruction-tuned) -- OpenThoughts-114k: Open community, reasoning-heavy CoT data -- LoRA Land: Predibase 2024, 224/310 LoRA models surpassed GPT-4 diff --git a/ai-ml/hf-finetuning/run_finetune.py b/ai-ml/hf-finetuning/run_finetune.py deleted file mode 100644 index eb681d0dcd5049e6cc4e18384aaf7815933b8999..0000000000000000000000000000000000000000 --- a/ai-ml/hf-finetuning/run_finetune.py +++ /dev/null @@ -1,67 +0,0 @@ -# ============================================================================= -# SFT Fine-Tuning — CLI Entry Point (LoRA Without Regret config) -# ============================================================================= -# Usage: -# # Default: tulu-3-sft + Llama-3.1-8B -# python run_finetune.py -# -# # OpenThoughts reasoning dataset -# python run_finetune.py --dataset_key openthoughts-114k -# -# # Ultrachat fallback -# python run_finetune.py --dataset_key ultrachat-200k -# -# # Custom hub model ID -# python run_finetune.py --hub_model_id my-org/my-model-v2 -# ============================================================================= - -import argparse -import sys -from finetune import FinetuneConfig, finetune, DATASET_REGISTRY - - -def main(): - parser = argparse.ArgumentParser(description="SFT Fine-Tuning (LoRA Without Regret)") - parser.add_argument("--dataset_key", default="tulu-3-sft", - choices=list(DATASET_REGISTRY.keys()), - help="Dataset to train on") - parser.add_argument("--hub_model_id", default=None, - help="HuggingFace Hub model ID for push") - parser.add_argument("--num_train_epochs", type=int, default=None) - parser.add_argument("--learning_rate", type=float, default=None) - parser.add_argument("--lora_r", type=int, default=None) - parser.add_argument("--per_device_train_batch_size", type=int, default=None) - parser.add_argument("--max_seq_length", type=int, default=None) - - args = parser.parse_args() - - config = FinetuneConfig() - if args.dataset_key: - config.dataset_key = args.dataset_key - if args.hub_model_id: - config.hub_model_id = args.hub_model_id - if args.num_train_epochs: - config.num_train_epochs = args.num_train_epochs - if args.learning_rate: - config.learning_rate = args.learning_rate - if args.lora_r: - config.lora_r = args.lora_r - if args.per_device_train_batch_size: - config.per_device_train_batch_size = args.per_device_train_batch_size - if args.max_seq_length: - config.max_seq_length = args.max_seq_length - - print(f"Config: model={config.model_name}") - print(f" dataset={config.dataset_key}") - print(f" lora_r={config.lora_r}, lora_alpha={config.lora_alpha}") - print(f" target_modules={config.target_modules}") - print(f" lr={config.learning_rate}, epochs={config.num_train_epochs}") - print(f" effective_batch={config.per_device_train_batch_size * config.gradient_accumulation_steps}") - print(f" packing={config.packing}, strategy={config.packing_strategy}") - print(f" assistant_only_loss={config.assistant_only_loss}") - - finetune(config) - - -if __name__ == "__main__": - main() diff --git a/ci-cd/gitlab-ci/.gitlab-ci.yml b/ci-cd/gitlab-ci/.gitlab-ci.yml deleted file mode 100644 index e0174a2a12e5170e5b76b03700942bdf8cfdd64d..0000000000000000000000000000000000000000 --- a/ci-cd/gitlab-ci/.gitlab-ci.yml +++ /dev/null @@ -1,113 +0,0 @@ -# ============================================================================= -# GitLab CI — DevSecOps Pipeline -# ============================================================================= - -stages: - - sast - - build - - scan - - test - - sign - - deploy - -variables: - REGISTRY: ecr.aws/devsecops - TRIVY_SEVERITY: "CRITICAL,HIGH" - -# --- SAST Stage --- -semgrep: - stage: sast - image: semgrep/semgrep:latest - script: - - semgrep --config auto --json --output semgrep.json . - artifacts: - paths: - - semgrep.json - -secret-scan: - stage: sast - image: aquasec/trivy:latest - script: - - trivy fs --scanners secret --exit-code 1 . - -checkov: - stage: sast - image: bridgecrew/checkov:latest - script: - - checkov -d terraform/ --output cli - -# --- Build Stage --- -build: - stage: build - image: docker:24 - services: - - docker:24-dind - before_script: - - aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY - script: - - | - docker build \ - --build-arg BUILD_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ) \ - -t $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA \ - -t $REGISTRY/$CI_PROJECT_NAME:latest . - - docker push $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA - -# --- Scan Stage --- -trivy-scan: - stage: scan - image: aquasec/trivy:latest - needs: [build] - script: - - trivy image --severity $TRIVY_SEVERITY --exit-code 1 --ignore-unfixed $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA - -generate-sbom: - stage: scan - image: anchore/syft:latest - needs: [build] - script: - - syft $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA -o spdx-json > sbom.spdx.json - artifacts: - paths: - - sbom.spdx.json - -# --- Test Stage --- -integration-test: - stage: test - image: docker:24 - services: - - docker:24-dind - script: - - docker compose -f docker-compose.test.yml up --abort-on-container-exit - -# --- Sign Stage --- -sign: - stage: sign - image: bitnami/cosign:latest - needs: [build, trivy-scan, generate-sbom] - variables: - COSIGN_EXPERIMENTAL: "1" - script: - - cosign sign --yes $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA - - cosign attest --yes --predicate sbom.spdx.json --type spdxjson $REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA - -# --- Deploy Stage --- -deploy-staging: - stage: deploy - image: bitnami/kubectl:latest - needs: [sign, integration-test] - environment: - name: staging - script: - - kubectl set image deployment/$CI_PROJECT_NAME $CI_PROJECT_NAME=$REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA -n staging - - kubectl rollout status deployment/$CI_PROJECT_NAME -n staging --timeout=300s - -deploy-prod: - stage: deploy - image: bitnami/kubectl:latest - needs: [deploy-staging] - environment: - name: production - when: manual - script: - - kubectl set image deployment/$CI_PROJECT_NAME $CI_PROJECT_NAME=$REGISTRY/$CI_PROJECT_NAME:$CI_COMMIT_SHORT_SHA -n production - - kubectl rollout status deployment/$CI_PROJECT_NAME -n production --timeout=300s diff --git a/ci-cd/jenkins/Jenkinsfile b/ci-cd/jenkins/Jenkinsfile deleted file mode 100644 index 380a2278e5054178afd9e4039ac5e77b0dfb77b3..0000000000000000000000000000000000000000 --- a/ci-cd/jenkins/Jenkinsfile +++ /dev/null @@ -1,136 +0,0 @@ -// ============================================================================= -// Jenkinsfile — Shared DevSecOps Pipeline -// ============================================================================= - -pipeline { - agent { label 'docker' } - - environment { - REGISTRY = 'ecr.aws/devsecops' - IMAGE_NAME = "${env.JOB_NAME.split('/').last()}" - IMAGE_TAG = "${env.GIT_COMMIT.take(12)}" - TRIVY_SEVERITY = 'CRITICAL,HIGH' - } - - stages { - // ----- Stage 1: SAST ----- - stage('SAST') { - parallel { - stage('Semgrep') { - steps { - sh 'semgrep --config auto --json --output semgrep.json .' - } - } - stage('Secret Scan') { - steps { - sh 'trivy fs --scanners secret --exit-code 1 .' - } - } - stage('IaC Scan') { - steps { - sh 'checkov -d terraform/ --output cli --soft-fail false' - } - } - } - } - - // ----- Stage 2: Build ----- - stage('Build') { - steps { - script { - docker.withRegistry("https://${REGISTRY}", 'ecr:us-east-1') { - def app = docker.build( - "${IMAGE_NAME}:${IMAGE_TAG}", - '--build-arg BUILD_DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ) .' - ) - app.push() - app.push('latest') - } - } - } - } - - // ----- Stage 3: Container Scan ----- - stage('Security Scan') { - steps { - sh """ - trivy image \ - --severity ${TRIVY_SEVERITY} \ - --exit-code 1 \ - --ignore-unfixed \ - ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} - """ - // Generate SBOM - sh """ - syft ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} \ - -o cyclonedx-json > sbom.cyclonedx.json - """ - } - } - - // ----- Stage 4: Test ----- - stage('Integration Test') { - steps { - sh 'docker compose -f docker-compose.test.yml up --abort-on-container-exit' - } - } - - // ----- Stage 5: Sign ----- - stage('Sign & Attest') { - steps { - sh """ - cosign sign --yes \ - ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} - cosign attest --yes \ - --predicate sbom.cyclonedx.json \ - --type cyclonedx \ - ${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} - """ - } - } - - // ----- Stage 6: Deploy ----- - stage('Deploy Staging') { - steps { - sh """ - kubectl set image deployment/${IMAGE_NAME} \ - ${IMAGE_NAME}=${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} \ - -n staging - """ - // Wait for rollout - sh 'kubectl rollout status deployment/${IMAGE_NAME} -n staging --timeout=300s' - } - } - - stage('Deploy Production') { - when { - branch 'main' - } - input { - message "Deploy ${IMAGE_NAME}:${IMAGE_TAG} to production?" - } - steps { - sh """ - kubectl set image deployment/${IMAGE_NAME} \ - ${IMAGE_NAME}=${REGISTRY}/${IMAGE_NAME}:${IMAGE_TAG} \ - -n production - """ - sh 'kubectl rollout status deployment/${IMAGE_NAME} -n production --timeout=300s' - } - } - } - - post { - always { - archiveArtifacts artifacts: 'semgrep.json, sbom.cyclonedx.json', allowEmptyArchive: true - recordIssues(tools: [semgrep(pattern: 'semgrep.json')]) - } - failure { - slackSend( - channel: '#platform-alerts', - color: 'danger', - message: "FAILED: ${env.JOB_NAME} #${env.BUILD_NUMBER}" - ) - } - } -} diff --git a/compliance/cis-benchmarks/cis-eks-k8s.yaml b/compliance/cis-eks-k8s.yaml similarity index 100% rename from compliance/cis-benchmarks/cis-eks-k8s.yaml rename to compliance/cis-eks-k8s.yaml diff --git a/compliance/soc2/controls-mapping.yaml b/compliance/controls-mapping.yaml similarity index 100% rename from compliance/soc2/controls-mapping.yaml rename to compliance/controls-mapping.yaml diff --git a/compliance/nist/nist-800-53-mapping.yaml b/compliance/nist-800-53-mapping.yaml similarity index 100% rename from compliance/nist/nist-800-53-mapping.yaml rename to compliance/nist-800-53-mapping.yaml diff --git a/docker/base-images/Dockerfile.ml-inference b/deployment/Dockerfile.ml-inference similarity index 100% rename from docker/base-images/Dockerfile.ml-inference rename to deployment/Dockerfile.ml-inference diff --git a/k8s/workloads/ml-pipeline/deployment.yaml b/deployment/deployment.yaml similarity index 100% rename from k8s/workloads/ml-pipeline/deployment.yaml rename to deployment/deployment.yaml diff --git a/ai-ml/mlflow/mlflow-deployment.yaml b/deployment/mlflow-deployment.yaml similarity index 100% rename from ai-ml/mlflow/mlflow-deployment.yaml rename to deployment/mlflow-deployment.yaml diff --git a/docker/base-images/Dockerfile.backend b/docker/base-images/Dockerfile.backend deleted file mode 100644 index d309b7fcbe746f7d999686829dd3e82c02dd350c..0000000000000000000000000000000000000000 --- a/docker/base-images/Dockerfile.backend +++ /dev/null @@ -1,51 +0,0 @@ -# ============================================================================= -# Multi-Stage Hardened Dockerfile — Python Backend -# ============================================================================= -# Security Features: -# - Multi-stage build (build → runtime) -# - Non-root user -# - Minimal base (distroless) -# - Pinned versions -# - No shell in runtime image -# - Health check -# ============================================================================= - -# --- Build Stage --- -FROM python:3.12-slim AS builder - -WORKDIR /build - -# Pin pip and install dependencies -COPY requirements.txt . -RUN pip install --no-cache-dir --require-hashes -r requirements.txt - -# Copy application -COPY src/ /build/src/ -COPY pyproject.toml /build/ - -# Build wheel -RUN pip wheel --no-cache-dir --no-deps -w /build/wheels . - -# --- Runtime Stage --- -FROM gcr.io/distroless/python3-debian12:nonroot AS runtime - -# Copy wheels from builder -COPY --from=builder /build/wheels /app/wheels/ -COPY --from=builder /build/src/ /app/src/ - -# Set environment -ENV PYTHONUNBUFFERED=1 \ - PYTHONDONTWRITEBYTECODE=1 \ - PATH="/app/.local/bin:${PATH}" - -WORKDIR /app - -# Run as non-root (distroless nonroot image UID 65532) -USER 65532:65532 - -EXPOSE 8080 - -HEALTHCHECK --interval=30s --timeout=5s --retries=3 \ - CMD ["python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/healthz')"] - -ENTRYPOINT ["python", "-m", "uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080"] diff --git a/docker/base-images/Dockerfile.frontend b/docker/base-images/Dockerfile.frontend deleted file mode 100644 index fa245ee6a6787c6ad3d766a2bb71512f5dfa435e..0000000000000000000000000000000000000000 --- a/docker/base-images/Dockerfile.frontend +++ /dev/null @@ -1,33 +0,0 @@ -# ============================================================================= -# Multi-Stage Hardened Dockerfile — React Frontend -# ============================================================================= - -# --- Build Stage --- -FROM node:20-alpine AS builder - -WORKDIR /app - -# Pin package versions with lockfile -COPY package.json package-lock.json ./ -RUN npm ci --ignore-scripts - -COPY . . -RUN npm run build - -# --- Runtime Stage --- -FROM nginxinc/nginx-unprivileged:1.25-alpine AS runtime - -# Remove default nginx configs -RUN rm -f /etc/nginx/conf.d/default.conf - -# Copy custom nginx config (security headers) -COPY docker/nginx.conf /etc/nginx/conf.d/ -COPY --from=builder /app/dist /usr/share/nginx/html - -# Security headers are in nginx.conf -EXPOSE 8080 - -USER 101:101 - -HEALTHCHECK --interval=30s --timeout=5s --retries=3 \ - CMD ["curl", "-f", "http://localhost:8080/healthz"] diff --git a/finops/cost-optimization.yaml b/finops/cost-optimization.yaml deleted file mode 100644 index bd49dcf6baf10a87469b4a8d96626b1b4847e7b5..0000000000000000000000000000000000000000 --- a/finops/cost-optimization.yaml +++ /dev/null @@ -1,73 +0,0 @@ -# ============================================================================= -# FinOps Engine — Cloud Cost Governance -# ============================================================================= -# Addresses: cost waste, rightsizing, scheduling, unit economics -# ============================================================================= - -# --- Spot Instance Strategy --- -# Use SPOT for ML training workloads (70-90% cost savings) -# Use ON_DEMAND for production services (no interruption risk) - -apiVersion: apps/v1 -kind: Deployment -metadata: - name: ml-training-spot - namespace: ml-pipeline - labels: - app: ml-training-spot - finops: spot-instance -spec: - replicas: 0 # Scale up on demand via KEDA - selector: - matchLabels: - app: ml-training-spot - template: - metadata: - labels: - app: ml-training-spot - finops: spot-instance - spec: - containers: - - name: trainer - image: "ecr.aws/devsecops/ml-train:v1.0.0" - resources: - requests: - cpu: "4" - memory: 16Gi - nvidia.com/gpu: "1" - limits: - cpu: "8" - memory: 32Gi - nvidia.com/gpu: "1" - tolerations: - - key: nvidia.com/gpu - operator: Exists - effect: NoSchedule - nodeSelector: - workload: ml-spot - # Allow eviction for spot reclamation - terminationGracePeriodSeconds: 120 ---- -# --- KEDA Scaler — Scale ML training on queue depth --- -apiVersion: keda.sh/v1alpha1 -kind: ScaledJob -metadata: - name: ml-training-scaler - namespace: ml-pipeline -spec: - minReplicaCount: 0 - maxReplicaCount: 4 - pollingInterval: 30 - triggers: - - type: aws-sqs - metadata: - queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/ml-training-queue - queueLength: "1" - jobTemplate: - spec: - template: - spec: - restartPolicy: Never - containers: - - name: trainer - image: "ecr.aws/devsecops/ml-train:v1.0.0" diff --git a/finops/finops-cronjob.yaml b/finops/finops-cronjob.yaml deleted file mode 100644 index 816f6f41ed5b3cb243c779357c0a151d6f3247ef..0000000000000000000000000000000000000000 --- a/finops/finops-cronjob.yaml +++ /dev/null @@ -1,23 +0,0 @@ -# FinOps Daily Cost Scanner -apiVersion: batch/v1 -kind: CronJob -metadata: - name: finops-scanner - namespace: platform-system -spec: - schedule: "0 6 * * 1-5" # 6am weekdays - jobTemplate: - spec: - template: - spec: - serviceAccountName: finops-scanner - containers: - - name: scanner - image: "ecr.aws/devsecops/finops-scanner:latest" - command: ["python3", "finops_scanner.py"] - env: - - name: AWS_REGION - value: "us-east-1" - restartPolicy: Never - concurrencyPolicy: Forbid - successfulJobsHistoryLimit: 7 diff --git a/finops/finops-policy.yaml b/finops/finops-policy.yaml deleted file mode 100644 index 4a4f7b1227a3c035d8e632ae8236ad3f38162dbb..0000000000000000000000000000000000000000 --- a/finops/finops-policy.yaml +++ /dev/null @@ -1,73 +0,0 @@ -# ============================================================================= -# FinOps Policy — Cloud Waste Detection & Rightsizing Rules -# ============================================================================= - -policies: - # --- Unused Resources --- - - id: FINOPS-001 - name: "Detect unused EBS volumes" - severity: medium - check: "aws ec2 describe-volumes --filters Name=status,Values=available" - action: "Create snapshot, delete volume after 7 days" - estimated_savings: "$50-200/month per volume" - - - id: FINOPS-002 - name: "Detect idle RDS instances" - severity: medium - check: "Connection count < 5 for 7 days" - action: "Downsize instance class or stop non-prod" - estimated_savings: "30-50% of RDS cost" - - - id: FINOPS-003 - name: "Detect unattached EIPs" - severity: low - check: "aws ec2 describe-addresses --filter Name=association-id,Values=''" - action: "Release EIP" - estimated_savings: "$3.60/month per EIP" - - # --- Rightsizing --- - - id: FINOPS-010 - name: "EC2 rightsizing recommendations" - severity: medium - check: "CPU < 15% for 14 days OR Memory < 25% for 14 days" - action: "Downsize to next tier (e.g., m6i.xlarge -> m6i.large)" - estimated_savings: "20-40% per instance" - - - id: FINOPS-011 - name: "Over-provisioned K8s workloads" - severity: medium - check: "Container CPU request > 2x actual P95 usage" - action: "Reduce requests to P95 + 20% headroom" - estimated_savings: "30-50% of cluster cost" - - # --- Scheduling --- - - id: FINOPS-020 - name: "Non-prod environment scheduling" - severity: high - check: "Dev/staging workloads running 24/7" - action: "Scale to 0 outside business hours (Mon-Fri 8am-8pm)" - estimated_savings: "65% of non-prod compute" - - # --- Reserved Instances Coverage --- - - id: FINOPS-030 - name: "RI coverage below 70%" - severity: high - check: "RI coverage < 70% for consistent workloads" - action: "Purchase RIs for EKS node groups + RDS" - estimated_savings: "30-40% vs on-demand" - - # --- Storage Tiering --- - - id: FINOPS-040 - name: "S3 intelligent tiering" - severity: medium - check: "S3 objects > 90 days in STANDARD" - action: "Enable S3 Intelligent-Tiering on all buckets" - estimated_savings: "40-60% on infrequent access data" - - # --- GPU Utilization --- - - id: FINOPS-050 - name: "Underutilized GPU nodes" - severity: critical - check: "GPU utilization < 30% for 4 hours" - action: "Scale down GPU node group or use KEDA for demand-based scaling" - estimated_savings: "$2-6/hour per GPU" diff --git a/incident-response/auto-remediation/auto-remediate.sh b/incident-response/auto-remediation/auto-remediate.sh deleted file mode 100644 index aab4e15d2881be581cd0fca868f530440374b446..0000000000000000000000000000000000000000 --- a/incident-response/auto-remediation/auto-remediate.sh +++ /dev/null @@ -1,50 +0,0 @@ -#!/usr/bin/env bash -# ============================================================================= -# Autonomous Incident Remediation Engine -# ============================================================================= -# Triggered by Alertmanager webhook. Auto-remediates known patterns. -# ============================================================================= - -set -euo pipefail - -ALERT_NAME="${1:-unknown}" -NAMESPACE="${2:-default}" -POD_NAME="${3:-}" - -log() { echo "[$(date +%H:%M:%S)] [REMEDIATE] $*"; } - -case "${ALERT_NAME}" in - PodCrashLooping) - log "Remediating crash-looping pod: ${NAMESPACE}/${POD_NAME}" - # Check if OOM killed - OOM_COUNT=$(kubectl get pod "${POD_NAME}" -n "${NAMESPACE}" -o json | jq -r '.status.containerStatuses[0].lastState.terminated.reason // empty' | grep -c OOMKilled || true) - if [[ "${OOM_COUNT}" -gt 0 ]]; then - log "OOM detected - increasing memory limit" - kubectl patch deployment "${POD_NAME%-*}" -n "${NAMESPACE}" -p '{"spec":{"template":{"spec":{"containers":[{"name":"app","resources":{"limits":{"memory":"2Gi"}}}]}}}}' - log "Memory limit increased to 2Gi" - else - log "Non-OOM crash - restarting pod" - kubectl delete pod "${POD_NAME}" -n "${NAMESPACE}" --grace-period=30 - fi - ;; - - HighMemoryUsage) - log "Node memory pressure detected" - # Evict lowest-priority pods - kubectl get pods -A --sort-by=.spec.priority --field-selector=status.phase=Running | tail -5 | while read ns pod rest; do - log "Considering eviction: ${ns}/${pod}" - done - ;; - - FalcoRuntimeAlert) - log "Runtime security alert - do NOT auto-remediate" - log "Escalate to security team: #security-alerts" - # Only notify - never auto-remediate security alerts - ;; - - *) - log "Unknown alert pattern: ${ALERT_NAME}" - log "Manual investigation required" - exit 1 - ;; -esac diff --git a/platform/adr/template.md b/infrastructure/adr-template.md similarity index 100% rename from platform/adr/template.md rename to infrastructure/adr-template.md diff --git a/ci-cd/github-actions/devsecops-pipeline.yml b/infrastructure/ci-cd/devsecops-pipeline.yml similarity index 100% rename from ci-cd/github-actions/devsecops-pipeline.yml rename to infrastructure/ci-cd/devsecops-pipeline.yml diff --git a/finops/finops_scanner.py b/infrastructure/finops_scanner.py similarity index 100% rename from finops/finops_scanner.py rename to infrastructure/finops_scanner.py diff --git a/scripts/bash/incident-response.sh b/infrastructure/incident-response.sh similarity index 100% rename from scripts/bash/incident-response.sh rename to infrastructure/incident-response.sh diff --git a/monitoring/alertmanager/alertmanager-config.yaml b/infrastructure/monitoring/alertmanager/alertmanager-config.yaml similarity index 100% rename from monitoring/alertmanager/alertmanager-config.yaml rename to infrastructure/monitoring/alertmanager/alertmanager-config.yaml diff --git a/monitoring/grafana/dashboards/platform-overview.json b/infrastructure/monitoring/grafana-platform-overview.json similarity index 100% rename from monitoring/grafana/dashboards/platform-overview.json rename to infrastructure/monitoring/grafana-platform-overview.json diff --git a/monitoring/otel/otel-collector.yaml b/infrastructure/monitoring/otel/otel-collector.yaml similarity index 100% rename from monitoring/otel/otel-collector.yaml rename to infrastructure/monitoring/otel/otel-collector.yaml diff --git a/monitoring/prometheus/alerts.yaml b/infrastructure/monitoring/prometheus/alerts.yaml similarity index 100% rename from monitoring/prometheus/alerts.yaml rename to infrastructure/monitoring/prometheus/alerts.yaml diff --git a/incident-response/postmortem/template.md b/infrastructure/postmortem-template.md similarity index 100% rename from incident-response/postmortem/template.md rename to infrastructure/postmortem-template.md diff --git a/terraform/environments/prod/main.tf b/infrastructure/terraform/environments/prod/main.tf similarity index 100% rename from terraform/environments/prod/main.tf rename to infrastructure/terraform/environments/prod/main.tf diff --git a/terraform/modules/eks/main.tf b/infrastructure/terraform/modules/eks/main.tf similarity index 100% rename from terraform/modules/eks/main.tf rename to infrastructure/terraform/modules/eks/main.tf diff --git a/terraform/modules/eks/outputs.tf b/infrastructure/terraform/modules/eks/outputs.tf similarity index 100% rename from terraform/modules/eks/outputs.tf rename to infrastructure/terraform/modules/eks/outputs.tf diff --git a/terraform/modules/eks/variables.tf b/infrastructure/terraform/modules/eks/variables.tf similarity index 100% rename from terraform/modules/eks/variables.tf rename to infrastructure/terraform/modules/eks/variables.tf diff --git a/terraform/modules/guardduty/main.tf b/infrastructure/terraform/modules/guardduty/main.tf similarity index 100% rename from terraform/modules/guardduty/main.tf rename to infrastructure/terraform/modules/guardduty/main.tf diff --git a/terraform/modules/guardduty/variables.tf b/infrastructure/terraform/modules/guardduty/variables.tf similarity index 100% rename from terraform/modules/guardduty/variables.tf rename to infrastructure/terraform/modules/guardduty/variables.tf diff --git a/terraform/modules/iam/main.tf b/infrastructure/terraform/modules/iam/main.tf similarity index 100% rename from terraform/modules/iam/main.tf rename to infrastructure/terraform/modules/iam/main.tf diff --git a/terraform/modules/kms/main.tf b/infrastructure/terraform/modules/kms/main.tf similarity index 100% rename from terraform/modules/kms/main.tf rename to infrastructure/terraform/modules/kms/main.tf diff --git a/terraform/modules/macie/main.tf b/infrastructure/terraform/modules/macie/main.tf similarity index 100% rename from terraform/modules/macie/main.tf rename to infrastructure/terraform/modules/macie/main.tf diff --git a/terraform/modules/rds/main.tf b/infrastructure/terraform/modules/rds/main.tf similarity index 100% rename from terraform/modules/rds/main.tf rename to infrastructure/terraform/modules/rds/main.tf diff --git a/terraform/modules/rds/variables.tf b/infrastructure/terraform/modules/rds/variables.tf similarity index 100% rename from terraform/modules/rds/variables.tf rename to infrastructure/terraform/modules/rds/variables.tf diff --git a/terraform/modules/s3/main.tf b/infrastructure/terraform/modules/s3/main.tf similarity index 100% rename from terraform/modules/s3/main.tf rename to infrastructure/terraform/modules/s3/main.tf diff --git a/terraform/modules/s3/variables.tf b/infrastructure/terraform/modules/s3/variables.tf similarity index 100% rename from terraform/modules/s3/variables.tf rename to infrastructure/terraform/modules/s3/variables.tf diff --git a/terraform/modules/vpc/main.tf b/infrastructure/terraform/modules/vpc/main.tf similarity index 100% rename from terraform/modules/vpc/main.tf rename to infrastructure/terraform/modules/vpc/main.tf diff --git a/terraform/modules/vpc/outputs.tf b/infrastructure/terraform/modules/vpc/outputs.tf similarity index 100% rename from terraform/modules/vpc/outputs.tf rename to infrastructure/terraform/modules/vpc/outputs.tf diff --git a/terraform/modules/vpc/variables.tf b/infrastructure/terraform/modules/vpc/variables.tf similarity index 100% rename from terraform/modules/vpc/variables.tf rename to infrastructure/terraform/modules/vpc/variables.tf diff --git a/k8s/base/limit-ranges/limit-ranges.yaml b/k8s/base/limit-ranges/limit-ranges.yaml deleted file mode 100644 index 078509d27625964922ac328646021fd4f4448717..0000000000000000000000000000000000000000 --- a/k8s/base/limit-ranges/limit-ranges.yaml +++ /dev/null @@ -1,74 +0,0 @@ -# ============================================================================= -# Limit Ranges — Default Resource Requests/Limits Per Container -# ============================================================================= - -apiVersion: v1 -kind: LimitRange -metadata: - name: default-limits - namespace: frontend -spec: - limits: - - type: Container - default: - cpu: 500m - memory: 256Mi - defaultRequest: - cpu: 100m - memory: 128Mi - max: - cpu: "2" - memory: 2Gi - min: - cpu: 50m - memory: 64Mi - maxLimitRequestRatio: - cpu: "4" - memory: "4" ---- -apiVersion: v1 -kind: LimitRange -metadata: - name: default-limits - namespace: backend -spec: - limits: - - type: Container - default: - cpu: "1" - memory: 512Mi - defaultRequest: - cpu: 200m - memory: 256Mi - max: - cpu: "4" - memory: 4Gi - min: - cpu: 100m - memory: 128Mi - maxLimitRequestRatio: - cpu: "4" - memory: "4" ---- -apiVersion: v1 -kind: LimitRange -metadata: - name: default-limits - namespace: ml-pipeline -spec: - limits: - - type: Container - default: - cpu: "2" - memory: 4Gi - nvidia.com/gpu: "1" - defaultRequest: - cpu: 500m - memory: 1Gi - max: - cpu: "8" - memory: 16Gi - nvidia.com/gpu: "2" - min: - cpu: 200m - memory: 512Mi diff --git a/k8s/base/namespaces/namespaces.yaml b/k8s/base/namespaces/namespaces.yaml deleted file mode 100644 index d951b37f9c063ee4b6b0c1be79bc8a2cadba214c..0000000000000000000000000000000000000000 --- a/k8s/base/namespaces/namespaces.yaml +++ /dev/null @@ -1,69 +0,0 @@ -# ============================================================================= -# Namespace Definitions — Security-First Multi-Tenant Layout -# ============================================================================= -# Each namespace gets: -# - Labels for network policy targeting -# - Resource quotas -# - Limit ranges -# - Pod security standards via labels (Kyverno enforces) - -apiVersion: v1 -kind: Namespace -metadata: - name: platform-system - labels: - pod-security.kubernetes.io/enforce: "privileged" - pod-security.kubernetes.io/audit: "privileged" - pod-security.kubernetes.io/warn: "privileged" - platform: "true" ---- -apiVersion: v1 -kind: Namespace -metadata: - name: monitoring - labels: - pod-security.kubernetes.io/enforce: "restricted" - pod-security.kubernetes.io/audit: "restricted" - pod-security.kubernetes.io/warn: "restricted" - platform: "true" ---- -apiVersion: v1 -kind: Namespace -metadata: - name: security - labels: - pod-security.kubernetes.io/enforce: "restricted" - pod-security.kubernetes.io/audit: "restricted" - pod-security.kubernetes.io/warn: "restricted" - platform: "true" ---- -apiVersion: v1 -kind: Namespace -metadata: - name: frontend - labels: - pod-security.kubernetes.io/enforce: "restricted" - pod-security.kubernetes.io/audit: "restricted" - pod-security.kubernetes.io/warn: "restricted" - app-team: "frontend" ---- -apiVersion: v1 -kind: Namespace -metadata: - name: backend - labels: - pod-security.kubernetes.io/enforce: "restricted" - pod-security.kubernetes.io/audit: "restricted" - pod-security.kubernetes.io/warn: "restricted" - app-team: "backend" ---- -apiVersion: v1 -kind: Namespace -metadata: - name: ml-pipeline - labels: - pod-security.kubernetes.io/enforce: "baseline" - pod-security.kubernetes.io/audit: "restricted" - pod-security.kubernetes.io/warn: "restricted" - app-team: "ml" - nvidia.com/gpu: "true" diff --git a/k8s/base/network-policies/network-policies.yaml b/k8s/base/network-policies/network-policies.yaml deleted file mode 100644 index 09805adf5f3a7a7ed6abcedc982fb46f8e2b1de6..0000000000000000000000000000000000000000 --- a/k8s/base/network-policies/network-policies.yaml +++ /dev/null @@ -1,124 +0,0 @@ -# ============================================================================= -# Network Policies — Zero Trust Default Deny + Selective Allow -# ============================================================================= -# Strategy: Default deny all ingress/egress, then allow only known traffic - -# --- Default Deny All Ingress in Every Namespace --- -apiVersion: networking.k8s.io/v1 -kind: NetworkPolicy -metadata: - name: default-deny-ingress - namespace: frontend -spec: - podSelector: {} # Matches all pods - policyTypes: - - Ingress ---- -apiVersion: networking.k8s.io/v1 -kind: NetworkPolicy -metadata: - name: default-deny-ingress - namespace: backend -spec: - podSelector: {} - policyTypes: - - Ingress ---- -apiVersion: networking.k8s.io/v1 -kind: NetworkPolicy -metadata: - name: default-deny-ingress - namespace: ml-pipeline -spec: - podSelector: {} - policyTypes: - - Ingress ---- -# --- Frontend: Allow ingress from Istio ingress gateway only --- -apiVersion: networking.k8s.io/v1 -kind: NetworkPolicy -metadata: - name: allow-istio-ingress - namespace: frontend -spec: - podSelector: - matchLabels: - app: frontend - policyTypes: - - Ingress - ingress: - - from: - - namespaceSelector: - matchLabels: - name: istio-system - - podSelector: - matchLabels: - istio: ingressgateway - ports: - - port: 8080 - protocol: TCP ---- -# --- Backend: Allow ingress from frontend namespace only --- -apiVersion: networking.k8s.io/v1 -kind: NetworkPolicy -metadata: - name: allow-from-frontend - namespace: backend -spec: - podSelector: - matchLabels: - app: backend - policyTypes: - - Ingress - - Egress - ingress: - - from: - - namespaceSelector: - matchLabels: - app-team: frontend - ports: - - port: 8080 - protocol: TCP - egress: - # Allow DNS - - to: [] - ports: - - port: 53 - protocol: UDP - - port: 53 - protocol: TCP - # Allow RDS - - to: [] - ports: - - port: 5432 - protocol: TCP ---- -# --- ML Pipeline: Allow from backend + Istio --- -apiVersion: networking.k8s.io/v1 -kind: NetworkPolicy -metadata: - name: allow-ml-traffic - namespace: ml-pipeline -spec: - podSelector: {} - policyTypes: - - Ingress - - Egress - ingress: - - from: - - namespaceSelector: - matchLabels: - app-team: backend - - from: - - namespaceSelector: - matchLabels: - name: istio-system - egress: - - to: [] - ports: - - port: 53 - protocol: UDP - - to: [] - ports: - - port: 443 - protocol: TCP # HuggingFace Hub, S3, etc. diff --git a/k8s/base/pdbs/pdbs.yaml b/k8s/base/pdbs/pdbs.yaml deleted file mode 100644 index 3cd446d15b475c87cb4cfcec6355209c13f2a948..0000000000000000000000000000000000000000 --- a/k8s/base/pdbs/pdbs.yaml +++ /dev/null @@ -1,62 +0,0 @@ -# ============================================================================= -# Pod Disruption Budgets — Availability Guarantees Per Workload -# ============================================================================= -# PDBs prevent voluntary evictions (upgrades, drain) from taking down too many pods. -# Without PDBs: kubectl drain or cluster-autoscaler can cause unplanned outages. -# ============================================================================= - -apiVersion: policy/v1 -kind: PodDisruptionBudget -metadata: - name: frontend-pdb - namespace: frontend -spec: - minAvailable: 2 # At least 2 pods always running (3 replicas total) - selector: - matchLabels: - app: frontend ---- -apiVersion: policy/v1 -kind: PodDisruptionBudget -metadata: - name: backend-pdb - namespace: backend -spec: - minAvailable: 2 - selector: - matchLabels: - app: backend ---- -apiVersion: policy/v1 -kind: PodDisruptionBudget -metadata: - name: ml-inference-pdb - namespace: ml-pipeline -spec: - maxUnavailable: 1 # At most 1 pod disrupted at a time - selector: - matchLabels: - app: ml-inference ---- -# Platform services — always keep 1 available -apiVersion: policy/v1 -kind: PodDisruptionBudget -metadata: - name: prometheus-pdb - namespace: monitoring -spec: - minAvailable: 1 - selector: - matchLabels: - app: kube-prometheus-stack-prometheus ---- -apiVersion: policy/v1 -kind: PodDisruptionBudget -metadata: - name: argocd-pdb - namespace: platform-system -spec: - minAvailable: 1 - selector: - matchLabels: - app.kubernetes.io/name: argocd diff --git a/k8s/base/rbac/rbac.yaml b/k8s/base/rbac/rbac.yaml deleted file mode 100644 index 1c9a7df929f351ebeb254459c4ccb66e276c0d2a..0000000000000000000000000000000000000000 --- a/k8s/base/rbac/rbac.yaml +++ /dev/null @@ -1,78 +0,0 @@ -# ============================================================================= -# RBAC — Least-Privilege Access Control -# ============================================================================= - -# Platform Admins — Full cluster access -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - name: platform-admin -rules: - - apiGroups: ["*"] - resources: ["*"] - verbs: ["*"] - # Exclude secrets CRUD for audit trail — use ExternalSecrets instead - - apiGroups: [""] - resources: ["secrets"] - verbs: ["get", "list", "watch"] # No create/update/delete ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRoleBinding -metadata: - name: platform-admin -roleRef: - apiGroup: rbac.authorization.k8s.io - kind: ClusterRole - name: platform-admin -subjects: - - kind: Group - name: platform-admins - apiGroup: rbac.authorization.k8s.io ---- -# Developer — Read + Pod Exec + Logs within their namespaces -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - name: developer -rules: - - apiGroups: ["", "apps", "batch", "extensions"] - resources: ["pods", "pods/log", "pods/exec", "deployments", "statefulsets", "jobs", "cronjobs"] - verbs: ["get", "list", "watch"] - - apiGroups: [""] - resources: ["pods/exec"] - verbs: ["create"] - - apiGroups: ["", "apps"] - resources: ["deployments", "statefulsets"] - verbs: ["patch"] # For restart rollout only - - apiGroups: ["metrics.k8s.io"] - resources: ["pods", "nodes"] - verbs: ["get", "list"] ---- -# Viewer — Read-only cluster-wide -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - name: viewer -rules: - - apiGroups: ["", "apps", "batch", "extensions", "networking.k8s.io"] - resources: ["*"] - verbs: ["get", "list", "watch"] - - nonResourceURLs: ["*"] - verbs: ["get"] ---- -# ML Engineer — Access to ml-pipeline namespace only -apiVersion: rbac.authorization.k8s.io/v1 -kind: Role -metadata: - name: ml-engineer - namespace: ml-pipeline -rules: - - apiGroups: ["", "apps", "batch", "kubeflow.org", "serving.kubeflow.org"] - resources: ["pods", "pods/log", "pods/exec", "deployments", "jobs", "notebooks", "inferenceservices"] - verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] - - apiGroups: [""] - resources: ["secrets"] - verbs: ["get", "list"] # No create/update - - apiGroups: [""] - resources: ["configmaps"] - verbs: ["get", "list", "create", "update"] diff --git a/k8s/base/resource-quotas/resource-quotas.yaml b/k8s/base/resource-quotas/resource-quotas.yaml deleted file mode 100644 index 055c8fb1a1223a7d73115d37f2b7919aeadb42ff..0000000000000000000000000000000000000000 --- a/k8s/base/resource-quotas/resource-quotas.yaml +++ /dev/null @@ -1,50 +0,0 @@ -# ============================================================================= -# Resource Quotas — Prevent Resource Exhaustion Per Namespace -# ============================================================================= - -apiVersion: v1 -kind: ResourceQuota -metadata: - name: frontend-quota - namespace: frontend -spec: - hard: - requests.cpu: "4" - requests.memory: 8Gi - limits.cpu: "8" - limits.memory: 16Gi - pods: "20" - services: "5" - persistentvolumeclaims: "10" - requests.nvidia.com/gpu: "0" # No GPUs for frontend ---- -apiVersion: v1 -kind: ResourceQuota -metadata: - name: backend-quota - namespace: backend -spec: - hard: - requests.cpu: "8" - requests.memory: 16Gi - limits.cpu: "16" - limits.memory: 32Gi - pods: "30" - services: "10" - persistentvolumeclaims: "20" ---- -apiVersion: v1 -kind: ResourceQuota -metadata: - name: ml-quota - namespace: ml-pipeline -spec: - hard: - requests.cpu: "16" - requests.memory: 64Gi - limits.cpu: "32" - limits.memory: 128Gi - pods: "15" - services: "5" - persistentvolumeclaims: "30" - requests.nvidia.com/gpu: "4" diff --git a/k8s/base/slos/slos.yaml b/k8s/base/slos/slos.yaml deleted file mode 100644 index fe06b15ffa42ebeb5a69b3a06586fc19ba8efec9..0000000000000000000000000000000000000000 --- a/k8s/base/slos/slos.yaml +++ /dev/null @@ -1,68 +0,0 @@ -# ============================================================================= -# Service Level Objectives — Platform SLOs -# ============================================================================= -# SLOs define reliability targets. Error budgets = 100% - SLO. -# Burn rate alerts fire when error budget is consumed too fast. -# ============================================================================= - -# --- API Availability SLO: 99.95% (21.9 min/month error budget) --- -apiVersion: monitoring.coreos.com/v1 -kind: PrometheusRule -metadata: - name: slo-api-availability - namespace: monitoring - labels: - release: kube-prometheus-stack - slo: "true" -spec: - groups: - - name: slo.api.availability - rules: - # SLO metric: 5m success rate - - record: slo:api_availability:rate5m - expr: | - sum(rate(http_requests_total{code!~"5.."}[5m])) - / - sum(rate(http_requests_total[5m])) - - # 1h error budget burn rate (14.4x = consume 30d budget in 2d) - - alert: SLOAPIAvailabilityBurnRateHigh - expr: | - ( - (1 - slo:api_availability:rate5m) > (14.4 * 0.001) - ) - for: 5m - labels: - severity: critical - slo: api-availability - annotations: - summary: "API availability SLO budget burning too fast" - runbook: "https://runbook.platform.internal/slo-api-burn" - - - name: slo.api.latency - rules: - # Latency SLO: P99 < 2s, 99.9% of requests - - record: slo:api_latency_p99:rate5m - expr: | - histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) - - - alert: SLOAPILatencyBurnRateHigh - expr: | - slo:api_latency_p99:rate5m > 2 - for: 10m - labels: - severity: warning - slo: api-latency - annotations: - summary: "API P99 latency exceeding 2s SLO" - - - name: slo.error_budget - rules: - # Remaining error budget (percentage) - - record: slo:error_budget_remaining:ratio - expr: | - 1 - ( - (1 - slo:api_availability:rate5m) - / - 0.0005 - ) diff --git a/k8s/kustomize/base/kustomization.yaml b/k8s/kustomize/base/kustomization.yaml deleted file mode 100644 index 4f769babab1f1144fb70e8816c4cc80a576754dc..0000000000000000000000000000000000000000 --- a/k8s/kustomize/base/kustomization.yaml +++ /dev/null @@ -1,18 +0,0 @@ -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization -resources: - - ../../base/namespaces/ - - ../../base/rbac/ - - ../../base/network-policies/ - - ../../base/resource-quotas/ - - ../../base/limit-ranges/ - - ../../base/pdbs/ - - ../../base/slos/ - - ../../manifests/cert-manager/ - - ../../manifests/external-secrets/ - - ../../manifests/istio/ - - ../../manifests/argo-cd/ - - ../../manifests/trivy-operator/ - - ../../manifests/falco/ - - ../../manifests/kyverno/ - - ../../manifests/prometheus-stack/ diff --git a/k8s/kustomize/overlays/dev/kustomization.yaml b/k8s/kustomize/overlays/dev/kustomization.yaml deleted file mode 100644 index 060dea9cd39a8fea1749d14e7cd788a66876b83b..0000000000000000000000000000000000000000 --- a/k8s/kustomize/overlays/dev/kustomization.yaml +++ /dev/null @@ -1,22 +0,0 @@ -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization -resources: - - ../../base - - ../../../workloads/frontend/ - - ../../../workloads/backend/ -patches: - - target: - kind: Deployment - patch: |- - - op: replace - path: /spec/replicas - value: 1 - - target: - kind: HorizontalPodAutoscaler - patch: |- - - op: replace - path: /spec/minReplicas - value: 1 - - op: replace - path: /spec/maxReplicas - value: 3 diff --git a/k8s/kustomize/overlays/prod/kustomization.yaml b/k8s/kustomize/overlays/prod/kustomization.yaml deleted file mode 100644 index cc491285e7c677eca11e4a684cccf2404dfd94bc..0000000000000000000000000000000000000000 --- a/k8s/kustomize/overlays/prod/kustomization.yaml +++ /dev/null @@ -1,22 +0,0 @@ -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization -resources: - - ../../base - - ../../../workloads/frontend/ - - ../../../workloads/backend/ - - ../../../workloads/ml-pipeline/ -patches: - - target: - kind: Deployment - name: backend - patch: |- - - op: replace - path: /spec/replicas - value: 5 - - target: - kind: HorizontalPodAutoscaler - name: backend-hpa - patch: |- - - op: replace - path: /spec/maxReplicas - value: 30 diff --git a/k8s/manifests/argo-cd/argocd.yaml b/k8s/manifests/argo-cd/argocd.yaml deleted file mode 100644 index 70bd52a3336ed4205892a1782c15df86c8512e87..0000000000000000000000000000000000000000 --- a/k8s/manifests/argo-cd/argocd.yaml +++ /dev/null @@ -1,60 +0,0 @@ -# ============================================================================= -# ArgoCD — GitOps Continuous Delivery -# ============================================================================= - -apiVersion: argoproj.io/v1alpha1 -kind: ArgoCD -metadata: - name: argocd - namespace: platform-system -spec: - server: - host: argocd.platform.internal - ingress: - enabled: true - tls: true - annotations: - cert-manager.io/cluster-issuer: letsencrypt-prod - nginx.ingress.kubernetes.io/ssl-passthrough: "true" - grpc: - ingress: - enabled: true - tls: true - sso: - provider: oidc - oidc: - name: Okta - issuer: https://devsecops.okta.com/oauth2/default - clientID: argocd - clientSecret: - name: argocd-oidc-secret - key: clientSecret - requestedScopes: - - openid - - groups - - email - - profile - requestedIDTokenClaims: - groups: - essential: true - rbac: - defaultPolicy: "role:readonly" - policy: | - g, platform-admins, role:admin - g, developers, role:developer - scopes: "[groups]" - repo: - # Enable private repo access via SSH deploy keys - sshPrivateKeySecret: - name: argocd-repo-ssh-key - key: sshPrivateKey - # HA mode - ha: - enabled: true - redis: - image: - repository: public.ecr.aws/bitnami/redis - tag: 7.2.4 - # Security hardening - server RBAC: - enabled: true diff --git a/k8s/manifests/cert-manager/cert-manager.yaml b/k8s/manifests/cert-manager/cert-manager.yaml deleted file mode 100644 index 7c64bb78c4deedbfccc377075b310179d2d5480a..0000000000000000000000000000000000000000 --- a/k8s/manifests/cert-manager/cert-manager.yaml +++ /dev/null @@ -1,62 +0,0 @@ -# ============================================================================= -# cert-manager — Automatic TLS Certificate Management -# ============================================================================= - -apiVersion: cert-manager.io/v1 -kind: ClusterIssuer -metadata: - name: letsencrypt-prod -spec: - acme: - server: https://acme-v02.api.letsencrypt.org/directory - email: platform-team@devsecops.internal - privateKeySecretRef: - name: letsencrypt-prod-key - solvers: - - dns01: - route53: - region: us-east-1 - role: arn:aws:iam::123456789012:role/cert-manager-dns01 ---- -apiVersion: cert-manager.io/v1 -kind: ClusterIssuer -metadata: - name: letsencrypt-staging -spec: - acme: - server: https://acme-staging-v02.api.letsencrypt.org/directory - email: platform-team@devsecops.internal - privateKeySecretRef: - name: letsencrypt-staging-key - solvers: - - dns01: - route53: - region: us-east-1 - role: arn:aws:iam::123456789012:role/cert-manager-dns01 ---- -# Internal CA for service mesh mTLS -apiVersion: cert-manager.io/v1 -kind: Issuer -metadata: - name: selfsigned-issuer - namespace: cert-manager -spec: - selfSigned: {} ---- -apiVersion: cert-manager.io/v1 -kind: Certificate -metadata: - name: internal-ca - namespace: cert-manager -spec: - isCA: true - commonName: devsecops-internal-ca - secretName: internal-ca-key - privateKey: - algorithm: ECDSA - size: 256 - issuerRef: - name: selfsigned-issuer - kind: Issuer - duration: 87600h # 10 years - renewBefore: 720h # 30 days diff --git a/k8s/manifests/external-secrets/external-secrets.yaml b/k8s/manifests/external-secrets/external-secrets.yaml deleted file mode 100644 index 928c352c562e81211ae19c47fc3ff220250dcd58..0000000000000000000000000000000000000000 --- a/k8s/manifests/external-secrets/external-secrets.yaml +++ /dev/null @@ -1,78 +0,0 @@ -# ============================================================================= -# External Secrets Operator — Sync from AWS Secrets Manager / Parameter Store -# ============================================================================= - -apiVersion: external-secrets.io/v1beta1 -kind: ClusterSecretStore -metadata: - name: aws-secrets-manager -spec: - provider: - aws: - service: SecretsManager - region: us-east-1 - auth: - jwt: - serviceAccountRef: - name: external-secrets-sa - namespace: security ---- -apiVersion: external-secrets.io/v1beta1 -kind: ClusterSecretStore -metadata: - name: aws-parameter-store -spec: - provider: - aws: - service: ParameterStore - region: us-east-1 - auth: - jwt: - serviceAccountRef: - name: external-secrets-sa - namespace: security ---- -# Example: Sync database credentials -apiVersion: external-secrets.io/v1beta1 -kind: ExternalSecret -metadata: - name: db-credentials - namespace: backend -spec: - refreshInterval: 1h - secretStoreRef: - name: aws-secrets-manager - kind: ClusterSecretStore - target: - name: db-credentials - creationPolicy: Owner - template: - type: Opaque - data: - DB_HOST: "{{ .host }}" - DB_PORT: "{{ .port }}" - DB_USER: "{{ .username }}" - DB_PASSWORD: "{{ .password }}" - DB_NAME: "{{ .dbname }}" - DATABASE_URL: "postgresql://{{ .username }}:{{ .password }}@{{ .host }}:{{ .port }}/{{ .dbname }}?sslmode=require" - data: - - secretKey: host - remoteRef: - key: prod/rds/credentials - property: host - - secretKey: port - remoteRef: - key: prod/rds/credentials - property: port - - secretKey: username - remoteRef: - key: prod/rds/credentials - property: username - - secretKey: password - remoteRef: - key: prod/rds/credentials - property: password - - secretKey: dbname - remoteRef: - key: prod/rds/credentials - property: dbname diff --git a/k8s/manifests/falco/falco.yaml b/k8s/manifests/falco/falco.yaml deleted file mode 100644 index 09b405ffcd36a1694ac86ba4c54be878c6c16a31..0000000000000000000000000000000000000000 --- a/k8s/manifests/falco/falco.yaml +++ /dev/null @@ -1,77 +0,0 @@ -# ============================================================================= -# Falco — Runtime Security Detection -# ============================================================================= - -apiVersion: helm.cattle.io/v1 -kind: HelmChart -metadata: - name: falco - namespace: security -spec: - repo: https://falcosecurity.github.io/charts - chart: falco - targetNamespace: security - valuesContent: |- - driver: - kind: ebpf # Modern kernel — eBPF preferred over kernel module - - falco: - http_output: - enabled: true - url: "http://falcosidekick.security:2801/" - json_output: true - log_level: info - log_stderr: true - log_syslog: false - - # Rate limiting - rate: 1000 - max_burst: 1000 - - # Custom rules — extend default rules for our platform - customRules: - # Alert on container drift (new process spawned) - container-drift.yaml: |- - - rule: Container Drift Detected - desc: New process started in container outside whitelist - condition: > - evt.type = execve and - container.id != host and - not proc.name in (nginx, python, node, gunicorn, uvicorn) - output: "Container drift detected (user=%user.name container=%container.name image=%container.image.repository command=%proc.cmdline)" - priority: WARNING - tags: [container, drift] - - # Alert on crypto mining - crypto-mining.yaml: |- - - rule: Detect Crypto Mining - desc: Detect outbound connections to known mining pools - condition: > - (evt.type = connect and - fd.sip in (known_mining_pools) and - container.id != host) - output: "Crypto mining detected (container=%container.name image=%container.image.repository connection=%fd.sip)" - priority: CRITICAL - tags: [crypto, malware] - - # Alert on shell in production container - shell-in-prod.yaml: |- - - rule: Shell Spawned in Production Container - desc: A shell was spawned in a production container - condition: > - evt.type = execve and - container.id != host and - proc.name in (bash, sh, zsh) and - not container.image.repository in (debug-tools) - output: "Shell spawned in production container (user=%user.name container=%container.name image=%container.image.repository shell=%proc.name)" - priority: CRITICAL - tags: [shell, production] - - falcosidekick: - enabled: true - config: - webhook: - enabled: true - address: "http://alertmanager.monitoring:9093/api/v2/alerts" - slack: - enabled: false # Configure per environment diff --git a/k8s/manifests/istio/istio.yaml b/k8s/manifests/istio/istio.yaml deleted file mode 100644 index d43bff73f97c330eabfffd301252584df1116941..0000000000000000000000000000000000000000 --- a/k8s/manifests/istio/istio.yaml +++ /dev/null @@ -1,96 +0,0 @@ -# ============================================================================= -# Istio Service Mesh — mTLS, Traffic Management, Observability -# ============================================================================= - -apiVersion: install.istio.io/v1alpha1 -kind: IstioOperator -metadata: - name: devsecops-mesh - namespace: istio-system -spec: - profile: default - - meshConfig: - accessLogFile: /dev/stdout - accessLogEncoding: JSON - defaultConfig: - tracing: - zipkin: - address: tempo.observability:9411 - holdApplicationUntilProxyStarts: true - - # Strict mTLS everywhere - mtls: - enabled: true - auto: true - - outlierDetection: - consecutive5xxErrors: 3 - interval: 30s - baseEjectionTime: 30s - - components: - pilot: - enabled: true - k8s: - resources: - requests: - cpu: 500m - memory: 2048Mi - limits: - cpu: "2" - memory: 4Gi - hpaSpec: - minReplicas: 2 - maxReplicas: 5 - - ingressGateways: - - name: istio-ingressgateway - enabled: true - k8s: - service: - type: LoadBalancer - annotations: - service.beta.kubernetes.io/aws-load-balancer-type: "nlb" - service.beta.kubernetes.io/aws-load-balancer-internal: "false" - resources: - requests: - cpu: 500m - memory: 512Mi - limits: - cpu: "2" - memory: 1Gi - hpaSpec: - minReplicas: 2 - maxReplicas: 10 - - cni: - enabled: true - - values: - global: - proxy: - resources: - requests: - cpu: 100m - memory: 128Mi - limits: - cpu: 500m - memory: 512Mi - holdApplicationUntilProxyStarts: true - - pilot: - autoscale: - enabled: true - minReplicas: 2 - ---- -# PeerAuthentication: Enforce strict mTLS cluster-wide -apiVersion: security.istio.io/v1beta1 -kind: PeerAuthentication -metadata: - name: default - namespace: istio-system -spec: - mtls: - mode: STRICT diff --git a/k8s/manifests/prometheus-stack/prometheus-stack.yaml b/k8s/manifests/prometheus-stack/prometheus-stack.yaml deleted file mode 100644 index 1d63142eab6a9de2617377399b853139bd6cb94e..0000000000000000000000000000000000000000 --- a/k8s/manifests/prometheus-stack/prometheus-stack.yaml +++ /dev/null @@ -1,88 +0,0 @@ -# ============================================================================= -# Prometheus Stack — Monitoring, Alerting, Dashboards -# ============================================================================= - -apiVersion: helm.cattle.io/v1 -kind: HelmChart -metadata: - name: kube-prometheus-stack - namespace: monitoring -spec: - repo: https://prometheus-community.github.io/helm-charts - chart: kube-prometheus-stack - targetNamespace: monitoring - valuesContent: |- - prometheus: - prometheusSpec: - replicas: 2 - retention: 30d - retentionSize: 45GB - storageSpec: - volumeClaimTemplate: - spec: - storageClassName: gp3-encrypted - accessModes: ["ReadWriteOnce"] - resources: - requests: - storage: 50Gi - resources: - requests: - cpu: "1" - memory: 4Gi - limits: - cpu: "2" - memory: 8Gi - # Scrape istio metrics - additionalScrapeConfigs: - - job_name: 'istio-mesh' - kubernetes_sd_configs: - - role: endpoints - relabel_configs: - - source_labels: [__meta_kubernetes_service_name] - regex: 'istio-telemetry' - action: keep - - alertmanager: - alertmanagerSpec: - replicas: 3 - storage: - volumeClaimTemplate: - spec: - storageClassName: gp3-encrypted - accessModes: ["ReadWriteOnce"] - resources: - requests: - storage: 5Gi - - grafana: - replicas: 2 - persistence: - enabled: true - storageClassName: gp3-encrypted - size: 10Gi - adminPassword: - existingSecret: grafana-admin-secret - key: password - sidecar: - dashboards: - enabled: true - searchNamespace: monitoring - datasources: - enabled: true - searchNamespace: monitoring - ingress: - enabled: true - annotations: - cert-manager.io/cluster-issuer: letsencrypt-prod - hosts: - - grafana.platform.internal - tls: - - secretName: grafana-tls - hosts: - - grafana.platform.internal - - nodeExporter: - enabled: true - - kubeStateMetrics: - enabled: true diff --git a/k8s/manifests/trivy-operator/trivy-operator.yaml b/k8s/manifests/trivy-operator/trivy-operator.yaml deleted file mode 100644 index d10f54b5bc1d57f94e44aa50c414aa0ae7d18c41..0000000000000000000000000000000000000000 --- a/k8s/manifests/trivy-operator/trivy-operator.yaml +++ /dev/null @@ -1,61 +0,0 @@ -# ============================================================================= -# Trivy Operator — Continuous Vulnerability Scanning -# ============================================================================= - -apiVersion: helm.cattle.io/v1 -kind: HelmChart -metadata: - name: trivy-operator - namespace: security -spec: - repo: https://aquasecurity.github.io/helm-charts - chart: trivy-operator - targetNamespace: security - valuesContent: |- - operator: - scanJobsConcurrentLimit: 5 - scanJobTimeout: 300s - metricsSecretName: trivy-metrics-secret - - trivy: - repository: ghcr.io/aquasecurity/trivy - tag: 0.50.0 - resources: - requests: - cpu: 200m - memory: 512Mi - limits: - cpu: "1" - memory: 1Gi - # Ignore unfixed CVEs by default - severity: CRITICAL,HIGH - # Scan config - skipUpdate: false - dbRepository: ghcr.io/aquasecurity/trivy-db - - scanner: - reportFormat: json - scanHistoryLimit: 100 - - serviceMonitor: - enabled: true - labels: - release: kube-prometheus-stack - - # ConfigAudit scanner - configAuditScanner: - enabled: true - - # RBAC assessment - rbacAssessmentScanner: - enabled: true - - # Infra assessment - infraAssessmentScanner: - enabled: true - - # Cluster compliance reports - compliance: - reports: - - type: nsa - - type: cis-benchmark diff --git a/k8s/workloads/backend/deployment.yaml b/k8s/workloads/backend/deployment.yaml deleted file mode 100644 index 86294b51fc462048d13c915b68012989ca27460c..0000000000000000000000000000000000000000 --- a/k8s/workloads/backend/deployment.yaml +++ /dev/null @@ -1,144 +0,0 @@ -# ============================================================================= -# Backend Deployment — Python FastAPI with DB + Redis -# ============================================================================= - -apiVersion: apps/v1 -kind: Deployment -metadata: - name: backend - namespace: backend - labels: - app: backend - version: v1 -spec: - replicas: 3 - selector: - matchLabels: - app: backend - strategy: - rollingUpdate: - maxSurge: 1 - maxUnavailable: 0 - type: RollingUpdate - template: - metadata: - labels: - app: backend - version: v1 - annotations: - sidecar.istio.io/inject: "true" - prometheus.io/scrape: "true" - prometheus.io/port: "8080" - spec: - serviceAccountName: backend - securityContext: - runAsNonRoot: true - runAsUser: 1000 - fsGroup: 1000 - seccompProfile: - type: RuntimeDefault - initContainers: - - name: db-migrate - image: "ecr.aws/devsecops/backend:v1.0.0" - command: ["alembic", "upgrade", "head"] - envFrom: - - secretRef: - name: db-credentials - securityContext: - allowPrivilegeEscalation: false - readOnlyRootFilesystem: true - capabilities: - drop: ["ALL"] - containers: - - name: backend - image: "ecr.aws/devsecops/backend:v1.0.0" - ports: - - containerPort: 8080 - protocol: TCP - env: - - name: DATABASE_URL - valueFrom: - secretKeyRef: - name: db-credentials - key: DATABASE_URL - - name: REDIS_URL - value: "redis://redis.backend.svc.cluster.local:6379" - envFrom: - - configMapRef: - name: backend-config - resources: - requests: - cpu: 200m - memory: 256Mi - limits: - cpu: "1" - memory: 512Mi - securityContext: - allowPrivilegeEscalation: false - readOnlyRootFilesystem: true - capabilities: - drop: ["ALL"] - livenessProbe: - httpGet: - path: /healthz - port: 8080 - initialDelaySeconds: 15 - periodSeconds: 15 - readinessProbe: - httpGet: - path: /readyz - port: 8080 - initialDelaySeconds: 5 - periodSeconds: 10 - volumeMounts: - - name: tmp - mountPath: /tmp - volumes: - - name: tmp - emptyDir: {} ---- -apiVersion: v1 -kind: Service -metadata: - name: backend - namespace: backend -spec: - selector: - app: backend - ports: - - port: 8080 - targetPort: 8080 ---- -apiVersion: v1 -kind: ServiceAccount -metadata: - name: backend - namespace: backend - automountServiceAccountToken: false ---- -# HPA -apiVersion: autoscaling/v2 -kind: HorizontalPodAutoscaler -metadata: - name: backend-hpa - namespace: backend -spec: - scaleTargetRef: - apiVersion: apps/v1 - kind: Deployment - name: backend - minReplicas: 3 - maxReplicas: 20 - metrics: - - type: Resource - resource: - name: cpu - target: - type: Utilization - averageUtilization: 70 - - type: Resource - resource: - name: memory - target: - type: Utilization - averageUtilization: 80 diff --git a/k8s/workloads/frontend/deployment.yaml b/k8s/workloads/frontend/deployment.yaml deleted file mode 100644 index 2a174ca765cd385a75baf26b7677e1ed174ff8c7..0000000000000000000000000000000000000000 --- a/k8s/workloads/frontend/deployment.yaml +++ /dev/null @@ -1,119 +0,0 @@ -# ============================================================================= -# Frontend Deployment — React App with Istio Sidecar -# ============================================================================= - -apiVersion: apps/v1 -kind: Deployment -metadata: - name: frontend - namespace: frontend - labels: - app: frontend - version: v1 -spec: - replicas: 3 - selector: - matchLabels: - app: frontend - strategy: - rollingUpdate: - maxSurge: 1 - maxUnavailable: 0 - type: RollingUpdate - template: - metadata: - labels: - app: frontend - version: v1 - annotations: - sidecar.istio.io/inject: "true" - prometheus.io/scrape: "true" - prometheus.io/port: "8080" - spec: - serviceAccountName: frontend - securityContext: - runAsNonRoot: true - runAsUser: 1000 - fsGroup: 1000 - seccompProfile: - type: RuntimeDefault - containers: - - name: frontend - image: "ecr.aws/devsecops/frontend:v1.0.0" - ports: - - containerPort: 8080 - protocol: TCP - env: - - name: BACKEND_URL - value: "http://backend.backend.svc.cluster.local:8080" - envFrom: - - configMapRef: - name: frontend-config - resources: - requests: - cpu: 100m - memory: 128Mi - limits: - cpu: 500m - memory: 256Mi - securityContext: - allowPrivilegeEscalation: false - readOnlyRootFilesystem: true - capabilities: - drop: ["ALL"] - livenessProbe: - httpGet: - path: /healthz - port: 8080 - initialDelaySeconds: 10 - periodSeconds: 15 - failureThreshold: 3 - readinessProbe: - httpGet: - path: /readyz - port: 8080 - initialDelaySeconds: 5 - periodSeconds: 10 - failureThreshold: 3 - volumeMounts: - - name: tmp - mountPath: /tmp - - name: cache - mountPath: /app/.cache - volumes: - - name: tmp - emptyDir: {} - - name: cache - emptyDir: - medium: Memory - sizeLimit: 64Mi - topologySpreadConstraints: - - maxSkew: 1 - topologyKey: topology.kubernetes.io/zone - whenUnsatisfiable: DoNotSchedule - labelSelector: - matchLabels: - app: frontend ---- -apiVersion: v1 -kind: Service -metadata: - name: frontend - namespace: frontend - labels: - app: frontend -spec: - selector: - app: frontend - ports: - - port: 8080 - targetPort: 8080 - protocol: TCP - type: ClusterIP ---- -apiVersion: v1 -kind: ServiceAccount -metadata: - name: frontend - namespace: frontend - automountServiceAccountToken: false diff --git a/ai-ml/hf-finetuning/BEST_DATASETS.md b/model/DATASETS.md similarity index 100% rename from ai-ml/hf-finetuning/BEST_DATASETS.md rename to model/DATASETS.md diff --git a/platform/golden-paths/microservice/checklist.md b/model/GOLDEN_PATH_CHECKLIST.md similarity index 100% rename from platform/golden-paths/microservice/checklist.md rename to model/GOLDEN_PATH_CHECKLIST.md diff --git a/ai-ml/hf-finetuning/finetune.py b/model/finetune_configurable.py similarity index 100% rename from ai-ml/hf-finetuning/finetune.py rename to model/finetune_configurable.py diff --git a/ai-ml/rag-pipeline/rag_pipeline.py b/model/rag_pipeline.py similarity index 100% rename from ai-ml/rag-pipeline/rag_pipeline.py rename to model/rag_pipeline.py diff --git a/ai-ml/hf-finetuning/train_openthoughts.py b/model/train_openthoughts.py similarity index 100% rename from ai-ml/hf-finetuning/train_openthoughts.py rename to model/train_openthoughts.py diff --git a/ai-ml/hf-finetuning/train_tulu3.py b/model/train_tulu3.py similarity index 100% rename from ai-ml/hf-finetuning/train_tulu3.py rename to model/train_tulu3.py diff --git a/platform/golden-paths/microservice/template.yaml b/platform/golden-paths/microservice/template.yaml deleted file mode 100644 index a2c3df547f2b7595e9c6b842ced09ba7fffb093e..0000000000000000000000000000000000000000 --- a/platform/golden-paths/microservice/template.yaml +++ /dev/null @@ -1,123 +0,0 @@ -# Golden Path - Microservice Template -# Replace: SERVICE_NAME, SERVICE_NAMESPACE, REGISTRY, IMAGE, TAG -apiVersion: apps/v1 -kind: Deployment -metadata: - name: SERVICE_NAME - namespace: SERVICE_NAMESPACE - labels: - app: SERVICE_NAME - version: v1 - golden-path: "true" -spec: - replicas: 3 - selector: - matchLabels: - app: SERVICE_NAME - strategy: - rollingUpdate: - maxSurge: 1 - maxUnavailable: 0 - template: - metadata: - labels: - app: SERVICE_NAME - version: v1 - annotations: - sidecar.istio.io/inject: "true" - prometheus.io/scrape: "true" - prometheus.io/port: "8080" - spec: - serviceAccountName: SERVICE_NAME - securityContext: - runAsNonRoot: true - runAsUser: 1000 - fsGroup: 1000 - seccompProfile: - type: RuntimeDefault - containers: - - name: SERVICE_NAME - image: REGISTRY/IMAGE:TAG - ports: - - containerPort: 8080 - envFrom: - - configMapRef: - name: SERVICE_NAME-config - - secretRef: - name: SERVICE_NAME-secrets - resources: - requests: - cpu: 100m - memory: 128Mi - limits: - cpu: 500m - memory: 256Mi - securityContext: - allowPrivilegeEscalation: false - readOnlyRootFilesystem: true - capabilities: - drop: ["ALL"] - livenessProbe: - httpGet: - path: /healthz - port: 8080 - initialDelaySeconds: 10 - periodSeconds: 15 - readinessProbe: - httpGet: - path: /readyz - port: 8080 - initialDelaySeconds: 5 - periodSeconds: 10 - volumeMounts: - - name: tmp - mountPath: /tmp - volumes: - - name: tmp - emptyDir: {} ---- -apiVersion: v1 -kind: Service -metadata: - name: SERVICE_NAME -spec: - selector: - app: SERVICE_NAME - ports: - - port: 8080 - targetPort: 8080 ---- -apiVersion: v1 -kind: ServiceAccount -metadata: - name: SERVICE_NAME - automountServiceAccountToken: false ---- -apiVersion: policy/v1 -kind: PodDisruptionBudget -metadata: - name: SERVICE_NAME-pdb -spec: - minAvailable: 2 - selector: - matchLabels: - app: SERVICE_NAME ---- -apiVersion: autoscaling/v2 -kind: HorizontalPodAutoscaler -metadata: - name: SERVICE_NAME-hpa -spec: - scaleTargetRef: - apiVersion: apps/v1 - kind: Deployment - name: SERVICE_NAME - minReplicas: 3 - maxReplicas: 20 - metrics: - - type: Resource - resource: - name: cpu - target: - type: Utilization - averageUtilization: 70 diff --git a/platform/self-service/environment-provisioner.yaml b/platform/self-service/environment-provisioner.yaml deleted file mode 100644 index 97738d4b5a21e245ce52dbed9b25eefd32aa9e8e..0000000000000000000000000000000000000000 --- a/platform/self-service/environment-provisioner.yaml +++ /dev/null @@ -1,24 +0,0 @@ -# ============================================================================= -# Self-Service Environment Provisioner (Backstage/Terraform) -# ============================================================================= -# Developers can spin up ephemeral environments via GitOps. -# Auto-destroys after TTL (default: 24h for dev, 7d for PR preview). - -apiVersion: v1 -kind: ConfigMap -metadata: - name: env-provisioner-config - namespace: platform-system -data: - ENV_TTL_DEV: "24h" - ENV_TTL_PR_PREVIEW: "7d" - ENV_TTL_STAGING: "infinite" - TERRAFORM_WORKSPACE_PREFIX: "dev-ephemeral-" - COST_LIMIT_PER_ENV: "50" - AUTO_DESTROY_AFTER_TTL: "true" - APPROVAL_REQUIRED: "false" # Developers can self-serve - ALLOWED_INSTANCE_TYPES: "t3.small,t3.medium" - ALLOWED_GPU_TYPES: "g5.xlarge" - MAX_ENVS_PER_DEVELOPER: "3" - SECRETS_SOURCE: "aws-secrets-manager" - NETWORK_POLICY: "default-deny" diff --git a/scripts/bash/bootstrap.sh b/scripts/bash/bootstrap.sh deleted file mode 100644 index f31af81ec2324479b6243ed568ba0be9b55e9e63..0000000000000000000000000000000000000000 --- a/scripts/bash/bootstrap.sh +++ /dev/null @@ -1,79 +0,0 @@ -#!/usr/bin/env bash -# ============================================================================= -# DevSecOps Platform — Bootstrap Script -# ============================================================================= -# Deploys the full platform from scratch -# ============================================================================= - -set -euo pipefail - -ENV="${1:?Usage: $0 }" -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -PLATFORM_DIR="$(dirname "$SCRIPT_DIR")" - -echo "============================================" -echo " DevSecOps Platform Bootstrap — ${ENV^^}" -echo "============================================" - -# --- Prerequisites --- -echo "[1/8] Checking prerequisites..." -command -v terraform >/dev/null || { echo "ERROR: terraform not found"; exit 1; } -command -v kubectl >/dev/null || { echo "ERROR: kubectl not found"; exit 1; } -command -v helm >/dev/null || { echo "ERROR: helm not found"; exit 1; } -command -v aws >/dev/null || { echo "ERROR: aws CLI not found"; exit 1; } -command -v trivy >/dev/null || { echo "ERROR: trivy not found"; exit 1; } -echo "Prerequisites OK" - -# --- Terraform Apply --- -echo "[2/8] Applying Terraform infrastructure..." -cd "${PLATFORM_DIR}/terraform/environments/${ENV}" -terraform init -backend-config="key=${ENV}/terraform.tfstate" -terraform plan -out=tfplan -terraform apply tfplan - -# --- Update kubeconfig --- -echo "[3/8] Updating kubeconfig..." -CLUSTER_NAME=$(terraform output -raw cluster_id 2>/dev/null || echo "${ENV}-eks") -aws eks update-kubeconfig --name "${CLUSTER_NAME}" --region us-east-1 - -# --- Namespace Setup --- -echo "[4/8] Creating namespaces and base resources..." -kubectl apply -f "${PLATFORM_DIR}/k8s/base/namespaces/" -kubectl apply -f "${PLATFORM_DIR}/k8s/base/rbac/" -kubectl apply -f "${PLATFORM_DIR}/k8s/base/network-policies/" -kubectl apply -f "${PLATFORM_DIR}/k8s/base/resource-quotas/" -kubectl apply -f "${PLATFORM_DIR}/k8s/base/limit-ranges/" - -# --- Platform Services --- -echo "[5/8] Installing platform services..." -kubectl apply -f "${PLATFORM_DIR}/k8s/manifests/cert-manager/" -kubectl apply -f "${PLATFORM_DIR}/k8s/manifests/external-secrets/" -kubectl apply -f "${PLATFORM_DIR}/k8s/manifests/istio/" -kubectl apply -f "${PLATFORM_DIR}/k8s/manifests/argo-cd/" - -# --- Security --- -echo "[6/8] Installing security tools..." -kubectl apply -f "${PLATFORM_DIR}/k8s/manifests/trivy-operator/" -kubectl apply -f "${PLATFORM_DIR}/k8s/manifests/falco/" -kubectl apply -f "${PLATFORM_DIR}/k8s/manifests/kyverno/" - -# --- Monitoring --- -echo "[7/8] Installing observability stack..." -kubectl apply -f "${PLATFORM_DIR}/k8s/manifests/prometheus-stack/" -kubectl apply -f "${PLATFORM_DIR}/monitoring/prometheus/" -kubectl apply -f "${PLATFORM_DIR}/monitoring/alertmanager/" -kubectl apply -f "${PLATFORM_DIR}/monitoring/otel/" - -# --- Security Scan --- -echo "[8/8] Running initial security scan..." -trivy k8s --report all --severity CRITICAL,HIGH - -echo "============================================" -echo " Platform ${ENV^^} bootstrap complete!" -echo "============================================" -echo "" -echo "Next steps:" -echo " 1. Configure ArgoCD: kubectl get svc -n platform-system argocd-server" -echo " 2. Access Grafana: kubectl get svc -n monitoring kube-prometheus-stack-grafana" -echo " 3. Check security: kubectl get configauditreports -A" -echo " 4. Deploy workloads: kubectl apply -f k8s/workloads/" diff --git a/k8s/manifests/kyverno/kyverno-policies.yaml b/security/policies/kyverno-policies.yaml similarity index 100% rename from k8s/manifests/kyverno/kyverno-policies.yaml rename to security/policies/kyverno-policies.yaml diff --git a/compliance/policies/opa-policies.yaml b/security/policies/opa-policies.yaml similarity index 100% rename from compliance/policies/opa-policies.yaml rename to security/policies/opa-policies.yaml diff --git a/security/semgrep/.semgrep.yml b/security/scanning/.semgrep.yml similarity index 100% rename from security/semgrep/.semgrep.yml rename to security/scanning/.semgrep.yml diff --git a/security/checkov/checkov.yml b/security/scanning/checkov.yml similarity index 100% rename from security/checkov/checkov.yml rename to security/scanning/checkov.yml diff --git a/docker/sbom-scripts/generate-sbom.sh b/security/scanning/generate-sbom.sh similarity index 100% rename from docker/sbom-scripts/generate-sbom.sh rename to security/scanning/generate-sbom.sh diff --git a/docker/scan-scripts/scan-image.sh b/security/scanning/scan-image.sh similarity index 100% rename from docker/scan-scripts/scan-image.sh rename to security/scanning/scan-image.sh diff --git a/scripts/python/security_audit.py b/security/scanning/security_audit.py similarity index 100% rename from scripts/python/security_audit.py rename to security/scanning/security_audit.py diff --git a/docker/sign-scripts/sign-image.sh b/security/scanning/sign-image.sh similarity index 100% rename from docker/sign-scripts/sign-image.sh rename to security/scanning/sign-image.sh diff --git a/security/trivy/trivy.yaml b/security/scanning/trivy.yaml similarity index 100% rename from security/trivy/trivy.yaml rename to security/scanning/trivy.yaml