--- title: Yaml Bert emoji: ๐Ÿ’ป colorFrom: blue colorTo: green sdk: gradio sdk_version: 6.14.0 python_version: '3.13' app_file: app.py pinned: false license: mit short_description: Tree-aware transformer for structured (YAML) data --- # YAML-BERT A tree-aware BERT-style encoder trained on 276K Kubernetes YAML manifests. Predicts missing structural fields and produces a document-level vector (`doc_vec`) that encodes manifest structure and content. **18.4M params ยท subword BPE vocab (8,192) ยท 20 epochs on an L4 GPU** ## What this Space offers Three tabs: 1. **Missing-field suggester** โ€” paste a YAML manifest; the model walks each parent level, identifies fields it expects to see but that are absent, and ranks them by confidence. 2. **Manifest galaxy** โ€” 10,000 manifests from the training corpus embedded as `doc_vec`s and projected to 2D with UMAP. Kinds cluster spontaneously โ€” the model was never told what `kind` is. 3. **Structural probes** โ€” hand-crafted manifest sets that test specific structural claims: Pod ยฑ initContainers, Service type, namespace clustering, apiVersion sensitivity, Pod vs Deployment wrapping. The 2D layout uses MDS over cosine distances so closer = more similar. You can also paste your own YAML to see where it lands. ## Architecture (briefly) - **Tree positional encoding**: each node's position is `depth` + `sibling_index` + `node_type`, summed into the input vector - **Bottom-up tree aggregator**: combines per-key subtree vectors into `doc_vec` - **Key Head**: predicts atomic keys from `[h_logical ; doc_vec ; s_parent]` โ€” local context plus whole-document context plus immediate parent subtree - **Unified byte-level BPE**: 8,192 subword tokens for both keys and values. Strings like `web-1`, `web-2`, `web-3` decompose into shared subwords so the model can both relate and distinguish them. ## What the model does well - Predicts standard Kubernetes structural fields (`spec`, `replicas`, `containers.image`, etc.) - Distinguishes kind-specific fields (`Deployment.replicas`, `Service.ports`) - Predicts status-side fields (`status.replicas`, `status.conditions`, `status.loadBalancer.ingress`) - Calibrated confidence: strong on common patterns, weaker on ambiguous positions ## Known limitations - Occasionally over-confident on invalid YAML structures - Foreign-key consistency (e.g., `volumeMounts[*].name` must match a defined `volumes[*].name`) is not learned โ€” attention doesn't perform cross-position equality checks - Trained on `substratusai/the-stack-yaml-k8s`; novel CRDs and rare annotations are seen but not deeply understood ## Links - Code & docs: - Results: - Design rationale: - Architecture: