| --- |
| title: Yaml Bert |
| emoji: π» |
| colorFrom: blue |
| colorTo: green |
| sdk: gradio |
| sdk_version: 6.14.0 |
| python_version: '3.13' |
| app_file: app.py |
| pinned: false |
| license: mit |
| short_description: Tree-aware transformer for structured (YAML) data |
| --- |
| |
| # YAML-BERT |
|
|
| A tree-aware BERT-style encoder trained on 276K Kubernetes YAML |
| manifests. Predicts missing structural fields and produces a |
| document-level vector (`doc_vec`) that encodes manifest structure and |
| content. |
|
|
| **18.4M params Β· subword BPE vocab (8,192) Β· 20 epochs on an L4 GPU** |
|
|
| ## What this Space offers |
|
|
| Three tabs: |
|
|
| 1. **Missing-field suggester** β paste a YAML manifest; the model |
| walks each parent level, identifies fields it expects to see but |
| that are absent, and ranks them by confidence. |
|
|
| 2. **Manifest galaxy** β 10,000 manifests from the training corpus |
| embedded as `doc_vec`s and projected to 2D with UMAP. Kinds cluster |
| spontaneously β the model was never told what `kind` is. |
|
|
| 3. **Structural probes** β hand-crafted manifest sets that test |
| specific structural claims: Pod Β± initContainers, Service type, |
| namespace clustering, apiVersion sensitivity, Pod vs Deployment |
| wrapping. The 2D layout uses MDS over cosine distances so closer = |
| more similar. You can also paste your own YAML to see where it |
| lands. |
|
|
| ## Architecture (briefly) |
|
|
| - **Tree positional encoding**: each node's position is `depth` + |
| `sibling_index` + `node_type`, summed into the input vector |
| - **Bottom-up tree aggregator**: combines per-key subtree vectors |
| into `doc_vec` |
| - **Key Head**: predicts atomic keys from |
| `[h_logical ; doc_vec ; s_parent]` β local context plus |
| whole-document context plus immediate parent subtree |
| - **Unified byte-level BPE**: 8,192 subword tokens for both keys and |
| values. Strings like `web-1`, `web-2`, `web-3` decompose into shared |
| subwords so the model can both relate and distinguish them. |
|
|
| ## What the model does well |
|
|
| - Predicts standard Kubernetes structural fields (`spec`, `replicas`, |
| `containers.image`, etc.) |
| - Distinguishes kind-specific fields (`Deployment.replicas`, |
| `Service.ports`) |
| - Predicts status-side fields (`status.replicas`, |
| `status.conditions`, `status.loadBalancer.ingress`) |
| - Calibrated confidence: strong on common patterns, weaker on |
| ambiguous positions |
|
|
| ## Known limitations |
|
|
| - Occasionally over-confident on invalid YAML structures |
| - Foreign-key consistency (e.g., `volumeMounts[*].name` must match a |
| defined `volumes[*].name`) is not learned β attention doesn't |
| perform cross-position equality checks |
| - Trained on `substratusai/the-stack-yaml-k8s`; novel CRDs and rare |
| annotations are seen but not deeply understood |
|
|
| ## Links |
|
|
| - Code & docs: <https://github.com/vimalk78/yaml-bert> |
| - Results: <https://github.com/vimalk78/yaml-bert/blob/main/docs/v9-subword-results.md> |
| - Design rationale: <https://github.com/vimalk78/yaml-bert/blob/main/docs/key-value-design-rationale.md> |
| - Architecture: <https://github.com/vimalk78/yaml-bert/blob/main/docs/architecture.md> |
|
|