A newer version of the Gradio SDK is available: 6.19.0
title: Yaml Bert
emoji: 💻
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.14.0
python_version: '3.13'
app_file: app.py
pinned: false
license: mit
short_description: Tree-aware transformer for structured (YAML) data
YAML-BERT
A tree-aware BERT-style encoder trained on 276K Kubernetes YAML
manifests. Predicts missing structural fields and produces a
document-level vector (doc_vec) that encodes manifest structure and
content.
18.4M params · subword BPE vocab (8,192) · 20 epochs on an L4 GPU
What this Space offers
Three tabs:
Missing-field suggester — paste a YAML manifest; the model walks each parent level, identifies fields it expects to see but that are absent, and ranks them by confidence.
Manifest galaxy — 10,000 manifests from the training corpus embedded as
doc_vecs and projected to 2D with UMAP. Kinds cluster spontaneously — the model was never told whatkindis.Structural probes — hand-crafted manifest sets that test specific structural claims: Pod ± initContainers, Service type, namespace clustering, apiVersion sensitivity, Pod vs Deployment wrapping. The 2D layout uses MDS over cosine distances so closer = more similar. You can also paste your own YAML to see where it lands.
Architecture (briefly)
- Tree positional encoding: each node's position is
depth+sibling_index+node_type, summed into the input vector - Bottom-up tree aggregator: combines per-key subtree vectors
into
doc_vec - Key Head: predicts atomic keys from
[h_logical ; doc_vec ; s_parent]— local context plus whole-document context plus immediate parent subtree - Unified byte-level BPE: 8,192 subword tokens for both keys and
values. Strings like
web-1,web-2,web-3decompose into shared subwords so the model can both relate and distinguish them.
What the model does well
- Predicts standard Kubernetes structural fields (
spec,replicas,containers.image, etc.) - Distinguishes kind-specific fields (
Deployment.replicas,Service.ports) - Predicts status-side fields (
status.replicas,status.conditions,status.loadBalancer.ingress) - Calibrated confidence: strong on common patterns, weaker on ambiguous positions
Known limitations
- Occasionally over-confident on invalid YAML structures
- Foreign-key consistency (e.g.,
volumeMounts[*].namemust match a definedvolumes[*].name) is not learned — attention doesn't perform cross-position equality checks - Trained on
substratusai/the-stack-yaml-k8s; novel CRDs and rare annotations are seen but not deeply understood
Links
- Code & docs: https://github.com/vimalk78/yaml-bert
- Results: https://github.com/vimalk78/yaml-bert/blob/main/docs/v9-subword-results.md
- Design rationale: https://github.com/vimalk78/yaml-bert/blob/main/docs/key-value-design-rationale.md
- Architecture: https://github.com/vimalk78/yaml-bert/blob/main/docs/architecture.md