yaml-bert / README.md
vimalk78's picture
probes: K8s-honest framing + probe-5 rename
c414c6c verified
|
Raw
History Blame Contribute Delete
3.02 kB
---
title: Yaml Bert
emoji: πŸ’»
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.14.0
python_version: '3.13'
app_file: app.py
pinned: false
license: mit
short_description: Tree-aware transformer for structured (YAML) data
---
# YAML-BERT
A tree-aware BERT-style encoder trained on 276K Kubernetes YAML
manifests. Predicts missing structural fields and produces a
document-level vector (`doc_vec`) that encodes manifest structure and
content.
**18.4M params Β· subword BPE vocab (8,192) Β· 20 epochs on an L4 GPU**
## What this Space offers
Three tabs:
1. **Missing-field suggester** β€” paste a YAML manifest; the model
walks each parent level, identifies fields it expects to see but
that are absent, and ranks them by confidence.
2. **Manifest galaxy** β€” 10,000 manifests from the training corpus
embedded as `doc_vec`s and projected to 2D with UMAP. Kinds cluster
spontaneously β€” the model was never told what `kind` is.
3. **Structural probes** β€” hand-crafted manifest sets that test
specific structural claims: Pod Β± initContainers, Service type,
namespace clustering, apiVersion sensitivity, Pod vs Deployment
wrapping. The 2D layout uses MDS over cosine distances so closer =
more similar. You can also paste your own YAML to see where it
lands.
## Architecture (briefly)
- **Tree positional encoding**: each node's position is `depth` +
`sibling_index` + `node_type`, summed into the input vector
- **Bottom-up tree aggregator**: combines per-key subtree vectors
into `doc_vec`
- **Key Head**: predicts atomic keys from
`[h_logical ; doc_vec ; s_parent]` β€” local context plus
whole-document context plus immediate parent subtree
- **Unified byte-level BPE**: 8,192 subword tokens for both keys and
values. Strings like `web-1`, `web-2`, `web-3` decompose into shared
subwords so the model can both relate and distinguish them.
## What the model does well
- Predicts standard Kubernetes structural fields (`spec`, `replicas`,
`containers.image`, etc.)
- Distinguishes kind-specific fields (`Deployment.replicas`,
`Service.ports`)
- Predicts status-side fields (`status.replicas`,
`status.conditions`, `status.loadBalancer.ingress`)
- Calibrated confidence: strong on common patterns, weaker on
ambiguous positions
## Known limitations
- Occasionally over-confident on invalid YAML structures
- Foreign-key consistency (e.g., `volumeMounts[*].name` must match a
defined `volumes[*].name`) is not learned β€” attention doesn't
perform cross-position equality checks
- Trained on `substratusai/the-stack-yaml-k8s`; novel CRDs and rare
annotations are seen but not deeply understood
## Links
- Code & docs: <https://github.com/vimalk78/yaml-bert>
- Results: <https://github.com/vimalk78/yaml-bert/blob/main/docs/v9-subword-results.md>
- Design rationale: <https://github.com/vimalk78/yaml-bert/blob/main/docs/key-value-design-rationale.md>
- Architecture: <https://github.com/vimalk78/yaml-bert/blob/main/docs/architecture.md>