Spaces:

vimalk78
/

yaml-bert

Running

App Files Files Community

yaml-bert / README.md

vimalk78

probes: K8s-honest framing + probe-5 rename

c414c6c verified about 1 month ago

preview code

Raw

History Blame Contribute Delete

3.02 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: Yaml Bert
emoji: 💻
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.14.0
python_version: '3.13'
app_file: app.py
pinned: false
license: mit
short_description: Tree-aware transformer for structured (YAML) data

YAML-BERT

A tree-aware BERT-style encoder trained on 276K Kubernetes YAML manifests. Predicts missing structural fields and produces a document-level vector (doc_vec) that encodes manifest structure and content.

18.4M params · subword BPE vocab (8,192) · 20 epochs on an L4 GPU

What this Space offers

Three tabs:

Missing-field suggester — paste a YAML manifest; the model walks each parent level, identifies fields it expects to see but that are absent, and ranks them by confidence.
Manifest galaxy — 10,000 manifests from the training corpus embedded as doc_vecs and projected to 2D with UMAP. Kinds cluster spontaneously — the model was never told what kind is.
Structural probes — hand-crafted manifest sets that test specific structural claims: Pod ± initContainers, Service type, namespace clustering, apiVersion sensitivity, Pod vs Deployment wrapping. The 2D layout uses MDS over cosine distances so closer = more similar. You can also paste your own YAML to see where it lands.

Architecture (briefly)

Tree positional encoding: each node's position is depth + sibling_index + node_type, summed into the input vector
Bottom-up tree aggregator: combines per-key subtree vectors into doc_vec
Key Head: predicts atomic keys from [h_logical ; doc_vec ; s_parent] — local context plus whole-document context plus immediate parent subtree
Unified byte-level BPE: 8,192 subword tokens for both keys and values. Strings like web-1, web-2, web-3 decompose into shared subwords so the model can both relate and distinguish them.

What the model does well

Predicts standard Kubernetes structural fields (spec, replicas, containers.image, etc.)
Distinguishes kind-specific fields (Deployment.replicas, Service.ports)
Predicts status-side fields (status.replicas, status.conditions, status.loadBalancer.ingress)
Calibrated confidence: strong on common patterns, weaker on ambiguous positions

Known limitations

Occasionally over-confident on invalid YAML structures
Foreign-key consistency (e.g., volumeMounts[*].name must match a defined volumes[*].name) is not learned — attention doesn't perform cross-position equality checks
Trained on substratusai/the-stack-yaml-k8s; novel CRDs and rare annotations are seen but not deeply understood

YAML-BERT

What this Space offers

Architecture (briefly)

What the model does well

Known limitations

Links