yaml-bert / README.md
vimalk78's picture
probes: K8s-honest framing + probe-5 rename
c414c6c verified
|
Raw
History Blame Contribute Delete
3.02 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: Yaml Bert
emoji: 💻
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.14.0
python_version: '3.13'
app_file: app.py
pinned: false
license: mit
short_description: Tree-aware transformer for structured (YAML) data

YAML-BERT

A tree-aware BERT-style encoder trained on 276K Kubernetes YAML manifests. Predicts missing structural fields and produces a document-level vector (doc_vec) that encodes manifest structure and content.

18.4M params · subword BPE vocab (8,192) · 20 epochs on an L4 GPU

What this Space offers

Three tabs:

  1. Missing-field suggester — paste a YAML manifest; the model walks each parent level, identifies fields it expects to see but that are absent, and ranks them by confidence.

  2. Manifest galaxy — 10,000 manifests from the training corpus embedded as doc_vecs and projected to 2D with UMAP. Kinds cluster spontaneously — the model was never told what kind is.

  3. Structural probes — hand-crafted manifest sets that test specific structural claims: Pod ± initContainers, Service type, namespace clustering, apiVersion sensitivity, Pod vs Deployment wrapping. The 2D layout uses MDS over cosine distances so closer = more similar. You can also paste your own YAML to see where it lands.

Architecture (briefly)

  • Tree positional encoding: each node's position is depth + sibling_index + node_type, summed into the input vector
  • Bottom-up tree aggregator: combines per-key subtree vectors into doc_vec
  • Key Head: predicts atomic keys from [h_logical ; doc_vec ; s_parent] — local context plus whole-document context plus immediate parent subtree
  • Unified byte-level BPE: 8,192 subword tokens for both keys and values. Strings like web-1, web-2, web-3 decompose into shared subwords so the model can both relate and distinguish them.

What the model does well

  • Predicts standard Kubernetes structural fields (spec, replicas, containers.image, etc.)
  • Distinguishes kind-specific fields (Deployment.replicas, Service.ports)
  • Predicts status-side fields (status.replicas, status.conditions, status.loadBalancer.ingress)
  • Calibrated confidence: strong on common patterns, weaker on ambiguous positions

Known limitations

  • Occasionally over-confident on invalid YAML structures
  • Foreign-key consistency (e.g., volumeMounts[*].name must match a defined volumes[*].name) is not learned — attention doesn't perform cross-position equality checks
  • Trained on substratusai/the-stack-yaml-k8s; novel CRDs and rare annotations are seen but not deeply understood

Links