Spaces:

vimalk78
/

yaml-bert

Running

App Files Files Community

yaml-bert / README.md

vimalk78

probes: K8s-honest framing + probe-5 rename

c414c6c verified about 1 month ago

preview code

Raw

History Blame Contribute Delete

3.02 kB

	---
	title: Yaml Bert
	emoji: 💻
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: 6.14.0
	python_version: '3.13'
	app_file: app.py
	pinned: false
	license: mit
	short_description: Tree-aware transformer for structured (YAML) data
	---

	# YAML-BERT

	A tree-aware BERT-style encoder trained on 276K Kubernetes YAML
	manifests. Predicts missing structural fields and produces a
	document-level vector (`doc_vec`) that encodes manifest structure and
	content.

	18.4M params · subword BPE vocab (8,192) · 20 epochs on an L4 GPU

	## What this Space offers

	Three tabs:

	1. Missing-field suggester — paste a YAML manifest; the model
	walks each parent level, identifies fields it expects to see but
	that are absent, and ranks them by confidence.

	2. Manifest galaxy — 10,000 manifests from the training corpus
	embedded as `doc_vec`s and projected to 2D with UMAP. Kinds cluster
	spontaneously — the model was never told what `kind` is.

	3. Structural probes — hand-crafted manifest sets that test
	specific structural claims: Pod ± initContainers, Service type,
	namespace clustering, apiVersion sensitivity, Pod vs Deployment
	wrapping. The 2D layout uses MDS over cosine distances so closer =
	more similar. You can also paste your own YAML to see where it
	lands.

	## Architecture (briefly)

	- Tree positional encoding: each node's position is `depth` +
	`sibling_index` + `node_type`, summed into the input vector
	- Bottom-up tree aggregator: combines per-key subtree vectors
	into `doc_vec`
	- Key Head: predicts atomic keys from
	`[h_logical ; doc_vec ; s_parent]` — local context plus
	whole-document context plus immediate parent subtree
	- Unified byte-level BPE: 8,192 subword tokens for both keys and
	values. Strings like `web-1`, `web-2`, `web-3` decompose into shared
	subwords so the model can both relate and distinguish them.

	## What the model does well

	- Predicts standard Kubernetes structural fields (`spec`, `replicas`,
	`containers.image`, etc.)
	- Distinguishes kind-specific fields (`Deployment.replicas`,
	`Service.ports`)
	- Predicts status-side fields (`status.replicas`,
	`status.conditions`, `status.loadBalancer.ingress`)
	- Calibrated confidence: strong on common patterns, weaker on
	ambiguous positions

	## Known limitations

	- Occasionally over-confident on invalid YAML structures
	- Foreign-key consistency (e.g., `volumeMounts[*].name` must match a
	defined `volumes[*].name`) is not learned — attention doesn't
	perform cross-position equality checks
	- Trained on `substratusai/the-stack-yaml-k8s`; novel CRDs and rare
	annotations are seen but not deeply understood

	## Links

	- Code & docs: <https://github.com/vimalk78/yaml-bert>
	- Results: <https://github.com/vimalk78/yaml-bert/blob/main/docs/v9-subword-results.md>
	- Design rationale: <https://github.com/vimalk78/yaml-bert/blob/main/docs/key-value-design-rationale.md>
	- Architecture: <https://github.com/vimalk78/yaml-bert/blob/main/docs/architecture.md>