Spaces:
Running
feat: K8s pilot corpus — 8 pages + config entry + JSON rewrite
Browse filesLand the minimal K8s corpus needed to run the 6-question pilot:
- Fetch 8 pages from kubernetes.io/docs (pods, deployment,
replicaset, configmap, secret, node-pressure-eviction,
network-policies, pod-security-admission) via defuddle. Flat
`k8s_*.md` naming mirrors the existing `fastapi_*.md` precedent.
- Expand `corpora.k8s` in `configs/default.yaml`: flip
`available: true`, add `golden_dataset` pointer, drop
`refusal_threshold` from 0.30 placeholder to 0.02 for the pilot
smoke test. Two-line inline comment preserves the 0.30
launch-intent rationale pending the full tuning sweep.
- Rewrite `k8s_golden_pilot.json` `expected_sources` from
path-style (`concepts/workloads/pods`) to filename stems
(`k8s_pods.md`) so the exact-string match in
`retrieval_precision_at_k` works. `source_pages` stays as the
human-readable path anchor.
- Fix three `source_snippets` that drifted from the live page text:
pilot_002 (deployment rollout sentence paraphrase), pilot_003
(secret snippet now link-free substring), pilot_006 (added
backticks to match fetched markdown).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- agent_bench/evaluation/datasets/k8s_golden_pilot.json +11 -11
- configs/default.yaml +5 -10
- data/k8s_docs/k8s_configmap.md +281 -0
- data/k8s_docs/k8s_deployment.md +1092 -0
- data/k8s_docs/k8s_network_policies.md +416 -0
- data/k8s_docs/k8s_node_pressure_eviction.md +339 -0
- data/k8s_docs/k8s_pod_security_admission.md +93 -0
- data/k8s_docs/k8s_pods.md +305 -0
- data/k8s_docs/k8s_replicaset.md +399 -0
- data/k8s_docs/k8s_secret.md +549 -0
|
@@ -12,7 +12,7 @@
|
|
| 12 |
"id": "k8s_pilot_001",
|
| 13 |
"question": "In Kubernetes, does each Pod receive its own IP address, and how do containers inside the same Pod talk to each other?",
|
| 14 |
"expected_answer_keywords": ["unique", "IP address", "shared", "localhost"],
|
| 15 |
-
"expected_sources": ["
|
| 16 |
"category": "retrieval",
|
| 17 |
"difficulty": "easy",
|
| 18 |
"requires_calculator": false,
|
|
@@ -31,8 +31,8 @@
|
|
| 31 |
"question": "When you update a Deployment's pod template, what mechanism does Kubernetes use to transition Pods from the old version to the new one, and what role does the ReplicaSet play?",
|
| 32 |
"expected_answer_keywords": ["ReplicaSet", "new ReplicaSet", "old ReplicaSet", "controlled rate", "replicas", "selector"],
|
| 33 |
"expected_sources": [
|
| 34 |
-
"
|
| 35 |
-
"
|
| 36 |
],
|
| 37 |
"category": "retrieval",
|
| 38 |
"difficulty": "hard",
|
|
@@ -42,7 +42,7 @@
|
|
| 42 |
"is_multi_hop": true,
|
| 43 |
"source_chunk_ids": [],
|
| 44 |
"source_snippets": [
|
| 45 |
-
"A new ReplicaSet is created and the Deployment
|
| 46 |
"A ReplicaSet is defined with fields, including a selector that specifies how to identify Pods it can acquire, a number of replicas indicating how many Pods it should be maintaining"
|
| 47 |
],
|
| 48 |
"source_pages": [
|
|
@@ -56,8 +56,8 @@
|
|
| 56 |
"question": "What is the key difference between a ConfigMap and a Secret when deciding where to store sensitive application data like database passwords?",
|
| 57 |
"expected_answer_keywords": ["non-confidential", "confidential", "Secret", "ConfigMap", "encryption", "etcd"],
|
| 58 |
"expected_sources": [
|
| 59 |
-
"
|
| 60 |
-
"
|
| 61 |
],
|
| 62 |
"category": "retrieval",
|
| 63 |
"difficulty": "medium",
|
|
@@ -68,7 +68,7 @@
|
|
| 68 |
"source_chunk_ids": [],
|
| 69 |
"source_snippets": [
|
| 70 |
"A ConfigMap is an API object used to store non-confidential data in key-value pairs",
|
| 71 |
-
"
|
| 72 |
],
|
| 73 |
"source_pages": [
|
| 74 |
"concepts/configuration/configmap",
|
|
@@ -80,7 +80,7 @@
|
|
| 80 |
"id": "k8s_pilot_004",
|
| 81 |
"question": "If I set a custom value for one hard eviction threshold on the kubelet (e.g., memory.available) but leave the other thresholds unset, what happens to the defaults for the thresholds I didn't override?",
|
| 82 |
"expected_answer_keywords": ["zero", "default", "not inherited", "custom", "all thresholds", "explicit"],
|
| 83 |
-
"expected_sources": ["
|
| 84 |
"category": "retrieval",
|
| 85 |
"difficulty": "hard",
|
| 86 |
"requires_calculator": false,
|
|
@@ -98,7 +98,7 @@
|
|
| 98 |
"id": "k8s_pilot_005",
|
| 99 |
"question": "How do I configure a Kubernetes NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace?",
|
| 100 |
"expected_answer_keywords": ["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"],
|
| 101 |
-
"expected_sources": ["
|
| 102 |
"category": "retrieval",
|
| 103 |
"difficulty": "medium",
|
| 104 |
"requires_calculator": false,
|
|
@@ -116,7 +116,7 @@
|
|
| 116 |
"id": "k8s_pilot_006",
|
| 117 |
"question": "As of the Kubernetes v1.31 snapshot, what is the feature state (alpha, beta, or stable) of the built-in Pod Security admission controller, and in which version did it reach that state?",
|
| 118 |
"expected_answer_keywords": ["stable", "v1.25", "Pod Security", "admission controller"],
|
| 119 |
-
"expected_sources": ["
|
| 120 |
"category": "retrieval",
|
| 121 |
"difficulty": "easy",
|
| 122 |
"requires_calculator": false,
|
|
@@ -125,7 +125,7 @@
|
|
| 125 |
"is_multi_hop": false,
|
| 126 |
"source_chunk_ids": [],
|
| 127 |
"source_snippets": [
|
| 128 |
-
"FEATURE STATE: Kubernetes v1.25 [stable]"
|
| 129 |
],
|
| 130 |
"source_pages": ["concepts/security/pod-security-admission"],
|
| 131 |
"source_sections": [""]
|
|
|
|
| 12 |
"id": "k8s_pilot_001",
|
| 13 |
"question": "In Kubernetes, does each Pod receive its own IP address, and how do containers inside the same Pod talk to each other?",
|
| 14 |
"expected_answer_keywords": ["unique", "IP address", "shared", "localhost"],
|
| 15 |
+
"expected_sources": ["k8s_pods.md"],
|
| 16 |
"category": "retrieval",
|
| 17 |
"difficulty": "easy",
|
| 18 |
"requires_calculator": false,
|
|
|
|
| 31 |
"question": "When you update a Deployment's pod template, what mechanism does Kubernetes use to transition Pods from the old version to the new one, and what role does the ReplicaSet play?",
|
| 32 |
"expected_answer_keywords": ["ReplicaSet", "new ReplicaSet", "old ReplicaSet", "controlled rate", "replicas", "selector"],
|
| 33 |
"expected_sources": [
|
| 34 |
+
"k8s_deployment.md",
|
| 35 |
+
"k8s_replicaset.md"
|
| 36 |
],
|
| 37 |
"category": "retrieval",
|
| 38 |
"difficulty": "hard",
|
|
|
|
| 42 |
"is_multi_hop": true,
|
| 43 |
"source_chunk_ids": [],
|
| 44 |
"source_snippets": [
|
| 45 |
+
"A new ReplicaSet is created, and the Deployment gradually scales it up while scaling down the old ReplicaSet, ensuring Pods are replaced at a controlled rate",
|
| 46 |
"A ReplicaSet is defined with fields, including a selector that specifies how to identify Pods it can acquire, a number of replicas indicating how many Pods it should be maintaining"
|
| 47 |
],
|
| 48 |
"source_pages": [
|
|
|
|
| 56 |
"question": "What is the key difference between a ConfigMap and a Secret when deciding where to store sensitive application data like database passwords?",
|
| 57 |
"expected_answer_keywords": ["non-confidential", "confidential", "Secret", "ConfigMap", "encryption", "etcd"],
|
| 58 |
"expected_sources": [
|
| 59 |
+
"k8s_configmap.md",
|
| 60 |
+
"k8s_secret.md"
|
| 61 |
],
|
| 62 |
"category": "retrieval",
|
| 63 |
"difficulty": "medium",
|
|
|
|
| 68 |
"source_chunk_ids": [],
|
| 69 |
"source_snippets": [
|
| 70 |
"A ConfigMap is an API object used to store non-confidential data in key-value pairs",
|
| 71 |
+
"specifically intended to hold confidential data"
|
| 72 |
],
|
| 73 |
"source_pages": [
|
| 74 |
"concepts/configuration/configmap",
|
|
|
|
| 80 |
"id": "k8s_pilot_004",
|
| 81 |
"question": "If I set a custom value for one hard eviction threshold on the kubelet (e.g., memory.available) but leave the other thresholds unset, what happens to the defaults for the thresholds I didn't override?",
|
| 82 |
"expected_answer_keywords": ["zero", "default", "not inherited", "custom", "all thresholds", "explicit"],
|
| 83 |
+
"expected_sources": ["k8s_node_pressure_eviction.md"],
|
| 84 |
"category": "retrieval",
|
| 85 |
"difficulty": "hard",
|
| 86 |
"requires_calculator": false,
|
|
|
|
| 98 |
"id": "k8s_pilot_005",
|
| 99 |
"question": "How do I configure a Kubernetes NetworkPolicy to enforce mutual TLS (mTLS) between Pods in the same namespace?",
|
| 100 |
"expected_answer_keywords": ["not", "does not", "NetworkPolicy", "service mesh", "TLS", "ingress controller"],
|
| 101 |
+
"expected_sources": ["k8s_network_policies.md"],
|
| 102 |
"category": "retrieval",
|
| 103 |
"difficulty": "medium",
|
| 104 |
"requires_calculator": false,
|
|
|
|
| 116 |
"id": "k8s_pilot_006",
|
| 117 |
"question": "As of the Kubernetes v1.31 snapshot, what is the feature state (alpha, beta, or stable) of the built-in Pod Security admission controller, and in which version did it reach that state?",
|
| 118 |
"expected_answer_keywords": ["stable", "v1.25", "Pod Security", "admission controller"],
|
| 119 |
+
"expected_sources": ["k8s_pod_security_admission.md"],
|
| 120 |
"category": "retrieval",
|
| 121 |
"difficulty": "easy",
|
| 122 |
"requires_calculator": false,
|
|
|
|
| 125 |
"is_multi_hop": false,
|
| 126 |
"source_chunk_ids": [],
|
| 127 |
"source_snippets": [
|
| 128 |
+
"FEATURE STATE: `Kubernetes v1.25 [stable]`"
|
| 129 |
],
|
| 130 |
"source_pages": ["concepts/security/pod-security-admission"],
|
| 131 |
"source_sections": [""]
|
|
@@ -103,15 +103,10 @@ corpora:
|
|
| 103 |
label: "Kubernetes"
|
| 104 |
store_path: .cache/store_k8s
|
| 105 |
data_path: data/k8s_docs
|
| 106 |
-
#
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
refusal_threshold: 0.30
|
| 110 |
top_k: 5
|
| 111 |
max_iterations: 3
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
# startup. Flip to true after data/k8s_docs/ is curated and
|
| 115 |
-
# `make ingest-k8s` has built .cache/store_k8s. See
|
| 116 |
-
# data/k8s_docs/SOURCES.md for the curation policy.
|
| 117 |
-
available: false
|
|
|
|
| 103 |
label: "Kubernetes"
|
| 104 |
store_path: .cache/store_k8s
|
| 105 |
data_path: data/k8s_docs
|
| 106 |
+
refusal_threshold: 0.02 # PILOT: matches fastapi working value for 6-pilot smoke test.
|
| 107 |
+
# 0.30 placeholder remains the launch-intent; full tuning sweep
|
| 108 |
+
# lands with the 25-question golden set (see DECISIONS.md).
|
|
|
|
| 109 |
top_k: 5
|
| 110 |
max_iterations: 3
|
| 111 |
+
golden_dataset: agent_bench/evaluation/datasets/k8s_golden_pilot.json
|
| 112 |
+
available: true
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -0,0 +1,281 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
A ConfigMap is an API object used to store non-confidential data in key-value pairs. [Pods](https://kubernetes.io/docs/concepts/workloads/pods/ "A Pod represents a set of running containers in your cluster.") can consume ConfigMaps as environment variables, command-line arguments, or as configuration files in a [volume](https://kubernetes.io/docs/concepts/storage/volumes/ "A directory containing data, accessible to the containers in a pod.").
|
| 2 |
+
|
| 3 |
+
A ConfigMap allows you to decouple environment-specific configuration from your [container images](https://kubernetes.io/docs/reference/glossary/?all=true#term-image "Stored instance of a container that holds a set of software needed to run an application."), so that your applications are easily portable.
|
| 4 |
+
|
| 5 |
+
> [!caution] Caution:
|
| 6 |
+
> ConfigMap does not provide secrecy or encryption. If the data you want to store are confidential, use a [Secret](https://kubernetes.io/docs/concepts/configuration/secret/ "Stores sensitive information, such as passwords, OAuth tokens, and ssh keys.") rather than a ConfigMap, or use additional (third party) tools to keep your data private.
|
| 7 |
+
|
| 8 |
+
## Motivation
|
| 9 |
+
|
| 10 |
+
Use a ConfigMap for setting configuration data separately from application code.
|
| 11 |
+
|
| 12 |
+
For example, imagine that you are developing an application that you can run on your own computer (for development) and in the cloud (to handle real traffic). You write the code to look in an environment variable named `DATABASE_HOST`. Locally, you set that variable to `localhost`. In the cloud, you set it to refer to a Kubernetes [Service](https://kubernetes.io/docs/concepts/services-networking/service/ "A way to expose an application running on a set of Pods as a network service.") that exposes the database component to your cluster. This lets you fetch a container image running in the cloud and debug the exact same code locally if needed.
|
| 13 |
+
|
| 14 |
+
> [!info] Note:
|
| 15 |
+
> A ConfigMap is not designed to hold large chunks of data. The data stored in a ConfigMap cannot exceed 1 MiB. If you need to store settings that are larger than this limit, you may want to consider mounting a volume or use a separate database or file service.
|
| 16 |
+
|
| 17 |
+
## ConfigMap object
|
| 18 |
+
|
| 19 |
+
A ConfigMap is an [API object](https://kubernetes.io/docs/concepts/overview/working-with-objects/#kubernetes-objects "An entity in the Kubernetes system, representing part of the state of your cluster.") that lets you store configuration for other objects to use. Unlike most Kubernetes objects that have a `spec`, a ConfigMap has `data` and `binaryData` fields. These fields accept key-value pairs as their values. Both the `data` field and the `binaryData` are optional. The `data` field is designed to contain UTF-8 strings while the `binaryData` field is designed to contain binary data as base64-encoded strings.
|
| 20 |
+
|
| 21 |
+
The name of a ConfigMap must be a valid [DNS subdomain name](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-subdomain-names).
|
| 22 |
+
|
| 23 |
+
Each key under the `data` or the `binaryData` field must consist of alphanumeric characters, `-`, `_` or `.`. The keys stored in `data` must not overlap with the keys in the `binaryData` field.
|
| 24 |
+
|
| 25 |
+
Starting from v1.19, you can add an `immutable` field to a ConfigMap definition to create an [immutable ConfigMap](#configmap-immutable).
|
| 26 |
+
|
| 27 |
+
## ConfigMaps and Pods
|
| 28 |
+
|
| 29 |
+
You can write a Pod `spec` that refers to a ConfigMap and configures the container(s) in that Pod based on the data in the ConfigMap. The Pod and the ConfigMap must be in the same [namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces "An abstraction used by Kubernetes to support isolation of groups of resources within a single cluster.").
|
| 30 |
+
|
| 31 |
+
> [!info] Note:
|
| 32 |
+
> The `spec` of a [static Pod](https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/ "A pod managed directly by the kubelet daemon on a specific node.") cannot refer to a ConfigMap or any other API objects.
|
| 33 |
+
|
| 34 |
+
Here's an example ConfigMap that has some keys with single values, and other keys where the value looks like a fragment of a configuration format.
|
| 35 |
+
|
| 36 |
+
```yaml
|
| 37 |
+
apiVersion: v1
|
| 38 |
+
kind: ConfigMap
|
| 39 |
+
metadata:
|
| 40 |
+
name: game-demo
|
| 41 |
+
data:
|
| 42 |
+
# property-like keys; each key maps to a simple value
|
| 43 |
+
player_initial_lives: "3"
|
| 44 |
+
ui_properties_file_name: "user-interface.properties"
|
| 45 |
+
|
| 46 |
+
# file-like keys
|
| 47 |
+
game.properties: |
|
| 48 |
+
enemy.types=aliens,monsters
|
| 49 |
+
player.maximum-lives=5
|
| 50 |
+
user-interface.properties: |
|
| 51 |
+
color.good=purple
|
| 52 |
+
color.bad=yellow
|
| 53 |
+
allow.textmode=true
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
There are four different ways that you can use a ConfigMap to configure a container inside a Pod:
|
| 57 |
+
|
| 58 |
+
1. Inside a container command and args
|
| 59 |
+
2. Environment variables for a container
|
| 60 |
+
3. Add a file in read-only volume, for the application to read
|
| 61 |
+
4. Write code to run inside the Pod that uses the Kubernetes API to read a ConfigMap
|
| 62 |
+
|
| 63 |
+
These different methods lend themselves to different ways of modeling the data being consumed. For the first three methods, the [kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet "An agent that runs on each node in the cluster. It makes sure that containers are running in a pod.") uses the data from the ConfigMap when it launches container(s) for a Pod.
|
| 64 |
+
|
| 65 |
+
The fourth method means you have to write code to read the ConfigMap and its data. However, because you're using the Kubernetes API directly, your application can subscribe to get updates whenever the ConfigMap changes, and react when that happens. By accessing the Kubernetes API directly, this technique also lets you access a ConfigMap in a different namespace.
|
| 66 |
+
|
| 67 |
+
Here's an example Pod that uses values from `game-demo` to configure a Pod:
|
| 68 |
+
|
| 69 |
+
```yaml
|
| 70 |
+
apiVersion: v1
|
| 71 |
+
kind: Pod
|
| 72 |
+
metadata:
|
| 73 |
+
name: configmap-demo-pod
|
| 74 |
+
spec:
|
| 75 |
+
containers:
|
| 76 |
+
- name: demo
|
| 77 |
+
image: alpine
|
| 78 |
+
command: ["sleep", "3600"]
|
| 79 |
+
env:
|
| 80 |
+
# Define the environment variable
|
| 81 |
+
- name: PLAYER_INITIAL_LIVES # Notice that the case is different here
|
| 82 |
+
# from the key name in the ConfigMap.
|
| 83 |
+
valueFrom:
|
| 84 |
+
configMapKeyRef:
|
| 85 |
+
name: game-demo # The ConfigMap this value comes from.
|
| 86 |
+
key: player_initial_lives # The key to fetch.
|
| 87 |
+
- name: UI_PROPERTIES_FILE_NAME
|
| 88 |
+
valueFrom:
|
| 89 |
+
configMapKeyRef:
|
| 90 |
+
name: game-demo
|
| 91 |
+
key: ui_properties_file_name
|
| 92 |
+
volumeMounts:
|
| 93 |
+
- name: config
|
| 94 |
+
mountPath: "/config"
|
| 95 |
+
readOnly: true
|
| 96 |
+
volumes:
|
| 97 |
+
# You set volumes at the Pod level, then mount them into containers inside that Pod
|
| 98 |
+
- name: config
|
| 99 |
+
configMap:
|
| 100 |
+
# Provide the name of the ConfigMap you want to mount.
|
| 101 |
+
name: game-demo
|
| 102 |
+
# An array of keys from the ConfigMap to create as files
|
| 103 |
+
items:
|
| 104 |
+
- key: "game.properties"
|
| 105 |
+
path: "game.properties"
|
| 106 |
+
- key: "user-interface.properties"
|
| 107 |
+
path: "user-interface.properties"
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
A ConfigMap doesn't differentiate between single line property values and multi-line file-like values. What matters is how Pods and other objects consume those values.
|
| 111 |
+
|
| 112 |
+
For this example, defining a volume and mounting it inside the `demo` container as `/config` creates two files, `/config/game.properties` and `/config/user-interface.properties`, even though there are four keys in the ConfigMap. This is because the Pod definition specifies an `items` array in the `volumes` section. If you omit the `items` array entirely, every key in the ConfigMap becomes a file with the same name as the key, and you get 4 files.
|
| 113 |
+
|
| 114 |
+
## Using ConfigMaps
|
| 115 |
+
|
| 116 |
+
ConfigMaps can be mounted as data volumes. ConfigMaps can also be used by other parts of the system, without being directly exposed to the Pod. For example, ConfigMaps can hold data that other parts of the system should use for configuration.
|
| 117 |
+
|
| 118 |
+
The most common way to use ConfigMaps is to configure settings for containers running in a Pod in the same namespace. You can also use a ConfigMap separately.
|
| 119 |
+
|
| 120 |
+
For example, you might encounter [addons](https://kubernetes.io/docs/concepts/cluster-administration/addons/ "Resources that extend the functionality of Kubernetes.") or [operators](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/ "A specialized controller used to manage a custom resource") that adjust their behavior based on a ConfigMap.
|
| 121 |
+
|
| 122 |
+
### Using ConfigMaps as files from a Pod
|
| 123 |
+
|
| 124 |
+
To consume a ConfigMap in a volume in a Pod:
|
| 125 |
+
|
| 126 |
+
1. Create a ConfigMap or use an existing one. Multiple Pods can reference the same ConfigMap.
|
| 127 |
+
2. Modify your Pod definition to add a volume under `.spec.volumes[]`. Name the volume anything, and have a `.spec.volumes[].configMap.name` field set to reference your ConfigMap object.
|
| 128 |
+
3. Add a `.spec.containers[].volumeMounts[]` to each container that needs the ConfigMap. Specify `.spec.containers[].volumeMounts[].readOnly = true` and `.spec.containers[].volumeMounts[].mountPath` to an unused directory name where you would like the ConfigMap to appear.
|
| 129 |
+
4. Modify your image or command line so that the program looks for files in that directory. Each key in the ConfigMap `data` map becomes the filename under `mountPath`.
|
| 130 |
+
|
| 131 |
+
This is an example of a Pod that mounts a ConfigMap in a volume:
|
| 132 |
+
|
| 133 |
+
```yaml
|
| 134 |
+
apiVersion: v1
|
| 135 |
+
kind: Pod
|
| 136 |
+
metadata:
|
| 137 |
+
name: mypod
|
| 138 |
+
spec:
|
| 139 |
+
containers:
|
| 140 |
+
- name: mypod
|
| 141 |
+
image: redis
|
| 142 |
+
volumeMounts:
|
| 143 |
+
- name: foo
|
| 144 |
+
mountPath: "/etc/foo"
|
| 145 |
+
readOnly: true
|
| 146 |
+
volumes:
|
| 147 |
+
- name: foo
|
| 148 |
+
configMap:
|
| 149 |
+
name: myconfigmap
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
Each ConfigMap you want to use needs to be referred to in `.spec.volumes`.
|
| 153 |
+
|
| 154 |
+
If there are multiple containers in the Pod, then each container needs its own `volumeMounts` block, but only one `.spec.volumes` is needed per ConfigMap.
|
| 155 |
+
|
| 156 |
+
#### Mounted ConfigMaps are updated automatically
|
| 157 |
+
|
| 158 |
+
When a ConfigMap currently consumed in a volume is updated, projected keys are eventually updated as well. The kubelet checks whether the mounted ConfigMap is fresh on every periodic sync. However, the kubelet uses its local cache for getting the current value of the ConfigMap. The type of the cache is configurable using the `configMapAndSecretChangeDetectionStrategy` field in the [KubeletConfiguration struct](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/). A ConfigMap can be either propagated by watch (default), ttl-based, or by redirecting all requests directly to the API server. As a result, the total delay from the moment when the ConfigMap is updated to the moment when new keys are projected to the Pod can be as long as the kubelet sync period + cache propagation delay, where the cache propagation delay depends on the chosen cache type (it equals to watch propagation delay, ttl of cache, or zero correspondingly).
|
| 159 |
+
|
| 160 |
+
ConfigMaps consumed as environment variables are not updated automatically and require a pod restart.
|
| 161 |
+
|
| 162 |
+
> [!info] Note:
|
| 163 |
+
> A container using a ConfigMap as a [subPath](https://kubernetes.io/docs/concepts/storage/volumes/#using-subpath) volume mount will not receive ConfigMap updates.
|
| 164 |
+
|
| 165 |
+
### Using Configmaps as environment variables
|
| 166 |
+
|
| 167 |
+
To use a Configmap in an [environment variable](https://kubernetes.io/docs/concepts/containers/container-environment/ "Container environment variables are name=value pairs that provide useful information into containers running in a Pod.") in a Pod:
|
| 168 |
+
|
| 169 |
+
1. For each container in your Pod specification, add an environment variable for each Configmap key that you want to use to the `env[].valueFrom.configMapKeyRef` field.
|
| 170 |
+
2. Modify your image and/or command line so that the program looks for values in the specified environment variables.
|
| 171 |
+
|
| 172 |
+
This is an example of defining a ConfigMap as a pod environment variable:
|
| 173 |
+
|
| 174 |
+
The following ConfigMap (myconfigmap.yaml) stores two properties: username and access\_level:
|
| 175 |
+
|
| 176 |
+
```yaml
|
| 177 |
+
apiVersion: v1
|
| 178 |
+
kind: ConfigMap
|
| 179 |
+
metadata:
|
| 180 |
+
name: myconfigmap
|
| 181 |
+
data:
|
| 182 |
+
username: k8s-admin
|
| 183 |
+
access_level: "1"
|
| 184 |
+
```
|
| 185 |
+
|
| 186 |
+
The following command will create the ConfigMap object:
|
| 187 |
+
|
| 188 |
+
```shell
|
| 189 |
+
kubectl apply -f myconfigmap.yaml
|
| 190 |
+
```
|
| 191 |
+
|
| 192 |
+
The following Pod consumes the content of the ConfigMap as environment variables:
|
| 193 |
+
|
| 194 |
+
```yaml
|
| 195 |
+
apiVersion: v1
|
| 196 |
+
kind: Pod
|
| 197 |
+
metadata:
|
| 198 |
+
name: env-configmap
|
| 199 |
+
spec:
|
| 200 |
+
containers:
|
| 201 |
+
- name: app
|
| 202 |
+
command: ["/bin/sh", "-c", "printenv"]
|
| 203 |
+
image: busybox:latest
|
| 204 |
+
envFrom:
|
| 205 |
+
- configMapRef:
|
| 206 |
+
name: myconfigmap
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
The `envFrom` field instructs Kubernetes to create environment variables from the sources nested within it. The inner `configMapRef` refers to a ConfigMap by its name and selects all its key-value pairs. Add the Pod to your cluster, then retrieve its logs to see the output from the printenv command. This should confirm that the two key-value pairs from the ConfigMap have been set as environment variables:
|
| 210 |
+
|
| 211 |
+
```shell
|
| 212 |
+
kubectl apply -f env-configmap.yaml
|
| 213 |
+
```
|
| 214 |
+
```shell
|
| 215 |
+
kubectl logs pod/env-configmap
|
| 216 |
+
```
|
| 217 |
+
|
| 218 |
+
The output is similar to this:
|
| 219 |
+
|
| 220 |
+
```console
|
| 221 |
+
...
|
| 222 |
+
username: "k8s-admin"
|
| 223 |
+
access_level: "1"
|
| 224 |
+
...
|
| 225 |
+
```
|
| 226 |
+
|
| 227 |
+
Sometimes a Pod won't require access to all the values in a ConfigMap. For example, you could have another Pod which only uses the username value from the ConfigMap. For this use case, you can use the `env.valueFrom` syntax instead, which lets you select individual keys in a ConfigMap. The name of the environment variable can also be different from the key within the ConfigMap. For example:
|
| 228 |
+
|
| 229 |
+
```yaml
|
| 230 |
+
apiVersion: v1
|
| 231 |
+
kind: Pod
|
| 232 |
+
metadata:
|
| 233 |
+
name: env-configmap
|
| 234 |
+
spec:
|
| 235 |
+
containers:
|
| 236 |
+
- name: envars-test-container
|
| 237 |
+
image: nginx
|
| 238 |
+
env:
|
| 239 |
+
- name: CONFIGMAP_USERNAME
|
| 240 |
+
valueFrom:
|
| 241 |
+
configMapKeyRef:
|
| 242 |
+
name: myconfigmap
|
| 243 |
+
key: username
|
| 244 |
+
```
|
| 245 |
+
|
| 246 |
+
In the Pod created from this manifest, you will see that the environment variable `CONFIGMAP_USERNAME` is set to the value of the `username` value from the ConfigMap. Other keys from the ConfigMap data are not copied into the environment.
|
| 247 |
+
|
| 248 |
+
It's important to note that the range of characters allowed for environment variable names in pods is [restricted](https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/#using-environment-variables-inside-of-your-config). If any keys do not meet the rules, those keys are not made available to your container, though the Pod is allowed to start.
|
| 249 |
+
|
| 250 |
+
## Immutable ConfigMaps
|
| 251 |
+
|
| 252 |
+
FEATURE STATE: `Kubernetes v1.21 [stable]`
|
| 253 |
+
|
| 254 |
+
The Kubernetes feature *Immutable Secrets and ConfigMaps* provides an option to set individual Secrets and ConfigMaps as immutable. For clusters that extensively use ConfigMaps (at least tens of thousands of unique ConfigMap to Pod mounts), preventing changes to their data has the following advantages:
|
| 255 |
+
|
| 256 |
+
- protects you from accidental (or unwanted) updates that could cause applications outages
|
| 257 |
+
- improves performance of your cluster by significantly reducing load on kube-apiserver, by closing watches for ConfigMaps marked as immutable.
|
| 258 |
+
|
| 259 |
+
You can create an immutable ConfigMap by setting the `immutable` field to `true`. For example:
|
| 260 |
+
|
| 261 |
+
```yaml
|
| 262 |
+
apiVersion: v1
|
| 263 |
+
kind: ConfigMap
|
| 264 |
+
metadata:
|
| 265 |
+
...
|
| 266 |
+
data:
|
| 267 |
+
...
|
| 268 |
+
immutable: true
|
| 269 |
+
```
|
| 270 |
+
|
| 271 |
+
Once a ConfigMap is marked as immutable, it is *not* possible to revert this change nor to mutate the contents of the `data` or the `binaryData` field. You can only delete and recreate the ConfigMap. Because existing Pods maintain a mount point to the deleted ConfigMap, it is recommended to recreate these pods.
|
| 272 |
+
|
| 273 |
+
## What's next
|
| 274 |
+
|
| 275 |
+
- Read about [Secrets](https://kubernetes.io/docs/concepts/configuration/secret/).
|
| 276 |
+
- Read [Configure a Pod to Use a ConfigMap](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/).
|
| 277 |
+
- Read about [changing a ConfigMap (or any other Kubernetes object)](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/update-api-object-kubectl-patch/)
|
| 278 |
+
- Read [The Twelve-Factor App](https://12factor.net/) to understand the motivation for separating code from configuration.
|
| 279 |
+
|
| 280 |
+
|
| 281 |
+
Last modified November 21, 2025 at 2:18 PM PST: [Fix formatting of kubectl logs command (69fb346f79)](https://github.com/kubernetes/website/commit/69fb346f79076561c9e5fdb6e65aed5b927e8ce5)
|
|
@@ -0,0 +1,1092 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
A Deployment manages a set of Pods to run an application workload, usually one that doesn't maintain state.
|
| 2 |
+
|
| 3 |
+
A *Deployment* provides declarative updates for [Pods](https://kubernetes.io/docs/concepts/workloads/pods/ "A Pod represents a set of running containers in your cluster.") and [ReplicaSets](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/ "ReplicaSet ensures that a specified number of Pod replicas are running at one time").
|
| 4 |
+
|
| 5 |
+
You describe a *desired state* in a Deployment, and the Deployment [Controller](https://kubernetes.io/docs/concepts/architecture/controller/ "A control loop that watches the shared state of the cluster through the apiserver and makes changes attempting to move the current state towards the desired state.") changes the actual state to the desired state at a controlled rate. You can define Deployments to create new ReplicaSets, or to remove existing Deployments and adopt all their resources with new Deployments.
|
| 6 |
+
|
| 7 |
+
> [!info] Note:
|
| 8 |
+
> Do not manage ReplicaSets owned by a Deployment. Consider opening an issue in the main Kubernetes repository if your use case is not covered below.
|
| 9 |
+
|
| 10 |
+
## Use Case
|
| 11 |
+
|
| 12 |
+
The following are typical use cases for Deployments:
|
| 13 |
+
|
| 14 |
+
- [Create a Deployment to rollout a ReplicaSet](#creating-a-deployment). The ReplicaSet creates Pods in the background. Check the status of the rollout to see if it succeeds or not.
|
| 15 |
+
- [Declare the new state of the Pods](#updating-a-deployment) by updating the PodTemplateSpec of the Deployment. A new ReplicaSet is created, and the Deployment gradually scales it up while scaling down the old ReplicaSet, ensuring Pods are replaced at a controlled rate. Each new ReplicaSet updates the revision of the Deployment.
|
| 16 |
+
- [Rollback to an earlier Deployment revision](#rolling-back-a-deployment) if the current state of the Deployment is not stable. Each rollback updates the revision of the Deployment.
|
| 17 |
+
- [Scale up the Deployment to facilitate more load](#scaling-a-deployment).
|
| 18 |
+
- [Pause the rollout of a Deployment](#pausing-and-resuming-a-deployment) to apply multiple fixes to its PodTemplateSpec and then resume it to start a new rollout.
|
| 19 |
+
- [Use the status of the Deployment](#deployment-status) as an indicator that a rollout has stuck.
|
| 20 |
+
- [Clean up older ReplicaSets](#clean-up-policy) that you don't need anymore.
|
| 21 |
+
|
| 22 |
+
## Creating a Deployment
|
| 23 |
+
|
| 24 |
+
The following is an example of a Deployment. It creates a ReplicaSet to bring up three `nginx` Pods:
|
| 25 |
+
|
| 26 |
+
```yaml
|
| 27 |
+
apiVersion: apps/v1
|
| 28 |
+
kind: Deployment
|
| 29 |
+
metadata:
|
| 30 |
+
name: nginx-deployment
|
| 31 |
+
labels:
|
| 32 |
+
app: nginx
|
| 33 |
+
spec:
|
| 34 |
+
replicas: 3
|
| 35 |
+
selector:
|
| 36 |
+
matchLabels:
|
| 37 |
+
app: nginx
|
| 38 |
+
template:
|
| 39 |
+
metadata:
|
| 40 |
+
labels:
|
| 41 |
+
app: nginx
|
| 42 |
+
spec:
|
| 43 |
+
containers:
|
| 44 |
+
- name: nginx
|
| 45 |
+
image: nginx:1.14.2
|
| 46 |
+
ports:
|
| 47 |
+
- containerPort: 80
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
In this example:
|
| 51 |
+
|
| 52 |
+
- A Deployment named `nginx-deployment` is created, indicated by the `.metadata.name` field. This name will become the basis for the ReplicaSets and Pods which are created later. See [Writing a Deployment Spec](#writing-a-deployment-spec) for more details.
|
| 53 |
+
- The Deployment creates a ReplicaSet that creates three replicated Pods, indicated by the `.spec.replicas` field.
|
| 54 |
+
- The `.spec.selector` field defines how the created ReplicaSet finds which Pods to manage. In this case, you select a label that is defined in the Pod template (`app: nginx`). However, more sophisticated selection rules are possible, as long as the Pod template itself satisfies the rule.
|
| 55 |
+
> [!info] Note:
|
| 56 |
+
> The `.spec.selector.matchLabels` field is a map of {key,value} pairs. A single {key,value} in the `matchLabels` map is equivalent to an element of `matchExpressions`, whose `key` field is "key", the `operator` is "In", and the `values` array contains only "value". All of the requirements, from both `matchLabels` and `matchExpressions`, must be satisfied in order to match.
|
| 57 |
+
- The `.spec.template` field contains the following sub-fields:
|
| 58 |
+
- The Pods are labeled `app: nginx` using the `.metadata.labels` field.
|
| 59 |
+
- The Pod template's specification, or `.spec` field, indicates that the Pods run one container, `nginx`, which runs the `nginx` [Docker Hub](https://hub.docker.com/) image at version 1.14.2.
|
| 60 |
+
- Create one container and name it `nginx` using the `.spec.containers[0].name` field.
|
| 61 |
+
|
| 62 |
+
Before you begin, make sure your Kubernetes cluster is up and running. Follow the steps given below to create the above Deployment:
|
| 63 |
+
|
| 64 |
+
1. Create the Deployment by running the following command:
|
| 65 |
+
```shell
|
| 66 |
+
kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml
|
| 67 |
+
```
|
| 68 |
+
2. Run `kubectl get deployments` to check if the Deployment was created.
|
| 69 |
+
If the Deployment is still being created, the output is similar to the following:
|
| 70 |
+
```
|
| 71 |
+
NAME READY UP-TO-DATE AVAILABLE AGE
|
| 72 |
+
nginx-deployment 0/3 0 0 1s
|
| 73 |
+
```
|
| 74 |
+
When you inspect the Deployments in your cluster, the following fields are displayed:
|
| 75 |
+
- `NAME` lists the names of the Deployments in the namespace.
|
| 76 |
+
- `READY` displays how many replicas of the application are available to your users. It follows the pattern ready/desired.
|
| 77 |
+
- `UP-TO-DATE` displays the number of replicas that have been updated to achieve the desired state.
|
| 78 |
+
- `AVAILABLE` displays how many replicas of the application are available to your users.
|
| 79 |
+
- `AGE` displays the amount of time that the application has been running.
|
| 80 |
+
Notice how the number of desired replicas is 3 according to `.spec.replicas` field.
|
| 81 |
+
3. To see the Deployment rollout status, run `kubectl rollout status deployment/nginx-deployment`.
|
| 82 |
+
The output is similar to:
|
| 83 |
+
```
|
| 84 |
+
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
|
| 85 |
+
deployment "nginx-deployment" successfully rolled out
|
| 86 |
+
```
|
| 87 |
+
4. Run the `kubectl get deployments` again a few seconds later. The output is similar to this:
|
| 88 |
+
```
|
| 89 |
+
NAME READY UP-TO-DATE AVAILABLE AGE
|
| 90 |
+
nginx-deployment 3/3 3 3 18s
|
| 91 |
+
```
|
| 92 |
+
Notice that the Deployment has created all three replicas, and all replicas are up-to-date (they contain the latest Pod template) and available.
|
| 93 |
+
5. To see the ReplicaSet (`rs`) created by the Deployment, run `kubectl get rs`. The output is similar to this:
|
| 94 |
+
```
|
| 95 |
+
NAME DESIRED CURRENT READY AGE
|
| 96 |
+
nginx-deployment-75675f5897 3 3 3 18s
|
| 97 |
+
```
|
| 98 |
+
ReplicaSet output shows the following fields:
|
| 99 |
+
- `NAME` lists the names of the ReplicaSets in the namespace.
|
| 100 |
+
- `DESIRED` displays the desired number of *replicas* of the application, which you define when you create the Deployment. This is the *desired state*.
|
| 101 |
+
- `CURRENT` displays how many replicas are currently running.
|
| 102 |
+
- `READY` displays how many replicas of the application are available to your users.
|
| 103 |
+
- `AGE` displays the amount of time that the application has been running.
|
| 104 |
+
Notice that the name of the ReplicaSet is always formatted as `[DEPLOYMENT-NAME]-[HASH]`. This name will become the basis for the Pods which are created.
|
| 105 |
+
The `HASH` string is the same as the `pod-template-hash` label on the ReplicaSet.
|
| 106 |
+
6. To see the labels automatically generated for each Pod, run `kubectl get pods --show-labels`. The output is similar to:
|
| 107 |
+
```
|
| 108 |
+
NAME READY STATUS RESTARTS AGE LABELS
|
| 109 |
+
nginx-deployment-75675f5897-7ci7o 1/1 Running 0 18s app=nginx,pod-template-hash=75675f5897
|
| 110 |
+
nginx-deployment-75675f5897-kzszj 1/1 Running 0 18s app=nginx,pod-template-hash=75675f5897
|
| 111 |
+
nginx-deployment-75675f5897-qqcnn 1/1 Running 0 18s app=nginx,pod-template-hash=75675f5897
|
| 112 |
+
```
|
| 113 |
+
The created ReplicaSet ensures that there are three `nginx` Pods.
|
| 114 |
+
|
| 115 |
+
> [!info] Note:
|
| 116 |
+
> You must specify an appropriate selector and Pod template labels in a Deployment (in this case, `app: nginx`).
|
| 117 |
+
>
|
| 118 |
+
> Do not overlap labels or selectors with other controllers (including other Deployments and StatefulSets). Kubernetes doesn't stop you from overlapping, and if multiple controllers have overlapping selectors those controllers might conflict and behave unexpectedly.
|
| 119 |
+
|
| 120 |
+
### Pod-template-hash label
|
| 121 |
+
|
| 122 |
+
> [!caution] Caution:
|
| 123 |
+
> Do not change this label.
|
| 124 |
+
|
| 125 |
+
The `pod-template-hash` label is added by the Deployment controller to every ReplicaSet that a Deployment creates or adopts.
|
| 126 |
+
|
| 127 |
+
This label ensures that child ReplicaSets of a Deployment do not overlap. It is generated by hashing the `PodTemplate` of the ReplicaSet and using the resulting hash as the label value that is added to the ReplicaSet selector, Pod template labels, and in any existing Pods that the ReplicaSet might have.
|
| 128 |
+
|
| 129 |
+
## Updating a Deployment
|
| 130 |
+
|
| 131 |
+
> [!info] Note:
|
| 132 |
+
> A Deployment's rollout is triggered if and only if the Deployment's Pod template (that is, `.spec.template`) is changed, for example if the labels or container images of the template are updated. Other updates, such as scaling the Deployment, do not trigger a rollout.
|
| 133 |
+
|
| 134 |
+
Follow the steps given below to update your Deployment:
|
| 135 |
+
|
| 136 |
+
1. Let's update the nginx Pods to use the `nginx:1.16.1` image instead of the `nginx:1.14.2` image.
|
| 137 |
+
```shell
|
| 138 |
+
kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1
|
| 139 |
+
```
|
| 140 |
+
or use the following command:
|
| 141 |
+
```shell
|
| 142 |
+
kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
|
| 143 |
+
```
|
| 144 |
+
where `deployment/nginx-deployment` indicates the Deployment, `nginx` indicates the Container the update will take place and `nginx:1.16.1` indicates the new image and its tag.
|
| 145 |
+
The output is similar to:
|
| 146 |
+
```
|
| 147 |
+
deployment.apps/nginx-deployment image updated
|
| 148 |
+
```
|
| 149 |
+
Alternatively, you can `edit` the Deployment and change `.spec.template.spec.containers[0].image` from `nginx:1.14.2` to `nginx:1.16.1`:
|
| 150 |
+
```shell
|
| 151 |
+
kubectl edit deployment/nginx-deployment
|
| 152 |
+
```
|
| 153 |
+
The output is similar to:
|
| 154 |
+
```
|
| 155 |
+
deployment.apps/nginx-deployment edited
|
| 156 |
+
```
|
| 157 |
+
2. To see the rollout status, run:
|
| 158 |
+
```shell
|
| 159 |
+
kubectl rollout status deployment/nginx-deployment
|
| 160 |
+
```
|
| 161 |
+
The output is similar to this:
|
| 162 |
+
```
|
| 163 |
+
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
|
| 164 |
+
```
|
| 165 |
+
or
|
| 166 |
+
```
|
| 167 |
+
deployment "nginx-deployment" successfully rolled out
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
Get more details on your updated Deployment:
|
| 171 |
+
|
| 172 |
+
- After the rollout succeeds, you can view the Deployment by running `kubectl get deployments`. The output is similar to this:
|
| 173 |
+
```
|
| 174 |
+
NAME READY UP-TO-DATE AVAILABLE AGE
|
| 175 |
+
nginx-deployment 3/3 3 3 36s
|
| 176 |
+
```
|
| 177 |
+
- Run `kubectl get rs` to see that the Deployment updated the Pods by creating a new ReplicaSet and scaling it up to 3 replicas, as well as scaling down the old ReplicaSet to 0 replicas.
|
| 178 |
+
```shell
|
| 179 |
+
kubectl get rs
|
| 180 |
+
```
|
| 181 |
+
The output is similar to this:
|
| 182 |
+
```
|
| 183 |
+
NAME DESIRED CURRENT READY AGE
|
| 184 |
+
nginx-deployment-1564180365 3 3 3 6s
|
| 185 |
+
nginx-deployment-2035384211 0 0 0 36s
|
| 186 |
+
```
|
| 187 |
+
- Running `get pods` should now show only the new Pods:
|
| 188 |
+
```shell
|
| 189 |
+
kubectl get pods
|
| 190 |
+
```
|
| 191 |
+
The output is similar to this:
|
| 192 |
+
```
|
| 193 |
+
NAME READY STATUS RESTARTS AGE
|
| 194 |
+
nginx-deployment-1564180365-khku8 1/1 Running 0 14s
|
| 195 |
+
nginx-deployment-1564180365-nacti 1/1 Running 0 14s
|
| 196 |
+
nginx-deployment-1564180365-z9gth 1/1 Running 0 14s
|
| 197 |
+
```
|
| 198 |
+
Next time you want to update these Pods, you only need to update the Deployment's Pod template again.
|
| 199 |
+
Deployment ensures that only a certain number of Pods are down while they are being updated. By default, it ensures that at least 75% of the desired number of Pods are up (25% max unavailable).
|
| 200 |
+
Deployment also ensures that only a certain number of Pods are created above the desired number of Pods. By default, it ensures that at most 125% of the desired number of Pods are up (25% max surge).
|
| 201 |
+
For example, if you look at the above Deployment closely, you will see that it first creates a new Pod, then deletes an old Pod, and creates another new one. It does not kill old Pods until a sufficient number of new Pods have come up, and does not create new Pods until a sufficient number of old Pods have been killed. It makes sure that at least 3 Pods are available and that at max 4 Pods in total are available. In case of a Deployment with 4 replicas, the number of Pods would be between 3 and 5.
|
| 202 |
+
- Get details of your Deployment:
|
| 203 |
+
```shell
|
| 204 |
+
kubectl describe deployments
|
| 205 |
+
```
|
| 206 |
+
The output is similar to this:
|
| 207 |
+
```
|
| 208 |
+
Name: nginx-deployment
|
| 209 |
+
Namespace: default
|
| 210 |
+
CreationTimestamp: Thu, 30 Nov 2017 10:56:25 +0000
|
| 211 |
+
Labels: app=nginx
|
| 212 |
+
Annotations: deployment.kubernetes.io/revision=2
|
| 213 |
+
Selector: app=nginx
|
| 214 |
+
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
|
| 215 |
+
StrategyType: RollingUpdate
|
| 216 |
+
MinReadySeconds: 0
|
| 217 |
+
RollingUpdateStrategy: 25% max unavailable, 25% max surge
|
| 218 |
+
Pod Template:
|
| 219 |
+
Labels: app=nginx
|
| 220 |
+
Containers:
|
| 221 |
+
nginx:
|
| 222 |
+
Image: nginx:1.16.1
|
| 223 |
+
Port: 80/TCP
|
| 224 |
+
Environment: <none>
|
| 225 |
+
Mounts: <none>
|
| 226 |
+
Volumes: <none>
|
| 227 |
+
Conditions:
|
| 228 |
+
Type Status Reason
|
| 229 |
+
---- ------ ------
|
| 230 |
+
Available True MinimumReplicasAvailable
|
| 231 |
+
Progressing True NewReplicaSetAvailable
|
| 232 |
+
OldReplicaSets: <none>
|
| 233 |
+
NewReplicaSet: nginx-deployment-1564180365 (3/3 replicas created)
|
| 234 |
+
Events:
|
| 235 |
+
Type Reason Age From Message
|
| 236 |
+
---- ------ ---- ---- -------
|
| 237 |
+
Normal ScalingReplicaSet 2m deployment-controller Scaled up replica set nginx-deployment-2035384211 to 3
|
| 238 |
+
Normal ScalingReplicaSet 24s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 1
|
| 239 |
+
Normal ScalingReplicaSet 22s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 2
|
| 240 |
+
Normal ScalingReplicaSet 22s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 2
|
| 241 |
+
Normal ScalingReplicaSet 19s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 1
|
| 242 |
+
Normal ScalingReplicaSet 19s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 3
|
| 243 |
+
Normal ScalingReplicaSet 14s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 0
|
| 244 |
+
```
|
| 245 |
+
Here you see that when you first created the Deployment, it created a ReplicaSet (nginx-deployment-2035384211) and scaled it up to 3 replicas directly. When you updated the Deployment, it created a new ReplicaSet (nginx-deployment-1564180365) and scaled it up to 1 and waited for it to come up. Then it scaled down the old ReplicaSet to 2 and scaled up the new ReplicaSet to 2 so that at least 3 Pods were available and at most 4 Pods were created at all times. It then continued scaling up and down the new and the old ReplicaSet, with the same rolling update strategy. Finally, you'll have 3 available replicas in the new ReplicaSet, and the old ReplicaSet is scaled down to 0.
|
| 246 |
+
|
| 247 |
+
> [!info] Note:
|
| 248 |
+
> Kubernetes doesn't count terminating Pods when calculating the number of `availableReplicas`, which must be between `replicas - maxUnavailable` and `replicas + maxSurge`. As a result, you might notice that there are more Pods than expected during a rollout, and that the total resources consumed by the Deployment is more than `replicas + maxSurge` until the `terminationGracePeriodSeconds` of the terminating Pods expires.
|
| 249 |
+
|
| 250 |
+
### Rollover (aka multiple updates in-flight)
|
| 251 |
+
|
| 252 |
+
Each time a new Deployment is observed by the Deployment controller, a ReplicaSet is created to bring up the desired Pods. If the Deployment is updated, the existing ReplicaSet that controls Pods whose labels match `.spec.selector` but whose template does not match `.spec.template` is scaled down. Eventually, the new ReplicaSet is scaled to `.spec.replicas` and all old ReplicaSets is scaled to 0.
|
| 253 |
+
|
| 254 |
+
If you update a Deployment while an existing rollout is in progress, the Deployment creates a new ReplicaSet as per the update and start scaling that up, and rolls over the ReplicaSet that it was scaling up previously -- it will add it to its list of old ReplicaSets and start scaling it down.
|
| 255 |
+
|
| 256 |
+
For example, suppose you create a Deployment to create 5 replicas of `nginx:1.14.2`, but then update the Deployment to create 5 replicas of `nginx:1.16.1`, when only 3 replicas of `nginx:1.14.2` had been created. In that case, the Deployment immediately starts killing the 3 `nginx:1.14.2` Pods that it had created, and starts creating `nginx:1.16.1` Pods. It does not wait for the 5 replicas of `nginx:1.14.2` to be created before changing course.
|
| 257 |
+
|
| 258 |
+
### Label selector updates
|
| 259 |
+
|
| 260 |
+
It is generally discouraged to make label selector updates and it is suggested to plan your selectors up front. A Deployment's label selector is **immutable** after creation; it cannot be updated via `kubectl patch`, `kubectl edit`, `kubectl apply`, or tools like `helm upgrade`.
|
| 261 |
+
|
| 262 |
+
If you must change the selector, you have to delete the Deployment and recreate it. Exercise great caution and ensure you grasp the following implications:
|
| 263 |
+
|
| 264 |
+
- **Additions:** When you create a new Deployment with a narrower selector, the new Deployment **must** also have a suitable Pod template. If you have an existing manifest and you edit the manifest to narrow the selector, you need to edit the metadata of the Pod template inside that Deployment, adding the new labels to match, as otherwise the API server returns a validation error. This is a *non-overlapping* change: the new Deployment will not "see" the old Pods (which lack the new label), causing the old ReplicaSet to be **orphaned** and a brand-new ReplicaSet to be created.
|
| 265 |
+
- **Value Updates:** Changing the existing value in a selector key (e.g., from `v1` to `v2`) results in the same behavior as additions (orphaning and recreation).
|
| 266 |
+
- **Removals:** Removing an existing key from the Deployment selector does not require any changes in the Pod template labels. This is an *overlapping* change: the new, broader selector would match the old Pods. Existing ReplicaSets are not orphaned, and a new ReplicaSet is not created, but note that the removed label still exists in any existing Pods and ReplicaSets. You can clean that up by triggering a rollout for the Deployment.
|
| 267 |
+
|
| 268 |
+
## Rolling Back a Deployment
|
| 269 |
+
|
| 270 |
+
Sometimes, you may want to rollback a Deployment; for example, when the Deployment is not stable, such as crash looping. By default, all of the Deployment's rollout history is kept in the system so that you can rollback anytime you want (you can change that by modifying revision history limit).
|
| 271 |
+
|
| 272 |
+
> [!info] Note:
|
| 273 |
+
> A Deployment's revision is created when a Deployment's rollout is triggered. This means that the new revision is created if and only if the Deployment's Pod template (`.spec.template`) is changed, for example if you update the labels or container images of the template. Other updates, such as scaling the Deployment, do not create a Deployment revision, so that you can facilitate simultaneous manual- or auto-scaling. This means that when you roll back to an earlier revision, only the Deployment's Pod template part is rolled back.
|
| 274 |
+
|
| 275 |
+
- Suppose that you made a typo while updating the Deployment, by putting the image name as `nginx:1.161` instead of `nginx:1.16.1`:
|
| 276 |
+
```shell
|
| 277 |
+
kubectl set image deployment/nginx-deployment nginx=nginx:1.161
|
| 278 |
+
```
|
| 279 |
+
The output is similar to this:
|
| 280 |
+
```
|
| 281 |
+
deployment.apps/nginx-deployment image updated
|
| 282 |
+
```
|
| 283 |
+
- The rollout gets stuck. You can verify it by checking the rollout status:
|
| 284 |
+
```shell
|
| 285 |
+
kubectl rollout status deployment/nginx-deployment
|
| 286 |
+
```
|
| 287 |
+
The output is similar to this:
|
| 288 |
+
```
|
| 289 |
+
Waiting for rollout to finish: 1 out of 3 new replicas have been updated...
|
| 290 |
+
```
|
| 291 |
+
- Press Ctrl-C to stop the above rollout status watch. For more information on stuck rollouts, [read more here](#deployment-status).
|
| 292 |
+
- You see that the number of old replicas (adding the replica count from `nginx-deployment-1564180365` and `nginx-deployment-2035384211`) is 3, and the number of new replicas (from `nginx-deployment-3066724191`) is 1.
|
| 293 |
+
```shell
|
| 294 |
+
kubectl get rs
|
| 295 |
+
```
|
| 296 |
+
The output is similar to this:
|
| 297 |
+
```
|
| 298 |
+
NAME DESIRED CURRENT READY AGE
|
| 299 |
+
nginx-deployment-1564180365 3 3 3 25s
|
| 300 |
+
nginx-deployment-2035384211 0 0 0 36s
|
| 301 |
+
nginx-deployment-3066724191 1 1 0 6s
|
| 302 |
+
```
|
| 303 |
+
- Looking at the Pods created, you see that 1 Pod created by new ReplicaSet is stuck in an image pull loop.
|
| 304 |
+
```shell
|
| 305 |
+
kubectl get pods
|
| 306 |
+
```
|
| 307 |
+
The output is similar to this:
|
| 308 |
+
```
|
| 309 |
+
NAME READY STATUS RESTARTS AGE
|
| 310 |
+
nginx-deployment-1564180365-70iae 1/1 Running 0 25s
|
| 311 |
+
nginx-deployment-1564180365-jbqqo 1/1 Running 0 25s
|
| 312 |
+
nginx-deployment-1564180365-hysrc 1/1 Running 0 25s
|
| 313 |
+
nginx-deployment-3066724191-08mng 0/1 ImagePullBackOff 0 6s
|
| 314 |
+
```
|
| 315 |
+
> [!info] Note:
|
| 316 |
+
> The Deployment controller stops the bad rollout automatically, and stops scaling up the new ReplicaSet. This depends on the rollingUpdate parameters (`maxUnavailable` specifically) that you have specified. Kubernetes by default sets the value to 25%.
|
| 317 |
+
- Get the description of the Deployment:
|
| 318 |
+
```shell
|
| 319 |
+
kubectl describe deployment
|
| 320 |
+
```
|
| 321 |
+
The output is similar to this:
|
| 322 |
+
```
|
| 323 |
+
Name: nginx-deployment
|
| 324 |
+
Namespace: default
|
| 325 |
+
CreationTimestamp: Tue, 15 Mar 2016 14:48:04 -0700
|
| 326 |
+
Labels: app=nginx
|
| 327 |
+
Selector: app=nginx
|
| 328 |
+
Replicas: 3 desired | 1 updated | 4 total | 3 available | 1 unavailable
|
| 329 |
+
StrategyType: RollingUpdate
|
| 330 |
+
MinReadySeconds: 0
|
| 331 |
+
RollingUpdateStrategy: 25% max unavailable, 25% max surge
|
| 332 |
+
Pod Template:
|
| 333 |
+
Labels: app=nginx
|
| 334 |
+
Containers:
|
| 335 |
+
nginx:
|
| 336 |
+
Image: nginx:1.161
|
| 337 |
+
Port: 80/TCP
|
| 338 |
+
Host Port: 0/TCP
|
| 339 |
+
Environment: <none>
|
| 340 |
+
Mounts: <none>
|
| 341 |
+
Volumes: <none>
|
| 342 |
+
Conditions:
|
| 343 |
+
Type Status Reason
|
| 344 |
+
---- ------ ------
|
| 345 |
+
Available True MinimumReplicasAvailable
|
| 346 |
+
Progressing True ReplicaSetUpdated
|
| 347 |
+
OldReplicaSets: nginx-deployment-1564180365 (3/3 replicas created)
|
| 348 |
+
NewReplicaSet: nginx-deployment-3066724191 (1/1 replicas created)
|
| 349 |
+
Events:
|
| 350 |
+
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
|
| 351 |
+
--------- -------- ----- ---- ------------- -------- ------ -------
|
| 352 |
+
1m 1m 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-2035384211 to 3
|
| 353 |
+
22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 1
|
| 354 |
+
22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 2
|
| 355 |
+
22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 2
|
| 356 |
+
21s 21s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 1
|
| 357 |
+
21s 21s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-1564180365 to 3
|
| 358 |
+
13s 13s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down replica set nginx-deployment-2035384211 to 0
|
| 359 |
+
13s 13s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set nginx-deployment-3066724191 to 1
|
| 360 |
+
```
|
| 361 |
+
To fix this, you need to rollback to a previous revision of Deployment that is stable.
|
| 362 |
+
|
| 363 |
+
### Checking Rollout History of a Deployment
|
| 364 |
+
|
| 365 |
+
Follow the steps given below to check the rollout history:
|
| 366 |
+
|
| 367 |
+
1. First, check the revisions of this Deployment:
|
| 368 |
+
```shell
|
| 369 |
+
kubectl rollout history deployment/nginx-deployment
|
| 370 |
+
```
|
| 371 |
+
The output is similar to this:
|
| 372 |
+
```
|
| 373 |
+
deployments "nginx-deployment"
|
| 374 |
+
REVISION CHANGE-CAUSE
|
| 375 |
+
1 <none>
|
| 376 |
+
2 <none>
|
| 377 |
+
3 <none>
|
| 378 |
+
```
|
| 379 |
+
`CHANGE-CAUSE` is copied from the Deployment annotation `kubernetes.io/change-cause` to its revisions upon creation. You can specify the `CHANGE-CAUSE` message by:
|
| 380 |
+
- Annotating the Deployment with `kubectl annotate deployment/nginx-deployment kubernetes.io/change-cause="image updated to 1.16.1"`
|
| 381 |
+
- Manually editing the manifest of the resource.
|
| 382 |
+
- Using tooling that sets the annotation automatically.
|
| 383 |
+
> [!info] Note:
|
| 384 |
+
> In older versions of Kubernetes, you could use the `--record` flag with kubectl commands to automatically populate the `CHANGE-CAUSE` field. This flag is deprecated and will be removed in a future release.
|
| 385 |
+
2. To see the details of each revision, run:
|
| 386 |
+
```shell
|
| 387 |
+
kubectl rollout history deployment/nginx-deployment --revision=2
|
| 388 |
+
```
|
| 389 |
+
The output is similar to this:
|
| 390 |
+
```
|
| 391 |
+
deployments "nginx-deployment" revision 2
|
| 392 |
+
Labels: app=nginx
|
| 393 |
+
pod-template-hash=1159050644
|
| 394 |
+
Containers:
|
| 395 |
+
nginx:
|
| 396 |
+
Image: nginx:1.16.1
|
| 397 |
+
Port: 80/TCP
|
| 398 |
+
QoS Tier:
|
| 399 |
+
cpu: BestEffort
|
| 400 |
+
memory: BestEffort
|
| 401 |
+
Environment Variables: <none>
|
| 402 |
+
No volumes.
|
| 403 |
+
```
|
| 404 |
+
|
| 405 |
+
### Rolling Back to a Previous Revision
|
| 406 |
+
|
| 407 |
+
Follow the steps given below to rollback the Deployment from the current version to the previous version, which is version 2.
|
| 408 |
+
|
| 409 |
+
1. Now you've decided to undo the current rollout and rollback to the previous revision:
|
| 410 |
+
```shell
|
| 411 |
+
kubectl rollout undo deployment/nginx-deployment
|
| 412 |
+
```
|
| 413 |
+
The output is similar to this:
|
| 414 |
+
```
|
| 415 |
+
deployment.apps/nginx-deployment rolled back
|
| 416 |
+
```
|
| 417 |
+
Alternatively, you can rollback to a specific revision by specifying it with `--to-revision`:
|
| 418 |
+
```shell
|
| 419 |
+
kubectl rollout undo deployment/nginx-deployment --to-revision=2
|
| 420 |
+
```
|
| 421 |
+
The output is similar to this:
|
| 422 |
+
```
|
| 423 |
+
deployment.apps/nginx-deployment rolled back
|
| 424 |
+
```
|
| 425 |
+
For more details about rollout related commands, read [`kubectl rollout`](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#rollout).
|
| 426 |
+
The Deployment is now rolled back to a previous stable revision. As you can see, a `DeploymentRollback` event for rolling back to revision 2 is generated from Deployment controller.
|
| 427 |
+
2. Check if the rollback was successful and the Deployment is running as expected, run:
|
| 428 |
+
```shell
|
| 429 |
+
kubectl get deployment nginx-deployment
|
| 430 |
+
```
|
| 431 |
+
The output is similar to this:
|
| 432 |
+
```
|
| 433 |
+
NAME READY UP-TO-DATE AVAILABLE AGE
|
| 434 |
+
nginx-deployment 3/3 3 3 30m
|
| 435 |
+
```
|
| 436 |
+
3. Get the description of the Deployment:
|
| 437 |
+
```shell
|
| 438 |
+
kubectl describe deployment nginx-deployment
|
| 439 |
+
```
|
| 440 |
+
The output is similar to this:
|
| 441 |
+
```
|
| 442 |
+
Name: nginx-deployment
|
| 443 |
+
Namespace: default
|
| 444 |
+
CreationTimestamp: Sun, 02 Sep 2018 18:17:55 -0500
|
| 445 |
+
Labels: app=nginx
|
| 446 |
+
Annotations: deployment.kubernetes.io/revision=4
|
| 447 |
+
Selector: app=nginx
|
| 448 |
+
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
|
| 449 |
+
StrategyType: RollingUpdate
|
| 450 |
+
MinReadySeconds: 0
|
| 451 |
+
RollingUpdateStrategy: 25% max unavailable, 25% max surge
|
| 452 |
+
Pod Template:
|
| 453 |
+
Labels: app=nginx
|
| 454 |
+
Containers:
|
| 455 |
+
nginx:
|
| 456 |
+
Image: nginx:1.16.1
|
| 457 |
+
Port: 80/TCP
|
| 458 |
+
Host Port: 0/TCP
|
| 459 |
+
Environment: <none>
|
| 460 |
+
Mounts: <none>
|
| 461 |
+
Volumes: <none>
|
| 462 |
+
Conditions:
|
| 463 |
+
Type Status Reason
|
| 464 |
+
---- ------ ------
|
| 465 |
+
Available True MinimumReplicasAvailable
|
| 466 |
+
Progressing True NewReplicaSetAvailable
|
| 467 |
+
OldReplicaSets: <none>
|
| 468 |
+
NewReplicaSet: nginx-deployment-c4747d96c (3/3 replicas created)
|
| 469 |
+
Events:
|
| 470 |
+
Type Reason Age From Message
|
| 471 |
+
---- ------ ---- ---- -------
|
| 472 |
+
Normal ScalingReplicaSet 12m deployment-controller Scaled up replica set nginx-deployment-75675f5897 to 3
|
| 473 |
+
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 1
|
| 474 |
+
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 2
|
| 475 |
+
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 2
|
| 476 |
+
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 1
|
| 477 |
+
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 3
|
| 478 |
+
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 0
|
| 479 |
+
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-595696685f to 1
|
| 480 |
+
Normal DeploymentRollback 15s deployment-controller Rolled back deployment "nginx-deployment" to revision 2
|
| 481 |
+
Normal ScalingReplicaSet 15s deployment-controller Scaled down replica set nginx-deployment-595696685f to 0
|
| 482 |
+
```
|
| 483 |
+
|
| 484 |
+
## Scaling a Deployment
|
| 485 |
+
|
| 486 |
+
You can scale a Deployment by using the following command:
|
| 487 |
+
|
| 488 |
+
```shell
|
| 489 |
+
kubectl scale deployment/nginx-deployment --replicas=10
|
| 490 |
+
```
|
| 491 |
+
|
| 492 |
+
The output is similar to this:
|
| 493 |
+
|
| 494 |
+
```
|
| 495 |
+
deployment.apps/nginx-deployment scaled
|
| 496 |
+
```
|
| 497 |
+
|
| 498 |
+
Assuming [horizontal Pod autoscaling](https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/) is enabled in your cluster, you can set up an autoscaler for your Deployment and choose the minimum and maximum number of Pods you want to run based on the CPU utilization of your existing Pods.
|
| 499 |
+
|
| 500 |
+
```shell
|
| 501 |
+
kubectl autoscale deployment/nginx-deployment --min=10 --max=15 --cpu-percent=80%
|
| 502 |
+
```
|
| 503 |
+
|
| 504 |
+
The output is similar to this:
|
| 505 |
+
|
| 506 |
+
```
|
| 507 |
+
deployment.apps/nginx-deployment scaled
|
| 508 |
+
```
|
| 509 |
+
|
| 510 |
+
### Proportional scaling
|
| 511 |
+
|
| 512 |
+
RollingUpdate Deployments support running multiple versions of an application at the same time. When you or an autoscaler scales a RollingUpdate Deployment that is in the middle of a rollout (either in progress or paused), the Deployment controller balances the additional replicas in the existing active ReplicaSets (ReplicaSets with Pods) in order to mitigate risk. This is called *proportional scaling*.
|
| 513 |
+
|
| 514 |
+
For example, you are running a Deployment with 10 replicas, [maxSurge](#max-surge) =3, and [maxUnavailable](#max-unavailable) =2.
|
| 515 |
+
|
| 516 |
+
- Ensure that the 10 replicas in your Deployment are running.
|
| 517 |
+
```shell
|
| 518 |
+
kubectl get deploy
|
| 519 |
+
```
|
| 520 |
+
The output is similar to this:
|
| 521 |
+
```
|
| 522 |
+
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
|
| 523 |
+
nginx-deployment 10 10 10 10 50s
|
| 524 |
+
```
|
| 525 |
+
- You update to a new image which happens to be unresolvable from inside the cluster.
|
| 526 |
+
```shell
|
| 527 |
+
kubectl set image deployment/nginx-deployment nginx=nginx:sometag
|
| 528 |
+
```
|
| 529 |
+
The output is similar to this:
|
| 530 |
+
```
|
| 531 |
+
deployment.apps/nginx-deployment image updated
|
| 532 |
+
```
|
| 533 |
+
- The image update starts a new rollout with ReplicaSet nginx-deployment-1989198191, but it's blocked due to the `maxUnavailable` requirement that you mentioned above. Check out the rollout status:
|
| 534 |
+
```shell
|
| 535 |
+
kubectl get rs
|
| 536 |
+
```
|
| 537 |
+
The output is similar to this:
|
| 538 |
+
```
|
| 539 |
+
NAME DESIRED CURRENT READY AGE
|
| 540 |
+
nginx-deployment-1989198191 5 5 0 9s
|
| 541 |
+
nginx-deployment-618515232 8 8 8 1m
|
| 542 |
+
```
|
| 543 |
+
- Then a new scaling request for the Deployment comes along. The autoscaler increments the Deployment replicas to 15. The Deployment controller needs to decide where to add these new 5 replicas. If you weren't using proportional scaling, all 5 of them would be added in the new ReplicaSet. With proportional scaling, you spread the additional replicas across all ReplicaSets. Bigger proportions go to the ReplicaSets with the most replicas and lower proportions go to ReplicaSets with less replicas. Any leftovers are added to the ReplicaSet with the most replicas. ReplicaSets with zero replicas are not scaled up.
|
| 544 |
+
|
| 545 |
+
In our example above, 3 replicas are added to the old ReplicaSet and 2 replicas are added to the new ReplicaSet. The rollout process should eventually move all replicas to the new ReplicaSet, assuming the new replicas become healthy. To confirm this, run:
|
| 546 |
+
|
| 547 |
+
```shell
|
| 548 |
+
kubectl get deploy
|
| 549 |
+
```
|
| 550 |
+
|
| 551 |
+
The output is similar to this:
|
| 552 |
+
|
| 553 |
+
```
|
| 554 |
+
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
|
| 555 |
+
nginx-deployment 15 18 7 8 7m
|
| 556 |
+
```
|
| 557 |
+
|
| 558 |
+
The rollout status confirms how the replicas were added to each ReplicaSet.
|
| 559 |
+
|
| 560 |
+
```shell
|
| 561 |
+
kubectl get rs
|
| 562 |
+
```
|
| 563 |
+
|
| 564 |
+
The output is similar to this:
|
| 565 |
+
|
| 566 |
+
```
|
| 567 |
+
NAME DESIRED CURRENT READY AGE
|
| 568 |
+
nginx-deployment-1989198191 7 7 0 7m
|
| 569 |
+
nginx-deployment-618515232 11 11 11 7m
|
| 570 |
+
```
|
| 571 |
+
|
| 572 |
+
## Pausing and Resuming a rollout of a Deployment
|
| 573 |
+
|
| 574 |
+
When you update a Deployment, or plan to, you can pause rollouts for that Deployment before you trigger one or more updates. When you're ready to apply those changes, you resume rollouts for the Deployment. This approach allows you to apply multiple fixes in between pausing and resuming without triggering unnecessary rollouts.
|
| 575 |
+
|
| 576 |
+
- For example, with a Deployment that was created:
|
| 577 |
+
Get the Deployment details:
|
| 578 |
+
```shell
|
| 579 |
+
kubectl get deploy
|
| 580 |
+
```
|
| 581 |
+
The output is similar to this:
|
| 582 |
+
```
|
| 583 |
+
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
|
| 584 |
+
nginx 3 3 3 3 1m
|
| 585 |
+
```
|
| 586 |
+
Get the rollout status:
|
| 587 |
+
```shell
|
| 588 |
+
kubectl get rs
|
| 589 |
+
```
|
| 590 |
+
The output is similar to this:
|
| 591 |
+
```
|
| 592 |
+
NAME DESIRED CURRENT READY AGE
|
| 593 |
+
nginx-2142116321 3 3 3 1m
|
| 594 |
+
```
|
| 595 |
+
- Pause by running the following command:
|
| 596 |
+
```shell
|
| 597 |
+
kubectl rollout pause deployment/nginx-deployment
|
| 598 |
+
```
|
| 599 |
+
The output is similar to this:
|
| 600 |
+
```
|
| 601 |
+
deployment.apps/nginx-deployment paused
|
| 602 |
+
```
|
| 603 |
+
- Then update the image of the Deployment:
|
| 604 |
+
```shell
|
| 605 |
+
kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
|
| 606 |
+
```
|
| 607 |
+
The output is similar to this:
|
| 608 |
+
```
|
| 609 |
+
deployment.apps/nginx-deployment image updated
|
| 610 |
+
```
|
| 611 |
+
- Notice that no new rollout started:
|
| 612 |
+
```shell
|
| 613 |
+
kubectl rollout history deployment/nginx-deployment
|
| 614 |
+
```
|
| 615 |
+
The output is similar to this:
|
| 616 |
+
```
|
| 617 |
+
deployments "nginx"
|
| 618 |
+
REVISION CHANGE-CAUSE
|
| 619 |
+
1 <none>
|
| 620 |
+
```
|
| 621 |
+
- Get the rollout status to verify that the existing ReplicaSet has not changed:
|
| 622 |
+
```shell
|
| 623 |
+
kubectl get rs
|
| 624 |
+
```
|
| 625 |
+
The output is similar to this:
|
| 626 |
+
```
|
| 627 |
+
NAME DESIRED CURRENT READY AGE
|
| 628 |
+
nginx-2142116321 3 3 3 2m
|
| 629 |
+
```
|
| 630 |
+
- You can make as many updates as you wish, for example, update the resources that will be used:
|
| 631 |
+
```shell
|
| 632 |
+
kubectl set resources deployment/nginx-deployment -c=nginx --limits=cpu=200m,memory=512Mi
|
| 633 |
+
```
|
| 634 |
+
The output is similar to this:
|
| 635 |
+
```
|
| 636 |
+
deployment.apps/nginx-deployment resource requirements updated
|
| 637 |
+
```
|
| 638 |
+
The initial state of the Deployment prior to pausing its rollout will continue its function, but new updates to the Deployment will not have any effect as long as the Deployment rollout is paused.
|
| 639 |
+
- Eventually, resume the Deployment rollout and observe a new ReplicaSet coming up with all the new updates:
|
| 640 |
+
```shell
|
| 641 |
+
kubectl rollout resume deployment/nginx-deployment
|
| 642 |
+
```
|
| 643 |
+
The output is similar to this:
|
| 644 |
+
```
|
| 645 |
+
deployment.apps/nginx-deployment resumed
|
| 646 |
+
```
|
| 647 |
+
- [Watch](https://kubernetes.io/docs/reference/using-api/api-concepts/#api-verbs "A verb that is used to track changes to an object in Kubernetes as a stream.") the status of the rollout until it's done.
|
| 648 |
+
```shell
|
| 649 |
+
kubectl get rs --watch
|
| 650 |
+
```
|
| 651 |
+
The output is similar to this:
|
| 652 |
+
```
|
| 653 |
+
NAME DESIRED CURRENT READY AGE
|
| 654 |
+
nginx-2142116321 2 2 2 2m
|
| 655 |
+
nginx-3926361531 2 2 0 6s
|
| 656 |
+
nginx-3926361531 2 2 1 18s
|
| 657 |
+
nginx-2142116321 1 2 2 2m
|
| 658 |
+
nginx-2142116321 1 2 2 2m
|
| 659 |
+
nginx-3926361531 3 2 1 18s
|
| 660 |
+
nginx-3926361531 3 2 1 18s
|
| 661 |
+
nginx-2142116321 1 1 1 2m
|
| 662 |
+
nginx-3926361531 3 3 1 18s
|
| 663 |
+
nginx-3926361531 3 3 2 19s
|
| 664 |
+
nginx-2142116321 0 1 1 2m
|
| 665 |
+
nginx-2142116321 0 1 1 2m
|
| 666 |
+
nginx-2142116321 0 0 0 2m
|
| 667 |
+
nginx-3926361531 3 3 3 20s
|
| 668 |
+
```
|
| 669 |
+
- Get the status of the latest rollout:
|
| 670 |
+
```shell
|
| 671 |
+
kubectl get rs
|
| 672 |
+
```
|
| 673 |
+
The output is similar to this:
|
| 674 |
+
```
|
| 675 |
+
NAME DESIRED CURRENT READY AGE
|
| 676 |
+
nginx-2142116321 0 0 0 2m
|
| 677 |
+
nginx-3926361531 3 3 3 28s
|
| 678 |
+
```
|
| 679 |
+
|
| 680 |
+
> [!info] Note:
|
| 681 |
+
> You cannot rollback a paused Deployment until you resume it.
|
| 682 |
+
|
| 683 |
+
## Deployment status
|
| 684 |
+
|
| 685 |
+
A Deployment enters various states during its lifecycle. It can be [progressing](#progressing-deployment) while rolling out a new ReplicaSet, it can be [complete](#complete-deployment), or it can [fail to progress](#failed-deployment).
|
| 686 |
+
|
| 687 |
+
### Progressing Deployment
|
| 688 |
+
|
| 689 |
+
Kubernetes marks a Deployment as *progressing* when one of the following tasks is performed:
|
| 690 |
+
|
| 691 |
+
- The Deployment creates a new ReplicaSet.
|
| 692 |
+
- The Deployment is scaling up its newest ReplicaSet.
|
| 693 |
+
- The Deployment is scaling down its older ReplicaSet(s).
|
| 694 |
+
- New Pods become ready or available (ready for at least [MinReadySeconds](#min-ready-seconds)).
|
| 695 |
+
|
| 696 |
+
When the rollout becomes “progressing”, the Deployment controller adds a condition with the following attributes to the Deployment's `.status.conditions`:
|
| 697 |
+
|
| 698 |
+
- `type: Progressing`
|
| 699 |
+
- `status: "True"`
|
| 700 |
+
- `reason: NewReplicaSetCreated` | `reason: FoundNewReplicaSet` | `reason: ReplicaSetUpdated`
|
| 701 |
+
|
| 702 |
+
You can monitor the progress for a Deployment by using `kubectl rollout status`.
|
| 703 |
+
|
| 704 |
+
### Complete Deployment
|
| 705 |
+
|
| 706 |
+
Kubernetes marks a Deployment as *complete* when it has the following characteristics:
|
| 707 |
+
|
| 708 |
+
- All of the replicas associated with the Deployment have been updated to the latest version you've specified, meaning any updates you've requested have been completed.
|
| 709 |
+
- All of the replicas associated with the Deployment are available.
|
| 710 |
+
- No old replicas for the Deployment are running.
|
| 711 |
+
|
| 712 |
+
When the rollout becomes “complete”, the Deployment controller sets a condition with the following attributes to the Deployment's `.status.conditions`:
|
| 713 |
+
|
| 714 |
+
- `type: Progressing`
|
| 715 |
+
- `status: "True"`
|
| 716 |
+
- `reason: NewReplicaSetAvailable`
|
| 717 |
+
|
| 718 |
+
This `Progressing` condition will retain a status value of `"True"` until a new rollout is initiated. The condition holds even when availability of replicas changes (which does instead affect the `Available` condition).
|
| 719 |
+
|
| 720 |
+
You can check if a Deployment has completed by using `kubectl rollout status`. If the rollout completed successfully, `kubectl rollout status` returns a zero exit code.
|
| 721 |
+
|
| 722 |
+
```shell
|
| 723 |
+
kubectl rollout status deployment/nginx-deployment
|
| 724 |
+
```
|
| 725 |
+
|
| 726 |
+
The output is similar to this:
|
| 727 |
+
|
| 728 |
+
```
|
| 729 |
+
Waiting for rollout to finish: 2 of 3 updated replicas are available...
|
| 730 |
+
deployment "nginx-deployment" successfully rolled out
|
| 731 |
+
```
|
| 732 |
+
|
| 733 |
+
and the exit status from `kubectl rollout` is 0 (success):
|
| 734 |
+
|
| 735 |
+
```shell
|
| 736 |
+
echo $?
|
| 737 |
+
```
|
| 738 |
+
```
|
| 739 |
+
0
|
| 740 |
+
```
|
| 741 |
+
|
| 742 |
+
### Failed Deployment
|
| 743 |
+
|
| 744 |
+
Your Deployment may get stuck trying to deploy its newest ReplicaSet without ever completing. This can occur due to some of the following factors:
|
| 745 |
+
|
| 746 |
+
- Insufficient quota
|
| 747 |
+
- Readiness probe failures
|
| 748 |
+
- Image pull errors
|
| 749 |
+
- Insufficient permissions
|
| 750 |
+
- Limit ranges
|
| 751 |
+
- Application runtime misconfiguration
|
| 752 |
+
|
| 753 |
+
One way you can detect this condition is to specify a deadline parameter in your Deployment spec: ([`.spec.progressDeadlineSeconds`](#progress-deadline-seconds)). `.spec.progressDeadlineSeconds` denotes the number of seconds the Deployment controller waits before indicating (in the Deployment status) that the Deployment progress has stalled.
|
| 754 |
+
|
| 755 |
+
The following `kubectl` command sets the spec with `progressDeadlineSeconds` to make the controller report lack of progress of a rollout for a Deployment after 10 minutes:
|
| 756 |
+
|
| 757 |
+
```shell
|
| 758 |
+
kubectl patch deployment/nginx-deployment -p '{"spec":{"progressDeadlineSeconds":600}}'
|
| 759 |
+
```
|
| 760 |
+
|
| 761 |
+
The output is similar to this:
|
| 762 |
+
|
| 763 |
+
```
|
| 764 |
+
deployment.apps/nginx-deployment patched
|
| 765 |
+
```
|
| 766 |
+
|
| 767 |
+
Once the deadline has been exceeded, the Deployment controller adds a DeploymentCondition with the following attributes to the Deployment's `.status.conditions`:
|
| 768 |
+
|
| 769 |
+
- `type: Progressing`
|
| 770 |
+
- `status: "False"`
|
| 771 |
+
- `reason: ProgressDeadlineExceeded`
|
| 772 |
+
|
| 773 |
+
This condition can also fail early and is then set to status value of `"False"` due to reasons as `ReplicaSetCreateError`. Also, the deadline is not taken into account anymore once the Deployment rollout completes.
|
| 774 |
+
|
| 775 |
+
See the [Kubernetes API conventions](https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties) for more information on status conditions.
|
| 776 |
+
|
| 777 |
+
> [!info] Note:
|
| 778 |
+
> Kubernetes takes no action on a stalled Deployment other than to report a status condition with `reason: ProgressDeadlineExceeded`. Higher level orchestrators can take advantage of it and act accordingly, for example, rollback the Deployment to its previous version.
|
| 779 |
+
|
| 780 |
+
> [!info] Note:
|
| 781 |
+
> If you pause a Deployment rollout, Kubernetes does not check progress against your specified deadline. You can safely pause a Deployment rollout in the middle of a rollout and resume without triggering the condition for exceeding the deadline.
|
| 782 |
+
|
| 783 |
+
You may experience transient errors with your Deployments, either due to a low timeout that you have set or due to any other kind of error that can be treated as transient. For example, let's suppose you have insufficient quota. If you describe the Deployment you will notice the following section:
|
| 784 |
+
|
| 785 |
+
```shell
|
| 786 |
+
kubectl describe deployment nginx-deployment
|
| 787 |
+
```
|
| 788 |
+
|
| 789 |
+
The output is similar to this:
|
| 790 |
+
|
| 791 |
+
```
|
| 792 |
+
<...>
|
| 793 |
+
Conditions:
|
| 794 |
+
Type Status Reason
|
| 795 |
+
---- ------ ------
|
| 796 |
+
Available True MinimumReplicasAvailable
|
| 797 |
+
Progressing True ReplicaSetUpdated
|
| 798 |
+
ReplicaFailure True FailedCreate
|
| 799 |
+
<...>
|
| 800 |
+
```
|
| 801 |
+
|
| 802 |
+
If you run `kubectl get deployment nginx-deployment -o yaml`, the Deployment status is similar to this:
|
| 803 |
+
|
| 804 |
+
```
|
| 805 |
+
status:
|
| 806 |
+
availableReplicas: 2
|
| 807 |
+
conditions:
|
| 808 |
+
- lastTransitionTime: 2016-10-04T12:25:39Z
|
| 809 |
+
lastUpdateTime: 2016-10-04T12:25:39Z
|
| 810 |
+
message: Replica set "nginx-deployment-4262182780" is progressing.
|
| 811 |
+
reason: ReplicaSetUpdated
|
| 812 |
+
status: "True"
|
| 813 |
+
type: Progressing
|
| 814 |
+
- lastTransitionTime: 2016-10-04T12:25:42Z
|
| 815 |
+
lastUpdateTime: 2016-10-04T12:25:42Z
|
| 816 |
+
message: Deployment has minimum availability.
|
| 817 |
+
reason: MinimumReplicasAvailable
|
| 818 |
+
status: "True"
|
| 819 |
+
type: Available
|
| 820 |
+
- lastTransitionTime: 2016-10-04T12:25:39Z
|
| 821 |
+
lastUpdateTime: 2016-10-04T12:25:39Z
|
| 822 |
+
message: 'Error creating: pods "nginx-deployment-4262182780-" is forbidden: exceeded quota:
|
| 823 |
+
object-counts, requested: pods=1, used: pods=3, limited: pods=2'
|
| 824 |
+
reason: FailedCreate
|
| 825 |
+
status: "True"
|
| 826 |
+
type: ReplicaFailure
|
| 827 |
+
observedGeneration: 3
|
| 828 |
+
replicas: 2
|
| 829 |
+
unavailableReplicas: 2
|
| 830 |
+
```
|
| 831 |
+
|
| 832 |
+
Eventually, once the Deployment progress deadline is exceeded, Kubernetes updates the status and the reason for the Progressing condition:
|
| 833 |
+
|
| 834 |
+
```
|
| 835 |
+
Conditions:
|
| 836 |
+
Type Status Reason
|
| 837 |
+
---- ------ ------
|
| 838 |
+
Available True MinimumReplicasAvailable
|
| 839 |
+
Progressing False ProgressDeadlineExceeded
|
| 840 |
+
ReplicaFailure True FailedCreate
|
| 841 |
+
```
|
| 842 |
+
|
| 843 |
+
You can address an issue of insufficient quota by scaling down your Deployment, by scaling down other controllers you may be running, or by increasing quota in your namespace. If you satisfy the quota conditions and the Deployment controller then completes the Deployment rollout, you'll see the Deployment's status update with a successful condition (`status: "True"` and `reason: NewReplicaSetAvailable`).
|
| 844 |
+
|
| 845 |
+
```
|
| 846 |
+
Conditions:
|
| 847 |
+
Type Status Reason
|
| 848 |
+
---- ------ ------
|
| 849 |
+
Available True MinimumReplicasAvailable
|
| 850 |
+
Progressing True NewReplicaSetAvailable
|
| 851 |
+
```
|
| 852 |
+
|
| 853 |
+
`type: Available` with `status: "True"` means that your Deployment has minimum availability. Minimum availability is dictated by the parameters specified in the deployment strategy. `type: Progressing` with `status: "True"` means that your Deployment is either in the middle of a rollout and it is progressing or that it has successfully completed its progress and the minimum required new replicas are available (see the Reason of the condition for the particulars - in our case `reason: NewReplicaSetAvailable` means that the Deployment is complete).
|
| 854 |
+
|
| 855 |
+
You can check if a Deployment has failed to progress by using `kubectl rollout status`. `kubectl rollout status` returns a non-zero exit code if the Deployment has exceeded the progression deadline.
|
| 856 |
+
|
| 857 |
+
```shell
|
| 858 |
+
kubectl rollout status deployment/nginx-deployment
|
| 859 |
+
```
|
| 860 |
+
|
| 861 |
+
The output is similar to this:
|
| 862 |
+
|
| 863 |
+
```
|
| 864 |
+
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
|
| 865 |
+
error: deployment "nginx" exceeded its progress deadline
|
| 866 |
+
```
|
| 867 |
+
|
| 868 |
+
and the exit status from `kubectl rollout` is 1 (indicating an error):
|
| 869 |
+
|
| 870 |
+
```shell
|
| 871 |
+
echo $?
|
| 872 |
+
```
|
| 873 |
+
```
|
| 874 |
+
1
|
| 875 |
+
```
|
| 876 |
+
|
| 877 |
+
### Operating on a failed deployment
|
| 878 |
+
|
| 879 |
+
All actions that apply to a complete Deployment also apply to a failed Deployment. You can scale it up/down, roll back to a previous revision, or even pause it if you need to apply multiple tweaks in the Deployment Pod template.
|
| 880 |
+
|
| 881 |
+
## Clean up Policy
|
| 882 |
+
|
| 883 |
+
You can set `.spec.revisionHistoryLimit` field in a Deployment to specify how many old ReplicaSets for this Deployment you want to retain. The rest will be garbage-collected in the background. By default, it is 10.
|
| 884 |
+
|
| 885 |
+
> [!info] Note:
|
| 886 |
+
> Explicitly setting this field to 0, will result in cleaning up all the history of your Deployment thus that Deployment will not be able to roll back.
|
| 887 |
+
|
| 888 |
+
The cleanup only starts **after** a Deployment reaches a [complete state](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment). If you set `.spec.revisionHistoryLimit` to 0, any rollout nonetheless triggers creation of a new ReplicaSet before Kubernetes removes the old one.
|
| 889 |
+
|
| 890 |
+
Even with a non-zero revision history limit, you can have more ReplicaSets than the limit you configure. For example, if pods are crash looping, and there are multiple rolling updates events triggered over time, you might end up with more ReplicaSets than the `.spec.revisionHistoryLimit` because the Deployment never reaches a complete state.
|
| 891 |
+
|
| 892 |
+
## Canary Deployment
|
| 893 |
+
|
| 894 |
+
If you want to roll out releases to a subset of users or servers using the Deployment, you can create multiple Deployments, one for each release, following the canary pattern described in [managing resources](https://kubernetes.io/docs/concepts/workloads/management/#canary-deployments).
|
| 895 |
+
|
| 896 |
+
## Writing a Deployment Spec
|
| 897 |
+
|
| 898 |
+
As with all other Kubernetes configs, a Deployment needs `.apiVersion`, `.kind`, and `.metadata` fields. For general information about working with config files, see [deploying applications](https://kubernetes.io/docs/tasks/run-application/run-stateless-application-deployment/), configuring containers, and [using kubectl to manage resources](https://kubernetes.io/docs/concepts/overview/working-with-objects/object-management/) documents.
|
| 899 |
+
|
| 900 |
+
When the control plane creates new Pods for a Deployment, the `.metadata.name` of the Deployment is part of the basis for naming those Pods. The name of a Deployment must be a valid [DNS subdomain](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-subdomain-names) value, but this can produce unexpected results for the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a [DNS label](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names).
|
| 901 |
+
|
| 902 |
+
A Deployment also needs a [`.spec` section](https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status).
|
| 903 |
+
|
| 904 |
+
### Pod Template
|
| 905 |
+
|
| 906 |
+
The `.spec.template` and `.spec.selector` are the only required fields of the `.spec`.
|
| 907 |
+
|
| 908 |
+
The `.spec.template` is a [Pod template](https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates). It has exactly the same schema as a [Pod](https://kubernetes.io/docs/concepts/workloads/pods/ "A Pod represents a set of running containers in your cluster."), except it is nested and does not have an `apiVersion` or `kind`.
|
| 909 |
+
|
| 910 |
+
In addition to required fields for a Pod, a Pod template in a Deployment must specify appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with other controllers. See [selector](#selector).
|
| 911 |
+
|
| 912 |
+
Only a [`.spec.template.spec.restartPolicy`](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy) equal to `Always` is allowed, which is the default if not specified.
|
| 913 |
+
|
| 914 |
+
### Replicas
|
| 915 |
+
|
| 916 |
+
`.spec.replicas` is an optional field that specifies the number of desired Pods. It defaults to 1.
|
| 917 |
+
|
| 918 |
+
Should you manually scale a Deployment, example via `kubectl scale deployment deployment --replicas=X`, and then you update that Deployment based on a manifest (for example: by running `kubectl apply -f deployment.yaml`), then applying that manifest overwrites the manual scaling that you previously did.
|
| 919 |
+
|
| 920 |
+
If a [HorizontalPodAutoscaler](https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/) (or any similar API for horizontal scaling) is managing scaling for a Deployment, don't set `.spec.replicas`.
|
| 921 |
+
|
| 922 |
+
Instead, allow the Kubernetes [control plane](https://kubernetes.io/docs/reference/glossary/?all=true#term-control-plane "The container orchestration layer that exposes the API and interfaces to define, deploy, and manage the lifecycle of containers.") to manage the `.spec.replicas` field automatically.
|
| 923 |
+
|
| 924 |
+
### Selector
|
| 925 |
+
|
| 926 |
+
`.spec.selector` is a required field that specifies a [label selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) for the Pods targeted by this Deployment.
|
| 927 |
+
|
| 928 |
+
`.spec.selector` must match `.spec.template.metadata.labels`, or it will be rejected by the API.
|
| 929 |
+
|
| 930 |
+
In API version `apps/v1`, `.spec.selector` and `.metadata.labels` do not default to `.spec.template.metadata.labels` if not set. So they must be set explicitly. Also note that `.spec.selector` is immutable after creation of the Deployment in `apps/v1`.
|
| 931 |
+
|
| 932 |
+
A Deployment may terminate Pods whose labels match the selector if their template is different from `.spec.template` or if the total number of such Pods exceeds `.spec.replicas`. It brings up new Pods with `.spec.template` if the number of Pods is less than the desired number.
|
| 933 |
+
|
| 934 |
+
> [!info] Note:
|
| 935 |
+
> You should not create other Pods whose labels match this selector, either directly, by creating another Deployment, or by creating another controller such as a ReplicaSet or a ReplicationController. If you do so, the first Deployment thinks that it created these other Pods. Kubernetes does not stop you from doing this.
|
| 936 |
+
|
| 937 |
+
If you have multiple controllers that have overlapping selectors, the controllers will fight with each other and won't behave correctly.
|
| 938 |
+
|
| 939 |
+
### Strategy
|
| 940 |
+
|
| 941 |
+
`.spec.strategy` specifies the strategy used to replace old Pods by new ones. `.spec.strategy.type` can be "Recreate" or "RollingUpdate". "RollingUpdate" is the default value.
|
| 942 |
+
|
| 943 |
+
#### Recreate Deployment
|
| 944 |
+
|
| 945 |
+
All existing Pods are killed before new ones are created when `.spec.strategy.type==Recreate`.
|
| 946 |
+
|
| 947 |
+
> [!info] Note:
|
| 948 |
+
> This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/).
|
| 949 |
+
|
| 950 |
+
#### Rolling Update Deployment
|
| 951 |
+
|
| 952 |
+
The Deployment updates Pods in a rolling update fashion (gradually scale down the old ReplicaSets and scale up the new one) when `.spec.strategy.type==RollingUpdate`. You can specify `maxUnavailable` and `maxSurge` to control the rolling update process.
|
| 953 |
+
|
| 954 |
+
##### Max Unavailable
|
| 955 |
+
|
| 956 |
+
`.spec.strategy.rollingUpdate.maxUnavailable` is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by rounding down. The value cannot be 0 if `.spec.strategy.rollingUpdate.maxSurge` is 0. The default value is 25%.
|
| 957 |
+
|
| 958 |
+
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
|
| 959 |
+
|
| 960 |
+
##### Max Surge
|
| 961 |
+
|
| 962 |
+
`.spec.strategy.rollingUpdate.maxSurge` is an optional field that specifies the maximum number of Pods that can be created over the desired number of Pods. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The value cannot be 0 if `maxUnavailable` is 0. The absolute number is calculated from the percentage by rounding up. The default value is 25%.
|
| 963 |
+
|
| 964 |
+
For example, when this value is set to 30%, the new ReplicaSet can be scaled up immediately when the rolling update starts, such that the total number of old and new Pods does not exceed 130% of desired Pods. Once old Pods have been killed, the new ReplicaSet can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130% of desired Pods.
|
| 965 |
+
|
| 966 |
+
Here are some Rolling Update Deployment examples that use the `maxUnavailable` and `maxSurge`:
|
| 967 |
+
|
| 968 |
+
```yaml
|
| 969 |
+
apiVersion: apps/v1
|
| 970 |
+
kind: Deployment
|
| 971 |
+
metadata:
|
| 972 |
+
name: nginx-deployment
|
| 973 |
+
labels:
|
| 974 |
+
app: nginx
|
| 975 |
+
spec:
|
| 976 |
+
replicas: 3
|
| 977 |
+
selector:
|
| 978 |
+
matchLabels:
|
| 979 |
+
app: nginx
|
| 980 |
+
template:
|
| 981 |
+
metadata:
|
| 982 |
+
labels:
|
| 983 |
+
app: nginx
|
| 984 |
+
spec:
|
| 985 |
+
containers:
|
| 986 |
+
- name: nginx
|
| 987 |
+
image: nginx:1.14.2
|
| 988 |
+
ports:
|
| 989 |
+
- containerPort: 80
|
| 990 |
+
strategy:
|
| 991 |
+
type: RollingUpdate
|
| 992 |
+
rollingUpdate:
|
| 993 |
+
maxUnavailable: 1
|
| 994 |
+
```
|
| 995 |
+
|
| 996 |
+
```yaml
|
| 997 |
+
apiVersion: apps/v1
|
| 998 |
+
kind: Deployment
|
| 999 |
+
metadata:
|
| 1000 |
+
name: nginx-deployment
|
| 1001 |
+
labels:
|
| 1002 |
+
app: nginx
|
| 1003 |
+
spec:
|
| 1004 |
+
replicas: 3
|
| 1005 |
+
selector:
|
| 1006 |
+
matchLabels:
|
| 1007 |
+
app: nginx
|
| 1008 |
+
template:
|
| 1009 |
+
metadata:
|
| 1010 |
+
labels:
|
| 1011 |
+
app: nginx
|
| 1012 |
+
spec:
|
| 1013 |
+
containers:
|
| 1014 |
+
- name: nginx
|
| 1015 |
+
image: nginx:1.14.2
|
| 1016 |
+
ports:
|
| 1017 |
+
- containerPort: 80
|
| 1018 |
+
strategy:
|
| 1019 |
+
type: RollingUpdate
|
| 1020 |
+
rollingUpdate:
|
| 1021 |
+
maxSurge: 1
|
| 1022 |
+
```
|
| 1023 |
+
|
| 1024 |
+
```yaml
|
| 1025 |
+
apiVersion: apps/v1
|
| 1026 |
+
kind: Deployment
|
| 1027 |
+
metadata:
|
| 1028 |
+
name: nginx-deployment
|
| 1029 |
+
labels:
|
| 1030 |
+
app: nginx
|
| 1031 |
+
spec:
|
| 1032 |
+
replicas: 3
|
| 1033 |
+
selector:
|
| 1034 |
+
matchLabels:
|
| 1035 |
+
app: nginx
|
| 1036 |
+
template:
|
| 1037 |
+
metadata:
|
| 1038 |
+
labels:
|
| 1039 |
+
app: nginx
|
| 1040 |
+
spec:
|
| 1041 |
+
containers:
|
| 1042 |
+
- name: nginx
|
| 1043 |
+
image: nginx:1.14.2
|
| 1044 |
+
ports:
|
| 1045 |
+
- containerPort: 80
|
| 1046 |
+
strategy:
|
| 1047 |
+
type: RollingUpdate
|
| 1048 |
+
rollingUpdate:
|
| 1049 |
+
maxSurge: 1
|
| 1050 |
+
maxUnavailable: 1
|
| 1051 |
+
```
|
| 1052 |
+
|
| 1053 |
+
### Progress Deadline Seconds
|
| 1054 |
+
|
| 1055 |
+
`.spec.progressDeadlineSeconds` is an optional field that specifies the number of seconds you want to wait for your Deployment to progress before the system reports back that the Deployment has [failed progressing](#failed-deployment) - surfaced as a condition with `type: Progressing`, `status: "False"`. and `reason: ProgressDeadlineExceeded` in the status of the resource. The Deployment controller will keep retrying the Deployment. This defaults to 600. In the future, once automatic rollback will be implemented, the Deployment controller will roll back a Deployment as soon as it observes such a condition.
|
| 1056 |
+
|
| 1057 |
+
If specified, this field needs to be greater than `.spec.minReadySeconds`.
|
| 1058 |
+
|
| 1059 |
+
### Min Ready Seconds
|
| 1060 |
+
|
| 1061 |
+
`.spec.minReadySeconds` is an optional field that specifies the minimum number of seconds for which a newly created Pod should be ready without any of its containers crashing, for it to be considered available. This defaults to 0 (the Pod will be considered available as soon as it is ready). To learn more about when a Pod is considered ready, see [Container Probes](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes).
|
| 1062 |
+
|
| 1063 |
+
### Terminating Pods
|
| 1064 |
+
|
| 1065 |
+
FEATURE STATE: `Kubernetes v1.35 [beta]` (enabled by default)
|
| 1066 |
+
|
| 1067 |
+
You can see the terminating pods only if the `DeploymentReplicaSetTerminatingReplicas` [feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) is enabled on the [API server](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/) and on the [kube-controller-manager](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/)
|
| 1068 |
+
|
| 1069 |
+
Pods that become terminating due to deletion or scale down may take a long time to terminate, and may consume additional resources during that period. As a result, the total number of all pods can temporarily exceed `.spec.replicas`. Terminating pods can be tracked using the `.status.terminatingReplicas` field of the Deployment.
|
| 1070 |
+
|
| 1071 |
+
### Revision History Limit
|
| 1072 |
+
|
| 1073 |
+
A Deployment's revision history is stored in the ReplicaSets it controls.
|
| 1074 |
+
|
| 1075 |
+
`.spec.revisionHistoryLimit` is an optional field that specifies the number of old ReplicaSets to retain to allow rollback. These old ReplicaSets consume resources in `etcd` and crowd the output of `kubectl get rs`. The configuration of each Deployment revision is stored in its ReplicaSets; therefore, once an old ReplicaSet is deleted, you lose the ability to rollback to that revision of Deployment. By default, 10 old ReplicaSets will be kept, however its ideal value depends on the frequency and stability of new Deployments.
|
| 1076 |
+
|
| 1077 |
+
More specifically, setting this field to zero means that all old ReplicaSets with 0 replicas will be cleaned up. In this case, a new Deployment rollout cannot be undone, since its revision history is cleaned up.
|
| 1078 |
+
|
| 1079 |
+
### Paused
|
| 1080 |
+
|
| 1081 |
+
`.spec.paused` is an optional boolean field for pausing and resuming a Deployment. The only difference between a paused Deployment and one that is not paused, is that any changes into the PodTemplateSpec of the paused Deployment will not trigger new rollouts as long as it is paused. A Deployment is not paused by default when it is created.
|
| 1082 |
+
|
| 1083 |
+
## What's next
|
| 1084 |
+
|
| 1085 |
+
- Learn more about [Pods](https://kubernetes.io/docs/concepts/workloads/pods/).
|
| 1086 |
+
- [Run a stateless application using a Deployment](https://kubernetes.io/docs/tasks/run-application/run-stateless-application-deployment/).
|
| 1087 |
+
- Read the [Deployment](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/) to understand the Deployment API.
|
| 1088 |
+
- Read about [PodDisruptionBudget](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/) and how you can use it to manage application availability during disruptions.
|
| 1089 |
+
- Use kubectl to [create a Deployment](https://kubernetes.io/docs/tutorials/kubernetes-basics/deploy-app/deploy-intro/).
|
| 1090 |
+
|
| 1091 |
+
|
| 1092 |
+
Last modified March 15, 2026 at 3:21 PM PST: [fix: replace deprecated argument \`--cpu-percent\` with \`--cpu\` (af93a0a732)](https://github.com/kubernetes/website/commit/af93a0a732cf3057895c62e615a212a44aa6cec7)
|
|
@@ -0,0 +1,416 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
If you want to control traffic flow at the IP address or port level (OSI layer 3 or 4), NetworkPolicies allow you to specify rules for traffic flow within your cluster, and also between Pods and the outside world. Your cluster must use a network plugin that supports NetworkPolicy enforcement.
|
| 2 |
+
|
| 3 |
+
If you want to control traffic flow at the IP address or port level for TCP, UDP, and SCTP protocols, then you might consider using Kubernetes NetworkPolicies for particular applications in your cluster. NetworkPolicies are an application-centric construct which allow you to specify how a [pod](https://kubernetes.io/docs/concepts/workloads/pods/ "A Pod represents a set of running containers in your cluster.") is allowed to communicate with various network "entities" (we use the word "entity" here to avoid overloading the more common terms such as "endpoints" and "services", which have specific Kubernetes connotations) over the network. NetworkPolicies apply to a connection with a pod on one or both ends, and are not relevant to other connections.
|
| 4 |
+
|
| 5 |
+
The entities that a Pod can communicate with are identified through a combination of the following three identifiers:
|
| 6 |
+
|
| 7 |
+
1. Other pods that are allowed (exception: a pod cannot block access to itself)
|
| 8 |
+
2. Namespaces that are allowed
|
| 9 |
+
3. IP blocks (exception: traffic to and from the node where a Pod is running is always allowed, regardless of the IP address of the Pod or the node)
|
| 10 |
+
|
| 11 |
+
When defining a pod- or namespace-based NetworkPolicy, you use a [selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/ "Allows users to filter a list of resources based on labels.") to specify what traffic is allowed to and from the Pod(s) that match the selector.
|
| 12 |
+
|
| 13 |
+
Meanwhile, when IP-based NetworkPolicies are created, we define policies based on IP blocks (CIDR ranges).
|
| 14 |
+
|
| 15 |
+
## Prerequisites
|
| 16 |
+
|
| 17 |
+
Network policies are implemented by the [network plugin](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/). To use network policies, you must be using a networking solution which supports NetworkPolicy. Creating a NetworkPolicy resource without a controller that implements it will have no effect.
|
| 18 |
+
|
| 19 |
+
## The two sorts of pod isolation
|
| 20 |
+
|
| 21 |
+
There are two sorts of isolation for a pod: isolation for egress, and isolation for ingress. They concern what connections may be established. "Isolation" here is not absolute, rather it means "some restrictions apply". The alternative, "non-isolated for $direction", means that no restrictions apply in the stated direction. The two sorts of isolation (or not) are declared independently, and are both relevant for a connection from one pod to another.
|
| 22 |
+
|
| 23 |
+
By default, a pod is non-isolated for egress; all outbound connections are allowed. A pod is isolated for egress if there is any NetworkPolicy that both selects the pod and has "Egress" in its `policyTypes`; we say that such a policy applies to the pod for egress. When a pod is isolated for egress, the only allowed connections from the pod are those allowed by the `egress` list of some NetworkPolicy that applies to the pod for egress. Reply traffic for those allowed connections will also be implicitly allowed. The effects of those `egress` lists combine additively.
|
| 24 |
+
|
| 25 |
+
By default, a pod is non-isolated for ingress; all inbound connections are allowed. A pod is isolated for ingress if there is any NetworkPolicy that both selects the pod and has "Ingress" in its `policyTypes`; we say that such a policy applies to the pod for ingress. When a pod is isolated for ingress, the only allowed connections into the pod are those from the pod's node and those allowed by the `ingress` list of some NetworkPolicy that applies to the pod for ingress. Reply traffic for those allowed connections will also be implicitly allowed. The effects of those `ingress` lists combine additively.
|
| 26 |
+
|
| 27 |
+
Network policies do not conflict; they are additive. If any policy or policies apply to a given pod for a given direction, the connections allowed in that direction from that pod is the union of what the applicable policies allow. Thus, order of evaluation does not affect the policy result.
|
| 28 |
+
|
| 29 |
+
For a connection from a source pod to a destination pod to be allowed, both the egress policy on the source pod and the ingress policy on the destination pod need to allow the connection. If either side does not allow the connection, it will not happen.
|
| 30 |
+
|
| 31 |
+
## The NetworkPolicy resource
|
| 32 |
+
|
| 33 |
+
See the [NetworkPolicy](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.35/#networkpolicy-v1-networking-k8s-io) reference for a full definition of the resource.
|
| 34 |
+
|
| 35 |
+
An example NetworkPolicy might look like this:
|
| 36 |
+
|
| 37 |
+
```yaml
|
| 38 |
+
apiVersion: networking.k8s.io/v1
|
| 39 |
+
kind: NetworkPolicy
|
| 40 |
+
metadata:
|
| 41 |
+
name: test-network-policy
|
| 42 |
+
namespace: default
|
| 43 |
+
spec:
|
| 44 |
+
podSelector:
|
| 45 |
+
matchLabels:
|
| 46 |
+
role: db
|
| 47 |
+
policyTypes:
|
| 48 |
+
- Ingress
|
| 49 |
+
- Egress
|
| 50 |
+
ingress:
|
| 51 |
+
- from:
|
| 52 |
+
- ipBlock:
|
| 53 |
+
cidr: 172.17.0.0/16
|
| 54 |
+
except:
|
| 55 |
+
- 172.17.1.0/24
|
| 56 |
+
- namespaceSelector:
|
| 57 |
+
matchLabels:
|
| 58 |
+
project: myproject
|
| 59 |
+
- podSelector:
|
| 60 |
+
matchLabels:
|
| 61 |
+
role: frontend
|
| 62 |
+
ports:
|
| 63 |
+
- protocol: TCP
|
| 64 |
+
port: 6379
|
| 65 |
+
egress:
|
| 66 |
+
- to:
|
| 67 |
+
- ipBlock:
|
| 68 |
+
cidr: 10.0.0.0/24
|
| 69 |
+
ports:
|
| 70 |
+
- protocol: TCP
|
| 71 |
+
port: 5978
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
> [!info] Note:
|
| 75 |
+
> POSTing this to the API server for your cluster will have no effect unless your chosen networking solution supports network policy.
|
| 76 |
+
|
| 77 |
+
**Mandatory Fields**: As with all other Kubernetes config, a NetworkPolicy needs `apiVersion`, `kind`, and `metadata` fields. For general information about working with config files, see [Configure a Pod to Use a ConfigMap](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/), and [Object Management](https://kubernetes.io/docs/concepts/overview/working-with-objects/object-management/).
|
| 78 |
+
|
| 79 |
+
**spec**: NetworkPolicy [spec](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-status) has all the information needed to define a particular network policy in the given namespace.
|
| 80 |
+
|
| 81 |
+
**podSelector**: Each NetworkPolicy includes a `podSelector` which selects the grouping of pods to which the policy applies. The example policy selects pods with the label "role=db". An empty `podSelector` selects all pods in the namespace.
|
| 82 |
+
|
| 83 |
+
**policyTypes**: Each NetworkPolicy includes a `policyTypes` list which may include either `Ingress`, `Egress`, or both. The `policyTypes` field indicates whether or not the given policy applies to ingress traffic to selected pod, egress traffic from selected pods, or both. If no `policyTypes` are specified on a NetworkPolicy then by default `Ingress` will always be set and `Egress` will be set if the NetworkPolicy has any egress rules.
|
| 84 |
+
|
| 85 |
+
**ingress**: Each NetworkPolicy may include a list of allowed `ingress` rules. Each rule allows traffic which matches both the `from` and `ports` sections. The example policy contains a single rule, which matches traffic on a single port, from one of three sources, the first specified via an `ipBlock`, the second via a `namespaceSelector` and the third via a `podSelector`.
|
| 86 |
+
|
| 87 |
+
**egress**: Each NetworkPolicy may include a list of allowed `egress` rules. Each rule allows traffic which matches both the `to` and `ports` sections. The example policy contains a single rule, which matches traffic on a single port to any destination in `10.0.0.0/24`.
|
| 88 |
+
|
| 89 |
+
So, the example NetworkPolicy:
|
| 90 |
+
|
| 91 |
+
1. isolates `role=db` pods in the `default` namespace for both ingress and egress traffic (if they weren't already isolated)
|
| 92 |
+
2. (Ingress rules) allows connections to all pods in the `default` namespace with the label `role=db` on TCP port 6379 from:
|
| 93 |
+
- any pod in the `default` namespace with the label `role=frontend`
|
| 94 |
+
- any pod in a namespace with the label `project=myproject`
|
| 95 |
+
- IP addresses in the ranges `172.17.0.0` – `172.17.0.255` and `172.17.2.0` – `172.17.255.255` (ie, all of `172.17.0.0/16` except `172.17.1.0/24`)
|
| 96 |
+
3. (Egress rules) allows connections from any pod in the `default` namespace with the label `role=db` to CIDR `10.0.0.0/24` on TCP port 5978
|
| 97 |
+
|
| 98 |
+
See the [Declare Network Policy](https://kubernetes.io/docs/tasks/administer-cluster/declare-network-policy/) walkthrough for further examples.
|
| 99 |
+
|
| 100 |
+
## Behavior of to and from selectors
|
| 101 |
+
|
| 102 |
+
There are four kinds of selectors that can be specified in an `ingress` `from` section or `egress` `to` section:
|
| 103 |
+
|
| 104 |
+
**podSelector**: This selects particular Pods in the same namespace as the NetworkPolicy which should be allowed as ingress sources or egress destinations.
|
| 105 |
+
|
| 106 |
+
**namespaceSelector**: This selects particular namespaces for which all Pods should be allowed as ingress sources or egress destinations.
|
| 107 |
+
|
| 108 |
+
**namespaceSelector** *and* **podSelector**: A single `to` / `from` entry that specifies both `namespaceSelector` and `podSelector` selects particular Pods within particular namespaces. Be careful to use correct YAML syntax. For example:
|
| 109 |
+
|
| 110 |
+
```yaml
|
| 111 |
+
...
|
| 112 |
+
ingress:
|
| 113 |
+
- from:
|
| 114 |
+
- namespaceSelector:
|
| 115 |
+
matchLabels:
|
| 116 |
+
user: alice
|
| 117 |
+
podSelector:
|
| 118 |
+
matchLabels:
|
| 119 |
+
role: client
|
| 120 |
+
...
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
This policy contains a single `from` element allowing connections from Pods with the label `role=client` in namespaces with the label `user=alice`. But the following policy is different:
|
| 124 |
+
|
| 125 |
+
```yaml
|
| 126 |
+
...
|
| 127 |
+
ingress:
|
| 128 |
+
- from:
|
| 129 |
+
- namespaceSelector:
|
| 130 |
+
matchLabels:
|
| 131 |
+
user: alice
|
| 132 |
+
- podSelector:
|
| 133 |
+
matchLabels:
|
| 134 |
+
role: client
|
| 135 |
+
...
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
It contains two elements in the `from` array, and allows connections from Pods in the local Namespace with the label `role=client`, *or* from any Pod in any namespace with the label `user=alice`.
|
| 139 |
+
|
| 140 |
+
When in doubt, use `kubectl describe` to see how Kubernetes has interpreted the policy.
|
| 141 |
+
|
| 142 |
+
**ipBlock**: This selects particular IP CIDR ranges to allow as ingress sources or egress destinations. These should be cluster-external IPs, since Pod IPs are ephemeral and unpredictable.
|
| 143 |
+
|
| 144 |
+
Cluster ingress and egress mechanisms often require rewriting the source or destination IP of packets. In cases where this happens, it is not defined whether this happens before or after NetworkPolicy processing, and the behavior may be different for different combinations of network plugin, cloud provider, `Service` implementation, etc.
|
| 145 |
+
|
| 146 |
+
In the case of ingress, this means that in some cases you may be able to filter incoming packets based on the actual original source IP, while in other cases, the "source IP" that the NetworkPolicy acts on may be the IP of a `LoadBalancer` or of the Pod's node, etc.
|
| 147 |
+
|
| 148 |
+
For egress, this means that connections from pods to `Service` IPs that get rewritten to cluster-external IPs may or may not be subject to `ipBlock` -based policies.
|
| 149 |
+
|
| 150 |
+
## Default policies
|
| 151 |
+
|
| 152 |
+
By default, if no policies exist in a namespace, then all ingress and egress traffic is allowed to and from pods in that namespace. The following examples let you change the default behavior in that namespace.
|
| 153 |
+
|
| 154 |
+
### Default deny all ingress traffic
|
| 155 |
+
|
| 156 |
+
You can create a "default" ingress isolation policy for a namespace by creating a NetworkPolicy that selects all pods but does not allow any ingress traffic to those pods.
|
| 157 |
+
|
| 158 |
+
```yaml
|
| 159 |
+
---
|
| 160 |
+
apiVersion: networking.k8s.io/v1
|
| 161 |
+
kind: NetworkPolicy
|
| 162 |
+
metadata:
|
| 163 |
+
name: default-deny-ingress
|
| 164 |
+
spec:
|
| 165 |
+
podSelector: {}
|
| 166 |
+
policyTypes:
|
| 167 |
+
- Ingress
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
This ensures that even pods that aren't selected by any other NetworkPolicy will still be isolated for ingress. This policy does not affect isolation for egress from any pod.
|
| 171 |
+
|
| 172 |
+
### Allow all ingress traffic
|
| 173 |
+
|
| 174 |
+
If you want to allow all incoming connections to all pods in a namespace, you can create a policy that explicitly allows that.
|
| 175 |
+
|
| 176 |
+
```yaml
|
| 177 |
+
---
|
| 178 |
+
apiVersion: networking.k8s.io/v1
|
| 179 |
+
kind: NetworkPolicy
|
| 180 |
+
metadata:
|
| 181 |
+
name: allow-all-ingress
|
| 182 |
+
spec:
|
| 183 |
+
podSelector: {}
|
| 184 |
+
ingress:
|
| 185 |
+
- {}
|
| 186 |
+
policyTypes:
|
| 187 |
+
- Ingress
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
With this policy in place, no additional policy or policies can cause any incoming connection to those pods to be denied. This policy has no effect on isolation for egress from any pod.
|
| 191 |
+
|
| 192 |
+
### Default deny all egress traffic
|
| 193 |
+
|
| 194 |
+
You can create a "default" egress isolation policy for a namespace by creating a NetworkPolicy that selects all pods but does not allow any egress traffic from those pods.
|
| 195 |
+
|
| 196 |
+
```yaml
|
| 197 |
+
---
|
| 198 |
+
apiVersion: networking.k8s.io/v1
|
| 199 |
+
kind: NetworkPolicy
|
| 200 |
+
metadata:
|
| 201 |
+
name: default-deny-egress
|
| 202 |
+
spec:
|
| 203 |
+
podSelector: {}
|
| 204 |
+
policyTypes:
|
| 205 |
+
- Egress
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
This ensures that even pods that aren't selected by any other NetworkPolicy will not be allowed egress traffic. This policy does not change the ingress isolation behavior of any pod.
|
| 209 |
+
|
| 210 |
+
> [!caution] Caution:
|
| 211 |
+
> A default deny-all egress policy also blocks DNS traffic. If your workloads need DNS resolution, you must add a separate NetworkPolicy that allows egress to your cluster's DNS service.
|
| 212 |
+
|
| 213 |
+
### Allow all egress traffic
|
| 214 |
+
|
| 215 |
+
If you want to allow all connections from all pods in a namespace, you can create a policy that explicitly allows all outgoing connections from pods in that namespace.
|
| 216 |
+
|
| 217 |
+
```yaml
|
| 218 |
+
---
|
| 219 |
+
apiVersion: networking.k8s.io/v1
|
| 220 |
+
kind: NetworkPolicy
|
| 221 |
+
metadata:
|
| 222 |
+
name: allow-all-egress
|
| 223 |
+
spec:
|
| 224 |
+
podSelector: {}
|
| 225 |
+
egress:
|
| 226 |
+
- {}
|
| 227 |
+
policyTypes:
|
| 228 |
+
- Egress
|
| 229 |
+
```
|
| 230 |
+
|
| 231 |
+
With this policy in place, no additional policy or policies can cause any outgoing connection from those pods to be denied. This policy has no effect on isolation for ingress to any pod.
|
| 232 |
+
|
| 233 |
+
### Default deny all ingress and all egress traffic
|
| 234 |
+
|
| 235 |
+
You can create a "default" policy for a namespace which prevents all ingress AND egress traffic by creating the following NetworkPolicy in that namespace.
|
| 236 |
+
|
| 237 |
+
```yaml
|
| 238 |
+
---
|
| 239 |
+
apiVersion: networking.k8s.io/v1
|
| 240 |
+
kind: NetworkPolicy
|
| 241 |
+
metadata:
|
| 242 |
+
name: default-deny-all
|
| 243 |
+
spec:
|
| 244 |
+
podSelector: {}
|
| 245 |
+
policyTypes:
|
| 246 |
+
- Ingress
|
| 247 |
+
- Egress
|
| 248 |
+
```
|
| 249 |
+
|
| 250 |
+
This ensures that even pods that aren't selected by any other NetworkPolicy will not be allowed ingress or egress traffic.
|
| 251 |
+
|
| 252 |
+
## Network traffic filtering
|
| 253 |
+
|
| 254 |
+
NetworkPolicy is defined for [layer 4](https://en.wikipedia.org/wiki/OSI_model#Layer_4:_Transport_layer) connections (TCP, UDP, and optionally SCTP). For all the other protocols, the behaviour may vary across network plugins.
|
| 255 |
+
|
| 256 |
+
> [!info] Note:
|
| 257 |
+
> You must be using a [CNI](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/ "Container network interface (CNI) plugins are a type of Network plugin that adheres to the appc/CNI specification.") plugin that supports SCTP protocol NetworkPolicies.
|
| 258 |
+
|
| 259 |
+
When a `deny all` network policy is defined, it is only guaranteed to deny TCP, UDP and SCTP connections. For other protocols, such as ARP or ICMP, the behaviour is undefined. The same applies to allow rules: when a specific pod is allowed as ingress source or egress destination, it is undefined what happens with (for example) ICMP packets. Protocols such as ICMP may be allowed by some network plugins and denied by others.
|
| 260 |
+
|
| 261 |
+
## Targeting a range of ports
|
| 262 |
+
|
| 263 |
+
FEATURE STATE: `Kubernetes v1.25 [stable]`
|
| 264 |
+
|
| 265 |
+
When writing a NetworkPolicy, you can target a range of ports instead of a single port.
|
| 266 |
+
|
| 267 |
+
This is achievable with the usage of the `endPort` field, as the following example:
|
| 268 |
+
|
| 269 |
+
```yaml
|
| 270 |
+
apiVersion: networking.k8s.io/v1
|
| 271 |
+
kind: NetworkPolicy
|
| 272 |
+
metadata:
|
| 273 |
+
name: multi-port-egress
|
| 274 |
+
namespace: default
|
| 275 |
+
spec:
|
| 276 |
+
podSelector:
|
| 277 |
+
matchLabels:
|
| 278 |
+
role: db
|
| 279 |
+
policyTypes:
|
| 280 |
+
- Egress
|
| 281 |
+
egress:
|
| 282 |
+
- to:
|
| 283 |
+
- ipBlock:
|
| 284 |
+
cidr: 10.0.0.0/24
|
| 285 |
+
ports:
|
| 286 |
+
- protocol: TCP
|
| 287 |
+
port: 32000
|
| 288 |
+
endPort: 32768
|
| 289 |
+
```
|
| 290 |
+
|
| 291 |
+
The above rule allows any Pod with label `role=db` on the namespace `default` to communicate with any IP within the range `10.0.0.0/24` over TCP, provided that the target port is between the range 32000 and 32768.
|
| 292 |
+
|
| 293 |
+
The following restrictions apply when using this field:
|
| 294 |
+
|
| 295 |
+
- The `endPort` field must be equal to or greater than the `port` field.
|
| 296 |
+
- `endPort` can only be defined if `port` is also defined.
|
| 297 |
+
- Both ports must be numeric.
|
| 298 |
+
|
| 299 |
+
> [!info] Note:
|
| 300 |
+
> Your cluster must be using a [CNI](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/ "Container network interface (CNI) plugins are a type of Network plugin that adheres to the appc/CNI specification.") plugin that supports the `endPort` field in NetworkPolicy specifications. If your [network plugin](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/) does not support the `endPort` field and you specify a NetworkPolicy with that, the policy will be applied only for the single `port` field.
|
| 301 |
+
|
| 302 |
+
## Targeting multiple namespaces by label
|
| 303 |
+
|
| 304 |
+
In this scenario, your `Egress` NetworkPolicy targets more than one namespace using their label names. For this to work, you need to label the target namespaces. For example:
|
| 305 |
+
|
| 306 |
+
```shell
|
| 307 |
+
kubectl label namespace frontend namespace=frontend
|
| 308 |
+
kubectl label namespace backend namespace=backend
|
| 309 |
+
```
|
| 310 |
+
|
| 311 |
+
Add the labels under `namespaceSelector` in your NetworkPolicy document. For example:
|
| 312 |
+
|
| 313 |
+
```yaml
|
| 314 |
+
apiVersion: networking.k8s.io/v1
|
| 315 |
+
kind: NetworkPolicy
|
| 316 |
+
metadata:
|
| 317 |
+
name: egress-namespaces
|
| 318 |
+
spec:
|
| 319 |
+
podSelector:
|
| 320 |
+
matchLabels:
|
| 321 |
+
app: myapp
|
| 322 |
+
policyTypes:
|
| 323 |
+
- Egress
|
| 324 |
+
egress:
|
| 325 |
+
- to:
|
| 326 |
+
- namespaceSelector:
|
| 327 |
+
matchExpressions:
|
| 328 |
+
- key: namespace
|
| 329 |
+
operator: In
|
| 330 |
+
values: ["frontend", "backend"]
|
| 331 |
+
```
|
| 332 |
+
|
| 333 |
+
> [!info] Note:
|
| 334 |
+
> It is not possible to directly specify the name of the namespaces in a NetworkPolicy. You must use a `namespaceSelector` with `matchLabels` or `matchExpressions` to select the namespaces based on their labels.
|
| 335 |
+
|
| 336 |
+
## Targeting a Namespace by its name
|
| 337 |
+
|
| 338 |
+
The Kubernetes control plane sets an immutable label `kubernetes.io/metadata.name` on all namespaces, the value of the label is the namespace name.
|
| 339 |
+
|
| 340 |
+
While NetworkPolicy cannot target a namespace by its name with some object field, you can use the standardized label to target a specific namespace.
|
| 341 |
+
|
| 342 |
+
## Pod lifecycle
|
| 343 |
+
|
| 344 |
+
> [!info] Note:
|
| 345 |
+
> The following applies to clusters with a conformant networking plugin and a conformant implementation of NetworkPolicy.
|
| 346 |
+
|
| 347 |
+
When a new NetworkPolicy object is created, it may take some time for a network plugin to handle the new object. If a pod that is affected by a NetworkPolicy is created before the network plugin has completed NetworkPolicy handling, that pod may be started unprotected, and isolation rules will be applied when the NetworkPolicy handling is completed.
|
| 348 |
+
|
| 349 |
+
Once the NetworkPolicy is handled by a network plugin,
|
| 350 |
+
|
| 351 |
+
1. All newly created pods affected by a given NetworkPolicy will be isolated before they are started. Implementations of NetworkPolicy must ensure that filtering is effective throughout the Pod lifecycle, even from the very first instant that any container in that Pod is started. Because they are applied at Pod level, NetworkPolicies apply equally to init containers, sidecar containers, and regular containers.
|
| 352 |
+
2. Allow rules will be applied eventually after the isolation rules (or may be applied at the same time). In the worst case, a newly created pod may have no network connectivity at all when it is first started, if isolation rules were already applied, but no allow rules were applied yet.
|
| 353 |
+
|
| 354 |
+
Every created NetworkPolicy will be handled by a network plugin eventually, but there is no way to tell from the Kubernetes API when exactly that happens.
|
| 355 |
+
|
| 356 |
+
Therefore, pods must be resilient against being started up with different network connectivity than expected. If you need to make sure the pod can reach certain destinations before being started, you can use an [init container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) to wait for those destinations to be reachable before kubelet starts the app containers.
|
| 357 |
+
|
| 358 |
+
Every NetworkPolicy will be applied to all selected pods eventually. Because the network plugin may implement NetworkPolicy in a distributed manner, it is possible that pods may see a slightly inconsistent view of network policies when the pod is first created, or when pods or policies change. For example, a newly-created pod that is supposed to be able to reach both Pod A on Node 1 and Pod B on Node 2 may find that it can reach Pod A immediately, but cannot reach Pod B until a few seconds later.
|
| 359 |
+
|
| 360 |
+
## NetworkPolicy and hostNetwork pods
|
| 361 |
+
|
| 362 |
+
NetworkPolicy behaviour for `hostNetwork` pods is undefined, but it should be limited to 2 possibilities:
|
| 363 |
+
|
| 364 |
+
- The network plugin can distinguish `hostNetwork` pod traffic from all other traffic (including being able to distinguish traffic from different `hostNetwork` pods on the same node), and will apply NetworkPolicy to `hostNetwork` pods just like it does to pod-network pods.
|
| 365 |
+
- The network plugin cannot properly distinguish `hostNetwork` pod traffic, and so it ignores `hostNetwork` pods when matching `podSelector` and `namespaceSelector`. Traffic to/from `hostNetwork` pods is treated the same as all other traffic to/from the node IP. (This is the most common implementation.)
|
| 366 |
+
|
| 367 |
+
This applies when
|
| 368 |
+
|
| 369 |
+
1. a `hostNetwork` pod is selected by `spec.podSelector`.
|
| 370 |
+
```yaml
|
| 371 |
+
...
|
| 372 |
+
spec:
|
| 373 |
+
podSelector:
|
| 374 |
+
matchLabels:
|
| 375 |
+
role: client
|
| 376 |
+
...
|
| 377 |
+
```
|
| 378 |
+
2. a `hostNetwork` pod is selected by a `podSelector` or `namespaceSelector` in an `ingress` or `egress` rule.
|
| 379 |
+
```yaml
|
| 380 |
+
...
|
| 381 |
+
ingress:
|
| 382 |
+
- from:
|
| 383 |
+
- podSelector:
|
| 384 |
+
matchLabels:
|
| 385 |
+
role: client
|
| 386 |
+
...
|
| 387 |
+
```
|
| 388 |
+
|
| 389 |
+
At the same time, since `hostNetwork` pods have the same IP addresses as the nodes they reside on, their connections will be treated as node connections. For example, you can allow traffic from a `hostNetwork` Pod using an `ipBlock` rule.
|
| 390 |
+
|
| 391 |
+
## What you can't do with network policies (at least, not yet)
|
| 392 |
+
|
| 393 |
+
As of Kubernetes 1.35, the following functionality does not exist in the NetworkPolicy API, but you might be able to implement workarounds using Operating System components (such as SELinux, OpenVSwitch, IPTables, and so on) or Layer 7 technologies (Ingress controllers, Service Mesh implementations) or admission controllers. In case you are new to network security in Kubernetes, its worth noting that the following User Stories cannot (yet) be implemented using the NetworkPolicy API.
|
| 394 |
+
|
| 395 |
+
- Forcing internal cluster traffic to go through a common gateway (this might be best served with a service mesh or other proxy).
|
| 396 |
+
- Anything TLS related (use a service mesh or ingress controller for this).
|
| 397 |
+
- Node specific policies (you can use CIDR notation for these, but you cannot target nodes by their Kubernetes identities specifically).
|
| 398 |
+
- Targeting of services by name (you can, however, target pods or namespaces by their [labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels "Tags objects with identifying attributes that are meaningful and relevant to users."), which is often a viable workaround).
|
| 399 |
+
- Creation or management of "Policy requests" that are fulfilled by a third party.
|
| 400 |
+
- Default policies which are applied to all namespaces or pods (there are some third party Kubernetes distributions and projects which can do this).
|
| 401 |
+
- Advanced policy querying and reachability tooling.
|
| 402 |
+
- The ability to log network security events (for example connections that are blocked or accepted).
|
| 403 |
+
- The ability to explicitly deny policies (currently the model for NetworkPolicies are deny by default, with only the ability to add allow rules).
|
| 404 |
+
- The ability to prevent loopback or incoming host traffic (Pods cannot currently block localhost access, nor do they have the ability to block access from their resident node).
|
| 405 |
+
|
| 406 |
+
## NetworkPolicy's impact on existing connections
|
| 407 |
+
|
| 408 |
+
When the set of NetworkPolicies that applies to an existing connection changes - this could happen either due to a change in NetworkPolicies or if the relevant labels of the namespaces/pods selected by the policy (both subject and peers) are changed in the middle of an existing connection - it is implementation defined as to whether the change will take effect for that existing connection or not. Example: A policy is created that leads to denying a previously allowed connection, the underlying network plugin implementation is responsible for defining if that new policy will close the existing connections or not. It is recommended not to modify policies/pods/namespaces in ways that might affect existing connections.
|
| 409 |
+
|
| 410 |
+
## What's next
|
| 411 |
+
|
| 412 |
+
- See the [Declare Network Policy](https://kubernetes.io/docs/tasks/administer-cluster/declare-network-policy/) walkthrough for further examples.
|
| 413 |
+
- See more [recipes](https://github.com/ahmetb/kubernetes-network-policy-recipes) for common scenarios enabled by the NetworkPolicy resource.
|
| 414 |
+
|
| 415 |
+
|
| 416 |
+
Last modified March 28, 2026 at 12:37 PM PST: [docs: add caution about DNS being blocked by deny-all egress (0a474b2b1a)](https://github.com/kubernetes/website/commit/0a474b2b1a8d5ac94d09fd5f4ee109a61e6ff511)
|
|
@@ -0,0 +1,339 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Node-pressure eviction is the process by which the [kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet "An agent that runs on each node in the cluster. It makes sure that containers are running in a pod.") proactively terminates pods to reclaim [resource](https://kubernetes.io/docs/reference/glossary/?all=true#term-infrastructure-resource "A defined amount of infrastructure available for consumption (CPU, memory, etc).") on nodes.
|
| 2 |
+
|
| 3 |
+
The [kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet "An agent that runs on each node in the cluster. It makes sure that containers are running in a pod.") monitors resources like memory, disk space, and filesystem inodes on your cluster's nodes. When one or more of these resources reach specific consumption levels, the kubelet can proactively fail one or more pods on the node to reclaim resources and prevent starvation.
|
| 4 |
+
|
| 5 |
+
During a node-pressure eviction, the kubelet sets the [phase](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase) for the selected pods to `Failed`, and terminates the Pod.
|
| 6 |
+
|
| 7 |
+
Node-pressure eviction is not the same as [API-initiated eviction](https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/).
|
| 8 |
+
|
| 9 |
+
The kubelet does not respect your configured [PodDisruptionBudget](https://kubernetes.io/docs/reference/glossary/?all=true#term-pod-disruption-budget "An object that limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.") or the pod's `terminationGracePeriodSeconds`. If you use [soft eviction thresholds](#soft-eviction-thresholds), the kubelet respects your configured `eviction-max-pod-grace-period`. If you use [hard eviction thresholds](#hard-eviction-thresholds), the kubelet uses a `0s` grace period (immediate shutdown) for termination.
|
| 10 |
+
|
| 11 |
+
## Self healing behavior
|
| 12 |
+
|
| 13 |
+
The kubelet attempts to [reclaim node-level resources](#reclaim-node-resources) before it terminates end-user pods. For example, it removes unused container images when disk resources are starved.
|
| 14 |
+
|
| 15 |
+
If the pods are managed by a [workload](https://kubernetes.io/docs/concepts/workloads/ "A workload is an application running on Kubernetes.") management object (such as [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/ "A StatefulSet manages deployment and scaling of a set of Pods, with durable storage and persistent identifiers for each Pod.") or [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ "Manages a replicated application on your cluster.")) that replaces failed pods, the control plane (`kube-controller-manager`) creates new pods in place of the evicted pods.
|
| 16 |
+
|
| 17 |
+
### Self healing for static pods
|
| 18 |
+
|
| 19 |
+
If you are running a [static pod](https://kubernetes.io/docs/concepts/workloads/pods/#static-pods) on a node that is under resource pressure, the kubelet may evict that static Pod. The kubelet then tries to create a replacement, because static Pods always represent an intent to run a Pod on that node.
|
| 20 |
+
|
| 21 |
+
The kubelet takes the *priority* of the static pod into account when creating a replacement. If the static pod manifest specifies a low priority, and there are higher-priority Pods defined within the cluster's control plane, and the node is under resource pressure, the kubelet may not be able to make room for that static pod. The kubelet continues to attempt to run all static pods even when there is resource pressure on a node.
|
| 22 |
+
|
| 23 |
+
## Eviction signals and thresholds
|
| 24 |
+
|
| 25 |
+
The kubelet uses various parameters to make eviction decisions, like the following:
|
| 26 |
+
|
| 27 |
+
- Eviction signals
|
| 28 |
+
- Eviction thresholds
|
| 29 |
+
- Monitoring intervals
|
| 30 |
+
|
| 31 |
+
### Eviction signals
|
| 32 |
+
|
| 33 |
+
Eviction signals are the current state of a particular resource at a specific point in time. The kubelet uses eviction signals to make eviction decisions by comparing the signals to eviction thresholds, which are the minimum amount of the resource that should be available on the node.
|
| 34 |
+
|
| 35 |
+
The kubelet uses the following eviction signals:
|
| 36 |
+
|
| 37 |
+
| Eviction Signal | Description | Linux Only |
|
| 38 |
+
| --- | --- | --- |
|
| 39 |
+
| `memory.available` | `memory.available`:= `node.status.capacity[memory]` - `node.stats.memory.workingSet` | |
|
| 40 |
+
| `nodefs.available` | `nodefs.available`:= `node.stats.fs.available` | |
|
| 41 |
+
| `nodefs.inodesFree` | `nodefs.inodesFree`:= `node.stats.fs.inodesFree` | • |
|
| 42 |
+
| `imagefs.available` | `imagefs.available`:= `node.stats.runtime.imagefs.available` | |
|
| 43 |
+
| `imagefs.inodesFree` | `imagefs.inodesFree`:= `node.stats.runtime.imagefs.inodesFree` | • |
|
| 44 |
+
| `containerfs.available` | `containerfs.available`:= `node.stats.runtime.containerfs.available` | |
|
| 45 |
+
| `containerfs.inodesFree` | `containerfs.inodesFree`:= `node.stats.runtime.containerfs.inodesFree` | • |
|
| 46 |
+
| `pid.available` | `pid.available`:= `node.stats.rlimit.maxpid` - `node.stats.rlimit.curproc` | • |
|
| 47 |
+
|
| 48 |
+
In this table, the **Description** column shows how kubelet gets the value of the signal. Each signal supports either a percentage or a literal value. The kubelet calculates the percentage value relative to the total capacity associated with the signal.
|
| 49 |
+
|
| 50 |
+
#### Memory signals
|
| 51 |
+
|
| 52 |
+
On Linux nodes, the value for `memory.available` is derived from the cgroupfs instead of tools like `free -m`. This is important because `free -m` does not work in a container, and if users use the [node allocatable](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable) feature, out of resource decisions are made local to the end user Pod part of the cgroup hierarchy as well as the root node. This [script](https://kubernetes.io/examples/admin/resource/memory-available.sh) or [cgroupv2 script](https://kubernetes.io/examples/admin/resource/memory-available-cgroupv2.sh) reproduces the same set of steps that the kubelet performs to calculate `memory.available`. The kubelet excludes inactive\_file (the number of bytes of file-backed memory on the inactive LRU list) from its calculation, as it assumes that memory is reclaimable under pressure.
|
| 53 |
+
|
| 54 |
+
On Windows nodes, the value for `memory.available` is derived from the node's global memory commit levels (queried through the [`GetPerformanceInfo()`](https://learn.microsoft.com/windows/win32/api/psapi/nf-psapi-getperformanceinfo) system call) by subtracting the node's global [`CommitTotal`](https://learn.microsoft.com/windows/win32/api/psapi/ns-psapi-performance_information) from the node's [`CommitLimit`](https://learn.microsoft.com/windows/win32/api/psapi/ns-psapi-performance_information). Please note that `CommitLimit` can change if the node's page-file size changes!
|
| 55 |
+
|
| 56 |
+
#### Filesystem signals
|
| 57 |
+
|
| 58 |
+
The kubelet recognizes three specific filesystem identifiers that can be used with eviction signals (`<identifier>.inodesFree` or `<identifier>.available`):
|
| 59 |
+
|
| 60 |
+
1. `nodefs`: The node's main filesystem, used for local disk volumes, emptyDir volumes not backed by memory, log storage, ephemeral storage, and more. For example, `nodefs` contains `/var/lib/kubelet`.
|
| 61 |
+
2. `imagefs`: An optional filesystem that container runtimes can use to store container images (which are the read-only layers) and container writable layers.
|
| 62 |
+
3. `containerfs`: An optional filesystem that container runtime can use to store the writeable layers. Similar to the main filesystem (see `nodefs`), it's used to store local disk volumes, emptyDir volumes not backed by memory, log storage, and ephemeral storage, except for the container images. When `containerfs` is used, the `imagefs` filesystem can be split to only store images (read-only layers) and nothing else.
|
| 63 |
+
|
| 64 |
+
> [!info] Note:
|
| 65 |
+
> FEATURE STATE: `Kubernetes v1.31 [beta]` (enabled by default)
|
| 66 |
+
>
|
| 67 |
+
> The *split image filesystem* feature, which enables support for the `containerfs` filesystem, adds several new eviction signals, thresholds and metrics. To use `containerfs`, the Kubernetes release v1.35 requires the `KubeletSeparateDiskGC` [feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) to be enabled. Currently, only CRI-O (v1.29 or higher) offers the `containerfs` filesystem support.
|
| 68 |
+
|
| 69 |
+
As such, kubelet generally allows three options for container filesystems:
|
| 70 |
+
|
| 71 |
+
- Everything is on the single `nodefs`, also referred to as "rootfs" or simply "root", and there is no dedicated image filesystem.
|
| 72 |
+
- Container storage (see `nodefs`) is on a dedicated disk, and `imagefs` (writable and read-only layers) is separate from the root filesystem. This is often referred to as "split disk" (or "separate disk") filesystem.
|
| 73 |
+
- Container filesystem `containerfs` (same as `nodefs` plus writable layers) is on root and the container images (read-only layers) are stored on separate `imagefs`. This is often referred to as "split image" filesystem.
|
| 74 |
+
|
| 75 |
+
The kubelet will attempt to auto-discover these filesystems with their current configuration directly from the underlying container runtime and will ignore other local node filesystems.
|
| 76 |
+
|
| 77 |
+
The kubelet does not support other container filesystems or storage configurations, and it does not currently support multiple filesystems for images and containers.
|
| 78 |
+
|
| 79 |
+
### Deprecated kubelet garbage collection features
|
| 80 |
+
|
| 81 |
+
Some kubelet garbage collection features are deprecated in favor of eviction:
|
| 82 |
+
|
| 83 |
+
| Existing Flag | Rationale |
|
| 84 |
+
| --- | --- |
|
| 85 |
+
| `--maximum-dead-containers` | deprecated once old logs are stored outside of container's context |
|
| 86 |
+
| `--maximum-dead-containers-per-container` | deprecated once old logs are stored outside of container's context |
|
| 87 |
+
| `--minimum-container-ttl-duration` | deprecated once old logs are stored outside of container's context |
|
| 88 |
+
|
| 89 |
+
### Eviction thresholds
|
| 90 |
+
|
| 91 |
+
You can specify custom eviction thresholds for the kubelet to use when it makes eviction decisions. You can configure [soft](#soft-eviction-thresholds) and [hard](#hard-eviction-thresholds) eviction thresholds.
|
| 92 |
+
|
| 93 |
+
Eviction thresholds have the form `[eviction-signal][operator][quantity]`, where:
|
| 94 |
+
|
| 95 |
+
- `eviction-signal` is the [eviction signal](#eviction-signals) to use.
|
| 96 |
+
- `operator` is the [relational operator](https://en.wikipedia.org/wiki/Relational_operator#Standard_relational_operators) you want, such as `<` (less than).
|
| 97 |
+
- `quantity` is the eviction threshold amount, such as `1Gi`. The value of `quantity` must match the quantity representation used by Kubernetes. You can use either literal values or percentages (`%`).
|
| 98 |
+
|
| 99 |
+
For example, if a node has 10GiB of total memory and you want trigger eviction if the available memory falls below 1GiB, you can define the eviction threshold as either `memory.available<10%` or `memory.available<1Gi` (you cannot use both).
|
| 100 |
+
|
| 101 |
+
#### Soft eviction thresholds
|
| 102 |
+
|
| 103 |
+
A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The kubelet does not evict pods until the grace period is exceeded. The kubelet returns an error on startup if you do not specify a grace period.
|
| 104 |
+
|
| 105 |
+
You can specify both a soft eviction threshold grace period and a maximum allowed pod termination grace period for kubelet to use during evictions. If you specify a maximum allowed grace period and the soft eviction threshold is met, the kubelet uses the lesser of the two grace periods. If you do not specify a maximum allowed grace period, the kubelet kills evicted pods immediately without graceful termination.
|
| 106 |
+
|
| 107 |
+
You can use the following flags to configure soft eviction thresholds:
|
| 108 |
+
|
| 109 |
+
- `eviction-soft`: A set of eviction thresholds like `memory.available<1.5Gi` that can trigger pod eviction if held over the specified grace period.
|
| 110 |
+
- `eviction-soft-grace-period`: A set of eviction grace periods like `memory.available=1m30s` that define how long a soft eviction threshold must hold before triggering a Pod eviction.
|
| 111 |
+
- `eviction-max-pod-grace-period`: The maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
|
| 112 |
+
|
| 113 |
+
#### Hard eviction thresholds
|
| 114 |
+
|
| 115 |
+
A hard eviction threshold has no grace period. When a hard eviction threshold is met, the kubelet kills pods immediately without graceful termination to reclaim the starved resource.
|
| 116 |
+
|
| 117 |
+
You can use the `eviction-hard` flag to configure a set of hard eviction thresholds like `memory.available<1Gi`.
|
| 118 |
+
|
| 119 |
+
The kubelet has the following default hard eviction thresholds:
|
| 120 |
+
|
| 121 |
+
- `memory.available<100Mi` (Linux nodes)
|
| 122 |
+
- `memory.available<500Mi` (Windows nodes)
|
| 123 |
+
- `nodefs.available<10%`
|
| 124 |
+
- `imagefs.available<15%`
|
| 125 |
+
- `nodefs.inodesFree<5%` (Linux nodes)
|
| 126 |
+
- `imagefs.inodesFree<5%` (Linux nodes)
|
| 127 |
+
|
| 128 |
+
These default values of hard eviction thresholds will only be set if none of the parameters is changed. If you change the value of any parameter, then the values of other parameters will not be inherited as the default values and will be set to zero. In order to provide custom values, you should provide all the thresholds respectively. You can also set the kubelet config MergeDefaultEvictionSettings to true in the kubelet configuration file. If set to true and any parameter is changed, then the other parameters will inherit their default values instead of 0.
|
| 129 |
+
|
| 130 |
+
The `containerfs.available` and `containerfs.inodesFree` (Linux nodes) default eviction thresholds will be set as follows:
|
| 131 |
+
|
| 132 |
+
- If a single filesystem is used for everything, then `containerfs` thresholds are set the same as `nodefs`.
|
| 133 |
+
- If separate filesystems are configured for both images and containers, then `containerfs` thresholds are set the same as `imagefs`.
|
| 134 |
+
|
| 135 |
+
Setting custom overrides for thresholds related to `containersfs` is currently not supported, and a warning will be issued if an attempt to do so is made; any provided custom values will, as such, be ignored.
|
| 136 |
+
|
| 137 |
+
## Eviction monitoring interval
|
| 138 |
+
|
| 139 |
+
The kubelet evaluates eviction thresholds based on its configured `housekeeping-interval`, which defaults to `10s`.
|
| 140 |
+
|
| 141 |
+
## Node conditions
|
| 142 |
+
|
| 143 |
+
The kubelet reports [node conditions](https://kubernetes.io/docs/concepts/architecture/nodes/#condition) to reflect that the node is under pressure because hard or soft eviction threshold is met, independent of configured grace periods.
|
| 144 |
+
|
| 145 |
+
The kubelet maps eviction signals to node conditions as follows:
|
| 146 |
+
|
| 147 |
+
| Node Condition | Eviction Signal | Description |
|
| 148 |
+
| --- | --- | --- |
|
| 149 |
+
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
|
| 150 |
+
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, `imagefs.inodesFree`, `containerfs.available`, or `containerfs.inodesFree` | Available disk space and inodes on either the node's root filesystem, image filesystem, or container filesystem has satisfied an eviction threshold |
|
| 151 |
+
| `PIDPressure` | `pid.available` | Available processes identifiers on the (Linux) node has fallen below an eviction threshold |
|
| 152 |
+
|
| 153 |
+
The control plane also [maps](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition) these node conditions to taints.
|
| 154 |
+
|
| 155 |
+
The kubelet updates the node conditions based on the configured `--node-status-update-frequency`, which defaults to `10s`.
|
| 156 |
+
|
| 157 |
+
### Node condition oscillation
|
| 158 |
+
|
| 159 |
+
In some cases, nodes oscillate above and below soft eviction thresholds without holding for the defined grace periods. This causes the reported node condition to constantly switch between `true` and `false`, leading to bad eviction decisions.
|
| 160 |
+
|
| 161 |
+
To protect against oscillation, you can use the `eviction-pressure-transition-period` flag, which controls how long the kubelet must wait before transitioning a node condition to a different state. The transition period has a default value of `5m`.
|
| 162 |
+
|
| 163 |
+
### Reclaiming node level resources
|
| 164 |
+
|
| 165 |
+
The kubelet tries to reclaim node-level resources before it evicts end-user pods.
|
| 166 |
+
|
| 167 |
+
When a `DiskPressure` node condition is reported, the kubelet reclaims node-level resources based on the filesystems on the node.
|
| 168 |
+
|
| 169 |
+
#### Without imagefs or containerfs
|
| 170 |
+
|
| 171 |
+
If the node only has a `nodefs` filesystem that meets eviction thresholds, the kubelet frees up disk space in the following order:
|
| 172 |
+
|
| 173 |
+
1. Garbage collect dead pods and containers.
|
| 174 |
+
2. Delete unused images.
|
| 175 |
+
|
| 176 |
+
#### With imagefs
|
| 177 |
+
|
| 178 |
+
If the node has a dedicated `imagefs` filesystem for container runtimes to use, the kubelet does the following:
|
| 179 |
+
|
| 180 |
+
- If the `nodefs` filesystem meets the eviction thresholds, the kubelet garbage collects dead pods and containers.
|
| 181 |
+
- If the `imagefs` filesystem meets the eviction thresholds, the kubelet deletes all unused images.
|
| 182 |
+
|
| 183 |
+
#### With imagefs and containerfs
|
| 184 |
+
|
| 185 |
+
If the node has a dedicated `containerfs` alongside the `imagefs` filesystem configured for the container runtimes to use, then kubelet will attempt to reclaim resources as follows:
|
| 186 |
+
|
| 187 |
+
- If the `containerfs` filesystem meets the eviction thresholds, the kubelet garbage collects dead pods and containers.
|
| 188 |
+
- If the `imagefs` filesystem meets the eviction thresholds, the kubelet deletes all unused images.
|
| 189 |
+
|
| 190 |
+
### Pod selection for kubelet eviction
|
| 191 |
+
|
| 192 |
+
If the kubelet's attempts to reclaim node-level resources don't bring the eviction signal below the threshold, the kubelet begins to evict end-user pods.
|
| 193 |
+
|
| 194 |
+
The kubelet uses the following parameters to determine the pod eviction order:
|
| 195 |
+
|
| 196 |
+
1. Whether the pod's resource usage exceeds requests
|
| 197 |
+
2. [Pod Priority](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/)
|
| 198 |
+
3. The pod's resource usage relative to requests
|
| 199 |
+
|
| 200 |
+
As a result, kubelet ranks and evicts pods in the following order:
|
| 201 |
+
|
| 202 |
+
1. `BestEffort` or `Burstable` pods where the usage exceeds requests. These pods are evicted based on their Priority and then by how much their usage level exceeds the request.
|
| 203 |
+
2. `Guaranteed` pods and `Burstable` pods where the usage is less than requests are evicted last, based on their Priority.
|
| 204 |
+
|
| 205 |
+
> [!info] Note:
|
| 206 |
+
> The kubelet does not use the pod's [QoS class](https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/) to determine the eviction order. You can use the QoS class to estimate the most likely pod eviction order when reclaiming resources like memory. QoS classification does not apply to EphemeralStorage requests, so the above scenario will not apply if the node is, for example, under `DiskPressure`.
|
| 207 |
+
|
| 208 |
+
`Guaranteed` pods are guaranteed only when requests and limits are specified for all the containers and they are equal. These pods will never be evicted because of another pod's resource consumption. If a system daemon (such as `kubelet` and `journald`) is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` pods using less resources than requests left on it, then the kubelet must choose to evict one of these pods to preserve node stability and to limit the impact of resource starvation on other pods. In this case, it will choose to evict pods of lowest Priority first.
|
| 209 |
+
|
| 210 |
+
If you are running a [static pod](https://kubernetes.io/docs/concepts/workloads/pods/#static-pods) and want to avoid having it evicted under resource pressure, set the `priority` field for that Pod directly. Static pods do not support the `priorityClassName` field.
|
| 211 |
+
|
| 212 |
+
When the kubelet evicts pods in response to inode or process ID starvation, it uses the Pods' relative priority to determine the eviction order, because inodes and PIDs have no requests.
|
| 213 |
+
|
| 214 |
+
The kubelet sorts pods differently based on whether the node has a dedicated `imagefs` or `containerfs` filesystem:
|
| 215 |
+
|
| 216 |
+
#### Without imagefs or containerfs (nodefs and imagefs use the same filesystem)
|
| 217 |
+
|
| 218 |
+
- If `nodefs` triggers evictions, the kubelet sorts pods based on their total disk usage (`local volumes + logs and a writable layer of all containers`).
|
| 219 |
+
|
| 220 |
+
#### With imagefs (nodefs and imagefs filesystems are separate)
|
| 221 |
+
|
| 222 |
+
- If `nodefs` triggers evictions, the kubelet sorts pods based on `nodefs` usage (`local volumes + logs of all containers`).
|
| 223 |
+
- If `imagefs` triggers evictions, the kubelet sorts pods based on the writable layer usage of all containers.
|
| 224 |
+
|
| 225 |
+
#### With imagesfs and containerfs (imagefs and containerfs have been split)
|
| 226 |
+
|
| 227 |
+
- If `containerfs` triggers evictions, the kubelet sorts pods based on `containerfs` usage (`local volumes + logs and a writable layer of all containers`).
|
| 228 |
+
- If `imagefs` triggers evictions, the kubelet sorts pods based on the `storage of images` rank, which represents the disk usage of a given image.
|
| 229 |
+
|
| 230 |
+
### Minimum eviction reclaim
|
| 231 |
+
|
| 232 |
+
> [!info] Note:
|
| 233 |
+
> As of Kubernetes v1.35, you cannot set a custom value for the `containerfs.available` metric. The configuration for this specific metric will be set automatically to reflect values set for either the `nodefs` or `imagefs`, depending on the configuration.
|
| 234 |
+
|
| 235 |
+
In some cases, pod eviction only reclaims a small amount of the starved resource. This can lead to the kubelet repeatedly hitting the configured eviction thresholds and triggering multiple evictions.
|
| 236 |
+
|
| 237 |
+
You can use the `--eviction-minimum-reclaim` flag or a [kubelet config file](https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/) to configure a minimum reclaim amount for each resource. When the kubelet notices that a resource is starved, it continues to reclaim that resource until it reclaims the quantity you specify.
|
| 238 |
+
|
| 239 |
+
For example, the following configuration sets minimum reclaim amounts:
|
| 240 |
+
|
| 241 |
+
```yaml
|
| 242 |
+
apiVersion: kubelet.config.k8s.io/v1beta1
|
| 243 |
+
kind: KubeletConfiguration
|
| 244 |
+
evictionHard:
|
| 245 |
+
memory.available: "500Mi"
|
| 246 |
+
nodefs.available: "1Gi"
|
| 247 |
+
imagefs.available: "100Gi"
|
| 248 |
+
evictionMinimumReclaim:
|
| 249 |
+
memory.available: "0Mi"
|
| 250 |
+
nodefs.available: "500Mi"
|
| 251 |
+
imagefs.available: "2Gi"
|
| 252 |
+
```
|
| 253 |
+
|
| 254 |
+
In this example, if the `nodefs.available` signal meets the eviction threshold, the kubelet reclaims the resource until the signal reaches the threshold of 1GiB, and then continues to reclaim the minimum amount of 500MiB, until the available nodefs storage value reaches 1.5GiB.
|
| 255 |
+
|
| 256 |
+
Similarly, the kubelet tries to reclaim the `imagefs` resource until the `imagefs.available` value reaches `102Gi`, representing 102 GiB of available container image storage. If the amount of storage that the kubelet could reclaim is less than 2GiB, the kubelet doesn't reclaim anything.
|
| 257 |
+
|
| 258 |
+
The default `eviction-minimum-reclaim` is `0` for all resources.
|
| 259 |
+
|
| 260 |
+
## Node out of memory behavior
|
| 261 |
+
|
| 262 |
+
If the node experiences an *out of memory* (OOM) event prior to the kubelet being able to reclaim memory, the node depends on the [oom\_killer](https://lwn.net/Articles/391222/) to respond.
|
| 263 |
+
|
| 264 |
+
The kubelet sets an `oom_score_adj` value for each container based on the QoS for the pod.
|
| 265 |
+
|
| 266 |
+
| Quality of Service | `oom_score_adj` |
|
| 267 |
+
| --- | --- |
|
| 268 |
+
| `Guaranteed` | \-997 |
|
| 269 |
+
| `BestEffort` | 1000 |
|
| 270 |
+
| `Burstable` | *min(max(2, 1000 - (1000 × memoryRequestBytes) / machineMemoryCapacityBytes), 999)* |
|
| 271 |
+
|
| 272 |
+
> [!info] Note:
|
| 273 |
+
> The kubelet also sets an `oom_score_adj` value of `-997` for any containers in Pods that have `system-node-critical` [Priority](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#pod-priority "Pod Priority indicates the importance of a Pod relative to other Pods.").
|
| 274 |
+
|
| 275 |
+
If the kubelet can't reclaim memory before a node experiences OOM, the `oom_killer` calculates an `oom_score` based on the percentage of memory it's using on the node, and then adds the `oom_score_adj` to get an effective `oom_score` for each container. It then kills the container with the highest score.
|
| 276 |
+
|
| 277 |
+
This means that containers in low QoS pods that consume a large amount of memory relative to their scheduling requests are killed first.
|
| 278 |
+
|
| 279 |
+
Unlike pod eviction, if a container is OOM killed, the kubelet can restart it based on its `restartPolicy`.
|
| 280 |
+
|
| 281 |
+
## Good practices
|
| 282 |
+
|
| 283 |
+
The following sections describe good practice for eviction configuration.
|
| 284 |
+
|
| 285 |
+
### Schedulable resources and eviction policies
|
| 286 |
+
|
| 287 |
+
When you configure the kubelet with an eviction policy, you should make sure that the scheduler will not schedule pods if they will trigger eviction because they immediately induce memory pressure.
|
| 288 |
+
|
| 289 |
+
Consider the following scenario:
|
| 290 |
+
|
| 291 |
+
- Node memory capacity: 10GiB
|
| 292 |
+
- Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
|
| 293 |
+
- Operator wants to evict Pods at 95% memory utilization to reduce incidence of system OOM.
|
| 294 |
+
|
| 295 |
+
For this to work, the kubelet is launched as follows:
|
| 296 |
+
|
| 297 |
+
```none
|
| 298 |
+
--eviction-hard=memory.available<500Mi
|
| 299 |
+
--system-reserved=memory=1.5Gi
|
| 300 |
+
```
|
| 301 |
+
|
| 302 |
+
In this configuration, the `--system-reserved` flag reserves 1.5GiB of memory for the system, which is `10% of the total memory + the eviction threshold amount`.
|
| 303 |
+
|
| 304 |
+
The node can reach the eviction threshold if a pod is using more than its request, or if the system is using more than 1GiB of memory, which makes the `memory.available` signal fall below 500MiB and triggers the threshold.
|
| 305 |
+
|
| 306 |
+
### DaemonSets and node-pressure eviction
|
| 307 |
+
|
| 308 |
+
Pod priority is a major factor in making eviction decisions. If you do not want the kubelet to evict pods that belong to a DaemonSet, give those pods a high enough priority by specifying a suitable `priorityClassName` in the pod spec. You can also use a lower priority, or the default, to only allow pods from that DaemonSet to run when there are enough resources.
|
| 309 |
+
|
| 310 |
+
## Known issues
|
| 311 |
+
|
| 312 |
+
The following sections describe known issues related to out of resource handling.
|
| 313 |
+
|
| 314 |
+
### kubelet may not observe memory pressure right away
|
| 315 |
+
|
| 316 |
+
By default, the kubelet polls cAdvisor to collect memory usage stats at a regular interval. If memory usage increases within that window rapidly, the kubelet may not observe `MemoryPressure` fast enough, and the OOM killer will still be invoked.
|
| 317 |
+
|
| 318 |
+
You can use the `--kernel-memcg-notification` flag to enable the `memcg` notification API on the kubelet to get notified immediately when a threshold is crossed.
|
| 319 |
+
|
| 320 |
+
If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for this issue is to use the `--kube-reserved` and `--system-reserved` flags to allocate memory for the system.
|
| 321 |
+
|
| 322 |
+
### active\_file memory is not considered as available memory
|
| 323 |
+
|
| 324 |
+
On Linux, the kernel tracks the number of bytes of file-backed memory on active least recently used (LRU) list as the `active_file` statistic. The kubelet treats `active_file` memory areas as not reclaimable. For workloads that make intensive use of block-backed local storage, including ephemeral local storage, kernel-level caches of file and block data means that many recently accessed cache pages are likely to be counted as `active_file`. If enough of these kernel block buffers are on the active LRU list, the kubelet is liable to observe this as high resource use and taint the node as experiencing memory pressure - triggering pod eviction.
|
| 325 |
+
|
| 326 |
+
For more details, see [https://github.com/kubernetes/kubernetes/issues/43916](https://github.com/kubernetes/kubernetes/issues/43916)
|
| 327 |
+
|
| 328 |
+
You can work around that behavior by setting the memory limit and memory request the same for containers likely to perform intensive I/O activity. You will need to estimate or measure an optimal memory limit value for that container.
|
| 329 |
+
|
| 330 |
+
## What's next
|
| 331 |
+
|
| 332 |
+
- Learn about [API-initiated Eviction](https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/)
|
| 333 |
+
- Learn about [Pod Priority and Preemption](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/)
|
| 334 |
+
- Learn about [PodDisruptionBudgets](https://kubernetes.io/docs/tasks/run-application/configure-pdb/)
|
| 335 |
+
- Learn about [Quality of Service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) (QoS)
|
| 336 |
+
- Check out the [Eviction API](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.35/#create-eviction-pod-v1-core)
|
| 337 |
+
|
| 338 |
+
|
| 339 |
+
Last modified September 19, 2025 at 9:38 PM PST: [fix: typos (a5d40c68e0)](https://github.com/kubernetes/website/commit/a5d40c68e0dda7c44cff5c6331747b502eede79a)
|
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
An overview of the Pod Security Admission Controller, which can enforce the Pod Security Standards.
|
| 2 |
+
|
| 3 |
+
FEATURE STATE: `Kubernetes v1.25 [stable]`
|
| 4 |
+
|
| 5 |
+
The Kubernetes [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) define different isolation levels for Pods. These standards let you define how you want to restrict the behavior of pods in a clear, consistent fashion.
|
| 6 |
+
|
| 7 |
+
Kubernetes offers a built-in *Pod Security* [admission controller](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/ "A piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object.") to enforce the Pod Security Standards. Pod security restrictions are applied at the [namespace](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces "An abstraction used by Kubernetes to support isolation of groups of resources within a single cluster.") level when pods are created.
|
| 8 |
+
|
| 9 |
+
### Built-in Pod Security admission enforcement
|
| 10 |
+
|
| 11 |
+
This page is part of the documentation for Kubernetes v1.35. If you are running a different version of Kubernetes, consult the documentation for that release.
|
| 12 |
+
|
| 13 |
+
## Pod Security levels
|
| 14 |
+
|
| 15 |
+
Pod Security admission places requirements on a Pod's [Security Context](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/) and other related fields according to the three levels defined by the [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/): `privileged`, `baseline`, and `restricted`. Refer to the [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) page for an in-depth look at those requirements.
|
| 16 |
+
|
| 17 |
+
## Pod Security Admission labels for namespaces
|
| 18 |
+
|
| 19 |
+
Once the feature is enabled or the webhook is installed, you can configure namespaces to define the admission control mode you want to use for pod security in each namespace. Kubernetes defines a set of [labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels "Tags objects with identifying attributes that are meaningful and relevant to users.") that you can set to define which of the predefined Pod Security Standard levels you want to use for a namespace. The label you select defines what action the [control plane](https://kubernetes.io/docs/reference/glossary/?all=true#term-control-plane "The container orchestration layer that exposes the API and interfaces to define, deploy, and manage the lifecycle of containers.") takes if a potential violation is detected:
|
| 20 |
+
|
| 21 |
+
| Mode | Description |
|
| 22 |
+
| --- | --- |
|
| 23 |
+
| **enforce** | Policy violations will cause the pod to be rejected. |
|
| 24 |
+
| **audit** | Policy violations will trigger the addition of an audit annotation to the event recorded in the [audit log](https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/), but are otherwise allowed. |
|
| 25 |
+
| **warn** | Policy violations will trigger a user-facing warning, but are otherwise allowed. |
|
| 26 |
+
|
| 27 |
+
A namespace can configure any or all modes, or even set a different level for different modes.
|
| 28 |
+
|
| 29 |
+
For each mode, there are two labels that determine the policy used:
|
| 30 |
+
|
| 31 |
+
```yaml
|
| 32 |
+
# The per-mode level label indicates which policy level to apply for the mode.
|
| 33 |
+
#
|
| 34 |
+
# MODE must be one of \`enforce\`, \`audit\`, or \`warn\`.
|
| 35 |
+
# LEVEL must be one of \`privileged\`, \`baseline\`, or \`restricted\`.
|
| 36 |
+
pod-security.kubernetes.io/<MODE>: <LEVEL>
|
| 37 |
+
|
| 38 |
+
# Optional: per-mode version label that can be used to pin the policy to the
|
| 39 |
+
# version that shipped with a given Kubernetes minor version (for example v1.35).
|
| 40 |
+
#
|
| 41 |
+
# MODE must be one of \`enforce\`, \`audit\`, or \`warn\`.
|
| 42 |
+
# VERSION must be a valid Kubernetes minor version, or \`latest\`.
|
| 43 |
+
pod-security.kubernetes.io/<MODE>-version: <VERSION>
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
Check out [Enforce Pod Security Standards with Namespace Labels](https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-namespace-labels/) to see example usage.
|
| 47 |
+
|
| 48 |
+
## Workload resources and Pod templates
|
| 49 |
+
|
| 50 |
+
Pods are often created indirectly, by creating a [workload object](https://kubernetes.io/docs/concepts/workloads/controllers/) such as a [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ "Manages a replicated application on your cluster.") or [Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/ "A finite or batch task that runs to completion."). The workload object defines a *Pod template* and a [controller](https://kubernetes.io/docs/concepts/architecture/controller/ "A control loop that watches the shared state of the cluster through the apiserver and makes changes attempting to move the current state towards the desired state.") for the workload resource creates Pods based on that template. To help catch violations early, both the audit and warning modes are applied to the workload resources. However, enforce mode is **not** applied to workload resources, only to the resulting pod objects.
|
| 51 |
+
|
| 52 |
+
## Exemptions
|
| 53 |
+
|
| 54 |
+
You can define *exemptions* from pod security enforcement in order to allow the creation of pods that would have otherwise been prohibited due to the policy associated with a given namespace. Exemptions can be statically configured in the [Admission Controller configuration](https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-admission-controller/#configure-the-admission-controller).
|
| 55 |
+
|
| 56 |
+
Exemptions must be explicitly enumerated. Requests meeting exemption criteria are *ignored* by the Admission Controller (all `enforce`, `audit` and `warn` behaviors are skipped). Exemption dimensions include:
|
| 57 |
+
|
| 58 |
+
- **Usernames:** requests from users with an exempt authenticated (or impersonated) username are ignored.
|
| 59 |
+
- **RuntimeClassNames:** pods and [workload resources](#workload-resources-and-pod-templates) specifying an exempt runtime class name are ignored.
|
| 60 |
+
- **Namespaces:** pods and [workload resources](#workload-resources-and-pod-templates) in an exempt namespace are ignored.
|
| 61 |
+
|
| 62 |
+
> [!caution] Caution:
|
| 63 |
+
> Most pods are created by a controller in response to a [workload resource](#workload-resources-and-pod-templates), meaning that exempting an end user will only exempt them from enforcement when creating pods directly, but not when creating a workload resource. Controller service accounts (such as `system:serviceaccount:kube-system:replicaset-controller`) should generally not be exempted, as doing so would implicitly exempt any user that can create the corresponding workload resource.
|
| 64 |
+
|
| 65 |
+
Updates to the following pod fields are exempt from policy checks, meaning that if a pod update request only changes these fields, it will not be denied even if the pod is in violation of the current policy level:
|
| 66 |
+
|
| 67 |
+
- Any metadata updates **except** changes to the seccomp or AppArmor annotations:
|
| 68 |
+
- `seccomp.security.alpha.kubernetes.io/pod` (deprecated)
|
| 69 |
+
- `container.seccomp.security.alpha.kubernetes.io/*` (deprecated)
|
| 70 |
+
- `container.apparmor.security.beta.kubernetes.io/*` (deprecated)
|
| 71 |
+
- Valid updates to `.spec.activeDeadlineSeconds`
|
| 72 |
+
- Valid updates to `.spec.tolerations`
|
| 73 |
+
|
| 74 |
+
## Metrics
|
| 75 |
+
|
| 76 |
+
Here are the Prometheus metrics exposed by kube-apiserver:
|
| 77 |
+
|
| 78 |
+
- `pod_security_errors_total`: This metric indicates the number of errors preventing normal evaluation. Non-fatal errors may result in the latest restricted profile being used for enforcement.
|
| 79 |
+
- `pod_security_evaluations_total`: This metric indicates the number of policy evaluations that have occurred, not counting ignored or exempt requests during exporting.
|
| 80 |
+
- `pod_security_exemptions_total`: This metric indicates the number of exempt requests, not counting ignored or out of scope requests.
|
| 81 |
+
|
| 82 |
+
## What's next
|
| 83 |
+
|
| 84 |
+
- [Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/)
|
| 85 |
+
- [Enforcing Pod Security Standards](https://kubernetes.io/docs/setup/best-practices/enforcing-pod-security-standards/)
|
| 86 |
+
- [Enforce Pod Security Standards by Configuring the Built-in Admission Controller](https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-admission-controller/)
|
| 87 |
+
- [Enforce Pod Security Standards with Namespace Labels](https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-namespace-labels/)
|
| 88 |
+
|
| 89 |
+
If you are running an older version of Kubernetes and want to upgrade to a version of Kubernetes that does not include PodSecurityPolicies, read [migrate from PodSecurityPolicy to the Built-In PodSecurity Admission Controller](https://kubernetes.io/docs/tasks/configure-pod-container/migrate-from-psp/).
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
Last modified March 07, 2024 at 4:54 PM PST: [AppArmor v1.30 docs update (4f11f83a45)](https://github.com/kubernetes/website/commit/4f11f83a451b55d2e79ccd0472058b9f59e562ed)
|
|
@@ -0,0 +1,305 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*Pods* are the smallest deployable units of computing that you can create and manage in Kubernetes.
|
| 2 |
+
|
| 3 |
+
A *Pod* (as in a pod of whales or pea pod) is a group of one or more [containers](https://kubernetes.io/docs/concepts/containers/ "A lightweight and portable executable image that contains software and all of its dependencies."), with shared storage and network resources, and a specification for how to run the containers. A Pod's contents are always co-located and co-scheduled, and run in a shared context. A Pod models an application-specific "logical host": it contains one or more application containers which are relatively tightly coupled. In non-cloud contexts, applications executed on the same physical or virtual machine are analogous to cloud applications executed on the same logical host.
|
| 4 |
+
|
| 5 |
+
As well as application containers, a Pod can contain [init containers](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/ "One or more initialization containers that must run to completion before any app containers run.") that run during Pod startup. You can also inject [ephemeral containers](https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/ "A type of container type that you can temporarily run inside a Pod") for debugging a running Pod.
|
| 6 |
+
|
| 7 |
+
## What is a Pod?
|
| 8 |
+
|
| 9 |
+
> [!info] Note:
|
| 10 |
+
> You need to install a [container runtime](https://kubernetes.io/docs/setup/production-environment/container-runtimes/) into each node in the cluster so that Pods can run there.
|
| 11 |
+
|
| 12 |
+
The shared context of a Pod is a set of Linux namespaces, cgroups, and potentially other facets of isolation - the same things that isolate a [container](https://kubernetes.io/docs/concepts/containers/ "A lightweight and portable executable image that contains software and all of its dependencies."). Within a Pod's context, the individual applications may have further sub-isolations applied.
|
| 13 |
+
|
| 14 |
+
A Pod is similar to a set of containers with shared namespaces and shared filesystem volumes.
|
| 15 |
+
|
| 16 |
+
Pods in a Kubernetes cluster are used in two main ways:
|
| 17 |
+
|
| 18 |
+
- **Pods that run a single container**. The "one-container-per-Pod" model is the most common Kubernetes use case; in this case, you can think of a Pod as a wrapper around a single container; Kubernetes manages Pods rather than managing the containers directly.
|
| 19 |
+
- **Pods that run multiple containers that need to work together**. A Pod can encapsulate an application composed of [multiple co-located containers](#how-pods-manage-multiple-containers) that are tightly coupled and need to share resources. These co-located containers form a single cohesive unit.
|
| 20 |
+
Grouping multiple co-located and co-managed containers in a single Pod is a relatively advanced use case. You should use this pattern only in specific instances in which your containers are tightly coupled.
|
| 21 |
+
You don't need to run multiple containers to provide replication (for resilience or capacity); if you need multiple replicas, see [Workload management](https://kubernetes.io/docs/concepts/workloads/controllers/).
|
| 22 |
+
|
| 23 |
+
## Using Pods
|
| 24 |
+
|
| 25 |
+
The following is an example of a Pod which consists of a container running the image `nginx:1.14.2`.
|
| 26 |
+
|
| 27 |
+
```yaml
|
| 28 |
+
apiVersion: v1
|
| 29 |
+
kind: Pod
|
| 30 |
+
metadata:
|
| 31 |
+
name: nginx
|
| 32 |
+
spec:
|
| 33 |
+
containers:
|
| 34 |
+
- name: nginx
|
| 35 |
+
image: nginx:1.14.2
|
| 36 |
+
ports:
|
| 37 |
+
- containerPort: 80
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
To create the Pod shown above, run the following command:
|
| 41 |
+
|
| 42 |
+
```shell
|
| 43 |
+
kubectl apply -f https://k8s.io/examples/pods/simple-pod.yaml
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
Pods are generally not created directly and are created using workload resources. See [Working with Pods](#working-with-pods) for more information on how Pods are used with workload resources.
|
| 47 |
+
|
| 48 |
+
### Workload resources for managing pods
|
| 49 |
+
|
| 50 |
+
Usually you don't need to create Pods directly, even singleton Pods. Instead, create them using workload resources such as [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ "Manages a replicated application on your cluster.") or [Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/ "A finite or batch task that runs to completion."). If your Pods need to track state, consider the [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/ "A StatefulSet manages deployment and scaling of a set of Pods, with durable storage and persistent identifiers for each Pod.") resource.
|
| 51 |
+
|
| 52 |
+
Each Pod is meant to run a single instance of a given application. If you want to scale your application horizontally (to provide more overall resources by running more instances), you should use multiple Pods, one for each instance. In Kubernetes, this is typically referred to as *replication*. Replicated Pods are usually created and managed as a group by a workload resource and its [controller](https://kubernetes.io/docs/concepts/architecture/controller/ "A control loop that watches the shared state of the cluster through the apiserver and makes changes attempting to move the current state towards the desired state.").
|
| 53 |
+
|
| 54 |
+
See [Pods and controllers](#pods-and-controllers) for more information on how Kubernetes uses workload resources, and their controllers, to implement application scaling and auto-healing.
|
| 55 |
+
|
| 56 |
+
Pods natively provide two kinds of shared resources for their constituent containers: [networking](#pod-networking) and [storage](#pod-storage).
|
| 57 |
+
|
| 58 |
+
## Working with Pods
|
| 59 |
+
|
| 60 |
+
You'll rarely create individual Pods directly in Kubernetes—even singleton Pods. This is because Pods are designed as relatively ephemeral, disposable entities. When a Pod gets created (directly by you, or indirectly by a [controller](https://kubernetes.io/docs/concepts/architecture/controller/ "A control loop that watches the shared state of the cluster through the apiserver and makes changes attempting to move the current state towards the desired state.")), the new Pod is scheduled to run on a [Node](https://kubernetes.io/docs/concepts/architecture/nodes/ "A node is a worker machine in Kubernetes.") in your cluster. The Pod remains on that node until the Pod finishes execution, the Pod object is deleted, the Pod is *evicted* for lack of resources, or the node fails.
|
| 61 |
+
|
| 62 |
+
> [!info] Note:
|
| 63 |
+
> Restarting a container in a Pod should not be confused with restarting a Pod. A Pod is not a process, but an environment for running container(s). A Pod persists until it is deleted.
|
| 64 |
+
|
| 65 |
+
The name of a Pod must be a valid [DNS subdomain](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-subdomain-names) value, but this can produce unexpected results for the Pod hostname. For best compatibility, the name should follow the more restrictive rules for a [DNS label](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names).
|
| 66 |
+
|
| 67 |
+
### Pod OS
|
| 68 |
+
|
| 69 |
+
FEATURE STATE: `Kubernetes v1.25 [stable]`
|
| 70 |
+
|
| 71 |
+
You should set the `.spec.os.name` field to either `windows` or `linux` to indicate the OS on which you want the pod to run. These two are the only operating systems supported for now by Kubernetes. In the future, this list may be expanded.
|
| 72 |
+
|
| 73 |
+
In Kubernetes v1.35, the value of `.spec.os.name` does not affect how the [kube-scheduler](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/ "Control plane component that watches for newly created pods with no assigned node, and selects a node for them to run on.") picks a node for the Pod to run on. In any cluster where there is more than one operating system for running nodes, you should set the [kubernetes.io/os](https://kubernetes.io/docs/reference/labels-annotations-taints/#kubernetes-io-os) label correctly on each node, and define pods with a `nodeSelector` based on the operating system label. The kube-scheduler assigns your pod to a node based on other criteria and may or may not succeed in picking a suitable node placement where the node OS is right for the containers in that Pod. The [Pod security standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) also use this field to avoid enforcing policies that aren't relevant to the operating system.
|
| 74 |
+
|
| 75 |
+
### Pods and controllers
|
| 76 |
+
|
| 77 |
+
You can use workload resources to create and manage multiple Pods for you. A controller for the resource handles replication and rollout and automatic healing in case of Pod failure. For example, if a Node fails, a controller notices that Pods on that Node have stopped working and creates a replacement Pod. The scheduler places the replacement Pod onto a healthy Node.
|
| 78 |
+
|
| 79 |
+
Here are some examples of workload resources that manage one or more Pods:
|
| 80 |
+
|
| 81 |
+
- [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ "Manages a replicated application on your cluster.")
|
| 82 |
+
- [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/ "A StatefulSet manages deployment and scaling of a set of Pods, with durable storage and persistent identifiers for each Pod.")
|
| 83 |
+
- [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset "Ensures a copy of a Pod is running across a set of nodes in a cluster.")
|
| 84 |
+
|
| 85 |
+
### Specifying a Workload reference
|
| 86 |
+
|
| 87 |
+
FEATURE STATE: `Kubernetes v1.35 [alpha]` (disabled by default)
|
| 88 |
+
|
| 89 |
+
By default, Kubernetes schedules every Pod individually. However, some tightly-coupled applications need a group of Pods to be scheduled simultaneously to function correctly.
|
| 90 |
+
|
| 91 |
+
You can link a Pod to a [Workload](https://kubernetes.io/docs/concepts/workloads/workload-api/) object using a [Workload reference](https://kubernetes.io/docs/concepts/workloads/pods/workload-reference/). This tells the `kube-scheduler` that the Pod is part of a specific group, enabling it to make coordinated placement decisions for the entire group at once.
|
| 92 |
+
|
| 93 |
+
### Pod templates
|
| 94 |
+
|
| 95 |
+
Controllers for [workload](https://kubernetes.io/docs/concepts/workloads/ "A workload is an application running on Kubernetes.") resources create Pods from a *pod template* and manage those Pods on your behalf.
|
| 96 |
+
|
| 97 |
+
PodTemplates are specifications for creating Pods, and are included in workload resources such as [Deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/), [Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job/), and [DaemonSets](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/).
|
| 98 |
+
|
| 99 |
+
Each controller for a workload resource uses the `PodTemplate` inside the workload object to make actual Pods. The `PodTemplate` is part of the desired state of whatever workload resource you used to run your app.
|
| 100 |
+
|
| 101 |
+
When you create a Pod, you can include [environment variables](https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/) in the Pod template for the containers that run in the Pod.
|
| 102 |
+
|
| 103 |
+
The sample below is a manifest for a simple Job with a `template` that starts one container. The container in that Pod prints a message then pauses.
|
| 104 |
+
|
| 105 |
+
```yaml
|
| 106 |
+
apiVersion: batch/v1
|
| 107 |
+
kind: Job
|
| 108 |
+
metadata:
|
| 109 |
+
name: hello
|
| 110 |
+
spec:
|
| 111 |
+
template:
|
| 112 |
+
# This is the pod template
|
| 113 |
+
spec:
|
| 114 |
+
containers:
|
| 115 |
+
- name: hello
|
| 116 |
+
image: busybox:1.28
|
| 117 |
+
command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 3600']
|
| 118 |
+
restartPolicy: OnFailure
|
| 119 |
+
# The pod template ends here
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
Modifying the pod template or switching to a new pod template has no direct effect on the Pods that already exist. If you change the pod template for a workload resource, that resource needs to create replacement Pods that use the updated template.
|
| 123 |
+
|
| 124 |
+
For example, the StatefulSet controller ensures that the running Pods match the current pod template for each StatefulSet object. If you edit the StatefulSet to change its pod template, the StatefulSet starts to create new Pods based on the updated template. Eventually, all of the old Pods are replaced with new Pods, and the update is complete.
|
| 125 |
+
|
| 126 |
+
Each workload resource implements its own rules for handling changes to the Pod template. If you want to read more about StatefulSet specifically, read [Update strategy](https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#updating-statefulsets) in the StatefulSet Basics tutorial.
|
| 127 |
+
|
| 128 |
+
On Nodes, the [kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet "An agent that runs on each node in the cluster. It makes sure that containers are running in a pod.") does not directly observe or manage any of the details around pod templates and updates; those details are abstracted away. That abstraction and separation of concerns simplifies system semantics, and makes it feasible to extend the cluster's behavior without changing existing code.
|
| 129 |
+
|
| 130 |
+
## Pod update and replacement
|
| 131 |
+
|
| 132 |
+
As mentioned in the previous section, when the Pod template for a workload resource is changed, the controller creates new Pods based on the updated template instead of updating or patching the existing Pods.
|
| 133 |
+
|
| 134 |
+
Kubernetes doesn't prevent you from managing Pods directly. It is possible to update some fields of a running Pod, in place. However, Pod update operations like [`patch`](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.35/#patch-pod-v1-core), and [`replace`](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.35/#replace-pod-v1-core) have some limitations:
|
| 135 |
+
|
| 136 |
+
- Most of the metadata about a Pod is immutable. For example, you cannot change the `namespace`, `name`, `uid`, or `creationTimestamp` fields.
|
| 137 |
+
- If the `metadata.deletionTimestamp` is set, no new entry can be added to the `metadata.finalizers` list.
|
| 138 |
+
- Pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.terminationGracePeriodSeconds`, `spec.tolerations` or `spec.schedulingGates`. For `spec.tolerations`, you can only add new entries.
|
| 139 |
+
- When updating the `spec.activeDeadlineSeconds` field, two types of updates are allowed:
|
| 140 |
+
1. setting the unassigned field to a positive number;
|
| 141 |
+
2. updating the field from a positive number to a smaller, non-negative number.
|
| 142 |
+
|
| 143 |
+
### Pod subresources
|
| 144 |
+
|
| 145 |
+
The above update rules apply to regular pod updates, but other pod fields can be updated through *subresources*.
|
| 146 |
+
|
| 147 |
+
- **Resize:** The `resize` subresource allows container resources (`spec.containers[*].resources`) to be updated. See [Resize Container Resources](https://kubernetes.io/docs/tasks/configure-pod-container/resize-container-resources/) for more details.
|
| 148 |
+
- **Ephemeral Containers:** The `ephemeralContainers` subresource allows [ephemeral containers](https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/ "A type of container type that you can temporarily run inside a Pod") to be added to a Pod. See [Ephemeral Containers](https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/) for more details.
|
| 149 |
+
- **Status:** The `status` subresource allows the pod status to be updated. This is typically only used by the Kubelet and other system controllers.
|
| 150 |
+
- **Binding:** The `binding` subresource allows setting the pod's `spec.nodeName` via a `Binding` request. This is typically only used by the [scheduler](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/ "Control plane component that watches for newly created pods with no assigned node, and selects a node for them to run on.").
|
| 151 |
+
|
| 152 |
+
### Pod generation
|
| 153 |
+
|
| 154 |
+
- The `metadata.generation` field is unique. It will be automatically set by the system such that new pods have a `metadata.generation` of 1, and every update to mutable fields in the pod's spec will increment the `metadata.generation` by 1.
|
| 155 |
+
|
| 156 |
+
FEATURE STATE: `Kubernetes v1.35 [stable]` (enabled by default)
|
| 157 |
+
|
| 158 |
+
- `observedGeneration` is a field that is captured in the `status` section of the Pod object. The Kubelet will set `status.observedGeneration` to track the pod state to the current pod status. The pod's `status.observedGeneration` will reflect the `metadata.generation` of the pod at the point that the pod status is being reported.
|
| 159 |
+
|
| 160 |
+
> [!info] Note:
|
| 161 |
+
> The `status.observedGeneration` field is managed by the kubelet and external controllers should **not** modify this field.
|
| 162 |
+
|
| 163 |
+
Different status fields may either be associated with the `metadata.generation` of the current sync loop, or with the `metadata.generation` of the previous sync loop. The key distinction is whether a change in the `spec` is reflected directly in the `status` or is an indirect result of a running process.
|
| 164 |
+
|
| 165 |
+
#### Direct Status Updates
|
| 166 |
+
|
| 167 |
+
For status fields where the allocated spec is directly reflected, the `observedGeneration` will be associated with the current `metadata.generation` (Generation N).
|
| 168 |
+
|
| 169 |
+
This behavior applies to:
|
| 170 |
+
|
| 171 |
+
- **Resize Status**: The status of a resource resize operation.
|
| 172 |
+
- **Allocated Resources**: The resources allocated to the Pod after a resize.
|
| 173 |
+
- **Ephemeral Containers**: When a new ephemeral container is added, and it is in `Waiting` state.
|
| 174 |
+
|
| 175 |
+
#### Indirect Status Updates
|
| 176 |
+
|
| 177 |
+
For status fields that are an indirect result of running the spec, the `observedGeneration` will be associated with the `metadata.generation` of the previous sync loop (Generation N-1).
|
| 178 |
+
|
| 179 |
+
This behavior applies to:
|
| 180 |
+
|
| 181 |
+
- **Container Image**: The `ContainerStatus.ImageID` reflects the image from the previous generation until the new image is pulled and the container is updated.
|
| 182 |
+
- **Actual Resources**: During an in-progress resize, the actual resources in use still belong to the previous generation's request.
|
| 183 |
+
- **Container state**: During an in-progress resize, with require restart policy reflects the previous generation's request.
|
| 184 |
+
- **activeDeadlineSeconds** & **terminationGracePeriodSeconds** & **deletionTimestamp**: The effects of these fields on the Pod's status are a result of the previously observed specification.
|
| 185 |
+
|
| 186 |
+
## Resource sharing and communication
|
| 187 |
+
|
| 188 |
+
Pods enable data sharing and communication among their constituent containers.
|
| 189 |
+
|
| 190 |
+
### Storage in Pods
|
| 191 |
+
|
| 192 |
+
A Pod can specify a set of shared storage [volumes](https://kubernetes.io/docs/concepts/storage/volumes/ "A directory containing data, accessible to the containers in a pod."). All containers in the Pod can access the shared volumes, allowing those containers to share data. Volumes also allow persistent data in a Pod to survive in case one of the containers within needs to be restarted. See [Storage](https://kubernetes.io/docs/concepts/storage/) for more information on how Kubernetes implements shared storage and makes it available to Pods.
|
| 193 |
+
|
| 194 |
+
### Pod networking
|
| 195 |
+
|
| 196 |
+
Each Pod is assigned a unique IP address for each address family. Every container in a Pod shares the network namespace, including the IP address and network ports. Inside a Pod (and **only** then), the containers that belong to the Pod can communicate with one another using `localhost`. When containers in a Pod communicate with entities *outside the Pod*, they must coordinate how they use the shared network resources (such as ports). Within a Pod, containers share an IP address and port space, and can find each other via `localhost`. The containers in a Pod can also communicate with each other using standard inter-process communications like SystemV semaphores or POSIX shared memory. Containers in different Pods have distinct IP addresses and can not communicate by OS-level IPC without special configuration. Containers that want to interact with a container running in a different Pod can use IP networking to communicate.
|
| 197 |
+
|
| 198 |
+
Containers within the Pod see the system hostname as being the same as the configured `name` for the Pod. There's more about this in the [networking](https://kubernetes.io/docs/concepts/cluster-administration/networking/) section.
|
| 199 |
+
|
| 200 |
+
## Pod security settings
|
| 201 |
+
|
| 202 |
+
To set security constraints on Pods and containers, you use the `securityContext` field in the Pod specification. This field gives you granular control over what a Pod or individual containers can do. See [Advanced Pod Configuration](https://kubernetes.io/docs/concepts/workloads/pods/advanced-pod-config/) for more details.
|
| 203 |
+
|
| 204 |
+
For basic security configuration, you should meet the Baseline Pod security standard and run containers as non-root. You can set simple security contexts:
|
| 205 |
+
|
| 206 |
+
```yaml
|
| 207 |
+
apiVersion: v1
|
| 208 |
+
kind: Pod
|
| 209 |
+
metadata:
|
| 210 |
+
name: security-context-demo
|
| 211 |
+
spec:
|
| 212 |
+
securityContext:
|
| 213 |
+
runAsUser: 1000
|
| 214 |
+
runAsGroup: 3000
|
| 215 |
+
fsGroup: 2000
|
| 216 |
+
containers:
|
| 217 |
+
- name: sec-ctx-demo
|
| 218 |
+
image: busybox
|
| 219 |
+
command: ["sh", "-c", "sleep 1h"]
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
For advanced security context configuration including capabilities, seccomp profiles, and detailed security options, see the [security concepts](https://kubernetes.io/docs/concepts/security/) section.
|
| 223 |
+
|
| 224 |
+
- To learn about kernel-level security constraints that you can use, see [Linux kernel security constraints for Pods and containers](https://kubernetes.io/docs/concepts/security/linux-kernel-security-constraints/).
|
| 225 |
+
- To learn more about the Pod security context, see [Configure a Security Context for a Pod or Container](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/).
|
| 226 |
+
|
| 227 |
+
## Resource requests and limits
|
| 228 |
+
|
| 229 |
+
When you specify a Pod, you can optionally specify how much of each resource a container needs. The most common resources to specify are CPU and memory (RAM).
|
| 230 |
+
|
| 231 |
+
When you specify the resource *request* for containers in a Pod, the kube-scheduler uses this information to decide which node to place the Pod on. When you specify a resource *limit* for a container, the kubelet enforces those limits so that the running container is not allowed to use more of that resource than the limit you set.
|
| 232 |
+
|
| 233 |
+
CPU limits are enforced by CPU throttling. When a container approaches its CPU limit, the kernel restricts its access to CPU. Memory limits are enforced by the kernel with out-of-memory (OOM) kills when a container exceeds its limit.
|
| 234 |
+
|
| 235 |
+
> [!info] Note:
|
| 236 |
+
> Setting CPU limits involves a trade-off. CPU limits help prevent noisy neighbor problems where a single workload starves others on the same node. This is especially important in multi-tenant environments. However, CPU limits can cause throttling even when the node has spare CPU capacity, potentially degrading latency-sensitive workload performance. Whether to set CPU limits depends on your environment, workload characteristics, and isolation requirements.
|
| 237 |
+
|
| 238 |
+
For details on resource units, enforcement behavior, and configuration examples, see [Resource Management for Pods and Containers](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/).
|
| 239 |
+
|
| 240 |
+
## Static Pods
|
| 241 |
+
|
| 242 |
+
*Static Pods* are managed directly by the kubelet daemon on a specific node, without the [API server](https://kubernetes.io/docs/concepts/architecture/#kube-apiserver "Control plane component that serves the Kubernetes API.") observing them. Whereas most Pods are managed by the control plane (for example, a [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ "Manages a replicated application on your cluster.")), for static Pods, the kubelet directly supervises each static Pod (and restarts it if it fails).
|
| 243 |
+
|
| 244 |
+
Static Pods are always bound to one [Kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet "An agent that runs on each node in the cluster. It makes sure that containers are running in a pod.") on a specific node. The main use for static Pods is to run a self-hosted control plane: in other words, using the kubelet to supervise the individual [control plane components](https://kubernetes.io/docs/concepts/architecture/#control-plane-components).
|
| 245 |
+
|
| 246 |
+
The kubelet automatically tries to create a [mirror Pod](https://kubernetes.io/docs/reference/glossary/?all=true#term-mirror-pod "An object in the API server that tracks a static pod on a kubelet.") on the Kubernetes API server for each static Pod. This means that the Pods running on a node are visible on the API server, but cannot be controlled from there. See the guide [Create static Pods](https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/) for more information.
|
| 247 |
+
|
| 248 |
+
> [!info] Note:
|
| 249 |
+
> The `spec` of a static Pod cannot refer to other API objects (e.g., [ServiceAccount](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/ "Provides an identity for processes that run in a Pod."), [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/ "An API object used to store non-confidential data in key-value pairs. Can be consumed as environment variables, command-line arguments, or configuration files in a volume."), [Secret](https://kubernetes.io/docs/concepts/configuration/secret/ "Stores sensitive information, such as passwords, OAuth tokens, and ssh keys."), etc).
|
| 250 |
+
|
| 251 |
+
## Pods with multiple containers
|
| 252 |
+
|
| 253 |
+
Pods are designed to support multiple cooperating processes (as containers) that form a cohesive unit of service. The containers in a Pod are automatically co-located and co-scheduled on the same physical or virtual machine in the cluster. The containers can share resources and dependencies, communicate with one another, and coordinate when and how they are terminated.
|
| 254 |
+
|
| 255 |
+
Pods in a Kubernetes cluster are used in two main ways:
|
| 256 |
+
|
| 257 |
+
- **Pods that run a single container**. The "one-container-per-Pod" model is the most common Kubernetes use case; in this case, you can think of a Pod as a wrapper around a single container; Kubernetes manages Pods rather than managing the containers directly.
|
| 258 |
+
- **Pods that run multiple containers that need to work together**. A Pod can encapsulate an application composed of multiple co-located containers that are tightly coupled and need to share resources. These co-located containers form a single cohesive unit of service—for example, one container serving data stored in a shared volume to the public, while a separate [sidecar container](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/ "An auxilliary container that stays running throughout the lifecycle of a Pod.") refreshes or updates those files. The Pod wraps these containers, storage resources, and an ephemeral network identity together as a single unit.
|
| 259 |
+
|
| 260 |
+
For example, you might have a container that acts as a web server for files in a shared volume, and a separate [sidecar container](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) that updates those files from a remote source, as in the following diagram:
|
| 261 |
+
|
| 262 |
+

|
| 263 |
+
|
| 264 |
+
Pod creation diagram
|
| 265 |
+
|
| 266 |
+
Some Pods have [init containers](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/ "One or more initialization containers that must run to completion before any app containers run.") as well as [app containers](https://kubernetes.io/docs/reference/glossary/?all=true#term-app-container "A container used to run part of a workload. Compare with init container."). By default, init containers run and complete before the app containers are started.
|
| 267 |
+
|
| 268 |
+
You can also have [sidecar containers](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) that provide auxiliary services to the main application Pod (for example: a service mesh).
|
| 269 |
+
|
| 270 |
+
FEATURE STATE: `Kubernetes v1.33 [stable]` (enabled by default)
|
| 271 |
+
|
| 272 |
+
Enabled by default, the `SidecarContainers` [feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) allows you to specify `restartPolicy: Always` for init containers. Setting the `Always` restart policy ensures that the containers where you set it are treated as *sidecars* that are kept running during the entire lifetime of the Pod. Containers that you explicitly define as sidecar containers start up before the main application Pod and remain running until the Pod is shut down.
|
| 273 |
+
|
| 274 |
+
## Container probes
|
| 275 |
+
|
| 276 |
+
A *probe* is a diagnostic performed periodically by the kubelet on a container. To perform a diagnostic, the kubelet can invoke different actions:
|
| 277 |
+
|
| 278 |
+
- `ExecAction` (performed with the help of the container runtime)
|
| 279 |
+
- `TCPSocketAction` (checked directly by the kubelet)
|
| 280 |
+
- `HTTPGetAction` (checked directly by the kubelet)
|
| 281 |
+
|
| 282 |
+
You can read more about [probes](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes) in the Pod Lifecycle documentation.
|
| 283 |
+
|
| 284 |
+
## What's next
|
| 285 |
+
|
| 286 |
+
- Learn about the [lifecycle of a Pod](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/).
|
| 287 |
+
- Read about [PodDisruptionBudget](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/) and how you can use it to manage application availability during disruptions.
|
| 288 |
+
- Pod is a top-level resource in the Kubernetes REST API. The [Pod](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/) object definition describes the object in detail.
|
| 289 |
+
- [The Distributed System Toolkit: Patterns for Composite Containers](https://kubernetes.io/blog/2015/06/the-distributed-system-toolkit-patterns/) explains common layouts for Pods with more than one container.
|
| 290 |
+
- Read about [Pod topology spread constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/)
|
| 291 |
+
- Read [Advanced Pod Configuration](https://kubernetes.io/docs/concepts/workloads/pods/advanced-pod-config/) to learn the topic in detail. That page covers aspects of Pod configuration beyond the essentials, including:
|
| 292 |
+
- PriorityClasses
|
| 293 |
+
- RuntimeClasses
|
| 294 |
+
- advanced ways to configure *scheduling*: the way that Kubernetes decides which node a Pod should run on.
|
| 295 |
+
|
| 296 |
+
To understand the context for why Kubernetes wraps a common Pod API in other resources (such as [StatefulSets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/ "A StatefulSet manages deployment and scaling of a set of Pods, with durable storage and persistent identifiers for each Pod.") or [Deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ "Manages a replicated application on your cluster.")), you can read about the prior art, including:
|
| 297 |
+
|
| 298 |
+
- [Aurora](https://aurora.apache.org/documentation/latest/reference/configuration/#job-schema)
|
| 299 |
+
- [Borg](https://research.google/pubs/large-scale-cluster-management-at-google-with-borg/)
|
| 300 |
+
- [Marathon](https://github.com/d2iq-archive/marathon)
|
| 301 |
+
- [Omega](https://research.google/pubs/pub41684/)
|
| 302 |
+
- [Tupperware](https://engineering.fb.com/data-center-engineering/tupperware/).
|
| 303 |
+
|
| 304 |
+
|
| 305 |
+
Last modified February 28, 2026 at 10:29 PM PST: [add resource requests and limits trade-off (79b3410c32)](https://github.com/kubernetes/website/commit/79b3410c328e4225eb7a9384ca2a6cb0a3b7c5ce)
|
|
@@ -0,0 +1,399 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. Usually, you define a Deployment and let that Deployment manage ReplicaSets automatically.
|
| 2 |
+
|
| 3 |
+
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. As such, it is often used to guarantee the availability of a specified number of identical Pods.
|
| 4 |
+
|
| 5 |
+
## How a ReplicaSet works
|
| 6 |
+
|
| 7 |
+
A ReplicaSet is defined with fields, including a selector that specifies how to identify Pods it can acquire, a number of replicas indicating how many Pods it should be maintaining, and a pod template specifying the data of new Pods it should create to meet the number of replicas criteria. A ReplicaSet then fulfills its purpose by creating and deleting Pods as needed to reach the desired number. When a ReplicaSet needs to create new Pods, it uses its Pod template.
|
| 8 |
+
|
| 9 |
+
A ReplicaSet is linked to its Pods via the Pods' [metadata.ownerReferences](https://kubernetes.io/docs/concepts/architecture/garbage-collection/#owners-dependents) field, which specifies what resource the current object is owned by. All Pods acquired by a ReplicaSet have their owning ReplicaSet's identifying information within their ownerReferences field. It's through this link that the ReplicaSet knows of the state of the Pods it is maintaining and plans accordingly.
|
| 10 |
+
|
| 11 |
+
A ReplicaSet identifies new Pods to acquire by using its selector. If there is a Pod that has no OwnerReference or the OwnerReference is not a [Controller](https://kubernetes.io/docs/concepts/architecture/controller/ "A control loop that watches the shared state of the cluster through the apiserver and makes changes attempting to move the current state towards the desired state.") and it matches a ReplicaSet's selector, it will be immediately acquired by said ReplicaSet.
|
| 12 |
+
|
| 13 |
+
## When to use a ReplicaSet
|
| 14 |
+
|
| 15 |
+
A ReplicaSet ensures that a specified number of pod replicas are running at any given time. However, a Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features. Therefore, we recommend using Deployments instead of directly using ReplicaSets, unless you require custom update orchestration or don't require updates at all.
|
| 16 |
+
|
| 17 |
+
This actually means that you may never need to manipulate ReplicaSet objects: use a Deployment instead, and define your application in the spec section.
|
| 18 |
+
|
| 19 |
+
## Example
|
| 20 |
+
|
| 21 |
+
```yaml
|
| 22 |
+
apiVersion: apps/v1
|
| 23 |
+
kind: ReplicaSet
|
| 24 |
+
metadata:
|
| 25 |
+
name: frontend
|
| 26 |
+
labels:
|
| 27 |
+
app: guestbook
|
| 28 |
+
tier: frontend
|
| 29 |
+
spec:
|
| 30 |
+
# modify replicas according to your case
|
| 31 |
+
replicas: 3
|
| 32 |
+
selector:
|
| 33 |
+
matchLabels:
|
| 34 |
+
tier: frontend
|
| 35 |
+
template:
|
| 36 |
+
metadata:
|
| 37 |
+
labels:
|
| 38 |
+
tier: frontend
|
| 39 |
+
spec:
|
| 40 |
+
containers:
|
| 41 |
+
- name: php-redis
|
| 42 |
+
image: us-docker.pkg.dev/google-samples/containers/gke/gb-frontend:v5
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
Saving this manifest into `frontend.yaml` and submitting it to a Kubernetes cluster will create the defined ReplicaSet and the Pods that it manages.
|
| 46 |
+
|
| 47 |
+
```shell
|
| 48 |
+
kubectl apply -f https://kubernetes.io/examples/controllers/frontend.yaml
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
You can then get the current ReplicaSets deployed:
|
| 52 |
+
|
| 53 |
+
```shell
|
| 54 |
+
kubectl get rs
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
And see the frontend one you created:
|
| 58 |
+
|
| 59 |
+
```
|
| 60 |
+
NAME DESIRED CURRENT READY AGE
|
| 61 |
+
frontend 3 3 3 6s
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
You can also check on the state of the ReplicaSet:
|
| 65 |
+
|
| 66 |
+
```shell
|
| 67 |
+
kubectl describe rs/frontend
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
And you will see output similar to:
|
| 71 |
+
|
| 72 |
+
```
|
| 73 |
+
Name: frontend
|
| 74 |
+
Namespace: default
|
| 75 |
+
Selector: tier=frontend
|
| 76 |
+
Labels: app=guestbook
|
| 77 |
+
tier=frontend
|
| 78 |
+
Annotations: <none>
|
| 79 |
+
Replicas: 3 current / 3 desired
|
| 80 |
+
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
|
| 81 |
+
Pod Template:
|
| 82 |
+
Labels: tier=frontend
|
| 83 |
+
Containers:
|
| 84 |
+
php-redis:
|
| 85 |
+
Image: us-docker.pkg.dev/google-samples/containers/gke/gb-frontend:v5
|
| 86 |
+
Port: <none>
|
| 87 |
+
Host Port: <none>
|
| 88 |
+
Environment: <none>
|
| 89 |
+
Mounts: <none>
|
| 90 |
+
Volumes: <none>
|
| 91 |
+
Events:
|
| 92 |
+
Type Reason Age From Message
|
| 93 |
+
---- ------ ---- ---- -------
|
| 94 |
+
Normal SuccessfulCreate 13s replicaset-controller Created pod: frontend-gbgfx
|
| 95 |
+
Normal SuccessfulCreate 13s replicaset-controller Created pod: frontend-rwz57
|
| 96 |
+
Normal SuccessfulCreate 13s replicaset-controller Created pod: frontend-wkl7w
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
And lastly you can check for the Pods brought up:
|
| 100 |
+
|
| 101 |
+
```shell
|
| 102 |
+
kubectl get pods
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
You should see Pod information similar to:
|
| 106 |
+
|
| 107 |
+
```
|
| 108 |
+
NAME READY STATUS RESTARTS AGE
|
| 109 |
+
frontend-gbgfx 1/1 Running 0 10m
|
| 110 |
+
frontend-rwz57 1/1 Running 0 10m
|
| 111 |
+
frontend-wkl7w 1/1 Running 0 10m
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
You can also verify that the owner reference of these pods is set to the frontend ReplicaSet. To do this, get the yaml of one of the Pods running:
|
| 115 |
+
|
| 116 |
+
```shell
|
| 117 |
+
kubectl get pods frontend-gbgfx -o yaml
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
The output will look similar to this, with the frontend ReplicaSet's info set in the metadata's ownerReferences field:
|
| 121 |
+
|
| 122 |
+
```yaml
|
| 123 |
+
apiVersion: v1
|
| 124 |
+
kind: Pod
|
| 125 |
+
metadata:
|
| 126 |
+
creationTimestamp: "2024-02-28T22:30:44Z"
|
| 127 |
+
generateName: frontend-
|
| 128 |
+
labels:
|
| 129 |
+
tier: frontend
|
| 130 |
+
name: frontend-gbgfx
|
| 131 |
+
namespace: default
|
| 132 |
+
ownerReferences:
|
| 133 |
+
- apiVersion: apps/v1
|
| 134 |
+
blockOwnerDeletion: true
|
| 135 |
+
controller: true
|
| 136 |
+
kind: ReplicaSet
|
| 137 |
+
name: frontend
|
| 138 |
+
uid: e129deca-f864-481b-bb16-b27abfd92292
|
| 139 |
+
...
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
## Non-Template Pod acquisitions
|
| 143 |
+
|
| 144 |
+
While you can create bare Pods with no problems, it is strongly recommended to make sure that the bare Pods do not have labels which match the selector of one of your ReplicaSets. The reason for this is because a ReplicaSet is not limited to owning Pods specified by its template-- it can acquire other Pods in the manner specified in the previous sections.
|
| 145 |
+
|
| 146 |
+
Take the previous frontend ReplicaSet example, and the Pods specified in the following manifest:
|
| 147 |
+
|
| 148 |
+
```yaml
|
| 149 |
+
apiVersion: v1
|
| 150 |
+
kind: Pod
|
| 151 |
+
metadata:
|
| 152 |
+
name: pod1
|
| 153 |
+
labels:
|
| 154 |
+
tier: frontend
|
| 155 |
+
spec:
|
| 156 |
+
containers:
|
| 157 |
+
- name: hello1
|
| 158 |
+
image: gcr.io/google-samples/hello-app:2.0
|
| 159 |
+
|
| 160 |
+
---
|
| 161 |
+
|
| 162 |
+
apiVersion: v1
|
| 163 |
+
kind: Pod
|
| 164 |
+
metadata:
|
| 165 |
+
name: pod2
|
| 166 |
+
labels:
|
| 167 |
+
tier: frontend
|
| 168 |
+
spec:
|
| 169 |
+
containers:
|
| 170 |
+
- name: hello2
|
| 171 |
+
image: gcr.io/google-samples/hello-app:1.0
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
As those Pods do not have a Controller (or any object) as their owner reference and match the selector of the frontend ReplicaSet, they will immediately be acquired by it.
|
| 175 |
+
|
| 176 |
+
Suppose you create the Pods after the frontend ReplicaSet has been deployed and has set up its initial Pod replicas to fulfill its replica count requirement:
|
| 177 |
+
|
| 178 |
+
```shell
|
| 179 |
+
kubectl apply -f https://kubernetes.io/examples/pods/pod-rs.yaml
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
The new Pods will be acquired by the ReplicaSet, and then immediately terminated as the ReplicaSet would be over its desired count.
|
| 183 |
+
|
| 184 |
+
Fetching the Pods:
|
| 185 |
+
|
| 186 |
+
```shell
|
| 187 |
+
kubectl get pods
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
The output shows that the new Pods are either already terminated, or in the process of being terminated:
|
| 191 |
+
|
| 192 |
+
```
|
| 193 |
+
NAME READY STATUS RESTARTS AGE
|
| 194 |
+
frontend-b2zdv 1/1 Running 0 10m
|
| 195 |
+
frontend-vcmts 1/1 Running 0 10m
|
| 196 |
+
frontend-wtsmm 1/1 Running 0 10m
|
| 197 |
+
pod1 0/1 Terminating 0 1s
|
| 198 |
+
pod2 0/1 Terminating 0 1s
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
If you create the Pods first:
|
| 202 |
+
|
| 203 |
+
```shell
|
| 204 |
+
kubectl apply -f https://kubernetes.io/examples/pods/pod-rs.yaml
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
And then create the ReplicaSet however:
|
| 208 |
+
|
| 209 |
+
```shell
|
| 210 |
+
kubectl apply -f https://kubernetes.io/examples/controllers/frontend.yaml
|
| 211 |
+
```
|
| 212 |
+
|
| 213 |
+
You shall see that the ReplicaSet has acquired the Pods and has only created new ones according to its spec until the number of its new Pods and the original matches its desired count. As fetching the Pods:
|
| 214 |
+
|
| 215 |
+
```shell
|
| 216 |
+
kubectl get pods
|
| 217 |
+
```
|
| 218 |
+
|
| 219 |
+
Will reveal in its output:
|
| 220 |
+
|
| 221 |
+
```
|
| 222 |
+
NAME READY STATUS RESTARTS AGE
|
| 223 |
+
frontend-hmmj2 1/1 Running 0 9s
|
| 224 |
+
pod1 1/1 Running 0 36s
|
| 225 |
+
pod2 1/1 Running 0 36s
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
In this manner, a ReplicaSet can own a non-homogeneous set of Pods
|
| 229 |
+
|
| 230 |
+
## Writing a ReplicaSet manifest
|
| 231 |
+
|
| 232 |
+
As with all other Kubernetes API objects, a ReplicaSet needs the `apiVersion`, `kind`, and `metadata` fields. For ReplicaSets, the `kind` is always a ReplicaSet.
|
| 233 |
+
|
| 234 |
+
When the control plane creates new Pods for a ReplicaSet, the `.metadata.name` of the ReplicaSet is part of the basis for naming those Pods. The name of a ReplicaSet must be a valid [DNS subdomain](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-subdomain-names) value, but this can produce unexpected results for the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a [DNS label](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names).
|
| 235 |
+
|
| 236 |
+
A ReplicaSet also needs a [`.spec` section](https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status).
|
| 237 |
+
|
| 238 |
+
### Pod Template
|
| 239 |
+
|
| 240 |
+
The `.spec.template` is a [pod template](https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates) which is also required to have labels in place. In our `frontend.yaml` example we had one label: `tier: frontend`. Be careful not to overlap with the selectors of other controllers, lest they try to adopt this Pod.
|
| 241 |
+
|
| 242 |
+
For the template's [restart policy](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy) field, `.spec.template.spec.restartPolicy`, the only allowed value is `Always`, which is the default.
|
| 243 |
+
|
| 244 |
+
### Pod Selector
|
| 245 |
+
|
| 246 |
+
The `.spec.selector` field is a [label selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/). As discussed [earlier](#how-a-replicaset-works) these are the labels used to identify potential Pods to acquire. In our `frontend.yaml` example, the selector was:
|
| 247 |
+
|
| 248 |
+
```yaml
|
| 249 |
+
matchLabels:
|
| 250 |
+
tier: frontend
|
| 251 |
+
```
|
| 252 |
+
|
| 253 |
+
In the ReplicaSet, `.spec.template.metadata.labels` must match `spec.selector`, or it will be rejected by the API.
|
| 254 |
+
|
| 255 |
+
> [!info] Note:
|
| 256 |
+
> For 2 ReplicaSets specifying the same `.spec.selector` but different `.spec.template.metadata.labels` and `.spec.template.spec` fields, each ReplicaSet ignores the Pods created by the other ReplicaSet.
|
| 257 |
+
|
| 258 |
+
### Replicas
|
| 259 |
+
|
| 260 |
+
You can specify how many Pods should run concurrently by setting `.spec.replicas`. The ReplicaSet will create/delete its Pods to match this number.
|
| 261 |
+
|
| 262 |
+
If you do not specify `.spec.replicas`, then it defaults to 1.
|
| 263 |
+
|
| 264 |
+
## Working with ReplicaSets
|
| 265 |
+
|
| 266 |
+
### Deleting a ReplicaSet and its Pods
|
| 267 |
+
|
| 268 |
+
To delete a ReplicaSet and all of its Pods, use [`kubectl delete`](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#delete). The [Garbage collector](https://kubernetes.io/docs/concepts/architecture/garbage-collection/) automatically deletes all of the dependent Pods by default.
|
| 269 |
+
|
| 270 |
+
When using the REST API or the `client-go` library, you must set `propagationPolicy` to `Background` or `Foreground` in the `-d` option. For example:
|
| 271 |
+
|
| 272 |
+
```shell
|
| 273 |
+
kubectl proxy --port=8080
|
| 274 |
+
curl -X DELETE 'localhost:8080/apis/apps/v1/namespaces/default/replicasets/frontend' \
|
| 275 |
+
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"}' \
|
| 276 |
+
-H "Content-Type: application/json"
|
| 277 |
+
```
|
| 278 |
+
|
| 279 |
+
### Deleting just a ReplicaSet
|
| 280 |
+
|
| 281 |
+
You can delete a ReplicaSet without affecting any of its Pods using [`kubectl delete`](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#delete) with the `--cascade=orphan` option. When using the REST API or the `client-go` library, you must set `propagationPolicy` to `Orphan`. For example:
|
| 282 |
+
|
| 283 |
+
```shell
|
| 284 |
+
kubectl proxy --port=8080
|
| 285 |
+
curl -X DELETE 'localhost:8080/apis/apps/v1/namespaces/default/replicasets/frontend' \
|
| 286 |
+
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Orphan"}' \
|
| 287 |
+
-H "Content-Type: application/json"
|
| 288 |
+
```
|
| 289 |
+
|
| 290 |
+
Once the original is deleted, you can create a new ReplicaSet to replace it. As long as the old and new `.spec.selector` are the same, then the new one will adopt the old Pods. However, it will not make any effort to make existing Pods match a new, different pod template. To update Pods to a new spec in a controlled way, use a [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#creating-a-deployment), as ReplicaSets do not support a rolling update directly.
|
| 291 |
+
|
| 292 |
+
### Terminating Pods
|
| 293 |
+
|
| 294 |
+
FEATURE STATE: `Kubernetes v1.35 [beta]` (enabled by default)
|
| 295 |
+
|
| 296 |
+
You can enable this feature by setting the `DeploymentReplicaSetTerminatingReplicas` [feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) on the [API server](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/) and on the [kube-controller-manager](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/)
|
| 297 |
+
|
| 298 |
+
Pods that become terminating due to deletion or scale down may take a long time to terminate, and may consume additional resources during that period. As a result, the total number of all pods can temporarily exceed `.spec.replicas`. Terminating pods can be tracked using the `.status.terminatingReplicas` field of the ReplicaSet.
|
| 299 |
+
|
| 300 |
+
### Isolating Pods from a ReplicaSet
|
| 301 |
+
|
| 302 |
+
You can remove Pods from a ReplicaSet by changing their labels. This technique may be used to remove Pods from service for debugging, data recovery, etc. Pods that are removed in this way will be replaced automatically ( assuming that the number of replicas is not also changed).
|
| 303 |
+
|
| 304 |
+
### Scaling a ReplicaSet
|
| 305 |
+
|
| 306 |
+
A ReplicaSet can be easily scaled up or down by simply updating the `.spec.replicas` field. The ReplicaSet controller ensures that a desired number of Pods with a matching label selector are available and operational.
|
| 307 |
+
|
| 308 |
+
When scaling down, the ReplicaSet controller chooses which pods to delete by sorting the available pods to prioritize scaling down pods based on the following general algorithm:
|
| 309 |
+
|
| 310 |
+
1. Pending (and unschedulable) pods are scaled down first
|
| 311 |
+
2. If `controller.kubernetes.io/pod-deletion-cost` annotation is set, then the pod with the lower value will come first.
|
| 312 |
+
3. Pods on nodes with more replicas come before pods on nodes with fewer replicas.
|
| 313 |
+
4. If the pods' creation times differ, the pod that was created more recently comes before the older pod (the creation times are bucketed on an integer log scale).
|
| 314 |
+
|
| 315 |
+
If all of the above match, then selection is random.
|
| 316 |
+
|
| 317 |
+
### Pod deletion cost
|
| 318 |
+
|
| 319 |
+
FEATURE STATE: `Kubernetes v1.22 [beta]`
|
| 320 |
+
|
| 321 |
+
Using the [`controller.kubernetes.io/pod-deletion-cost`](https://kubernetes.io/docs/reference/labels-annotations-taints/#pod-deletion-cost) annotation, users can set a preference regarding which pods to remove first when downscaling a ReplicaSet.
|
| 322 |
+
|
| 323 |
+
The annotation should be set on the pod, the range is \[-2147483648, 2147483647\]. It represents the cost of deleting a pod compared to other pods belonging to the same ReplicaSet. Pods with lower deletion cost are preferred to be deleted before pods with higher deletion cost.
|
| 324 |
+
|
| 325 |
+
The implicit value for this annotation for pods that don't set it is 0; negative values are permitted. Invalid values will be rejected by the API server.
|
| 326 |
+
|
| 327 |
+
This feature is beta and enabled by default. You can disable it using the [feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) `PodDeletionCost` in both kube-apiserver and kube-controller-manager.
|
| 328 |
+
|
| 329 |
+
> [!info] Note:
|
| 330 |
+
> - This is honored on a best-effort basis, so it does not offer any guarantees on pod deletion order.
|
| 331 |
+
> - Users should avoid updating the annotation frequently, such as updating it based on a metric value, because doing so will generate a significant number of pod updates on the apiserver.
|
| 332 |
+
|
| 333 |
+
#### Example Use Case
|
| 334 |
+
|
| 335 |
+
The different pods of an application could have different utilization levels. On scale down, the application may prefer to remove the pods with lower utilization. To avoid frequently updating the pods, the application should update `controller.kubernetes.io/pod-deletion-cost` once before issuing a scale down (setting the annotation to a value proportional to pod utilization level). This works if the application itself controls the down scaling; for example, the driver pod of a Spark deployment.
|
| 336 |
+
|
| 337 |
+
### ReplicaSet as a Horizontal Pod Autoscaler Target
|
| 338 |
+
|
| 339 |
+
A ReplicaSet can also be a target for [Horizontal Pod Autoscalers (HPA)](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/). That is, a ReplicaSet can be auto-scaled by an HPA. Here is an example HPA targeting the ReplicaSet we created in the previous example.
|
| 340 |
+
|
| 341 |
+
```yaml
|
| 342 |
+
apiVersion: autoscaling/v1
|
| 343 |
+
kind: HorizontalPodAutoscaler
|
| 344 |
+
metadata:
|
| 345 |
+
name: frontend-scaler
|
| 346 |
+
spec:
|
| 347 |
+
scaleTargetRef:
|
| 348 |
+
apiVersion: apps/v1
|
| 349 |
+
kind: ReplicaSet
|
| 350 |
+
name: frontend
|
| 351 |
+
minReplicas: 3
|
| 352 |
+
maxReplicas: 10
|
| 353 |
+
targetCPUUtilizationPercentage: 50
|
| 354 |
+
```
|
| 355 |
+
|
| 356 |
+
Saving this manifest into `hpa-rs.yaml` and submitting it to a Kubernetes cluster should create the defined HPA that autoscales the target ReplicaSet depending on the CPU usage of the replicated Pods.
|
| 357 |
+
|
| 358 |
+
```shell
|
| 359 |
+
kubectl apply -f https://k8s.io/examples/controllers/hpa-rs.yaml
|
| 360 |
+
```
|
| 361 |
+
|
| 362 |
+
Alternatively, you can use the `kubectl autoscale` command to accomplish the same (and it's easier!)
|
| 363 |
+
|
| 364 |
+
```shell
|
| 365 |
+
kubectl autoscale rs frontend --max=10 --min=3 --cpu=50%
|
| 366 |
+
```
|
| 367 |
+
|
| 368 |
+
## Alternatives to ReplicaSet
|
| 369 |
+
|
| 370 |
+
### Deployment (recommended)
|
| 371 |
+
|
| 372 |
+
[`Deployment`](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) is an object which can own ReplicaSets and update them and their Pods via declarative, server-side rolling updates. While ReplicaSets can be used independently, today they're mainly used by Deployments as a mechanism to orchestrate Pod creation, deletion and updates. When you use Deployments you don't have to worry about managing the ReplicaSets that they create. Deployments own and manage their ReplicaSets. As such, it is recommended to use Deployments when you want ReplicaSets.
|
| 373 |
+
|
| 374 |
+
### Bare Pods
|
| 375 |
+
|
| 376 |
+
Unlike the case where a user directly created Pods, a ReplicaSet replaces Pods that are deleted or terminated for any reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, we recommend that you use a ReplicaSet even if your application requires only a single Pod. Think of it similarly to a process supervisor, only it supervises multiple Pods across multiple nodes instead of individual processes on a single node. A ReplicaSet delegates local container restarts to some agent on the node such as Kubelet.
|
| 377 |
+
|
| 378 |
+
### Job
|
| 379 |
+
|
| 380 |
+
Use a [`Job`](https://kubernetes.io/docs/concepts/workloads/controllers/job/) instead of a ReplicaSet for Pods that are expected to terminate on their own (that is, batch jobs).
|
| 381 |
+
|
| 382 |
+
### DaemonSet
|
| 383 |
+
|
| 384 |
+
Use a [`DaemonSet`](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) instead of a ReplicaSet for Pods that provide a machine-level function, such as machine monitoring or machine logging. These Pods have a lifetime that is tied to a machine lifetime: the Pod needs to be running on the machine before other Pods start, and are safe to terminate when the machine is otherwise ready to be rebooted/shutdown.
|
| 385 |
+
|
| 386 |
+
### ReplicationController
|
| 387 |
+
|
| 388 |
+
ReplicaSets are the successors to [ReplicationControllers](https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller/). The two serve the same purpose, and behave similarly, except that a ReplicationController does not support set-based selector requirements as described in the [labels user guide](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors). As such, ReplicaSets are preferred over ReplicationControllers
|
| 389 |
+
|
| 390 |
+
## What's next
|
| 391 |
+
|
| 392 |
+
- Learn about [Pods](https://kubernetes.io/docs/concepts/workloads/pods/).
|
| 393 |
+
- Learn about [Deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/).
|
| 394 |
+
- [Run a Stateless Application Using a Deployment](https://kubernetes.io/docs/tasks/run-application/run-stateless-application-deployment/), which relies on ReplicaSets to work.
|
| 395 |
+
- `ReplicaSet` is a top-level resource in the Kubernetes REST API. Read the [ReplicaSet](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/replica-set-v1/) object definition to understand the API for replica sets.
|
| 396 |
+
- Read about [PodDisruptionBudget](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/) and how you can use it to manage application availability during disruptions.
|
| 397 |
+
|
| 398 |
+
|
| 399 |
+
Last modified September 26, 2025 at 6:20 PM PST: [Fix HPA CLI example in ReplicaSet doc (55add008ed)](https://github.com/kubernetes/website/commit/55add008edd6efd03de533257d4cf79628f58103)
|
|
@@ -0,0 +1,549 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
A Secret is an object that contains a small amount of sensitive data such as a password, a token, or a key. Such information might otherwise be put in a [Pod](https://kubernetes.io/docs/concepts/workloads/pods/ "A Pod represents a set of running containers in your cluster.") specification or in a [container image](https://kubernetes.io/docs/reference/glossary/?all=true#term-image "Stored instance of a container that holds a set of software needed to run an application."). Using a Secret means that you don't need to include confidential data in your application code.
|
| 2 |
+
|
| 3 |
+
Because Secrets can be created independently of the Pods that use them, there is less risk of the Secret (and its data) being exposed during the workflow of creating, viewing, and editing Pods. Kubernetes, and applications that run in your cluster, can also take additional precautions with Secrets, such as avoiding writing sensitive data to nonvolatile storage.
|
| 4 |
+
|
| 5 |
+
Secrets are similar to [ConfigMaps](https://kubernetes.io/docs/concepts/configuration/configmap/ "An API object used to store non-confidential data in key-value pairs. Can be consumed as environment variables, command-line arguments, or configuration files in a volume.") but are specifically intended to hold confidential data.
|
| 6 |
+
|
| 7 |
+
> [!caution] Caution:
|
| 8 |
+
> Kubernetes Secrets are, by default, stored unencrypted in the API server's underlying data store (etcd). Anyone with API access can retrieve or modify a Secret, and so can anyone with access to etcd. Additionally, anyone who is authorized to create a Pod in a namespace can use that access to read any Secret in that namespace; this includes indirect access such as the ability to create a Deployment.
|
| 9 |
+
>
|
| 10 |
+
> In order to safely use Secrets, take at least the following steps:
|
| 11 |
+
>
|
| 12 |
+
> 1. [Enable Encryption at Rest](https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/) for Secrets.
|
| 13 |
+
> 2. [Enable or configure RBAC rules](https://kubernetes.io/docs/reference/access-authn-authz/authorization/) with least-privilege access to Secrets.
|
| 14 |
+
> 3. Restrict Secret access to specific containers.
|
| 15 |
+
> 4. [Consider using external Secret store providers](https://secrets-store-csi-driver.sigs.k8s.io/concepts.html#provider-for-the-secrets-store-csi-driver).
|
| 16 |
+
>
|
| 17 |
+
> For more guidelines to manage and improve the security of your Secrets, refer to [Good practices for Kubernetes Secrets](https://kubernetes.io/docs/concepts/security/secrets-good-practices/).
|
| 18 |
+
|
| 19 |
+
See [Information security for Secrets](#information-security-for-secrets) for more details.
|
| 20 |
+
|
| 21 |
+
## Uses for Secrets
|
| 22 |
+
|
| 23 |
+
You can use Secrets for purposes such as the following:
|
| 24 |
+
|
| 25 |
+
- [Set environment variables for a container](https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/#define-container-environment-variables-using-secret-data).
|
| 26 |
+
- [Provide credentials such as SSH keys or passwords to Pods](https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/#provide-prod-test-creds).
|
| 27 |
+
- [Allow the kubelet to pull container images from private registries](https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/).
|
| 28 |
+
|
| 29 |
+
The Kubernetes control plane also uses Secrets; for example, [bootstrap token Secrets](#bootstrap-token-secrets) are a mechanism to help automate node registration.
|
| 30 |
+
|
| 31 |
+
### Use case: dotfiles in a secret volume
|
| 32 |
+
|
| 33 |
+
You can make your data "hidden" by defining a key that begins with a dot. This key represents a dotfile or "hidden" file. For example, when the following Secret is mounted into a volume, `secret-volume`, the volume will contain a single file, called `.secret-file`, and the `dotfile-test-container` will have this file present at the path `/etc/secret-volume/.secret-file`.
|
| 34 |
+
|
| 35 |
+
> [!info] Note:
|
| 36 |
+
> Files beginning with dot characters are hidden from the output of `ls -l`; you must use `ls -la` to see them when listing directory contents.
|
| 37 |
+
|
| 38 |
+
```yaml
|
| 39 |
+
apiVersion: v1
|
| 40 |
+
kind: Secret
|
| 41 |
+
metadata:
|
| 42 |
+
name: dotfile-secret
|
| 43 |
+
data:
|
| 44 |
+
.secret-file: dmFsdWUtMg0KDQo=
|
| 45 |
+
---
|
| 46 |
+
apiVersion: v1
|
| 47 |
+
kind: Pod
|
| 48 |
+
metadata:
|
| 49 |
+
name: secret-dotfiles-pod
|
| 50 |
+
spec:
|
| 51 |
+
volumes:
|
| 52 |
+
- name: secret-volume
|
| 53 |
+
secret:
|
| 54 |
+
secretName: dotfile-secret
|
| 55 |
+
containers:
|
| 56 |
+
- name: dotfile-test-container
|
| 57 |
+
image: registry.k8s.io/busybox
|
| 58 |
+
command:
|
| 59 |
+
- ls
|
| 60 |
+
- "-l"
|
| 61 |
+
- "/etc/secret-volume"
|
| 62 |
+
volumeMounts:
|
| 63 |
+
- name: secret-volume
|
| 64 |
+
readOnly: true
|
| 65 |
+
mountPath: "/etc/secret-volume"
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
### Use case: Secret visible to one container in a Pod
|
| 69 |
+
|
| 70 |
+
Consider a program that needs to handle HTTP requests, do some complex business logic, and then sign some messages with an HMAC. Because it has complex application logic, there might be an unnoticed remote file reading exploit in the server, which could expose the private key to an attacker.
|
| 71 |
+
|
| 72 |
+
This could be divided into two processes in two containers: a frontend container which handles user interaction and business logic, but which cannot see the private key; and a signer container that can see the private key, and responds to simple signing requests from the frontend (for example, over localhost networking).
|
| 73 |
+
|
| 74 |
+
With this partitioned approach, an attacker now has to trick the application server into doing something rather arbitrary, which may be harder than getting it to read a file.
|
| 75 |
+
|
| 76 |
+
### Alternatives to Secrets
|
| 77 |
+
|
| 78 |
+
Rather than using a Secret to protect confidential data, you can pick from alternatives.
|
| 79 |
+
|
| 80 |
+
Here are some of your options:
|
| 81 |
+
|
| 82 |
+
- If your cloud-native component needs to authenticate to another application that you know is running within the same Kubernetes cluster, you can use a [ServiceAccount](https://kubernetes.io/docs/reference/access-authn-authz/authentication/#service-account-tokens) and its tokens to identify your client.
|
| 83 |
+
- There are third-party tools that you can run, either within or outside your cluster, that manage sensitive data. For example, a service that Pods access over HTTPS, that reveals a Secret if the client correctly authenticates (for example, with a ServiceAccount token).
|
| 84 |
+
- For authentication, you can implement a custom signer for X.509 certificates, and use [CertificateSigningRequests](https://kubernetes.io/docs/reference/access-authn-authz/certificate-signing-requests/) to let that custom signer issue certificates to Pods that need them.
|
| 85 |
+
- You can use a [device plugin](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/) to expose node-local encryption hardware to a specific Pod. For example, you can schedule trusted Pods onto nodes that provide a Trusted Platform Module, configured out-of-band.
|
| 86 |
+
|
| 87 |
+
You can also combine two or more of those options, including the option to use Secret objects themselves.
|
| 88 |
+
|
| 89 |
+
For example: implement (or deploy) an [operator](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/ "A specialized controller used to manage a custom resource") that fetches short-lived session tokens from an external service, and then creates Secrets based on those short-lived session tokens. Pods running in your cluster can make use of the session tokens, and operator ensures they are valid. This separation means that you can run Pods that are unaware of the exact mechanisms for issuing and refreshing those session tokens.
|
| 90 |
+
|
| 91 |
+
## Types of Secret
|
| 92 |
+
|
| 93 |
+
When creating a Secret, you can specify its type using the `type` field of the [Secret](https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/secret-v1/) resource, or certain equivalent `kubectl` command line flags (if available). The Secret type is used to facilitate programmatic handling of the Secret data.
|
| 94 |
+
|
| 95 |
+
Kubernetes provides several built-in types for some common usage scenarios. These types vary in terms of the validations performed and the constraints Kubernetes imposes on them.
|
| 96 |
+
|
| 97 |
+
| Built-in Type | Usage |
|
| 98 |
+
| --- | --- |
|
| 99 |
+
| `Opaque` | arbitrary user-defined data |
|
| 100 |
+
| `kubernetes.io/service-account-token` | ServiceAccount token |
|
| 101 |
+
| `kubernetes.io/dockercfg` | serialized `~/.dockercfg` file |
|
| 102 |
+
| `kubernetes.io/dockerconfigjson` | serialized `~/.docker/config.json` file |
|
| 103 |
+
| `kubernetes.io/basic-auth` | credentials for basic authentication |
|
| 104 |
+
| `kubernetes.io/ssh-auth` | credentials for SSH authentication |
|
| 105 |
+
| `kubernetes.io/tls` | data for a TLS client or server |
|
| 106 |
+
| `bootstrap.kubernetes.io/token` | bootstrap token data |
|
| 107 |
+
|
| 108 |
+
You can define and use your own Secret type by assigning a non-empty string as the `type` value for a Secret object (an empty string is treated as an `Opaque` type).
|
| 109 |
+
|
| 110 |
+
Kubernetes doesn't impose any constraints on the type name. However, if you are using one of the built-in types, you must meet all the requirements defined for that type.
|
| 111 |
+
|
| 112 |
+
If you are defining a type of Secret that's for public use, follow the convention and structure the Secret type to have your domain name before the name, separated by a `/`. For example: `cloud-hosting.example.net/cloud-api-credentials`.
|
| 113 |
+
|
| 114 |
+
### Opaque Secrets
|
| 115 |
+
|
| 116 |
+
`Opaque` is the default Secret type if you don't explicitly specify a type in a Secret manifest. When you create a Secret using `kubectl`, you must use the `generic` subcommand to indicate an `Opaque` Secret type. For example, the following command creates an empty Secret of type `Opaque`:
|
| 117 |
+
|
| 118 |
+
```shell
|
| 119 |
+
kubectl create secret generic empty-secret
|
| 120 |
+
kubectl get secret empty-secret
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
The output looks like:
|
| 124 |
+
|
| 125 |
+
```
|
| 126 |
+
NAME TYPE DATA AGE
|
| 127 |
+
empty-secret Opaque 0 2m6s
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
The `DATA` column shows the number of data items stored in the Secret. In this case, `0` means you have created an empty Secret.
|
| 131 |
+
|
| 132 |
+
### ServiceAccount token Secrets
|
| 133 |
+
|
| 134 |
+
A `kubernetes.io/service-account-token` type of Secret is used to store a token credential that identifies a [ServiceAccount](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/ "Provides an identity for processes that run in a Pod."). This is a legacy mechanism that provides long-lived ServiceAccount credentials to Pods.
|
| 135 |
+
|
| 136 |
+
In Kubernetes v1.22 and later, the recommended approach is to obtain a short-lived, automatically rotating ServiceAccount token by using the [`TokenRequest`](https://kubernetes.io/docs/reference/kubernetes-api/authentication-resources/token-request-v1/) API instead. You can get these short-lived tokens using the following methods:
|
| 137 |
+
|
| 138 |
+
- Call the `TokenRequest` API either directly or by using an API client like `kubectl`. For example, you can use the [`kubectl create token`](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#-em-token-em-) command.
|
| 139 |
+
- Request a mounted token in a [projected volume](https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume) in your Pod manifest. Kubernetes creates the token and mounts it in the Pod. The token is automatically invalidated when the Pod that it's mounted in is deleted. For details, see [Launch a Pod using service account token projection](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#launch-a-pod-using-service-account-token-projection).
|
| 140 |
+
|
| 141 |
+
> [!info] Note:
|
| 142 |
+
> You should only create a ServiceAccount token Secret if you can't use the `TokenRequest` API to obtain a token, and the security exposure of persisting a non-expiring token credential in a readable API object is acceptable to you. For instructions, see [Manually create a long-lived API token for a ServiceAccount](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#manually-create-an-api-token-for-a-serviceaccount).
|
| 143 |
+
|
| 144 |
+
When using this Secret type, you need to ensure that the `kubernetes.io/service-account.name` annotation is set to an existing ServiceAccount name. If you are creating both the ServiceAccount and the Secret objects, you should create the ServiceAccount object first.
|
| 145 |
+
|
| 146 |
+
After the Secret is created, a Kubernetes [controller](https://kubernetes.io/docs/concepts/architecture/controller/ "A control loop that watches the shared state of the cluster through the apiserver and makes changes attempting to move the current state towards the desired state.") fills in some other fields such as the `kubernetes.io/service-account.uid` annotation, and the `token` key in the `data` field, which is populated with an authentication token.
|
| 147 |
+
|
| 148 |
+
The following example configuration declares a ServiceAccount token Secret:
|
| 149 |
+
|
| 150 |
+
```yaml
|
| 151 |
+
apiVersion: v1
|
| 152 |
+
kind: Secret
|
| 153 |
+
metadata:
|
| 154 |
+
name: secret-sa-sample
|
| 155 |
+
annotations:
|
| 156 |
+
kubernetes.io/service-account.name: "sa-name"
|
| 157 |
+
type: kubernetes.io/service-account-token
|
| 158 |
+
data:
|
| 159 |
+
extra: YmFyCg==
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
After creating the Secret, wait for Kubernetes to populate the `token` key in the `data` field.
|
| 163 |
+
|
| 164 |
+
See the [ServiceAccount](https://kubernetes.io/docs/concepts/security/service-accounts/) documentation for more information on how ServiceAccounts work. You can also check the `automountServiceAccountToken` field and the `serviceAccountName` field of the [`Pod`](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.35/#pod-v1-core) for information on referencing ServiceAccount credentials from within Pods.
|
| 165 |
+
|
| 166 |
+
### Docker config Secrets
|
| 167 |
+
|
| 168 |
+
If you are creating a Secret to store credentials for accessing a container image registry, you must use one of the following `type` values for that Secret:
|
| 169 |
+
|
| 170 |
+
- `kubernetes.io/dockercfg`: store a serialized `~/.dockercfg` which is the legacy format for configuring Docker command line. The Secret `data` field contains a `.dockercfg` key whose value is the content of a base64 encoded `~/.dockercfg` file.
|
| 171 |
+
- `kubernetes.io/dockerconfigjson`: store a serialized JSON that follows the same format rules as the `~/.docker/config.json` file, which is a new format for `~/.dockercfg`. The Secret `data` field must contain a `.dockerconfigjson` key for which the value is the content of a base64 encoded `~/.docker/config.json` file.
|
| 172 |
+
|
| 173 |
+
Below is an example for a `kubernetes.io/dockercfg` type of Secret:
|
| 174 |
+
|
| 175 |
+
```yaml
|
| 176 |
+
apiVersion: v1
|
| 177 |
+
kind: Secret
|
| 178 |
+
metadata:
|
| 179 |
+
name: secret-dockercfg
|
| 180 |
+
type: kubernetes.io/dockercfg
|
| 181 |
+
data:
|
| 182 |
+
.dockercfg: |
|
| 183 |
+
eyJhdXRocyI6eyJodHRwczovL2V4YW1wbGUvdjEvIjp7ImF1dGgiOiJvcGVuc2VzYW1lIn19fQo=
|
| 184 |
+
```
|
| 185 |
+
|
| 186 |
+
> [!info] Note:
|
| 187 |
+
> If you do not want to perform the base64 encoding, you can choose to use the `stringData` field instead.
|
| 188 |
+
|
| 189 |
+
When you create Docker config Secrets using a manifest, the API server checks whether the expected key exists in the `data` field, and it verifies if the value provided can be parsed as a valid JSON. The API server doesn't validate if the JSON actually is a Docker config file.
|
| 190 |
+
|
| 191 |
+
You can also use `kubectl` to create a Secret for accessing a container registry, such as when you don't have a Docker configuration file:
|
| 192 |
+
|
| 193 |
+
```shell
|
| 194 |
+
kubectl create secret docker-registry secret-tiger-docker \
|
| 195 |
+
--docker-email=tiger@acme.example \
|
| 196 |
+
--docker-username=tiger \
|
| 197 |
+
--docker-password=pass1234 \
|
| 198 |
+
--docker-server=my-registry.example:5000
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
This command creates a Secret of type `kubernetes.io/dockerconfigjson`.
|
| 202 |
+
|
| 203 |
+
Retrieve the `.data.dockerconfigjson` field from that new Secret and decode the data:
|
| 204 |
+
|
| 205 |
+
```shell
|
| 206 |
+
kubectl get secret secret-tiger-docker -o jsonpath='{.data.*}' | base64 -d
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
The output is equivalent to the following JSON document (which is also a valid Docker configuration file):
|
| 210 |
+
|
| 211 |
+
```json
|
| 212 |
+
{
|
| 213 |
+
"auths": {
|
| 214 |
+
"my-registry.example:5000": {
|
| 215 |
+
"username": "tiger",
|
| 216 |
+
"password": "pass1234",
|
| 217 |
+
"email": "tiger@acme.example",
|
| 218 |
+
"auth": "dGlnZXI6cGFzczEyMzQ="
|
| 219 |
+
}
|
| 220 |
+
}
|
| 221 |
+
}
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
> [!caution] Caution:
|
| 225 |
+
> The `auth` value there is base64 encoded; it is obscured but not secret. Anyone who can read that Secret can learn the registry access bearer token.
|
| 226 |
+
>
|
| 227 |
+
> It is suggested to use [credential providers](https://kubernetes.io/docs/tasks/administer-cluster/kubelet-credential-provider/) to dynamically and securely provide pull secrets on-demand.
|
| 228 |
+
|
| 229 |
+
### Basic authentication Secret
|
| 230 |
+
|
| 231 |
+
The `kubernetes.io/basic-auth` type is provided for storing credentials needed for basic authentication. When using this Secret type, the `data` field of the Secret must contain one of the following two keys:
|
| 232 |
+
|
| 233 |
+
- `username`: the user name for authentication
|
| 234 |
+
- `password`: the password or token for authentication
|
| 235 |
+
|
| 236 |
+
Both values for the above two keys are base64 encoded strings. You can alternatively provide the clear text content using the `stringData` field in the Secret manifest.
|
| 237 |
+
|
| 238 |
+
The following manifest is an example of a basic authentication Secret:
|
| 239 |
+
|
| 240 |
+
```yaml
|
| 241 |
+
apiVersion: v1
|
| 242 |
+
kind: Secret
|
| 243 |
+
metadata:
|
| 244 |
+
name: secret-basic-auth
|
| 245 |
+
type: kubernetes.io/basic-auth
|
| 246 |
+
stringData:
|
| 247 |
+
username: admin # required field for kubernetes.io/basic-auth
|
| 248 |
+
password: t0p-Secret # required field for kubernetes.io/basic-auth
|
| 249 |
+
```
|
| 250 |
+
|
| 251 |
+
> [!info] Note:
|
| 252 |
+
> The `stringData` field for a Secret does not work well with server-side apply.
|
| 253 |
+
|
| 254 |
+
The basic authentication Secret type is provided only for convenience. You can create an `Opaque` type for credentials used for basic authentication. However, using the defined and public Secret type (`kubernetes.io/basic-auth`) helps other people to understand the purpose of your Secret, and sets a convention for what key names to expect.
|
| 255 |
+
|
| 256 |
+
### SSH authentication Secrets
|
| 257 |
+
|
| 258 |
+
The builtin type `kubernetes.io/ssh-auth` is provided for storing data used in SSH authentication. When using this Secret type, you will have to specify a `ssh-privatekey` key-value pair in the `data` (or `stringData`) field as the SSH credential to use.
|
| 259 |
+
|
| 260 |
+
The following manifest is an example of a Secret used for SSH public/private key authentication:
|
| 261 |
+
|
| 262 |
+
```yaml
|
| 263 |
+
apiVersion: v1
|
| 264 |
+
kind: Secret
|
| 265 |
+
metadata:
|
| 266 |
+
name: secret-ssh-auth
|
| 267 |
+
type: kubernetes.io/ssh-auth
|
| 268 |
+
data:
|
| 269 |
+
# the data is abbreviated in this example
|
| 270 |
+
ssh-privatekey: |
|
| 271 |
+
UG91cmluZzYlRW1vdGljb24lU2N1YmE=
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
The SSH authentication Secret type is provided only for convenience. You can create an `Opaque` type for credentials used for SSH authentication. However, using the defined and public Secret type (`kubernetes.io/ssh-auth`) helps other people to understand the purpose of your Secret, and sets a convention for what key names to expect. The Kubernetes API verifies that the required keys are set for a Secret of this type.
|
| 275 |
+
|
| 276 |
+
> [!caution] Caution:
|
| 277 |
+
> SSH private keys do not establish trusted communication between an SSH client and host server on their own. A secondary means of establishing trust is needed to mitigate "man in the middle" attacks, such as a `known_hosts` file added to a ConfigMap.
|
| 278 |
+
|
| 279 |
+
### TLS Secrets
|
| 280 |
+
|
| 281 |
+
The `kubernetes.io/tls` Secret type is for storing a certificate and its associated key that are typically used for TLS.
|
| 282 |
+
|
| 283 |
+
One common use for TLS Secrets is to configure encryption in transit for an [Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/), but you can also use it with other resources or directly in your workload. When using this type of Secret, the `tls.key` and the `tls.crt` key must be provided in the `data` (or `stringData`) field of the Secret configuration, although the API server doesn't actually validate the values for each key.
|
| 284 |
+
|
| 285 |
+
As an alternative to using `stringData`, you can use the `data` field to provide the base64 encoded certificate and private key. For details, see [Constraints on Secret names and data](#restriction-names-data).
|
| 286 |
+
|
| 287 |
+
The following YAML contains an example config for a TLS Secret:
|
| 288 |
+
|
| 289 |
+
```yaml
|
| 290 |
+
apiVersion: v1
|
| 291 |
+
kind: Secret
|
| 292 |
+
metadata:
|
| 293 |
+
name: secret-tls
|
| 294 |
+
type: kubernetes.io/tls
|
| 295 |
+
data:
|
| 296 |
+
# values are base64 encoded, which obscures them but does NOT provide
|
| 297 |
+
# any useful level of confidentiality
|
| 298 |
+
# Replace the following values with your own base64-encoded certificate and key.
|
| 299 |
+
tls.crt: "REPLACE_WITH_BASE64_CERT"
|
| 300 |
+
tls.key: "REPLACE_WITH_BASE64_KEY"
|
| 301 |
+
```
|
| 302 |
+
|
| 303 |
+
The TLS Secret type is provided only for convenience. You can create an `Opaque` type for credentials used for TLS authentication. However, using the defined and public Secret type (`kubernetes.io/tls`) helps ensure the consistency of Secret format in your project. The API server verifies if the required keys are set for a Secret of this type.
|
| 304 |
+
|
| 305 |
+
To create a TLS Secret using `kubectl`, use the `tls` subcommand:
|
| 306 |
+
|
| 307 |
+
```shell
|
| 308 |
+
kubectl create secret tls my-tls-secret \
|
| 309 |
+
--cert=path/to/cert/file \
|
| 310 |
+
--key=path/to/key/file
|
| 311 |
+
```
|
| 312 |
+
|
| 313 |
+
The public/private key pair must exist before hand. The public key certificate for `--cert` must be.PEM encoded and must match the given private key for `--key`.
|
| 314 |
+
|
| 315 |
+
### Bootstrap token Secrets
|
| 316 |
+
|
| 317 |
+
The `bootstrap.kubernetes.io/token` Secret type is for tokens used during the node bootstrap process. It stores tokens used to sign well-known ConfigMaps.
|
| 318 |
+
|
| 319 |
+
A bootstrap token Secret is usually created in the `kube-system` namespace and named in the form `bootstrap-token-<token-id>` where `<token-id>` is a 6 character string of the token ID.
|
| 320 |
+
|
| 321 |
+
As a Kubernetes manifest, a bootstrap token Secret might look like the following:
|
| 322 |
+
|
| 323 |
+
```yaml
|
| 324 |
+
apiVersion: v1
|
| 325 |
+
kind: Secret
|
| 326 |
+
metadata:
|
| 327 |
+
name: bootstrap-token-5emitj
|
| 328 |
+
namespace: kube-system
|
| 329 |
+
type: bootstrap.kubernetes.io/token
|
| 330 |
+
data:
|
| 331 |
+
auth-extra-groups: c3lzdGVtOmJvb3RzdHJhcHBlcnM6a3ViZWFkbTpkZWZhdWx0LW5vZGUtdG9rZW4=
|
| 332 |
+
expiration: MjAyMC0wOS0xM1QwNDozOToxMFo=
|
| 333 |
+
token-id: NWVtaXRq
|
| 334 |
+
token-secret: a3E0Z2lodnN6emduMXAwcg==
|
| 335 |
+
usage-bootstrap-authentication: dHJ1ZQ==
|
| 336 |
+
usage-bootstrap-signing: dHJ1ZQ==
|
| 337 |
+
```
|
| 338 |
+
|
| 339 |
+
A bootstrap token Secret has the following keys specified under `data`:
|
| 340 |
+
|
| 341 |
+
- `token-id`: A random 6 character string as the token identifier. Required.
|
| 342 |
+
- `token-secret`: A random 16 character string as the actual token Secret. Required.
|
| 343 |
+
- `description`: A human-readable string that describes what the token is used for. Optional.
|
| 344 |
+
- `expiration`: An absolute UTC time using [RFC3339](https://datatracker.ietf.org/doc/html/rfc3339) specifying when the token should be expired. Optional.
|
| 345 |
+
- `usage-bootstrap-<usage>`: A boolean flag indicating additional usage for the bootstrap token.
|
| 346 |
+
- `auth-extra-groups`: A comma-separated list of group names that will be authenticated as in addition to the `system:bootstrappers` group.
|
| 347 |
+
|
| 348 |
+
You can alternatively provide the values in the `stringData` field of the Secret without base64 encoding them:
|
| 349 |
+
|
| 350 |
+
```yaml
|
| 351 |
+
apiVersion: v1
|
| 352 |
+
kind: Secret
|
| 353 |
+
metadata:
|
| 354 |
+
# Note how the Secret is named
|
| 355 |
+
name: bootstrap-token-5emitj
|
| 356 |
+
# A bootstrap token Secret usually resides in the kube-system namespace
|
| 357 |
+
namespace: kube-system
|
| 358 |
+
type: bootstrap.kubernetes.io/token
|
| 359 |
+
stringData:
|
| 360 |
+
auth-extra-groups: "system:bootstrappers:kubeadm:default-node-token"
|
| 361 |
+
expiration: "2020-09-13T04:39:10Z"
|
| 362 |
+
# This token ID is used in the name
|
| 363 |
+
token-id: "5emitj"
|
| 364 |
+
token-secret: "kq4gihvszzgn1p0r"
|
| 365 |
+
# This token can be used for authentication
|
| 366 |
+
usage-bootstrap-authentication: "true"
|
| 367 |
+
# and it can be used for signing
|
| 368 |
+
usage-bootstrap-signing: "true"
|
| 369 |
+
```
|
| 370 |
+
|
| 371 |
+
> [!info] Note:
|
| 372 |
+
> The `stringData` field for a Secret does not work well with server-side apply.
|
| 373 |
+
|
| 374 |
+
## Working with Secrets
|
| 375 |
+
|
| 376 |
+
### Creating a Secret
|
| 377 |
+
|
| 378 |
+
There are several options to create a Secret:
|
| 379 |
+
|
| 380 |
+
- [Use `kubectl`](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-kubectl/)
|
| 381 |
+
- [Use a configuration file](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-config-file/)
|
| 382 |
+
- [Use the Kustomize tool](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-kustomize/)
|
| 383 |
+
|
| 384 |
+
#### Constraints on Secret names and data
|
| 385 |
+
|
| 386 |
+
The name of a Secret object must be a valid [DNS subdomain name](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-subdomain-names).
|
| 387 |
+
|
| 388 |
+
You can specify the `data` and/or the `stringData` field when creating a configuration file for a Secret. The `data` and the `stringData` fields are optional. The values for all keys in the `data` field have to be base64-encoded strings. If the conversion to base64 string is not desirable, you can choose to specify the `stringData` field instead, which accepts arbitrary strings as values.
|
| 389 |
+
|
| 390 |
+
The keys of `data` and `stringData` must consist of alphanumeric characters, `-`, `_` or `.`. All key-value pairs in the `stringData` field are internally merged into the `data` field. If a key appears in both the `data` and the `stringData` field, the value specified in the `stringData` field takes precedence.
|
| 391 |
+
|
| 392 |
+
#### Size limit
|
| 393 |
+
|
| 394 |
+
Individual Secrets are limited to 1MiB in size. This is to discourage creation of very large Secrets that could exhaust the API server and kubelet memory. However, creation of many smaller Secrets could also exhaust memory. You can use a [resource quota](https://kubernetes.io/docs/concepts/policy/resource-quotas/) to limit the number of Secrets (or other resources) in a namespace.
|
| 395 |
+
|
| 396 |
+
### Editing a Secret
|
| 397 |
+
|
| 398 |
+
You can edit an existing Secret unless it is [immutable](#secret-immutable). To edit a Secret, use one of the following methods:
|
| 399 |
+
|
| 400 |
+
- [Use `kubectl`](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-kubectl/#edit-secret)
|
| 401 |
+
- [Use a configuration file](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-config-file/#edit-secret)
|
| 402 |
+
|
| 403 |
+
You can also edit the data in a Secret using the [Kustomize tool](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-kustomize/#edit-secret). However, this method creates a new `Secret` object with the edited data.
|
| 404 |
+
|
| 405 |
+
Depending on how you created the Secret, as well as how the Secret is used in your Pods, updates to existing `Secret` objects are propagated automatically to Pods that use the data. For more information, refer to [Using Secrets as files from a Pod](#using-secrets-as-files-from-a-pod) section.
|
| 406 |
+
|
| 407 |
+
### Using a Secret
|
| 408 |
+
|
| 409 |
+
Secrets can be mounted as data volumes or exposed as [environment variables](https://kubernetes.io/docs/concepts/containers/container-environment/ "Container environment variables are name=value pairs that provide useful information into containers running in a Pod.") to be used by a container in a Pod. Secrets can also be used by other parts of the system, without being directly exposed to the Pod. For example, Secrets can hold credentials that other parts of the system should use to interact with external systems on your behalf.
|
| 410 |
+
|
| 411 |
+
Secret volume sources are validated to ensure that the specified object reference actually points to an object of type Secret. Therefore, a Secret needs to be created before any Pods that depend on it.
|
| 412 |
+
|
| 413 |
+
If the Secret cannot be fetched (perhaps because it does not exist, or due to a temporary lack of connection to the API server) the kubelet periodically retries running that Pod. The kubelet also reports an Event for that Pod, including details of the problem fetching the Secret.
|
| 414 |
+
|
| 415 |
+
#### Optional Secrets
|
| 416 |
+
|
| 417 |
+
When you reference a Secret in a Pod, you can mark the Secret as *optional*, such as in the following example. If an optional Secret doesn't exist, Kubernetes ignores it.
|
| 418 |
+
|
| 419 |
+
```yaml
|
| 420 |
+
apiVersion: v1
|
| 421 |
+
kind: Pod
|
| 422 |
+
metadata:
|
| 423 |
+
name: mypod
|
| 424 |
+
spec:
|
| 425 |
+
containers:
|
| 426 |
+
- name: mypod
|
| 427 |
+
image: redis
|
| 428 |
+
volumeMounts:
|
| 429 |
+
- name: foo
|
| 430 |
+
mountPath: "/etc/foo"
|
| 431 |
+
readOnly: true
|
| 432 |
+
volumes:
|
| 433 |
+
- name: foo
|
| 434 |
+
secret:
|
| 435 |
+
secretName: mysecret
|
| 436 |
+
optional: true
|
| 437 |
+
```
|
| 438 |
+
|
| 439 |
+
By default, Secrets are required. None of a Pod's containers will start until all non-optional Secrets are available.
|
| 440 |
+
|
| 441 |
+
If a Pod references a specific key in a non-optional Secret and that Secret does exist, but is missing the named key, the Pod fails during startup.
|
| 442 |
+
|
| 443 |
+
### Using Secrets as files from a Pod
|
| 444 |
+
|
| 445 |
+
If you want to access data from a Secret in a Pod, one way to do that is to have Kubernetes make the value of that Secret be available as a file inside the filesystem of one or more of the Pod's containers.
|
| 446 |
+
|
| 447 |
+
For instructions, refer to [Create a Pod that has access to the secret data through a Volume](https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/#create-a-pod-that-has-access-to-the-secret-data-through-a-volume).
|
| 448 |
+
|
| 449 |
+
When a volume contains data from a Secret, and that Secret is updated, Kubernetes tracks this and updates the data in the volume, using an eventually-consistent approach.
|
| 450 |
+
|
| 451 |
+
> [!info] Note:
|
| 452 |
+
> A container using a Secret as a [subPath](https://kubernetes.io/docs/concepts/storage/volumes/#using-subpath) volume mount does not receive automated Secret updates.
|
| 453 |
+
|
| 454 |
+
The kubelet keeps a cache of the current keys and values for the Secrets that are used in volumes for pods on that node. You can configure the way that the kubelet detects changes from the cached values. The `configMapAndSecretChangeDetectionStrategy` field in the [kubelet configuration](https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/) controls which strategy the kubelet uses. The default strategy is `Watch`.
|
| 455 |
+
|
| 456 |
+
Updates to Secrets can be either propagated by an API watch mechanism (the default), based on a cache with a defined time-to-live, or polled from the cluster API server on each kubelet synchronisation loop.
|
| 457 |
+
|
| 458 |
+
As a result, the total delay from the moment when the Secret is updated to the moment when new keys are projected to the Pod can be as long as the kubelet sync period + cache propagation delay, where the cache propagation delay depends on the chosen cache type (following the same order listed in the previous paragraph, these are: watch propagation delay, the configured cache TTL, or zero for direct polling).
|
| 459 |
+
|
| 460 |
+
### Using Secrets as environment variables
|
| 461 |
+
|
| 462 |
+
To use a Secret in an [environment variable](https://kubernetes.io/docs/concepts/containers/container-environment/ "Container environment variables are name=value pairs that provide useful information into containers running in a Pod.") in a Pod:
|
| 463 |
+
|
| 464 |
+
1. For each container in your Pod specification, add an environment variable for each Secret key that you want to use to the `env[].valueFrom.secretKeyRef` field.
|
| 465 |
+
2. Modify your image and/or command line so that the program looks for values in the specified environment variables.
|
| 466 |
+
|
| 467 |
+
For instructions, refer to [Define container environment variables using Secret data](https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/#define-container-environment-variables-using-secret-data).
|
| 468 |
+
|
| 469 |
+
It's important to note that the range of characters allowed for environment variable names in pods is [restricted](https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/#using-environment-variables-inside-of-your-config). If any keys do not meet the rules, those keys are not made available to your container, though the Pod is allowed to start.
|
| 470 |
+
|
| 471 |
+
### Container image pull Secrets
|
| 472 |
+
|
| 473 |
+
If you want to fetch container images from a private repository, you need a way for the kubelet on each node to authenticate to that repository. You can configure *image pull Secrets* to make this possible. These Secrets are configured at the Pod level.
|
| 474 |
+
|
| 475 |
+
#### Using imagePullSecrets
|
| 476 |
+
|
| 477 |
+
The `imagePullSecrets` field is a list of references to Secrets in the same namespace. You can use an `imagePullSecrets` to pass a Secret that contains a Docker (or other) image registry password to the kubelet. The kubelet uses this information to pull a private image on behalf of your Pod. See the [PodSpec API](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.35/#podspec-v1-core) for more information about the `imagePullSecrets` field.
|
| 478 |
+
|
| 479 |
+
##### Manually specifying an imagePullSecret
|
| 480 |
+
|
| 481 |
+
You can learn how to specify `imagePullSecrets` from the [container images](https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod) documentation.
|
| 482 |
+
|
| 483 |
+
##### Arranging for imagePullSecrets to be automatically attached
|
| 484 |
+
|
| 485 |
+
You can manually create `imagePullSecrets`, and reference these from a ServiceAccount. Any Pods created with that ServiceAccount or created with that ServiceAccount by default, will get their `imagePullSecrets` field set to that of the service account. See [Add ImagePullSecrets to a service account](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account) for a detailed explanation of that process.
|
| 486 |
+
|
| 487 |
+
### Using Secrets with static Pods
|
| 488 |
+
|
| 489 |
+
You cannot use ConfigMaps or Secrets with [static Pods](https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/ "A pod managed directly by the kubelet daemon on a specific node.").
|
| 490 |
+
|
| 491 |
+
## Immutable Secrets
|
| 492 |
+
|
| 493 |
+
FEATURE STATE: `Kubernetes v1.21 [stable]`
|
| 494 |
+
|
| 495 |
+
Kubernetes lets you mark specific Secrets (and ConfigMaps) as *immutable*. Preventing changes to the data of an existing Secret has the following benefits:
|
| 496 |
+
|
| 497 |
+
- protects you from accidental (or unwanted) updates that could cause applications outages
|
| 498 |
+
- (for clusters that extensively use Secrets - at least tens of thousands of unique Secret to Pod mounts), switching to immutable Secrets improves the performance of your cluster by significantly reducing load on kube-apiserver. The kubelet does not need to maintain a \[watch\] on any Secrets that are marked as immutable.
|
| 499 |
+
|
| 500 |
+
### Marking a Secret as immutable
|
| 501 |
+
|
| 502 |
+
You can create an immutable Secret by setting the `immutable` field to `true`. For example,
|
| 503 |
+
|
| 504 |
+
```yaml
|
| 505 |
+
apiVersion: v1
|
| 506 |
+
kind: Secret
|
| 507 |
+
metadata: ...
|
| 508 |
+
data: ...
|
| 509 |
+
immutable: true
|
| 510 |
+
```
|
| 511 |
+
|
| 512 |
+
You can also update any existing mutable Secret to make it immutable.
|
| 513 |
+
|
| 514 |
+
> [!info] Note:
|
| 515 |
+
> Once a Secret or ConfigMap is marked as immutable, it is *not* possible to revert this change nor to mutate the contents of the `data` field. You can only delete and recreate the Secret. Existing Pods maintain a mount point to the deleted Secret - it is recommended to recreate these pods.
|
| 516 |
+
|
| 517 |
+
## Information security for Secrets
|
| 518 |
+
|
| 519 |
+
Although ConfigMap and Secret work similarly, Kubernetes applies some additional protection for Secret objects.
|
| 520 |
+
|
| 521 |
+
Secrets often hold values that span a spectrum of importance, many of which can cause escalations within Kubernetes (e.g. service account tokens) and to external systems. Even if an individual app can reason about the power of the Secrets it expects to interact with, other apps within the same namespace can render those assumptions invalid.
|
| 522 |
+
|
| 523 |
+
Authorization configuration affects how Secret data can be accessed within a namespace. For example, granting **list** or **watch** permissions on Secrets allows a subject to read all Secret data in that namespace, not only the Secrets explicitly referenced by its Pods. Restrict access to the minimum set of permissions required for a workload to function, and avoid granting broad roles such as `cluster-admin` unless required for administrative purposes.
|
| 524 |
+
|
| 525 |
+
Also see the [Authorization documentation](https://kubernetes.io/docs/reference/access-authn-authz/rbac/).
|
| 526 |
+
|
| 527 |
+
A Secret is only sent to a node if a Pod on that node requires it. For mounting Secrets into Pods, the kubelet stores a copy of the data into a `tmpfs` so that the confidential data is not written to durable storage. Once the Pod that depends on the Secret is deleted, the kubelet deletes its local copy of the confidential data from the Secret.
|
| 528 |
+
|
| 529 |
+
There may be several containers in a Pod. By default, containers you define only have access to the default ServiceAccount and its related Secret. You must explicitly define environment variables or map a volume into a container in order to provide access to any other Secret.
|
| 530 |
+
|
| 531 |
+
There may be Secrets for several Pods on the same node. However, only the Secrets that a Pod requests are potentially visible within its containers. Therefore, one Pod does not have access to the Secrets of another Pod.
|
| 532 |
+
|
| 533 |
+
### Configure least-privilege access to Secrets
|
| 534 |
+
|
| 535 |
+
To enhance the security measures around Secrets, use separate namespaces to isolate access to mounted secrets.
|
| 536 |
+
|
| 537 |
+
> [!danger] Warning:
|
| 538 |
+
> Any containers that run with `privileged: true` on a node can access all Secrets used on that node.
|
| 539 |
+
|
| 540 |
+
## What's next
|
| 541 |
+
|
| 542 |
+
- For guidelines to manage and improve the security of your Secrets, refer to [Good practices for Kubernetes Secrets](https://kubernetes.io/docs/concepts/security/secrets-good-practices/).
|
| 543 |
+
- Learn how to [manage Secrets using `kubectl`](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-kubectl/)
|
| 544 |
+
- Learn how to [manage Secrets using config file](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-config-file/)
|
| 545 |
+
- Learn how to [manage Secrets using kustomize](https://kubernetes.io/docs/tasks/configmap-secret/managing-secret-using-kustomize/)
|
| 546 |
+
- Read the [API reference](https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/secret-v1/) for `Secret`
|
| 547 |
+
|
| 548 |
+
|
| 549 |
+
Last modified March 17, 2026 at 1:33 AM PST: [Improve security clarification for Kubernetes Secrets (#54644) (8af7916eb8)](https://github.com/kubernetes/website/commit/8af7916eb81024c5da7a9b4c4477db18e5fffda2)
|