I have a cluster running on GCP that currently consists entirely of preemtible nodes. We're experiencing issues where kube-dns becomes unavailable (presumably because a node has been preempted). We'd like to improve the resilience of DNS by moving kube-dns pods to more stable nodes.
Is it possible to schedule system cluster critical pods like kube-dns (or all pods in the kube-system namespace) on a node pool of only non-preemptible nodes? I'm wary of using affinity or anti-affinity or taints, since these pods are auto-created at cluster bootstrapping and any changes made could be clobbered by a Kubernetes version upgrade. Is there a way do do this that will persist across upgrades?
You can add the nodeSelector field to your Pod specification and specify the node labels you want the target node to have. Kubernetes only schedules the Pod onto nodes that have each of the labels you specify.
CoreDNS is a single container per instance, vs kube-dns which uses three. Kube-dns uses dnsmasq for caching, which is single threaded C. CoreDNS is multi-threaded Go. CoreDNS enables negative caching in the default deployment.
Kubernetes expects that a service is running within the pod network mesh that performs name resolution and acts as the primary name server within the cluster. , which runs on the master nodes.
The solution was to use taints and tolerations in conjunction with node affinity. We created a second node pool, and added a taint to the preemptible pool.
Terraform config:
resource "google_container_node_pool" "preemptible_worker_pool" {
  node_config {
    ...
    preemptible     = true
    labels {
      preemptible = "true"
      dedicated   = "preemptible-worker-pool"
    }
    taint {
      key    = "dedicated"
      value  = "preemptible-worker-pool"
      effect = "NO_SCHEDULE"
    }
  }
}
We then used a toleration and nodeAffinity to allow our existing workloads to run on the tainted node pool, effectively forcing the cluster-critical pods to run on the untainted (non-preemtible) node pool.
Kubernetes config:
spec:
  template:
    spec:
      # The affinity + tolerations sections together allow and enforce that the workers are
      # run on dedicated nodes tainted with "dedicated=preemptible-worker-pool:NoSchedule".
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: dedicated
                operator: In
                values:
                - preemptible-worker-pool
      tolerations:
      - key: dedicated
        operator: "Equal"
        value: preemptible-worker-pool
        effect: "NoSchedule"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With