Extending the Kubernetes Scheduler

Kubernetes is equally loved and loathed for its extreme flexibility which comes with a high cost of complexity. For a research project, I was looking into ways to extend the Kubernetes Scheduler (kube-scheduler) with awareness of FPGAs. More precisely, I wanted to make sure that each node could advertise several FPGA “slots”, to which a pod could be bound to. Each of these slots only supports one attached application at a time, for which they are reconfigured. The FPGA architecture details are, though very interesting, not the focus of this post. Instead, we’re going to explore the ways Kubernetes allows us to extend the built-in scheduling process to support additional compute resources.

First, let’s try to understand what happens when you deploy a pod in a Kubernetes cluster from the perspective of the default scheduler.

Scheduling Basics

The goal of scheduling in Kubernetes is to make the best placement decision, i.e. assigning pods to the best nodes to run on. The Kubernetes agent on each node, kubelet, will then make sure the assigned pods are provisioned as expected. The scheduler is activated each time an event for an unassigned pod is received and runs through the following process: First, from all existing nodes, a subset of feasible nodes is selected in the filtering step. Criteria for feasibility include checking for sufficient resources on the node given resource limits/requests from the pod spec. From the list of feasible nodes, the most suitable node is prioritized in the scoring step. After completing both steps, the resulting node is returned to be used for binding the pod.

Scheduling Profiles

While filter, score, and bind are the central lifecycle stages, Kubernetes exposes several other so-called extension points that scheduling plugins can implement. Default plugins include ImageLocality for favoring nodes that have container images downloaded already in the scoring step, NodePorts for filtering out nodes that already allocated requested ports, and NodeResourcesFit for ruling out nodes with insufficient compute resources.

Scheduling profiles can be configured in the scheduler configuration to configure the scheduling plugins running at certain extension points.

Extenders

Initially, I tried to add FPGA awareness to the scheduler in the most minimal-invasive way possible, i.e. not attempting to change any logic within the scheduler. One approach I figured could work was to use extenders, webhooks that are invoked for assisting scheduling decisions. This way, every time a new pod would be scheduled, our extender webhook would receive a request that could inform the scheduler about feasible nodes, or the most suitable node in the scoring step.

Adding an extender is a matter of updating the scheduler configuration, which would have made it easy to experiment with. A downside of this approach was that we did not have write access to state within Kubernetes, so we would have needed to keep track of currently used and total available FPGA slots within an external data store (in the easiest case, a key-value map from node to slots). Furthermore, we would have needed to cover cases like a node failing to be provisioned or being evicted, to make sure no resources were occupied in state while already released on the node or assigning the same slots twice.

As attractive as the extender approach seemed initially, the downsides quickly started to outweigh the benefits so I kept on researching other ways to support custom resources.

Forking the scheduler

If using the built-in configuration options wasn’t doing the job, I reasoned that we might have to fork the scheduler itself. This way, we could solve our requirements exactly as needed. The only downside was that we’d have to be able to deploy our customized components (i.e. a custom scheduler image), this could prove problematic in managed Kubernetes solutions.

Some time in, I found myself attempting to run the Kubernetes build process on a slightly-modified scheduler codebase. This meant understanding the build system based on lots of Makefiles and shell scripts and consumed more hours than I’d like to admit. In the end, building the individual components was infuriatingly slow, so I decided to add a custom Dockerfile and focus on building just the kube-scheduler image.

FROM golang:1.20.3-alpine3.17 AS build

COPY . /go/src/github.com/kubernetes/kubernetes
WORKDIR /go/src/github.com/kubernetes/kubernetes

RUN go build -o /usr/local/bin/kube-scheduler ./cmd/kube-scheduler

FROM alpine:3.17.3

COPY --from=build /usr/local/bin/kube-scheduler /usr/local/bin/kube-scheduler

ENTRYPOINT ["/usr/local/bin/kube-scheduler"]

To test the custom image in a real Kubernetes environment, I created an OCI registry, pulled the latest Kubernetes component images (kube-apiserver, coredns, etc.) and pushed the customized kube-scheduler image. Then, I started a local Minikube cluster on my machine (albeit within a VM since I needed to run all workloads on x86).

minikube start
--extra-config=scheduler.v="5" \
--image-repository=<my custom repo> \
--driver docker \
--cache-images=false

By providing the custom image repository to Minikube, it would preload and pass the custom container image straight to kubeadm and onward. After a couple of minutes, my custom scheduler build was running.

Updating the scheduler and other related components was one possible way to go, arguably involving lots of code exploration and understanding the Kubernetes codebase. I still wasn’t completely convinced, so I kept searching.

Extended Resources

Hidden deep within the Administer a cluster section of the Kubernetes documentation, I found the solution to my problem: Advertising extended resources for a node. This easily overlooked page described exactly what I was looking for, and it was already built into Kubernetes.

Using the Kubernetes API, you can advertise extended resources, which Kubernetes wouldn’t know about otherwise (like our FPGA slots). After proxying the API using kubectl, we could send the following request to advertise our slots on the primary Minikube node

curl --header "Content-Type: application/json-patch+json" \
  --request PATCH \
  --data '[{"op": "add", "path": "/status/capacity/example.com~1fpga-slot", "value": "2"}]' \
  http://localhost:8001/api/v1/nodes/minikube/status

Running kubectl describe node minikube quickly shows that our new extended resource is successfully recognized.

Capacity:
  cpu: 2
  memory: 2049008Ki
  example.com/fpga-slot: 2

Now that our node tracks FPGA slots, we can simply request resources in the pod spec as usual, just with our extended resource.

apiVersion: v1
kind: Pod
metadata:
  name: extended-resource-demo
spec:
  containers:
  - name: extended-resource-demo-ctr
    image: nginx
    resources:
      requests:
        example.com/fpga-slot: 2
      limits:
        example.com/fpga-slot: 2

Deploying the pod once works completely fine as expected, retrieving the node state afterwards shows that resources were reserved properly. Attempting to deploy another instance of the pod spec will fail, as no more FPGA slots are available on the node. As with CPU and memory, the Kubernetes scheduler will reject the node in the filter step due to insufficient capacity.

This way, Kubernetes limits pods to available resources on the node. It does not, however, assign individual FPGA slots to the deployed applications, as it still doesn’t know about the concepts. This task is left to the application developer.

Kubernetes is an incredibly powerful system that is usually completely overkill for most use cases. However, it has been very rewarding to dive into the codebase and fundamentals of the scheduler implementation to get a deeper understanding of the fundamental lifecycle of deployments, and how it could be adapted to different requirements.