Skip to content

Learn to Run - Platform Configuration - Set Node Roles

Docs

Note - Always refer to documentation - this is just a students' guide

The following node roles can be configured on the cluster:

  1. System node: Reserved for Run:ai system-level services.

  2. GPU Worker node: Dedicated for GPU-based workloads.

  3. CPU Worker node: Used for CPU-only workloads.

Pre-reqs

  1. Ensure that scheduling restrictions are enabled in the cluster.

Edit the runaiconfig file to set global.nodeAffinity.restrictScheduling to true.

kubectl edit runaiconfig runai -n runai
# Add the following field:
#     global.nodeAffinity.restrictScheduling: true
  1. Label the node to reflect the role:
# List the nodes
kubectl get nodes
# Choose our node to restrict to CPU only workloads
kubectl label nodes <node-name> node-role.kubernetes.io/runai-cpu-worker=true
  1. Check the label has stuck:
kubectl get no <node-name> --show-labels
  1. Reset the label:
kubectl label node <node-name> node-role.kubernetes.io/runai-cpu-worker-
  1. Edit the runaiconfig file to disable global.nodeAffinity.restrictScheduling.
    kubectl edit runaiconfig runai -n runai
    # Set the following:
    #     global.nodeAffinity.restrictScheduling: false
    

If you do not do this, jobs requiring CPU resources will fail to schedule, as you no longer have any nodes labeled as CPU workers. This will show up as an event like the following:

4m33s (x25 over 6m33s)   Normal    Unschedulable       PodGroup/pg-jupyter-cpu-only-0-e592844a-618f-4e9d-9160-ef9c14895b6f   PodSchedulingErrors: Resources were not found for pod runai-test-project/jupyter-cpu-only-0-0 due to: Scheduling conditions were not met for pod runai-test-project/jupyter-cpu-only-0-0:
MaxNodePoolResources: The pod runai-test-project/jupyter-cpu-only-0-0 requires GPU: 0, CPU: 0.1 (cores), memory: 0.1 (GB). No node in the default node-pool has CPU resources.