Skip to content

Installing a Self-Hosted Cluster

Docs

Note - Always refer to documentation - this is just a students' guide

Pre-Reqs

Make Sure you're in the right Directory

  • Change to the Artifacts directory
cd /home/ubuntu/Artifacts
  • Set your cluster ENV on the instance
# Refer to the Students page in Course Details
export MY_DOMAIN="[your_name].learn-to.run"

Create Namespaces

  • Create runai-backend and runai namespaces
kubectl create ns runai
kubectl create ns runai-backend

Enable registry secret

  • We'll need to provide this secret to access the Run AI registry to pull all artifacts
kubectl create -f ./install/runai-gcr-secret.yaml

Certificates

Domain Certificate

  • You have been supplied with some valid certificates for your domain in the certificates directory. Apply them as secrets.
kubectl create secret tls runai-backend-tls -n runai-backend \
--cert ./certificates/fullchain.pem --key ./certificates/private.pem

kubectl create secret tls runai-cluster-domain-tls-secret -n runai \
--cert ./certificates/fullchain.pem --key ./certificates/private.pem

Cluster Installation when running with a local certificate authority

  • Create a secret for the CA certificate.
kubectl -n runai create secret generic runai-ca-cert \
--from-file=runai-ca.pem=./certificates/fullchain.pem

Install GPU Operator

  • We will install a fake GPU operator rather than the official NVIDIA one as we have no real GPUs in the nodes
helm upgrade -i gpu-operator oci://ghcr.io/run-ai/fake-gpu-operator/fake-gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --version 0.0.53 \
  -f ./gpu-driver/values.yaml

Install Prometheus

  • We'll need prometheus for our metrics
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
    -n monitoring --create-namespace --set grafana.enabled=false

Install Ingress Controller

  • We'll use nginx. As it's a KIND cluster, there's a manifest available for that. First we'll label the nodes appropriately for this particular manifest
kubectl apply -f ./ingress/nginx-ingress.yaml

Run the pre-flight check script

  • Download and run the script. Correct any errors before proceeding:
wget -O preflight https://github.com/run-ai/preinstall-diagnostics/releases/download/v2.16.19/preinstall-diagnostics-linux-amd64
chmod +x preflight
./preflight --cluster-domain $MY_DOMAIN --domain $MY_DOMAIN

Install the Backend

  • Helm install the backend
helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update
helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane --version "~2.20" --set global.domain=$MY_DOMAIN

Observe the Backend Pods

  • In your terminal, launch K9S
k9s
  • Set the namespace
:ns # [enter]
# use arrow keys to select runai-backend [enter]

Open the Web UI

  • Once all pods in the runai-backend namespace are healthy, open a web browser and navigate to $MY_DOMAIN. Use credentials user: test@run.ai / password: Abcd!234

Install the Cluster

  • Follow the instructions after you log in
  • Deploy to the same cluster as the control plane and set relevant matching version
  • Copy the installation instructions to deploy
  • Click "done"

Observe the Cluster Pods

  • In your terminal, launch K9S
k9s
  • Set the namespace
:ns # [enter]
# use arrow keys to select runai [enter]

Check for the correct nodes and GPUs

  • Visit the nodes and page to ensure that all expected GPUs are registered
  • View the dashboards to ensure that metrics are working as expected