Installing a Self-Hosted Cluster¶

Note - Always refer to documentation - this is just a students' guide

Pre-Reqs¶

Make Sure you're in the right Directory¶

Change to the Artifacts directory

cd /home/ubuntu/Artifacts

Set your cluster ENV on the instance

# Refer to the Students page in Course Details
export MY_DOMAIN="[your_name].learn-to.run"

Create Namespaces¶

Create runai-backend and runai namespaces

kubectl create ns runai
kubectl create ns runai-backend

Enable registry secret¶

We'll need to provide this secret to access the Run AI registry to pull all artifacts

kubectl create -f ./install/runai-gcr-secret.yaml

Certificates¶

Domain Certificate¶

You have been supplied with some valid certificates for your domain in the certificates directory. Apply them as secrets.

kubectl create secret tls runai-backend-tls -n runai-backend \
--cert ./certificates/fullchain.pem --key ./certificates/private.pem

kubectl create secret tls runai-cluster-domain-tls-secret -n runai \
--cert ./certificates/fullchain.pem --key ./certificates/private.pem

Cluster Installation when running with a local certificate authority¶

Create a secret for the CA certificate.

kubectl -n runai create secret generic runai-ca-cert \
--from-file=runai-ca.pem=./certificates/fullchain.pem

Install GPU Operator¶

We will install a fake GPU operator rather than the official NVIDIA one as we have no real GPUs in the nodes

helm upgrade -i gpu-operator oci://ghcr.io/run-ai/fake-gpu-operator/fake-gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --version 0.0.53 \
  -f ./gpu-driver/values.yaml

Install Prometheus¶

We'll need prometheus for our metrics

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
    -n monitoring --create-namespace --set grafana.enabled=false

Install Ingress Controller¶

We'll use nginx. As it's a KIND cluster, there's a manifest available for that. First we'll label the nodes appropriately for this particular manifest

kubectl apply -f ./ingress/nginx-ingress.yaml

Run the pre-flight check script¶

Download and run the script. Correct any errors before proceeding:

wget -O preflight https://github.com/run-ai/preinstall-diagnostics/releases/download/v2.16.19/preinstall-diagnostics-linux-amd64
chmod +x preflight
./preflight --cluster-domain $MY_DOMAIN --domain $MY_DOMAIN

Install the Backend¶

Helm install the backend

helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update
helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane --version "~2.20" --set global.domain=$MY_DOMAIN

Observe the Backend Pods¶

In your terminal, launch K9S

k9s

Set the namespace

:ns # [enter]
# use arrow keys to select runai-backend [enter]

Open the Web UI¶

Once all pods in the runai-backend namespace are healthy, open a web browser and navigate to $MY_DOMAIN. Use credentials user: test@run.ai / password: Abcd!234

Install the Cluster¶

Follow the instructions after you log in
Deploy to the same cluster as the control plane and set relevant matching version
Copy the installation instructions to deploy
Click "done"

Observe the Cluster Pods¶

In your terminal, launch K9S

k9s

Set the namespace

:ns # [enter]
# use arrow keys to select runai [enter]

Check for the correct nodes and GPUs¶

Visit the nodes and page to ensure that all expected GPUs are registered
View the dashboards to ensure that metrics are working as expected