Installing a Self-Hosted Cluster¶
Note - Always refer to documentation - this is just a students' guide
Pre-Reqs¶
Make Sure you're in the right Directory¶
- Change to the Artifacts directory
- Set your cluster ENV on the instance
Create Namespaces¶
- Create runai-backend and runai namespaces
Enable registry secret¶
- We'll need to provide this secret to access the Run AI registry to pull all artifacts
Certificates¶
Domain Certificate¶
- You have been supplied with some valid certificates for your domain in the certificates directory. Apply them as secrets.
kubectl create secret tls runai-backend-tls -n runai-backend \
--cert ./certificates/fullchain.pem --key ./certificates/private.pem
kubectl create secret tls runai-cluster-domain-tls-secret -n runai \
--cert ./certificates/fullchain.pem --key ./certificates/private.pem
Cluster Installation when running with a local certificate authority¶
- Create a secret for the CA certificate.
kubectl -n runai create secret generic runai-ca-cert \
--from-file=runai-ca.pem=./certificates/fullchain.pem
Install GPU Operator¶
- We will install a fake GPU operator rather than the official NVIDIA one as we have no real GPUs in the nodes
helm upgrade -i gpu-operator oci://ghcr.io/run-ai/fake-gpu-operator/fake-gpu-operator \
--namespace gpu-operator \
--create-namespace \
--version 0.0.53 \
-f ./gpu-driver/values.yaml
Install Prometheus¶
- We'll need prometheus for our metrics
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace --set grafana.enabled=false
Install Ingress Controller¶
- We'll use nginx. As it's a KIND cluster, there's a manifest available for that. First we'll label the nodes appropriately for this particular manifest
Run the pre-flight check script¶
- Download and run the script. Correct any errors before proceeding:
wget -O preflight https://github.com/run-ai/preinstall-diagnostics/releases/download/v2.16.19/preinstall-diagnostics-linux-amd64
chmod +x preflight
./preflight --cluster-domain $MY_DOMAIN --domain $MY_DOMAIN
Install the Backend¶
- Helm install the backend
helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update
helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane --version "~2.20" --set global.domain=$MY_DOMAIN
Observe the Backend Pods¶
- In your terminal, launch K9S
- Set the namespace
Open the Web UI¶
- Once all pods in the runai-backend namespace are healthy, open a web browser and navigate to $MY_DOMAIN. Use credentials user: test@run.ai / password: Abcd!234
Install the Cluster¶
- Follow the instructions after you log in
- Deploy to the same cluster as the control plane and set relevant matching version
- Copy the installation instructions to deploy
- Click "done"
Observe the Cluster Pods¶
- In your terminal, launch K9S
- Set the namespace
Check for the correct nodes and GPUs¶
- Visit the nodes and page to ensure that all expected GPUs are registered
- View the dashboards to ensure that metrics are working as expected