Understanding Preemption¶

Docs

This exercise demonstrates how Run:AI's scheduler handles preemption - the ability to pause lower-priority workloads to make room for higher-priority ones or ensure fair resource sharing.

What is Preemption?¶

Preemption occurs when: - A workload is running over-quota (using more resources than allocated) - Another project needs resources within their guaranteed quota - The scheduler fairly redistributes resources based on priority and quotas

Key Concepts¶

Priority Classes¶

Run:AI uses these default priority levels:

Inference: 125 (Non-preemptible)
Build Workspace: 100 (Non-preemptible)
Interactive Workspace: 75 (Preemptible)
Training: 50 (Preemptible)

Scheduling Behavior¶

Over-Quota: Projects can use unused resources from other projects
Fair Share: Resources are redistributed when rightful owners need them
Gang Scheduling: Multi-GPU workloads are scheduled as a unit

Prerequisites¶

Before starting this exercise: - Admin access to create projects and configure quotas - Clean environment with no existing workloads - At least 8 GPUs available on a single node - Understanding of project and quota concepts

Exercise Setup¶

1. Prepare the Environment¶

Clear Existing Workloads:

# List and delete all workloads
runai list --all-projects
runai delete <workload-name> --project <project-name>

Verify Clean State:

# Ensure no workloads are running
runai list --all-projects

# Check GPU availability
runai cluster-info

2. Create Node Pool¶

Navigate to Platform Admin → Cluster Management → Node Pools
Click "+ NEW NODE POOL"
Configure the node pool:

Name: single-node-pool
Nodes: Select 1 node with 8 GPUs
GPU Allocation: All 8 GPUs to this pool

3. Create Projects with Quotas¶

Create Project A: 1. Go to Platform Admin → Projects 2. Click "+ NEW PROJECT" 3. Configure:

Name: project-a
Description: "First project for preemption demo"

4. Set quota in the single-node-pool:

GPU Quota: 4 GPUs

Create Project B: 1. Repeat the process for a second project:

Name: project-b
Description: "Second project for preemption demo"
GPU Quota: 4 GPUs (in single-node-pool)

Exercise Steps¶

Step 1: Create Over-Quota Training Workload¶

Submit Training Job in Project A:

runai submit "large-training-job" \
    --project project-a \
    --image pytorch/pytorch:latest \
    --gpu 8 \
    --command "python -c 'import time; import torch; print(f\"Using {torch.cuda.device_count()} GPUs\"); time.sleep(3600)'"

Verify Over-Quota Status: 1. Navigate to Workload manager → Workloads 2. Check that the training job is Running with 8 GPUs 3. Note in the dashboard that project-a is over-quota (using 8/4 GPUs)

Step 2: Observe Over-Quota Behavior¶

Check Resource Allocation:

# View detailed workload information
runai describe large-training-job --project project-a

# Monitor cluster resources
runai cluster-info

Key Observations: - Training workload gets all 8 GPUs despite 4 GPU quota - This is allowed because project-b isn't using their quota - Project-a is now running over-quota

Step 3: Trigger Preemption¶

Create Interactive Workload in Project B:

runai submit "jupyter-workspace" \
    --project project-b \
    --image jupyter/tensorflow-notebook \
    --gpu 4 \
    --interactive \
    --port 8888:8888

Observe Preemption Process: 1. Watch the workload status in real-time:

watch -n 2 'runai list --all-projects'

Expected behavior:
project-b Jupyter workspace will start scheduling
project-a training job will transition to Pending
Resources are redistributed fairly

Step 4: Monitor Preemption Events¶

Check Workload Events:

# View events for the preempted training job
runai describe large-training-job --project project-a --show-events

# Check events for the new interactive job
runai describe jupyter-workspace --project project-b --show-events

Dashboard Monitoring: 1. Navigate to Platform Admin → Analytics 2. Observe: - GPU Utilization changes - Project Resource Usage rebalancing - Quota vs. Usage metrics

Step 5: Test Resource Return¶

Stop the Interactive Workload:

runai delete jupyter-workspace --project project-b

Observe Automatic Rescheduling:

# Watch the training job return to running state
watch -n 2 'runai describe large-training-job --project project-a'

Expected Behavior: - Training workload automatically resumes with 8 GPUs - No manual intervention required - Resources are automatically redistributed

Understanding the Results¶

Priority-Based Scheduling¶

The preemption occurred because: 1. Interactive workspaces (priority 75) > Training jobs (priority 50) 2. project-b had guaranteed quota for 4 GPUs 3. Fair share algorithm redistributed resources

Quota vs. Over-Quota¶

Initial State:
  project-a: 8/4 GPUs (over-quota)
  project-b: 0/4 GPUs (under-quota)

After Preemption:
  project-a: 4/4 GPUs (at quota)
  project-b: 4/4 GPUs (at quota)

After project-b Cleanup:
  project-a: 8/4 GPUs (over-quota again)
  project-b: 0/4 GPUs (under-quota)

Advanced Monitoring¶

Real-Time Resource Tracking¶

Monitor GPU Allocation:

# Check node-level GPU allocation
kubectl describe node <node-name> | grep nvidia.com/gpu

# View Run:AI resource allocation
runai get resources --project project-a
runai get resources --project project-b

Performance Impact Analysis:

# Monitor GPU utilization during preemption
watch -n 1 'nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv'

Event Timeline Analysis¶

Create a timeline of preemption events:

# Export events for analysis
runai describe large-training-job --project project-a --show-events > preemption-events.log

# Look for key events
grep -E "(Preempted|Scheduled|Started)" preemption-events.log

Best Practices¶

Designing for Preemption¶

Checkpoint Frequently: Enable your training jobs to resume from checkpoints

# Save model checkpoints regularly
if epoch % 10 == 0:
    torch.save(model.state_dict(), f'/shared/checkpoint-{epoch}.pth')

Handle Interruptions Gracefully: Use signal handlers for cleanup

import signal
import sys

def signal_handler(sig, frame):
    print('Saving final checkpoint before preemption...')
    torch.save(model.state_dict(), '/shared/final-checkpoint.pth')
    sys.exit(0)

signal.signal(signal.SIGTERM, signal_handler)

Use Appropriate Priorities: Match workload types to their intended use
Training: Use train priority for batch jobs
Development: Use interactive priority for Jupyter notebooks
Production: Use inference priority for serving models

Resource Planning¶

Right-Size Quotas: Set quotas based on actual needs, not maximum usage
Plan for Over-Quota: Design workloads to benefit from unused resources
Monitor Patterns: Use analytics to understand usage patterns and optimize

Troubleshooting¶

Common Issues¶

Workload Won't Preempt:

# Check workload priorities
runai describe <workload-name> | grep Priority

# Verify quota settings
runai describe project <project-name>

Preemption Takes Too Long: - Check if workloads have proper signal handling - Verify graceful termination timeouts - Look for hung processes preventing cleanup

Unexpected Preemption Behavior:

# Check scheduler logs
kubectl logs -n runai-system <scheduler-pod-name>

# Verify node pool configurations
runai get node-pools

Key Takeaways¶

After completing this exercise, you should understand:

Fair Share Scheduling: How Run:AI balances resources across projects
Priority-Based Preemption: How different workload types are prioritized
Over-Quota Benefits: How projects can utilize unused cluster resources
Automatic Recovery: How workloads resume when resources become available

Next Steps¶

Now that you understand preemption:

Implement Checkpointing: Add checkpoint/resume capability to your training jobs
Optimize Priorities: Use appropriate priority classes for different workload types
Monitor Usage: Set up alerting for quota utilization and preemption events
Plan Capacity: Use preemption patterns to inform resource planning

Training Workload - Implement checkpoint-friendly training
Interactive Workload - Understand interactive priorities
GPU Fractions - Optimize resource sharing