Skip to content

Understanding Preemption

Docs

This exercise demonstrates how Run:AI's scheduler handles preemption - the ability to pause lower-priority workloads to make room for higher-priority ones or ensure fair resource sharing.

What is Preemption?

Preemption occurs when: - A workload is running over-quota (using more resources than allocated) - Another project needs resources within their guaranteed quota - The scheduler fairly redistributes resources based on priority and quotas

Key Concepts

Priority Classes

Run:AI uses these default priority levels:

Inference: 125 (Non-preemptible)
Build Workspace: 100 (Non-preemptible)
Interactive Workspace: 75 (Preemptible)
Training: 50 (Preemptible)

Scheduling Behavior

  • Over-Quota: Projects can use unused resources from other projects
  • Fair Share: Resources are redistributed when rightful owners need them
  • Gang Scheduling: Multi-GPU workloads are scheduled as a unit

Prerequisites

Before starting this exercise: - Admin access to create projects and configure quotas - Clean environment with no existing workloads - At least 8 GPUs available on a single node - Understanding of project and quota concepts

Exercise Setup

1. Prepare the Environment

Clear Existing Workloads:

# List and delete all workloads
runai list --all-projects
runai delete <workload-name> --project <project-name>

Verify Clean State:

# Ensure no workloads are running
runai list --all-projects

# Check GPU availability
runai cluster-info

2. Create Node Pool

  1. Navigate to Platform Admin → Cluster Management → Node Pools
  2. Click "+ NEW NODE POOL"
  3. Configure the node pool:
Name: single-node-pool
Nodes: Select 1 node with 8 GPUs
GPU Allocation: All 8 GPUs to this pool

3. Create Projects with Quotas

Create Project A: 1. Go to Platform Admin → Projects 2. Click "+ NEW PROJECT" 3. Configure:

Name: project-a
Description: "First project for preemption demo"
4. Set quota in the single-node-pool:
GPU Quota: 4 GPUs

Create Project B: 1. Repeat the process for a second project:

Name: project-b
Description: "Second project for preemption demo"
GPU Quota: 4 GPUs (in single-node-pool)

Exercise Steps

Step 1: Create Over-Quota Training Workload

Submit Training Job in Project A:

runai submit "large-training-job" \
    --project project-a \
    --image pytorch/pytorch:latest \
    --gpu 8 \
    --command "python -c 'import time; import torch; print(f\"Using {torch.cuda.device_count()} GPUs\"); time.sleep(3600)'"

Verify Over-Quota Status: 1. Navigate to Workload manager → Workloads 2. Check that the training job is Running with 8 GPUs 3. Note in the dashboard that project-a is over-quota (using 8/4 GPUs)

Step 2: Observe Over-Quota Behavior

Check Resource Allocation:

# View detailed workload information
runai describe large-training-job --project project-a

# Monitor cluster resources
runai cluster-info

Key Observations: - Training workload gets all 8 GPUs despite 4 GPU quota - This is allowed because project-b isn't using their quota - Project-a is now running over-quota

Step 3: Trigger Preemption

Create Interactive Workload in Project B:

runai submit "jupyter-workspace" \
    --project project-b \
    --image jupyter/tensorflow-notebook \
    --gpu 4 \
    --interactive \
    --port 8888:8888

Observe Preemption Process: 1. Watch the workload status in real-time:

watch -n 2 'runai list --all-projects'

  1. Expected behavior:
  2. project-b Jupyter workspace will start scheduling
  3. project-a training job will transition to Pending
  4. Resources are redistributed fairly

Step 4: Monitor Preemption Events

Check Workload Events:

# View events for the preempted training job
runai describe large-training-job --project project-a --show-events

# Check events for the new interactive job
runai describe jupyter-workspace --project project-b --show-events

Dashboard Monitoring: 1. Navigate to Platform Admin → Analytics 2. Observe: - GPU Utilization changes - Project Resource Usage rebalancing - Quota vs. Usage metrics

Step 5: Test Resource Return

Stop the Interactive Workload:

runai delete jupyter-workspace --project project-b

Observe Automatic Rescheduling:

# Watch the training job return to running state
watch -n 2 'runai describe large-training-job --project project-a'

Expected Behavior: - Training workload automatically resumes with 8 GPUs - No manual intervention required - Resources are automatically redistributed

Understanding the Results

Priority-Based Scheduling

The preemption occurred because: 1. Interactive workspaces (priority 75) > Training jobs (priority 50) 2. project-b had guaranteed quota for 4 GPUs 3. Fair share algorithm redistributed resources

Quota vs. Over-Quota

Initial State:
  project-a: 8/4 GPUs (over-quota)
  project-b: 0/4 GPUs (under-quota)

After Preemption:
  project-a: 4/4 GPUs (at quota)
  project-b: 4/4 GPUs (at quota)

After project-b Cleanup:
  project-a: 8/4 GPUs (over-quota again)
  project-b: 0/4 GPUs (under-quota)

Advanced Monitoring

Real-Time Resource Tracking

Monitor GPU Allocation:

# Check node-level GPU allocation
kubectl describe node <node-name> | grep nvidia.com/gpu

# View Run:AI resource allocation
runai get resources --project project-a
runai get resources --project project-b

Performance Impact Analysis:

# Monitor GPU utilization during preemption
watch -n 1 'nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv'

Event Timeline Analysis

Create a timeline of preemption events:

# Export events for analysis
runai describe large-training-job --project project-a --show-events > preemption-events.log

# Look for key events
grep -E "(Preempted|Scheduled|Started)" preemption-events.log

Best Practices

Designing for Preemption

  1. Checkpoint Frequently: Enable your training jobs to resume from checkpoints

    # Save model checkpoints regularly
    if epoch % 10 == 0:
        torch.save(model.state_dict(), f'/shared/checkpoint-{epoch}.pth')
    

  2. Handle Interruptions Gracefully: Use signal handlers for cleanup

    import signal
    import sys
    
    def signal_handler(sig, frame):
        print('Saving final checkpoint before preemption...')
        torch.save(model.state_dict(), '/shared/final-checkpoint.pth')
        sys.exit(0)
    
    signal.signal(signal.SIGTERM, signal_handler)
    

  3. Use Appropriate Priorities: Match workload types to their intended use

  4. Training: Use train priority for batch jobs
  5. Development: Use interactive priority for Jupyter notebooks
  6. Production: Use inference priority for serving models

Resource Planning

  1. Right-Size Quotas: Set quotas based on actual needs, not maximum usage
  2. Plan for Over-Quota: Design workloads to benefit from unused resources
  3. Monitor Patterns: Use analytics to understand usage patterns and optimize

Troubleshooting

Common Issues

Workload Won't Preempt:

# Check workload priorities
runai describe <workload-name> | grep Priority

# Verify quota settings
runai describe project <project-name>

Preemption Takes Too Long: - Check if workloads have proper signal handling - Verify graceful termination timeouts - Look for hung processes preventing cleanup

Unexpected Preemption Behavior:

# Check scheduler logs
kubectl logs -n runai-system <scheduler-pod-name>

# Verify node pool configurations
runai get node-pools

Key Takeaways

After completing this exercise, you should understand:

  1. Fair Share Scheduling: How Run:AI balances resources across projects
  2. Priority-Based Preemption: How different workload types are prioritized
  3. Over-Quota Benefits: How projects can utilize unused cluster resources
  4. Automatic Recovery: How workloads resume when resources become available

Next Steps

Now that you understand preemption:

  1. Implement Checkpointing: Add checkpoint/resume capability to your training jobs
  2. Optimize Priorities: Use appropriate priority classes for different workload types
  3. Monitor Usage: Set up alerting for quota utilization and preemption events
  4. Plan Capacity: Use preemption patterns to inform resource planning