Understanding Preemption¶
This exercise demonstrates how Run:AI's scheduler handles preemption - the ability to pause lower-priority workloads to make room for higher-priority ones or ensure fair resource sharing.
What is Preemption?¶
Preemption occurs when: - A workload is running over-quota (using more resources than allocated) - Another project needs resources within their guaranteed quota - The scheduler fairly redistributes resources based on priority and quotas
Key Concepts¶
Priority Classes¶
Run:AI uses these default priority levels:
Inference: 125 (Non-preemptible)
Build Workspace: 100 (Non-preemptible)
Interactive Workspace: 75 (Preemptible)
Training: 50 (Preemptible)
Scheduling Behavior¶
- Over-Quota: Projects can use unused resources from other projects
- Fair Share: Resources are redistributed when rightful owners need them
- Gang Scheduling: Multi-GPU workloads are scheduled as a unit
Prerequisites¶
Before starting this exercise: - Admin access to create projects and configure quotas - Clean environment with no existing workloads - At least 8 GPUs available on a single node - Understanding of project and quota concepts
Exercise Setup¶
1. Prepare the Environment¶
Clear Existing Workloads:
# List and delete all workloads
runai list --all-projects
runai delete <workload-name> --project <project-name>
Verify Clean State:
# Ensure no workloads are running
runai list --all-projects
# Check GPU availability
runai cluster-info
2. Create Node Pool¶
- Navigate to Platform Admin → Cluster Management → Node Pools
- Click "+ NEW NODE POOL"
- Configure the node pool:
3. Create Projects with Quotas¶
Create Project A: 1. Go to Platform Admin → Projects 2. Click "+ NEW PROJECT" 3. Configure:
4. Set quota in the single-node-pool:Create Project B: 1. Repeat the process for a second project:
Name: project-b
Description: "Second project for preemption demo"
GPU Quota: 4 GPUs (in single-node-pool)
Exercise Steps¶
Step 1: Create Over-Quota Training Workload¶
Submit Training Job in Project A:
runai submit "large-training-job" \
--project project-a \
--image pytorch/pytorch:latest \
--gpu 8 \
--command "python -c 'import time; import torch; print(f\"Using {torch.cuda.device_count()} GPUs\"); time.sleep(3600)'"
Verify Over-Quota Status: 1. Navigate to Workload manager → Workloads 2. Check that the training job is Running with 8 GPUs 3. Note in the dashboard that project-a is over-quota (using 8/4 GPUs)
Step 2: Observe Over-Quota Behavior¶
Check Resource Allocation:
# View detailed workload information
runai describe large-training-job --project project-a
# Monitor cluster resources
runai cluster-info
Key Observations: - Training workload gets all 8 GPUs despite 4 GPU quota - This is allowed because project-b isn't using their quota - Project-a is now running over-quota
Step 3: Trigger Preemption¶
Create Interactive Workload in Project B:
runai submit "jupyter-workspace" \
--project project-b \
--image jupyter/tensorflow-notebook \
--gpu 4 \
--interactive \
--port 8888:8888
Observe Preemption Process: 1. Watch the workload status in real-time:
- Expected behavior:
- project-b Jupyter workspace will start scheduling
- project-a training job will transition to Pending
- Resources are redistributed fairly
Step 4: Monitor Preemption Events¶
Check Workload Events:
# View events for the preempted training job
runai describe large-training-job --project project-a --show-events
# Check events for the new interactive job
runai describe jupyter-workspace --project project-b --show-events
Dashboard Monitoring: 1. Navigate to Platform Admin → Analytics 2. Observe: - GPU Utilization changes - Project Resource Usage rebalancing - Quota vs. Usage metrics
Step 5: Test Resource Return¶
Stop the Interactive Workload:
Observe Automatic Rescheduling:
# Watch the training job return to running state
watch -n 2 'runai describe large-training-job --project project-a'
Expected Behavior: - Training workload automatically resumes with 8 GPUs - No manual intervention required - Resources are automatically redistributed
Understanding the Results¶
Priority-Based Scheduling¶
The preemption occurred because: 1. Interactive workspaces (priority 75) > Training jobs (priority 50) 2. project-b had guaranteed quota for 4 GPUs 3. Fair share algorithm redistributed resources
Quota vs. Over-Quota¶
Initial State:
project-a: 8/4 GPUs (over-quota)
project-b: 0/4 GPUs (under-quota)
After Preemption:
project-a: 4/4 GPUs (at quota)
project-b: 4/4 GPUs (at quota)
After project-b Cleanup:
project-a: 8/4 GPUs (over-quota again)
project-b: 0/4 GPUs (under-quota)
Advanced Monitoring¶
Real-Time Resource Tracking¶
Monitor GPU Allocation:
# Check node-level GPU allocation
kubectl describe node <node-name> | grep nvidia.com/gpu
# View Run:AI resource allocation
runai get resources --project project-a
runai get resources --project project-b
Performance Impact Analysis:
# Monitor GPU utilization during preemption
watch -n 1 'nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv'
Event Timeline Analysis¶
Create a timeline of preemption events:
# Export events for analysis
runai describe large-training-job --project project-a --show-events > preemption-events.log
# Look for key events
grep -E "(Preempted|Scheduled|Started)" preemption-events.log
Best Practices¶
Designing for Preemption¶
-
Checkpoint Frequently: Enable your training jobs to resume from checkpoints
-
Handle Interruptions Gracefully: Use signal handlers for cleanup
-
Use Appropriate Priorities: Match workload types to their intended use
- Training: Use
train
priority for batch jobs - Development: Use
interactive
priority for Jupyter notebooks - Production: Use
inference
priority for serving models
Resource Planning¶
- Right-Size Quotas: Set quotas based on actual needs, not maximum usage
- Plan for Over-Quota: Design workloads to benefit from unused resources
- Monitor Patterns: Use analytics to understand usage patterns and optimize
Troubleshooting¶
Common Issues¶
Workload Won't Preempt:
# Check workload priorities
runai describe <workload-name> | grep Priority
# Verify quota settings
runai describe project <project-name>
Preemption Takes Too Long: - Check if workloads have proper signal handling - Verify graceful termination timeouts - Look for hung processes preventing cleanup
Unexpected Preemption Behavior:
# Check scheduler logs
kubectl logs -n runai-system <scheduler-pod-name>
# Verify node pool configurations
runai get node-pools
Key Takeaways¶
After completing this exercise, you should understand:
- Fair Share Scheduling: How Run:AI balances resources across projects
- Priority-Based Preemption: How different workload types are prioritized
- Over-Quota Benefits: How projects can utilize unused cluster resources
- Automatic Recovery: How workloads resume when resources become available
Next Steps¶
Now that you understand preemption:
- Implement Checkpointing: Add checkpoint/resume capability to your training jobs
- Optimize Priorities: Use appropriate priority classes for different workload types
- Monitor Usage: Set up alerting for quota utilization and preemption events
- Plan Capacity: Use preemption patterns to inform resource planning
Related Guides¶
- Training Workload - Implement checkpoint-friendly training
- Interactive Workload - Understand interactive priorities
- GPU Fractions - Optimize resource sharing