Compute Resource Assets¶
Compute resource assets define standardized hardware configurations including GPU, CPU, and memory specifications. They simplify workload creation and ensure consistent resource allocation.
What are Compute Resource Assets?¶
Compute resources specify:
- GPU Requirements: Whole GPUs or fractional GPU portions
- CPU Allocation: Core count and limits
- Memory Limits: RAM allocation and constraints
- Node Selection: Specific node pools or constraints
Creating Compute Resource Assets¶
Method 1: Using the UI¶
- Navigate to Assets → Compute Resources
- Click "+ NEW COMPUTE RESOURCE"
- Configure the resource:
Basic Configuration:
Resource Specifications:
GPU Type: Whole GPU
GPU Count: 1
CPU Request: 4 cores
CPU Limit: 8 cores
Memory Request: 8Gi
Memory Limit: 16Gi
Method 2: Using YAML¶
Create large-training-resource.yaml
:
apiVersion: run.ai/v1
kind: ComputeResource
metadata:
name: gpu-training-large
namespace: runai-<project-name>
spec:
gpu:
type: "gpu"
count: 1
cpu:
request: "4"
limit: "8"
memory:
request: "8Gi"
limit: "16Gi"
nodeSelector:
gpu-type: "a100"
Apply with:
Common Resource Configurations¶
1. Development Resources¶
Small Development:
Name: dev-small
GPU: 0.25 (25% fraction)
CPU: 1-2 cores
Memory: 2-4Gi
Use Case: Code development, small experiments
Interactive Development:
Name: jupyter-dev
GPU: 0.5 (50% fraction)
CPU: 2-4 cores
Memory: 4-8Gi
Use Case: Jupyter notebooks, data exploration
2. Training Resources¶
Medium Training:
Name: training-medium
GPU: 1 whole GPU
CPU: 4-8 cores
Memory: 16-32Gi
Use Case: Single GPU model training
Large Training:
Name: training-large
GPU: 2-4 whole GPUs
CPU: 8-16 cores
Memory: 32-64Gi
Use Case: Multi-GPU distributed training
3. Inference Resources¶
Lightweight Inference:
Name: inference-light
GPU: 0.1 (10% fraction)
CPU: 1-2 cores
Memory: 1-2Gi
Use Case: Model serving, API endpoints
High-throughput Inference:
Name: inference-heavy
GPU: 1 whole GPU
CPU: 4-8 cores
Memory: 8-16Gi
Use Case: Batch inference, real-time serving
GPU Configuration Options¶
1. Whole GPU Resources¶
Single GPU:
apiVersion: run.ai/v1
kind: ComputeResource
metadata:
name: single-gpu-training
spec:
gpu:
type: "gpu"
count: 1
cpu:
request: "4"
limit: "8"
memory:
request: "16Gi"
limit: "32Gi"
Multi-GPU:
apiVersion: run.ai/v1
kind: ComputeResource
metadata:
name: multi-gpu-training
spec:
gpu:
type: "gpu"
count: 4
cpu:
request: "16"
limit: "32"
memory:
request: "64Gi"
limit: "128Gi"
2. Fractional GPU Resources¶
Small Fraction:
apiVersion: run.ai/v1
kind: ComputeResource
metadata:
name: gpu-fraction-small
spec:
gpu:
type: "portion"
portion: 0.25 # 25% of GPU
cpu:
request: "2"
limit: "4"
memory:
request: "4Gi"
limit: "8Gi"
Dynamic Fractions:
apiVersion: run.ai/v1
kind: ComputeResource
metadata:
name: gpu-fraction-dynamic
spec:
gpu:
type: "portion"
portion: 0.25 # Minimum 25%
portionLimit: 0.75 # Maximum 75%
cpu:
request: "2"
limit: "8"
memory:
request: "4Gi"
limit: "16Gi"
Node Selection and Constraints¶
1. Node Pool Selection¶
Specific GPU Types:
Memory Requirements:
2. Node Affinity Rules¶
Preferred Nodes:
spec:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: gpu-generation
operator: In
values: ["ampere", "ada"]
Required Constraints:
spec:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-memory
operator: Gt
values: ["20Gi"]
Using Compute Resource Assets¶
In Workload Creation¶
Via UI: 1. Create new workload 2. Compute Resources section → Select your compute resource asset 3. Optionally override specific settings
Via CLI:
runai submit "training-job" \
--compute-resource gpu-training-large \
--image pytorch/pytorch:latest
Overriding Resource Settings¶
Increase Resources Temporarily:
runai submit "intensive-training" \
--compute-resource training-medium \
--cpu-limit 16 \
--memory-limit 64Gi \
--gpu 2
Scale Down for Testing:
Resource Planning Guidelines¶
1. Sizing Recommendations¶
Model Size-Based Planning:
def estimate_gpu_memory(model_params, batch_size=32):
"""Estimate GPU memory requirements"""
# Model parameters (in GB)
model_memory = model_params * 4 / 1024**3 # 4 bytes per float32
# Training overhead (gradients, optimizer, activations)
training_overhead = 4 # 4x model size
# Batch size impact
batch_memory = batch_size * 0.1 # Rough estimate
total_memory = model_memory * training_overhead + batch_memory
return total_memory
# Examples
bert_base = estimate_gpu_memory(110_000_000) # ~1.7GB
bert_large = estimate_gpu_memory(340_000_000) # ~5.4GB
gpt2_medium = estimate_gpu_memory(345_000_000) # ~5.5GB
CPU Requirements:
def estimate_cpu_cores(data_workers=4, model_complexity="medium"):
"""Estimate CPU core requirements"""
base_cores = 2 # Minimum for system processes
data_cores = data_workers # One core per data worker
model_cores = {
"simple": 1,
"medium": 2,
"complex": 4
}
return base_cores + data_cores + model_cores[model_complexity]
2. Resource Templates¶
Create Resource Matrix:
# Small workloads
dev-xs: GPU: 0.1, CPU: 1, Memory: 1Gi
dev-small: GPU: 0.25, CPU: 2, Memory: 4Gi
dev-med: GPU: 0.5, CPU: 4, Memory: 8Gi
# Training workloads
train-sm: GPU: 1, CPU: 4, Memory: 16Gi
train-med: GPU: 1, CPU: 8, Memory: 32Gi
train-lg: GPU: 2, CPU: 16, Memory: 64Gi
# Inference workloads
infer-sm: GPU: 0.1, CPU: 2, Memory: 2Gi
infer-med: GPU: 0.25, CPU: 4, Memory: 8Gi
infer-lg: GPU: 0.5, CPU: 8, Memory: 16Gi
Best Practices¶
1. Resource Naming¶
Descriptive Names:
# Good examples
gpu-training-small-v100
inference-cpu-optimized
jupyter-interactive-a100
batch-processing-large
# Avoid
resource1
temp-config
gpu-thing
2. Resource Optimization¶
Right-sizing Guidelines:
# Start conservative, scale up as needed
Initial: GPU: 0.25, CPU: 2, Memory: 4Gi
Optimized: GPU: 0.5, CPU: 4, Memory: 8Gi
Production: GPU: 1, CPU: 8, Memory: 16Gi
Monitor and Adjust:
# Check actual usage
runai top <workload-name>
# View resource utilization
runai describe <workload-name> | grep -A 5 "Resource Usage"
3. Cost Management¶
Efficiency Targets:
Resource Sharing:
# Use fractional GPUs for development
--gpu-portion 0.25
# Share larger resources during off-hours
--gpu-portion-limit 1.0
Monitoring Resource Usage¶
1. Real-time Monitoring¶
Via CLI:
# Monitor specific workload
runai top <workload-name>
# Monitor all workloads
runai list --show-resources
# Detailed resource view
runai describe <workload-name>
Via Dashboard: - Navigate to workload in UI - View real-time resource graphs - Check utilization percentages
2. Resource Alerts¶
Set Up Monitoring:
runai submit "monitored-job" \
--compute-resource training-medium \
--alert-on-gpu-util-low 70 \
--alert-on-memory-high 90
Troubleshooting¶
Common Issues¶
Insufficient Resources:
# Check cluster availability
runai cluster-info
# View node resources
kubectl describe node <node-name>
# Check project quota
runai describe project <project-name>
Resource Constraints:
# Workload pending due to resources
kubectl describe pod <pod-name> | grep Events
# Check resource requests vs limits
runai describe <workload-name> | grep -A 10 "Resource Spec"
Performance Issues:
# Monitor resource bottlenecks
runai top <workload-name>
# Check if resources are underutilized
watch -n 5 'runai top <workload-name>'
Debugging Commands¶
Test Resource Allocation:
# Submit test workload
runai submit "resource-test" \
--compute-resource training-medium \
--command "nvidia-smi && free -h && nproc" \
--image nvidia/cuda:11.8-base-ubuntu20.04
# Check assigned resources
runai logs resource-test
Next Steps¶
- Start with Templates: Use predefined resource configurations
- Monitor Usage: Track actual vs allocated resources
- Optimize Gradually: Adjust based on real usage patterns
- Create Standards: Establish resource guidelines for teams
Related Assets¶
- Environments - Pair resources with appropriate software stacks
- Data Sources - Consider data transfer requirements
- Credentials - Secure access for resource-intensive workloads