Skip to content

Compute Resource Assets

Docs

Compute resource assets define standardized hardware configurations including GPU, CPU, and memory specifications. They simplify workload creation and ensure consistent resource allocation.

What are Compute Resource Assets?

Compute resources specify: - GPU Requirements: Whole GPUs or fractional GPU portions - CPU Allocation: Core count and limits - Memory Limits: RAM allocation and constraints
- Node Selection: Specific node pools or constraints

Creating Compute Resource Assets

Method 1: Using the UI

  1. Navigate to Assets → Compute Resources
  2. Click "+ NEW COMPUTE RESOURCE"
  3. Configure the resource:

Basic Configuration:

Name: gpu-training-large
Description: Large GPU training configuration
Scope: Project

Resource Specifications:

GPU Type: Whole GPU
GPU Count: 1
CPU Request: 4 cores
CPU Limit: 8 cores
Memory Request: 8Gi
Memory Limit: 16Gi

Method 2: Using YAML

Create large-training-resource.yaml:

apiVersion: run.ai/v1
kind: ComputeResource
metadata:
  name: gpu-training-large
  namespace: runai-<project-name>
spec:
  gpu:
    type: "gpu"
    count: 1
  cpu:
    request: "4"
    limit: "8"  
  memory:
    request: "8Gi"
    limit: "16Gi"
  nodeSelector:
    gpu-type: "a100"

Apply with:

kubectl apply -f large-training-resource.yaml

Common Resource Configurations

1. Development Resources

Small Development:

Name: dev-small
GPU: 0.25 (25% fraction)
CPU: 1-2 cores
Memory: 2-4Gi
Use Case: Code development, small experiments

Interactive Development:

Name: jupyter-dev
GPU: 0.5 (50% fraction)  
CPU: 2-4 cores
Memory: 4-8Gi
Use Case: Jupyter notebooks, data exploration

2. Training Resources

Medium Training:

Name: training-medium
GPU: 1 whole GPU
CPU: 4-8 cores
Memory: 16-32Gi
Use Case: Single GPU model training

Large Training:

Name: training-large
GPU: 2-4 whole GPUs
CPU: 8-16 cores
Memory: 32-64Gi
Use Case: Multi-GPU distributed training

3. Inference Resources

Lightweight Inference:

Name: inference-light
GPU: 0.1 (10% fraction)
CPU: 1-2 cores
Memory: 1-2Gi
Use Case: Model serving, API endpoints

High-throughput Inference:

Name: inference-heavy
GPU: 1 whole GPU
CPU: 4-8 cores
Memory: 8-16Gi
Use Case: Batch inference, real-time serving

GPU Configuration Options

1. Whole GPU Resources

Single GPU:

apiVersion: run.ai/v1
kind: ComputeResource
metadata:
  name: single-gpu-training
spec:
  gpu:
    type: "gpu"
    count: 1
  cpu:
    request: "4"
    limit: "8"
  memory:
    request: "16Gi"
    limit: "32Gi"

Multi-GPU:

apiVersion: run.ai/v1
kind: ComputeResource
metadata:
  name: multi-gpu-training
spec:
  gpu:
    type: "gpu"
    count: 4
  cpu:
    request: "16"
    limit: "32"
  memory:
    request: "64Gi"
    limit: "128Gi"

2. Fractional GPU Resources

Small Fraction:

apiVersion: run.ai/v1
kind: ComputeResource
metadata:
  name: gpu-fraction-small
spec:
  gpu:
    type: "portion"
    portion: 0.25  # 25% of GPU
  cpu:
    request: "2"
    limit: "4"
  memory:
    request: "4Gi"
    limit: "8Gi"

Dynamic Fractions:

apiVersion: run.ai/v1
kind: ComputeResource
metadata:
  name: gpu-fraction-dynamic
spec:
  gpu:
    type: "portion"
    portion: 0.25      # Minimum 25%
    portionLimit: 0.75 # Maximum 75%
  cpu:
    request: "2"
    limit: "8"
  memory:
    request: "4Gi"
    limit: "16Gi"

Node Selection and Constraints

1. Node Pool Selection

Specific GPU Types:

spec:
  nodeSelector:
    gpu-type: "a100"
    # Available options: a100, v100, rtx3090, etc.

Memory Requirements:

spec:
  nodeSelector:
    memory-class: "high-memory"
    # Ensure nodes have sufficient RAM

2. Node Affinity Rules

Preferred Nodes:

spec:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: gpu-generation
          operator: In
          values: ["ampere", "ada"]

Required Constraints:

spec:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: gpu-memory
          operator: Gt
          values: ["20Gi"]

Using Compute Resource Assets

In Workload Creation

Via UI: 1. Create new workload 2. Compute Resources section → Select your compute resource asset 3. Optionally override specific settings

Via CLI:

runai submit "training-job" \
    --compute-resource gpu-training-large \
    --image pytorch/pytorch:latest

Overriding Resource Settings

Increase Resources Temporarily:

runai submit "intensive-training" \
    --compute-resource training-medium \
    --cpu-limit 16 \
    --memory-limit 64Gi \
    --gpu 2

Scale Down for Testing:

runai submit "quick-test" \
    --compute-resource training-large \
    --gpu-portion 0.1 \
    --cpu-request 1

Resource Planning Guidelines

1. Sizing Recommendations

Model Size-Based Planning:

def estimate_gpu_memory(model_params, batch_size=32):
    """Estimate GPU memory requirements"""

    # Model parameters (in GB)
    model_memory = model_params * 4 / 1024**3  # 4 bytes per float32

    # Training overhead (gradients, optimizer, activations)
    training_overhead = 4  # 4x model size

    # Batch size impact  
    batch_memory = batch_size * 0.1  # Rough estimate

    total_memory = model_memory * training_overhead + batch_memory

    return total_memory

# Examples
bert_base = estimate_gpu_memory(110_000_000)      # ~1.7GB
bert_large = estimate_gpu_memory(340_000_000)     # ~5.4GB  
gpt2_medium = estimate_gpu_memory(345_000_000)    # ~5.5GB

CPU Requirements:

def estimate_cpu_cores(data_workers=4, model_complexity="medium"):
    """Estimate CPU core requirements"""

    base_cores = 2  # Minimum for system processes
    data_cores = data_workers  # One core per data worker

    model_cores = {
        "simple": 1,
        "medium": 2, 
        "complex": 4
    }

    return base_cores + data_cores + model_cores[model_complexity]

2. Resource Templates

Create Resource Matrix:

# Small workloads
dev-xs:    GPU: 0.1,  CPU: 1,   Memory: 1Gi
dev-small: GPU: 0.25, CPU: 2,   Memory: 4Gi
dev-med:   GPU: 0.5,  CPU: 4,   Memory: 8Gi

# Training workloads  
train-sm:  GPU: 1,    CPU: 4,   Memory: 16Gi
train-med: GPU: 1,    CPU: 8,   Memory: 32Gi
train-lg:  GPU: 2,    CPU: 16,  Memory: 64Gi

# Inference workloads
infer-sm:  GPU: 0.1,  CPU: 2,   Memory: 2Gi
infer-med: GPU: 0.25, CPU: 4,   Memory: 8Gi
infer-lg:  GPU: 0.5,  CPU: 8,   Memory: 16Gi

Best Practices

1. Resource Naming

Descriptive Names:

# Good examples
gpu-training-small-v100
inference-cpu-optimized  
jupyter-interactive-a100
batch-processing-large

# Avoid
resource1
temp-config
gpu-thing

2. Resource Optimization

Right-sizing Guidelines:

# Start conservative, scale up as needed
Initial: GPU: 0.25, CPU: 2, Memory: 4Gi
Optimized: GPU: 0.5, CPU: 4, Memory: 8Gi
Production: GPU: 1, CPU: 8, Memory: 16Gi

Monitor and Adjust:

# Check actual usage
runai top <workload-name>

# View resource utilization  
runai describe <workload-name> | grep -A 5 "Resource Usage"

3. Cost Management

Efficiency Targets:

GPU Utilization: >80%
CPU Utilization: 60-80%  
Memory Utilization: 70-90%

Resource Sharing:

# Use fractional GPUs for development
--gpu-portion 0.25

# Share larger resources during off-hours
--gpu-portion-limit 1.0

Monitoring Resource Usage

1. Real-time Monitoring

Via CLI:

# Monitor specific workload
runai top <workload-name>

# Monitor all workloads
runai list --show-resources

# Detailed resource view
runai describe <workload-name>

Via Dashboard: - Navigate to workload in UI - View real-time resource graphs - Check utilization percentages

2. Resource Alerts

Set Up Monitoring:

runai submit "monitored-job" \
    --compute-resource training-medium \
    --alert-on-gpu-util-low 70 \
    --alert-on-memory-high 90

Troubleshooting

Common Issues

Insufficient Resources:

# Check cluster availability
runai cluster-info

# View node resources
kubectl describe node <node-name>

# Check project quota
runai describe project <project-name>

Resource Constraints:

# Workload pending due to resources
kubectl describe pod <pod-name> | grep Events

# Check resource requests vs limits
runai describe <workload-name> | grep -A 10 "Resource Spec"

Performance Issues:

# Monitor resource bottlenecks
runai top <workload-name>

# Check if resources are underutilized
watch -n 5 'runai top <workload-name>'

Debugging Commands

Test Resource Allocation:

# Submit test workload
runai submit "resource-test" \
    --compute-resource training-medium \
    --command "nvidia-smi && free -h && nproc" \
    --image nvidia/cuda:11.8-base-ubuntu20.04

# Check assigned resources
runai logs resource-test

Next Steps

  1. Start with Templates: Use predefined resource configurations
  2. Monitor Usage: Track actual vs allocated resources
  3. Optimize Gradually: Adjust based on real usage patterns
  4. Create Standards: Establish resource guidelines for teams
  • Environments - Pair resources with appropriate software stacks
  • Data Sources - Consider data transfer requirements
  • Credentials - Secure access for resource-intensive workloads