Compute Resource Assets¶

Docs

Compute resource assets define standardized hardware configurations including GPU, CPU, and memory specifications. They simplify workload creation and ensure consistent resource allocation.

What are Compute Resource Assets?¶

Compute resources specify: - GPU Requirements: Whole GPUs or fractional GPU portions - CPU Allocation: Core count and limits - Memory Limits: RAM allocation and constraints
- Node Selection: Specific node pools or constraints

Creating Compute Resource Assets¶

Method 1: Using the UI¶

Navigate to Assets → Compute Resources
Click "+ NEW COMPUTE RESOURCE"
Configure the resource:

Basic Configuration:

Name: gpu-training-large
Description: Large GPU training configuration
Scope: Project

Resource Specifications:

GPU Type: Whole GPU
GPU Count: 1
CPU Request: 4 cores
CPU Limit: 8 cores
Memory Request: 8Gi
Memory Limit: 16Gi

Method 2: Using YAML¶

Create large-training-resource.yaml:

apiVersion: run.ai/v1
kind: ComputeResource
metadata:
  name: gpu-training-large
  namespace: runai-<project-name>
spec:
  gpu:
    type: "gpu"
    count: 1
  cpu:
    request: "4"
    limit: "8"  
  memory:
    request: "8Gi"
    limit: "16Gi"
  nodeSelector:
    gpu-type: "a100"

Apply with:

kubectl apply -f large-training-resource.yaml

Common Resource Configurations¶

1. Development Resources¶

Small Development:

Name: dev-small
GPU: 0.25 (25% fraction)
CPU: 1-2 cores
Memory: 2-4Gi
Use Case: Code development, small experiments

Interactive Development:

Name: jupyter-dev
GPU: 0.5 (50% fraction)  
CPU: 2-4 cores
Memory: 4-8Gi
Use Case: Jupyter notebooks, data exploration

2. Training Resources¶

Medium Training:

Name: training-medium
GPU: 1 whole GPU
CPU: 4-8 cores
Memory: 16-32Gi
Use Case: Single GPU model training

Large Training:

Name: training-large
GPU: 2-4 whole GPUs
CPU: 8-16 cores
Memory: 32-64Gi
Use Case: Multi-GPU distributed training

3. Inference Resources¶

Lightweight Inference:

Name: inference-light
GPU: 0.1 (10% fraction)
CPU: 1-2 cores
Memory: 1-2Gi
Use Case: Model serving, API endpoints

High-throughput Inference:

Name: inference-heavy
GPU: 1 whole GPU
CPU: 4-8 cores
Memory: 8-16Gi
Use Case: Batch inference, real-time serving

GPU Configuration Options¶

1. Whole GPU Resources¶

Single GPU:

apiVersion: run.ai/v1
kind: ComputeResource
metadata:
  name: single-gpu-training
spec:
  gpu:
    type: "gpu"
    count: 1
  cpu:
    request: "4"
    limit: "8"
  memory:
    request: "16Gi"
    limit: "32Gi"

Multi-GPU:

apiVersion: run.ai/v1
kind: ComputeResource
metadata:
  name: multi-gpu-training
spec:
  gpu:
    type: "gpu"
    count: 4
  cpu:
    request: "16"
    limit: "32"
  memory:
    request: "64Gi"
    limit: "128Gi"

2. Fractional GPU Resources¶

Small Fraction:

apiVersion: run.ai/v1
kind: ComputeResource
metadata:
  name: gpu-fraction-small
spec:
  gpu:
    type: "portion"
    portion: 0.25  # 25% of GPU
  cpu:
    request: "2"
    limit: "4"
  memory:
    request: "4Gi"
    limit: "8Gi"

Dynamic Fractions:

apiVersion: run.ai/v1
kind: ComputeResource
metadata:
  name: gpu-fraction-dynamic
spec:
  gpu:
    type: "portion"
    portion: 0.25      # Minimum 25%
    portionLimit: 0.75 # Maximum 75%
  cpu:
    request: "2"
    limit: "8"
  memory:
    request: "4Gi"
    limit: "16Gi"

Node Selection and Constraints¶

1. Node Pool Selection¶

Specific GPU Types:

spec:
  nodeSelector:
    gpu-type: "a100"
    # Available options: a100, v100, rtx3090, etc.

Memory Requirements:

spec:
  nodeSelector:
    memory-class: "high-memory"
    # Ensure nodes have sufficient RAM

2. Node Affinity Rules¶

Preferred Nodes:

spec:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: gpu-generation
          operator: In
          values: ["ampere", "ada"]

Required Constraints:

spec:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: gpu-memory
          operator: Gt
          values: ["20Gi"]

Using Compute Resource Assets¶

In Workload Creation¶

Via UI: 1. Create new workload 2. Compute Resources section → Select your compute resource asset 3. Optionally override specific settings

Via CLI:

runai submit "training-job" \
    --compute-resource gpu-training-large \
    --image pytorch/pytorch:latest

Overriding Resource Settings¶

Increase Resources Temporarily:

runai submit "intensive-training" \
    --compute-resource training-medium \
    --cpu-limit 16 \
    --memory-limit 64Gi \
    --gpu 2

Scale Down for Testing:

runai submit "quick-test" \
    --compute-resource training-large \
    --gpu-portion 0.1 \
    --cpu-request 1

Resource Planning Guidelines¶

1. Sizing Recommendations¶

Model Size-Based Planning:

def estimate_gpu_memory(model_params, batch_size=32):
    """Estimate GPU memory requirements"""

    # Model parameters (in GB)
    model_memory = model_params * 4 / 1024**3  # 4 bytes per float32

    # Training overhead (gradients, optimizer, activations)
    training_overhead = 4  # 4x model size

    # Batch size impact  
    batch_memory = batch_size * 0.1  # Rough estimate

    total_memory = model_memory * training_overhead + batch_memory

    return total_memory

# Examples
bert_base = estimate_gpu_memory(110_000_000)      # ~1.7GB
bert_large = estimate_gpu_memory(340_000_000)     # ~5.4GB  
gpt2_medium = estimate_gpu_memory(345_000_000)    # ~5.5GB

CPU Requirements:

def estimate_cpu_cores(data_workers=4, model_complexity="medium"):
    """Estimate CPU core requirements"""

    base_cores = 2  # Minimum for system processes
    data_cores = data_workers  # One core per data worker

    model_cores = {
        "simple": 1,
        "medium": 2, 
        "complex": 4
    }

    return base_cores + data_cores + model_cores[model_complexity]

2. Resource Templates¶

Create Resource Matrix:

# Small workloads
dev-xs:    GPU: 0.1,  CPU: 1,   Memory: 1Gi
dev-small: GPU: 0.25, CPU: 2,   Memory: 4Gi
dev-med:   GPU: 0.5,  CPU: 4,   Memory: 8Gi

# Training workloads  
train-sm:  GPU: 1,    CPU: 4,   Memory: 16Gi
train-med: GPU: 1,    CPU: 8,   Memory: 32Gi
train-lg:  GPU: 2,    CPU: 16,  Memory: 64Gi

# Inference workloads
infer-sm:  GPU: 0.1,  CPU: 2,   Memory: 2Gi
infer-med: GPU: 0.25, CPU: 4,   Memory: 8Gi
infer-lg:  GPU: 0.5,  CPU: 8,   Memory: 16Gi

Best Practices¶

1. Resource Naming¶

Descriptive Names:

# Good examples
gpu-training-small-v100
inference-cpu-optimized  
jupyter-interactive-a100
batch-processing-large

# Avoid
resource1
temp-config
gpu-thing

2. Resource Optimization¶

Right-sizing Guidelines:

# Start conservative, scale up as needed
Initial: GPU: 0.25, CPU: 2, Memory: 4Gi
Optimized: GPU: 0.5, CPU: 4, Memory: 8Gi
Production: GPU: 1, CPU: 8, Memory: 16Gi

Monitor and Adjust:

# Check actual usage
runai top <workload-name>

# View resource utilization  
runai describe <workload-name> | grep -A 5 "Resource Usage"

3. Cost Management¶

Efficiency Targets:

GPU Utilization: >80%
CPU Utilization: 60-80%  
Memory Utilization: 70-90%

Resource Sharing:

# Use fractional GPUs for development
--gpu-portion 0.25

# Share larger resources during off-hours
--gpu-portion-limit 1.0

Monitoring Resource Usage¶

1. Real-time Monitoring¶