Skip to content

Launching Workloads with GPU Fractions

Docs

This tutorial shows you how to use GPU fractions to optimize resource utilization in Run:AI. GPU fractions allow multiple workloads to share a single GPU, maximizing efficiency and reducing costs.

What are GPU Fractions?

GPU fractions allow you to: - Share GPUs among multiple workloads - Optimize Resource Usage by allocating only what you need - Reduce Costs by avoiding over-provisioning - Increase Throughput by running more workloads simultaneously

When to Use GPU Fractions

Ideal Use Cases: - Development and experimentation - Small model training - Inference workloads - Data preprocessing tasks - Jupyter notebooks for exploration

Avoid for: - Large model training requiring full GPU memory - High-performance computing workloads - Production inference with strict latency requirements

Prerequisites

Before starting, ensure: - Your cluster has GPU fractions enabled - You have a project with GPU quota - Your workload can benefit from fractional GPU resources

Method 1: Using the UI

1. Create a New Workload

  1. Navigate to Workload manager → Workloads
  2. Click "+NEW WORKLOAD"
  3. Choose your workload type (Training, Workspace, etc.)

2. Configure Compute Resources

  1. Resource Type: Select "GPU Portion" instead of "Whole GPU"
  2. Fraction Size: Choose from available options:
    0.1 (10% of GPU)
    0.25 (25% of GPU)  
    0.5 (50% of GPU)
    0.75 (75% of GPU)
    
  3. Memory Allocation: Automatically calculated based on fraction size

3. Example Configuration

Small Development Workload:

Compute Resource: GPU Portion
GPU Fraction: 0.25 (25%)
GPU Memory: ~6GB (on 24GB GPU)
CPU: 2 cores
Memory: 4Gi

Medium Training Job:

Compute Resource: GPU Portion  
GPU Fraction: 0.5 (50%)
GPU Memory: ~12GB (on 24GB GPU)
CPU: 4 cores
Memory: 8Gi

Method 2: Using the CLI

1. Basic GPU Fraction Submission

# Submit workload with 25% GPU
runai submit "fractional-training" \
    --image pytorch/pytorch:latest \
    --gpu-request-type portion \
    --gpu-portion-request 0.25 \
    --cpu-request 2 \
    --memory-request 4Gi

2. Advanced Fractional Configurations

Jupyter Notebook with Small Fraction:

runai submit "jupyter-dev" \
    --image jupyter/tensorflow-notebook \
    --gpu-request-type portion \
    --gpu-portion-request 0.1 \
    --cpu-request 1 \
    --memory-request 2Gi \
    --port 8888:8888 \
    --interactive

Model Training with Medium Fraction:

runai submit "bert-fine-tune" \
    --image huggingface/transformers-pytorch-gpu \
    --gpu-request-type portion \
    --gpu-portion-request 0.5 \
    --cpu-request 4 \
    --memory-request 8Gi \
    --volume /data/models:/workspace/models \
    --command "python fine_tune.py --batch_size 16"

Inference Service with Dynamic Scaling:

runai submit "inference-api" \
    --image tensorflow/serving:latest-gpu \
    --gpu-request-type portion \
    --gpu-portion-request 0.25 \
    --cpu-request 2 \
    --memory-request 4Gi \
    --port 8501:8501 \
    --service-type LoadBalancer

Method 3: Dynamic GPU Fractions

Dynamic fractions allow workloads to request more GPU resources when available and scale down when needed.

1. Configure Dynamic Fractions

runai submit "dynamic-training" \
    --image pytorch/pytorch:latest \
    --gpu-request-type portion \
    --gpu-portion-request 0.25 \
    --gpu-portion-limit 0.75 \
    --cpu-request 2 \
    --memory-request 4Gi

2. Dynamic Fraction Policies

Elastic Training:

# workload-config.yaml
apiVersion: run.ai/v1
kind: Workload
metadata:
  name: elastic-training
spec:
  resources:
    gpu:
      type: portion
      request: 0.25  # Minimum GPU needed
      limit: 1.0     # Maximum GPU that can be used
      scaling: elastic # Allow dynamic scaling
    cpu:
      request: 2
      limit: 8
    memory:
      request: 4Gi
      limit: 16Gi

Monitoring GPU Fraction Usage

1. Check Utilization

Via CLI:

# Monitor workload resource usage
runai top <workload-name>

# Get detailed resource information
runai describe <workload-name>

# List all workloads with resource usage
runai list --show-resources

Via UI: - Navigate to your workload in the dashboard - View real-time GPU utilization metrics - Monitor memory usage and CPU consumption

2. GPU Sharing Visibility

# See which workloads share the same GPU
runai get node <node-name> --show-workloads

# Check GPU allocation across cluster
runai cluster-info --show-gpu-allocation

Optimizing Code for GPU Fractions

1. Memory Management

Limit TensorFlow GPU Memory Growth:

import tensorflow as tf

# Configure GPU memory growth
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)

PyTorch Memory Management:

import torch

# Set memory fraction (for 25% GPU)
torch.cuda.set_per_process_memory_fraction(0.25)

# Enable memory pooling
torch.cuda.empty_cache()

# Monitor memory usage
def check_gpu_memory():
    print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"Reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

2. Batch Size Optimization

# Adjust batch size based on GPU fraction
import os

def get_optimal_batch_size():
    gpu_fraction = float(os.environ.get('GPU_FRACTION', '1.0'))
    base_batch_size = 32

    # Scale batch size with GPU fraction
    optimal_batch_size = int(base_batch_size * gpu_fraction)
    return max(optimal_batch_size, 1)  # Minimum batch size of 1

batch_size = get_optimal_batch_size()
print(f"Using batch size: {batch_size}")

3. Model Size Considerations

# Check if model fits in fractional GPU memory
def check_model_size(model, gpu_fraction=0.25):
    """Estimate if model fits in fractional GPU memory"""
    param_size = sum(p.numel() * p.element_size() for p in model.parameters())
    buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())

    total_size_gb = (param_size + buffer_size) / 1024**3
    available_memory = 24 * gpu_fraction  # Assuming 24GB GPU

    print(f"Model size: {total_size_gb:.2f} GB")
    print(f"Available memory: {available_memory:.2f} GB")

    return total_size_gb < available_memory * 0.8  # 80% utilization threshold

Best Practices

1. Choosing the Right Fraction

Development Work:

# Use small fractions for experimentation
--gpu-portion-request 0.1  # 10% for quick tests
--gpu-portion-request 0.25 # 25% for development

Production Inference:

# Balance performance and resource sharing
--gpu-portion-request 0.5  # 50% for stable inference

Training Jobs:

# Use larger fractions or whole GPUs
--gpu-portion-request 0.75 # 75% for medium models
--gpu-request 1            # Whole GPU for large models

2. Resource Planning

Calculate GPU Requirements:

def calculate_gpu_needs():
    """Calculate GPU fraction needed based on model requirements"""

    # Model parameters
    model_params = 110_000_000  # 110M parameters (BERT-base)
    bytes_per_param = 4  # float32

    # Training overhead (gradients, optimizer states, activations)
    training_overhead = 4  # 4x model size

    # Total memory needed (in GB)
    memory_needed = (model_params * bytes_per_param * training_overhead) / 1024**3

    # GPU memory available (24GB GPU)
    gpu_memory = 24

    # Required fraction
    fraction_needed = memory_needed / gpu_memory

    # Add safety margin
    recommended_fraction = min(fraction_needed * 1.2, 1.0)

    return recommended_fraction

3. Monitoring and Alerts

# Set up resource monitoring
runai submit "monitored-workload" \
    --gpu-portion-request 0.5 \
    --alert-on-gpu-utilization-low 50 \
    --alert-on-memory-usage-high 90

Troubleshooting

Common Issues

Out of Memory Errors:

# Reduce batch size
batch_size = batch_size // 2

# Enable gradient checkpointing
model = torch.utils.checkpoint.checkpoint_sequential(model, segments=2)

# Use mixed precision training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

Poor Performance:

# Check if you need more GPU resources
runai describe <workload-name> | grep "GPU Utilization"

# Consider upgrading to larger fraction
runai update <workload-name> --gpu-portion-request 0.5

GPU Not Allocated:

# Check cluster availability
runai cluster-info

# Verify project quota
runai describe project <project-name>

Debugging Commands

# Check GPU sharing on node
kubectl describe node <node-name> | grep nvidia.com/gpu

# View detailed resource allocation
runai get workload <workload-name> -o yaml | grep resources -A 10

# Monitor real-time GPU usage
watch -n 1 'runai top <workload-name>'

Next Steps

Now that you understand GPU fractions:

  1. Experiment with Different Sizes: Test various fractions for your workloads
  2. Implement Dynamic Scaling: Use dynamic fractions for elastic workloads
  3. Optimize Your Code: Adapt applications for fractional GPU resources
  4. Monitor Resource Usage: Set up comprehensive monitoring