Skip to content

Create a Training Workload

Docs

This guide walks you through creating and running your first standard training workload in Run:AI. Standard training is perfect for single-GPU model training, experimentation, and development workflows.

What is Standard Training?

Standard training allows you to: - Train models on single or multiple GPUs (non-distributed) - Experiment rapidly with different model architectures
- Develop and test ML pipelines - Fine-tune models with full control over the training process

Prerequisites

Before starting, ensure you have: - A project with at least 1 GPU quota assigned - Access to the Run:AI user interface or CLI - A container image with your training code, or use pre-built images

Method 1: Using the UI

1. Create a New Training Workload

  1. Navigate to Workload manager → Workloads
  2. Click "+NEW WORKLOAD"
  3. Select "Training" from the workload types
  4. Choose your cluster and project

2. Configure Basic Settings

  1. Architecture Type: Select "Standard" (not distributed)
  2. Template: Choose "Start from Scratch" for full control
  3. Workload Name: Enter a unique, descriptive name like bert-fine-tuning-v1

3. Environment Configuration

Choose Training Image:

# Popular ML frameworks
pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
tensorflow/tensorflow:2.13-gpu
huggingface/transformers-pytorch-gpu:4.21.0
nvcr.io/nvidia/pytorch:23.10-py3

Set Environment Variables (optional):

PYTHONPATH=/workspace
WANDB_PROJECT=my-experiments

4. Compute Resources

GPU Allocation:

# Full GPU (recommended for training)
GPU Request: 1 whole GPU
GPU Memory: 24GB (typical)

# Or fractional GPU for smaller models
GPU Portion: 0.5 (50% of GPU)

CPU and Memory:

CPU Request: 4 cores
Memory Request: 8Gi

5. Data Sources and Storage

Mount Datasets:

# Mount datasets from PVC
Source: /data/imagenet
Mount Path: /workspace/data

# Mount model checkpoints
Source: /models/checkpoints  
Mount Path: /workspace/models

6. Training Command

Set Training Command:

# Simple training script
python train.py --epochs 50 --batch_size 32

# More complex training
python train.py --model resnet50 --dataset imagenet --epochs 100 --lr 0.1

Working Directory: /workspace

7. Submit the Training Job

  1. Review all configurations
  2. Click "CREATE WORKLOAD"
  3. Monitor progress in the workload manager

Method 2: Using the CLI

Basic Training Submission

# Simple training job
runai submit "pytorch-training" \
    --image pytorch/pytorch:latest \
    --gpu 1 \
    --cpu 4 \
    --memory 8Gi \
    --volume /data:/workspace/data \
    --command "python train.py --epochs 50"

Advanced Training Configuration

# Comprehensive training setup
runai submit "bert-fine-tuning" \
    --image huggingface/transformers-pytorch-gpu:latest \
    --gpu 1 \
    --cpu 8 \
    --memory 16Gi \
    --volume /datasets:/workspace/data \
    --volume /models:/workspace/checkpoints \
    --env WANDB_PROJECT=bert-experiments \
    --working-dir /workspace \
    --command "python run_training.py --model bert-base --epochs 10"

Example Training Script

Here's a simple PyTorch training script to get you started:

# train.py
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import argparse

def main():
    parser = argparse.ArgumentParser(description='PyTorch Training')
    parser.add_argument('--epochs', default=10, type=int)
    parser.add_argument('--batch_size', default=128, type=int)
    parser.add_argument('--lr', default=0.1, type=float)
    args = parser.parse_args()

    # Setup device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f'Using device: {device}')

    # Data loading
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    trainset = torchvision.datasets.CIFAR10(
        root='/workspace/data', train=True, download=True, transform=transform
    )
    trainloader = torch.utils.data.DataLoader(
        trainset, batch_size=args.batch_size, shuffle=True
    )

    # Simple model
    model = torchvision.models.resnet18(num_classes=10).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=0.9)

    # Training loop
    for epoch in range(args.epochs):
        running_loss = 0.0
        for i, (inputs, labels) in enumerate(trainloader):
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            if i % 100 == 0:
                print(f'Epoch {epoch+1}, Batch {i}, Loss: {loss.item():.3f}')

        print(f'Epoch {epoch+1} completed, Average Loss: {running_loss/len(trainloader):.3f}')

    # Save model
    torch.save(model.state_dict(), '/workspace/models/final_model.pth')
    print('Training completed!')

if __name__ == '__main__':
    main()

Monitoring Training Progress

Using the UI

  1. Navigate to your training workload in the dashboard
  2. Click "SHOW DETAILS" to view:
  3. Real-time logs showing training progress
  4. Resource utilization (GPU, CPU, Memory)
  5. Training events and status updates

Using CLI Commands

# Monitor training status
runai list

# View detailed workload information
runai describe <workload-name>

# Stream training logs in real-time
runai logs <workload-name> --follow

# Check resource usage
runai top <workload-name>

Best Practices

Resource Planning

  1. Start with appropriate GPU allocation:

    # For small models and experimentation
    --gpu-portion-request 0.25
    
    # For medium models (BERT, ResNet)
    --gpu 1
    
    # For large models
    --gpu 2  # or more
    

  2. Use checkpointing for long-running training:

    # Save checkpoints regularly
    if epoch % 10 == 0:
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
        }, f'/workspace/models/checkpoint_epoch_{epoch}.pth')
    

  3. Monitor GPU utilization:

    # Check if you need more/less GPU resources
    runai top <workload-name>
    nvidia-smi  # inside the container
    

Troubleshooting

Common Issues

Out of Memory Errors:

# Reduce batch size
batch_size = 64  # instead of 128

# Use gradient accumulation
accumulation_steps = 2
for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Workload Not Starting:

# Check project quota
runai describe project <project-name>

# Check resource availability
runai cluster-info

# Verify image accessibility
runai submit test --image <your-image> --command "echo 'Image works'"

Slow Training Performance: - Check GPU utilization with nvidia-smi - Verify data loading isn't a bottleneck - Consider using faster data formats or more data workers

Next Steps

Now that you can run training workloads:

  1. Try larger models with distributed training
  2. Experiment with GPU fractions for resource optimization
  3. Set up monitoring with TensorBoard or Weights & Biases
  4. Implement proper checkpointing for production workloads