Create a Training Workload¶
This guide walks you through creating and running your first standard training workload in Run:AI. Standard training is perfect for single-GPU model training, experimentation, and development workflows.
What is Standard Training?¶
Standard training allows you to:
- Train models on single or multiple GPUs (non-distributed)
- Experiment rapidly with different model architectures
- Develop and test ML pipelines
- Fine-tune models with full control over the training process
Prerequisites¶
Before starting, ensure you have: - A project with at least 1 GPU quota assigned - Access to the Run:AI user interface or CLI - A container image with your training code, or use pre-built images
Method 1: Using the UI¶
1. Create a New Training Workload¶
- Navigate to Workload manager → Workloads
- Click "+NEW WORKLOAD"
- Select "Training" from the workload types
- Choose your cluster and project
2. Configure Basic Settings¶
- Architecture Type: Select "Standard" (not distributed)
- Template: Choose "Start from Scratch" for full control
- Workload Name: Enter a unique, descriptive name like
bert-fine-tuning-v1
3. Environment Configuration¶
Choose Training Image:
# Popular ML frameworks
pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
tensorflow/tensorflow:2.13-gpu
huggingface/transformers-pytorch-gpu:4.21.0
nvcr.io/nvidia/pytorch:23.10-py3
Set Environment Variables (optional):
4. Compute Resources¶
GPU Allocation:
# Full GPU (recommended for training)
GPU Request: 1 whole GPU
GPU Memory: 24GB (typical)
# Or fractional GPU for smaller models
GPU Portion: 0.5 (50% of GPU)
CPU and Memory:
5. Data Sources and Storage¶
Mount Datasets:
# Mount datasets from PVC
Source: /data/imagenet
Mount Path: /workspace/data
# Mount model checkpoints
Source: /models/checkpoints
Mount Path: /workspace/models
6. Training Command¶
Set Training Command:
# Simple training script
python train.py --epochs 50 --batch_size 32
# More complex training
python train.py --model resnet50 --dataset imagenet --epochs 100 --lr 0.1
Working Directory: /workspace
7. Submit the Training Job¶
- Review all configurations
- Click "CREATE WORKLOAD"
- Monitor progress in the workload manager
Method 2: Using the CLI¶
Basic Training Submission¶
# Simple training job
runai submit "pytorch-training" \
--image pytorch/pytorch:latest \
--gpu 1 \
--cpu 4 \
--memory 8Gi \
--volume /data:/workspace/data \
--command "python train.py --epochs 50"
Advanced Training Configuration¶
# Comprehensive training setup
runai submit "bert-fine-tuning" \
--image huggingface/transformers-pytorch-gpu:latest \
--gpu 1 \
--cpu 8 \
--memory 16Gi \
--volume /datasets:/workspace/data \
--volume /models:/workspace/checkpoints \
--env WANDB_PROJECT=bert-experiments \
--working-dir /workspace \
--command "python run_training.py --model bert-base --epochs 10"
Example Training Script¶
Here's a simple PyTorch training script to get you started:
# train.py
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import argparse
def main():
parser = argparse.ArgumentParser(description='PyTorch Training')
parser.add_argument('--epochs', default=10, type=int)
parser.add_argument('--batch_size', default=128, type=int)
parser.add_argument('--lr', default=0.1, type=float)
args = parser.parse_args()
# Setup device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
# Data loading
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(
root='/workspace/data', train=True, download=True, transform=transform
)
trainloader = torch.utils.data.DataLoader(
trainset, batch_size=args.batch_size, shuffle=True
)
# Simple model
model = torchvision.models.resnet18(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=0.9)
# Training loop
for epoch in range(args.epochs):
running_loss = 0.0
for i, (inputs, labels) in enumerate(trainloader):
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 100 == 0:
print(f'Epoch {epoch+1}, Batch {i}, Loss: {loss.item():.3f}')
print(f'Epoch {epoch+1} completed, Average Loss: {running_loss/len(trainloader):.3f}')
# Save model
torch.save(model.state_dict(), '/workspace/models/final_model.pth')
print('Training completed!')
if __name__ == '__main__':
main()
Monitoring Training Progress¶
Using the UI¶
- Navigate to your training workload in the dashboard
- Click "SHOW DETAILS" to view:
- Real-time logs showing training progress
- Resource utilization (GPU, CPU, Memory)
- Training events and status updates
Using CLI Commands¶
# Monitor training status
runai list
# View detailed workload information
runai describe <workload-name>
# Stream training logs in real-time
runai logs <workload-name> --follow
# Check resource usage
runai top <workload-name>
Best Practices¶
Resource Planning¶
-
Start with appropriate GPU allocation:
-
Use checkpointing for long-running training:
-
Monitor GPU utilization:
Troubleshooting¶
Common Issues¶
Out of Memory Errors:
# Reduce batch size
batch_size = 64 # instead of 128
# Use gradient accumulation
accumulation_steps = 2
for i, batch in enumerate(dataloader):
loss = model(batch) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Workload Not Starting:
# Check project quota
runai describe project <project-name>
# Check resource availability
runai cluster-info
# Verify image accessibility
runai submit test --image <your-image> --command "echo 'Image works'"
Slow Training Performance:
- Check GPU utilization with nvidia-smi
- Verify data loading isn't a bottleneck
- Consider using faster data formats or more data workers
Next Steps¶
Now that you can run training workloads:
- Try larger models with distributed training
- Experiment with GPU fractions for resource optimization
- Set up monitoring with TensorBoard or Weights & Biases
- Implement proper checkpointing for production workloads
Related Guides¶
- Distributed Training - Scale beyond single GPU
- GPU Fractions - Optimize resource usage
- Preemption Exercise - Understand workload priorities