Skip to content

Environment Assets

Docs

Environment assets define container images and runtime configurations that your workloads will use. They standardize the software stack and ensure consistent environments across teams.

What are Environment Assets?

Environment assets specify: - Container Image: The base Docker image with your tools and frameworks - Environment Variables: Runtime configuration and secrets - Working Directory: Default execution path - Commands: Startup commands and entry points

Creating Environment Assets

Method 1: Using the UI

  1. Navigate to Assets → Environments
  2. Click "+ NEW ENVIRONMENT"
  3. Configure the environment:

Basic Configuration:

Name: pytorch-gpu-training
Description: PyTorch environment for GPU training
Scope: Project

Container Settings:

Image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
Working Directory: /workspace

Environment Variables:

CUDA_VISIBLE_DEVICES: all
PYTHONPATH: /workspace
OMP_NUM_THREADS: 4

Method 2: Using YAML

Create pytorch-env.yaml:

apiVersion: run.ai/v1
kind: Environment
metadata:
  name: pytorch-gpu-training
  namespace: runai-<project-name>
spec:
  image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
  workingDir: /workspace
  env:
    - name: PYTHONPATH
      value: /workspace
    - name: OMP_NUM_THREADS
      value: "4"
  command: ["/bin/bash"]
  args: ["-c", "jupyter lab --allow-root --ip=0.0.0.0"]

Apply with:

kubectl apply -f pytorch-env.yaml

Common Environment Examples

1. Data Science Environment

Jupyter with TensorFlow:

Name: tensorflow-jupyter
Image: tensorflow/tensorflow:2.13.0-gpu-jupyter
Working Directory: /workspace
Environment Variables:
  JUPYTER_ENABLE_LAB: yes
  JUPYTER_TOKEN: ""
Command: jupyter lab --allow-root --ip=0.0.0.0 --port=8888

2. PyTorch Training Environment

Research and Development:

Name: pytorch-research
Image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
Working Directory: /workspace
Environment Variables:
  CUDA_VISIBLE_DEVICES: all
  PYTHONPATH: /workspace:/workspace/src
  WANDB_PROJECT: research-experiments
Command: python -m pytest --version && python --version

3. Custom Environment with Dependencies

MLflow Tracking:

Name: mlflow-training
Image: python:3.9-slim
Working Directory: /workspace
Environment Variables:
  MLFLOW_TRACKING_URI: http://mlflow-server:5000
  EXPERIMENT_NAME: model-training
Setup Commands:
  - pip install torch torchvision mlflow scikit-learn
  - pip install transformers datasets

Using Environment Assets

In Workload Creation

Via UI: 1. Create new workload 2. Environment section → Select your environment asset 3. Optionally override environment variables

Via CLI:

runai submit "training-job" \
    --environment pytorch-gpu-training \
    --gpu 1 \
    --volume /data:/workspace/data

Override Environment Settings

Add Extra Environment Variables:

runai submit "custom-training" \
    --environment pytorch-gpu-training \
    --env BATCH_SIZE=32 \
    --env LEARNING_RATE=0.001 \
    --gpu 1

Override Working Directory:

runai submit "notebook-session" \
    --environment tensorflow-jupyter \
    --working-dir /workspace/notebooks \
    --interactive

Environment Variable Management

1. Common Variables

GPU Configuration:

CUDA_VISIBLE_DEVICES=all
NVIDIA_VISIBLE_DEVICES=all

Python Environment:

PYTHONPATH=/workspace:/workspace/src
PYTHONUNBUFFERED=1

Framework Specific:

# TensorFlow
TF_CPP_MIN_LOG_LEVEL=2
TF_GPU_ALLOCATOR=cuda_malloc_async

# PyTorch  
TORCH_HOME=/workspace/.torch
OMP_NUM_THREADS=4

2. Experiment Tracking

Weights & Biases:

WANDB_PROJECT=my-experiments
WANDB_ENTITY=my-team
WANDB_RUN_GROUP=experiment-1

MLflow:

MLFLOW_TRACKING_URI=http://mlflow:5000
MLFLOW_EXPERIMENT_NAME=training-runs

Best Practices

1. Image Selection

Use Official Images:

# Good choices
pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
tensorflow/tensorflow:2.13.0-gpu
nvidia/cuda:11.8-cudnn8-devel-ubuntu20.04

Version Pinning:

# Pin specific versions for reproducibility
Image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
# Avoid: pytorch/pytorch:latest

2. Environment Organization

Naming Convention:

# Framework-purpose-version pattern
pytorch-training-v2.0
tensorflow-inference-v2.13
jupyter-datascience-v3.9

Scope Management: - Project scope: Team-specific configurations - Cluster scope: Organization-wide base images

3. Security Considerations

Avoid Hardcoded Secrets:

# Don't do this
Environment Variables:
  API_KEY: sk-1234567890abcdef  # Never hardcode secrets

# Do this instead
Environment Variables:
  API_KEY_FILE: /workspace/secrets/api-key

Use Minimal Base Images:

# More secure
Image: python:3.9-slim

# Less secure (larger attack surface)
Image: ubuntu:latest

Troubleshooting

Common Issues

Image Pull Failures:

# Check image exists
docker pull pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel

# Verify registry access
kubectl get secrets -n runai-<project>

Environment Variable Issues:

# Debug inside running workload
runai exec <workload-name> -- env | grep CUDA
runai exec <workload-name> -- echo $PYTHONPATH

Permission Problems:

# Check user context in container
runai exec <workload-name> -- whoami
runai exec <workload-name> -- ls -la /workspace

Debugging Commands

Test Environment:

# Submit test workload
runai submit "env-test" \
    --environment pytorch-gpu-training \
    --command "python --version && pip list | head -10" \
    --gpu 0.1

# Check environment variables
runai logs env-test

Validate GPU Access:

runai submit "gpu-test" \
    --environment pytorch-gpu-training \
    --command "python -c 'import torch; print(torch.cuda.is_available())'" \
    --gpu 0.1

Next Steps

  1. Create Base Environments: Start with framework-specific environments
  2. Customize for Teams: Add team-specific tools and configurations
  3. Version Control: Maintain multiple versions for different use cases
  4. Test Thoroughly: Validate environments before team deployment