Environment Assets¶

Docs

Environment assets define container images and runtime configurations that your workloads will use. They standardize the software stack and ensure consistent environments across teams.

What are Environment Assets?¶

Environment assets specify: - Container Image: The base Docker image with your tools and frameworks - Environment Variables: Runtime configuration and secrets - Working Directory: Default execution path - Commands: Startup commands and entry points

Creating Environment Assets¶

Method 1: Using the UI¶

Navigate to Assets → Environments
Click "+ NEW ENVIRONMENT"
Configure the environment:

Basic Configuration:

Name: pytorch-gpu-training
Description: PyTorch environment for GPU training
Scope: Project

Container Settings:

Image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
Working Directory: /workspace

Environment Variables:

CUDA_VISIBLE_DEVICES: all
PYTHONPATH: /workspace
OMP_NUM_THREADS: 4

Method 2: Using YAML¶

Create pytorch-env.yaml:

apiVersion: run.ai/v1
kind: Environment
metadata:
  name: pytorch-gpu-training
  namespace: runai-<project-name>
spec:
  image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
  workingDir: /workspace
  env:
    - name: PYTHONPATH
      value: /workspace
    - name: OMP_NUM_THREADS
      value: "4"
  command: ["/bin/bash"]
  args: ["-c", "jupyter lab --allow-root --ip=0.0.0.0"]

Apply with:

kubectl apply -f pytorch-env.yaml

Common Environment Examples¶

1. Data Science Environment¶

Jupyter with TensorFlow:

Name: tensorflow-jupyter
Image: tensorflow/tensorflow:2.13.0-gpu-jupyter
Working Directory: /workspace
Environment Variables:
  JUPYTER_ENABLE_LAB: yes
  JUPYTER_TOKEN: ""
Command: jupyter lab --allow-root --ip=0.0.0.0 --port=8888

2. PyTorch Training Environment¶

Research and Development:

Name: pytorch-research
Image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
Working Directory: /workspace
Environment Variables:
  CUDA_VISIBLE_DEVICES: all
  PYTHONPATH: /workspace:/workspace/src
  WANDB_PROJECT: research-experiments
Command: python -m pytest --version && python --version

3. Custom Environment with Dependencies¶

MLflow Tracking:

Name: mlflow-training
Image: python:3.9-slim
Working Directory: /workspace
Environment Variables:
  MLFLOW_TRACKING_URI: http://mlflow-server:5000
  EXPERIMENT_NAME: model-training
Setup Commands:
  - pip install torch torchvision mlflow scikit-learn
  - pip install transformers datasets

Using Environment Assets¶

In Workload Creation¶

Via UI: 1. Create new workload 2. Environment section → Select your environment asset 3. Optionally override environment variables

Via CLI:

runai submit "training-job" \
    --environment pytorch-gpu-training \
    --gpu 1 \
    --volume /data:/workspace/data

Override Environment Settings¶

Add Extra Environment Variables:

runai submit "custom-training" \
    --environment pytorch-gpu-training \
    --env BATCH_SIZE=32 \
    --env LEARNING_RATE=0.001 \
    --gpu 1

Override Working Directory:

runai submit "notebook-session" \
    --environment tensorflow-jupyter \
    --working-dir /workspace/notebooks \
    --interactive

Environment Variable Management¶

1. Common Variables¶

GPU Configuration:

CUDA_VISIBLE_DEVICES=all
NVIDIA_VISIBLE_DEVICES=all

Python Environment:

PYTHONPATH=/workspace:/workspace/src
PYTHONUNBUFFERED=1

Framework Specific:

# TensorFlow
TF_CPP_MIN_LOG_LEVEL=2
TF_GPU_ALLOCATOR=cuda_malloc_async

# PyTorch  
TORCH_HOME=/workspace/.torch
OMP_NUM_THREADS=4

2. Experiment Tracking¶

Weights & Biases:

WANDB_PROJECT=my-experiments
WANDB_ENTITY=my-team
WANDB_RUN_GROUP=experiment-1

MLflow:

MLFLOW_TRACKING_URI=http://mlflow:5000
MLFLOW_EXPERIMENT_NAME=training-runs

Best Practices¶

1. Image Selection¶

Use Official Images:

# Good choices
pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
tensorflow/tensorflow:2.13.0-gpu
nvidia/cuda:11.8-cudnn8-devel-ubuntu20.04

Version Pinning:

# Pin specific versions for reproducibility
Image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
# Avoid: pytorch/pytorch:latest

2. Environment Organization¶

Naming Convention:

# Framework-purpose-version pattern
pytorch-training-v2.0
tensorflow-inference-v2.13
jupyter-datascience-v3.9

Scope Management: - Project scope: Team-specific configurations - Cluster scope: Organization-wide base images

3. Security Considerations¶

Avoid Hardcoded Secrets:

# Don't do this
Environment Variables:
  API_KEY: sk-1234567890abcdef  # Never hardcode secrets

# Do this instead
Environment Variables:
  API_KEY_FILE: /workspace/secrets/api-key

Use Minimal Base Images:

# More secure
Image: python:3.9-slim

# Less secure (larger attack surface)
Image: ubuntu:latest

Troubleshooting¶

Common Issues¶

Image Pull Failures:

# Check image exists
docker pull pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel

# Verify registry access
kubectl get secrets -n runai-<project>

Environment Variable Issues:

# Debug inside running workload
runai exec <workload-name> -- env | grep CUDA
runai exec <workload-name> -- echo $PYTHONPATH

Permission Problems:

# Check user context in container
runai exec <workload-name> -- whoami
runai exec <workload-name> -- ls -la /workspace

Debugging Commands¶

Test Environment:

# Submit test workload
runai submit "env-test" \
    --environment pytorch-gpu-training \
    --command "python --version && pip list | head -10" \
    --gpu 0.1

# Check environment variables
runai logs env-test

Validate GPU Access:

runai submit "gpu-test" \
    --environment pytorch-gpu-training \
    --command "python -c 'import torch; print(torch.cuda.is_available())'" \
    --gpu 0.1

Next Steps¶

Create Base Environments: Start with framework-specific environments
Customize for Teams: Add team-specific tools and configurations
Version Control: Maintain multiple versions for different use cases
Test Thoroughly: Validate environments before team deployment

Compute Resources - Pair environments with appropriate hardware
Data Sources - Mount data into your environments
Credentials - Secure access to external services