Skip to content

Data Source Assets

Docs

Data source assets provide standardized access to datasets and storage systems. They simplify data mounting and ensure consistent data access patterns across workloads.

What are Data Source Assets?

Data sources define: - Storage Connections: PVCs, NFS, Git repositories, S3 buckets - Mount Paths: Where data appears in your containers - Access Modes: Read-only, read-write permissions - Data Organization: Directory structure and naming conventions

Types of Data Sources

1. Persistent Volume Claims (PVCs)

Dataset Storage:

Name: training-datasets
Type: PVC
PVC Name: shared-datasets-pvc
Mount Path: /data/datasets
Access Mode: ReadOnlyMany

Model Storage:

Name: model-artifacts
Type: PVC  
PVC Name: model-storage-pvc
Mount Path: /data/models
Access Mode: ReadWriteMany

2. Git Repositories

Code Repositories:

Name: training-code
Type: Git
Repository: https://github.com/company/ml-training
Branch: main
Mount Path: /workspace/code

Configuration Repositories:

Name: experiment-configs
Type: Git
Repository: https://github.com/company/ml-configs
Branch: production
Mount Path: /workspace/configs

3. Cloud Storage

S3 Buckets:

Name: s3-datasets
Type: S3
Bucket: company-ml-data
Prefix: datasets/
Mount Path: /data/cloud

Creating Data Source Assets

Method 1: Using the UI

  1. Navigate to Assets → Data Sources
  2. Click "+ NEW DATA SOURCE"
  3. Configure the data source:

PVC Data Source:

Name: shared-training-data
Description: Shared training datasets for all projects
Type: PVC
PVC Name: training-data-pvc
Mount Path: /data/training
Access Mode: ReadOnlyMany
Scope: Project

Method 2: Using YAML

Create training-data-source.yaml:

apiVersion: run.ai/v1
kind: DataSource
metadata:
  name: shared-training-data
  namespace: runai-<project-name>
spec:
  pvc:
    name: training-data-pvc
    mountPath: /data/training
    accessMode: ReadOnlyMany
    subPath: ""

Apply with:

kubectl apply -f training-data-source.yaml

Common Data Source Examples

1. Dataset Collections

Image Datasets:

Name: vision-datasets
Type: PVC
PVC: vision-data-pvc
Mount Path: /data/images
Structure:
  - /data/images/imagenet/
  - /data/images/cifar10/
  - /data/images/coco/

Text Datasets:

Name: nlp-datasets  
Type: PVC
PVC: text-data-pvc
Mount Path: /data/text
Structure:
  - /data/text/wikipedia/
  - /data/text/books/
  - /data/text/news/

2. Model Repositories

Pretrained Models:

Name: pretrained-models
Type: PVC
PVC: model-hub-pvc
Mount Path: /models/pretrained
Access: ReadOnlyMany
Structure:
  - /models/pretrained/bert/
  - /models/pretrained/resnet/
  - /models/pretrained/gpt/

Experiment Outputs:

Name: experiment-results
Type: PVC
PVC: results-pvc
Mount Path: /results
Access: ReadWriteMany
Structure:
  - /results/experiments/
  - /results/checkpoints/
  - /results/logs/

3. Code and Configuration

Training Scripts:

Name: ml-training-repo
Type: Git
Repository: https://github.com/company/ml-training
Branch: main
Mount Path: /workspace/training
Files:
  - train.py
  - evaluate.py
  - utils/

Experiment Configurations:

Name: hyperparameter-configs
Type: Git
Repository: https://github.com/company/ml-configs
Branch: production
Mount Path: /workspace/configs
Files:
  - experiments/
  - hyperparams/
  - pipelines/

Setting Up PVC Data Sources

1. Create Persistent Volume Claims

Large Dataset Storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: large-datasets-pvc
  namespace: runai-<project-name>
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Ti
  storageClassName: nfs-storage

Fast SSD Storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fast-cache-pvc
  namespace: runai-<project-name>
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: ssd-storage

2. Data Organization

Recommended Structure:

/data/
├── datasets/
   ├── raw/           # Original, unprocessed data
   ├── processed/     # Cleaned and preprocessed
   └── splits/        # Train/val/test splits
├── models/
   ├── pretrained/    # Downloaded base models
   ├── checkpoints/   # Training checkpoints
   └── final/         # Production-ready models
└── experiments/
    ├── logs/          # Training logs
    ├── metrics/       # Evaluation results
    └── artifacts/     # Model artifacts

Using Data Source Assets

In Workload Creation

Via UI: 1. Create new workload 2. Data Sources section → Add data source assets 3. Verify mount paths don't conflict

Via CLI:

runai submit "training-job" \
    --data-source shared-training-data \
    --data-source model-artifacts \
    --image pytorch/pytorch:latest

Multiple Data Sources

Mount Multiple Sources:

runai submit "comprehensive-training" \
    --data-source training-datasets:/data/train \
    --data-source validation-datasets:/data/val \
    --data-source pretrained-models:/models \
    --data-source experiment-configs:/configs \
    --gpu 1

Data Source with Subpaths

Access Specific Subdirectories:

runai submit "specific-dataset" \
    --pvc training-data-pvc:/data/datasets/imagenet \
    --pvc model-storage-pvc:/models/resnet \
    --gpu 1

Git Repository Integration

1. Public Repositories

Clone Public Repo:

apiVersion: run.ai/v1
kind: DataSource
metadata:
  name: public-ml-repo
spec:
  git:
    repository: https://github.com/pytorch/examples
    branch: main
    mountPath: /workspace/examples

2. Private Repositories

Using SSH Keys:

apiVersion: run.ai/v1
kind: DataSource
metadata:
  name: private-ml-repo
spec:
  git:
    repository: git@github.com:company/private-ml.git
    branch: main
    mountPath: /workspace/private
    credentials:
      secretName: git-ssh-credentials

Using Access Tokens:

apiVersion: run.ai/v1
kind: DataSource
metadata:
  name: token-auth-repo
spec:
  git:
    repository: https://github.com/company/ml-project.git
    branch: development
    mountPath: /workspace/dev
    credentials:
      secretName: git-token-secret

Data Preprocessing Workflows

1. Data Pipeline Setup

Raw to Processed Pipeline:

# Submit preprocessing job
runai submit "data-preprocessing" \
    --data-source raw-datasets:/data/raw \
    --data-source processed-datasets:/data/processed \
    --image python:3.9 \
    --command "python preprocess.py --input /data/raw --output /data/processed" \
    --cpu 4 \
    --memory 16Gi

2. Data Validation

Validate Data Quality:

# data_validation.py
import os

def validate_dataset(data_path):
    """Validate dataset structure and content"""

    required_dirs = ['train', 'val', 'test']

    for split in required_dirs:
        split_path = os.path.join(data_path, split)
        if not os.path.exists(split_path):
            print(f"Missing required directory: {split}")
            return False

        file_count = len(os.listdir(split_path))
        print(f"{split}: {file_count} files")

    return True

# Usage in workload
if __name__ == "__main__":
    data_path = "/data/datasets/imagenet"
    is_valid = validate_dataset(data_path)
    print(f"Dataset valid: {is_valid}")

Best Practices

1. Data Organization

Consistent Naming:

# Good structure
/data/datasets/imagenet-2012/train/
/data/datasets/imagenet-2012/val/
/data/models/resnet50/pretrained/
/data/models/resnet50/checkpoints/

# Avoid inconsistent naming
/data/ImageNet/Training/
/data/models/ResNet-50-pretrained/

2. Access Patterns

Read-Only for Datasets:

# Shared datasets should be read-only
datasets-pvc:
  accessMode: ReadOnlyMany
  mountPath: /data/datasets

Read-Write for Outputs:

# Results and checkpoints need write access
results-pvc:
  accessMode: ReadWriteMany  
  mountPath: /results

3. Performance Optimization

Storage Class Selection:

# For frequently accessed data
storageClassName: ssd-fast

# For archive/backup data  
storageClassName: hdd-bulk

# For shared datasets
storageClassName: nfs-shared

Data Locality:

# Prefer local storage for intensive I/O
nodeAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
    - matchExpressions:
      - key: storage-type
        operator: In
        values: ["local-ssd"]

Monitoring Data Usage

1. Storage Monitoring

Check PVC Usage:

# View PVC status
kubectl get pvc -n runai-<project>

# Check storage consumption
kubectl describe pvc <pvc-name> -n runai-<project>

Monitor Data Transfer:

# Check mount status in workload
runai exec <workload-name> -- df -h

# Verify data accessibility
runai exec <workload-name> -- ls -la /data/datasets

2. Performance Metrics

I/O Performance:

# Test read performance
runai exec <workload-name> -- dd if=/data/datasets/large_file of=/dev/null bs=1M count=1000

# Test write performance  
runai exec <workload-name> -- dd if=/dev/zero of=/results/test_file bs=1M count=100

Troubleshooting

Common Issues

Mount Failures:

# Check PVC status
kubectl get pvc -n runai-<project>

# Verify storage class
kubectl get storageclass

# Check pod events
kubectl describe pod <pod-name> -n runai-<project>

Permission Errors:

# Check mount permissions
runai exec <workload-name> -- ls -la /data

# Fix permissions (if needed)
runai exec <workload-name> -- chmod -R 755 /data/writable

Git Clone Issues:

# Test git access
runai exec <workload-name> -- git ls-remote <repository-url>

# Check credentials
kubectl get secret <git-secret> -n runai-<project>

Debugging Commands

Verify Data Sources:

# List mounted volumes
runai exec <workload-name> -- mount | grep /data

# Check data availability
runai exec <workload-name> -- find /data -type f | head -10

# Test data access
runai exec <workload-name> -- python -c "import os; print(os.listdir('/data/datasets'))"

Next Steps

  1. Plan Data Architecture: Design storage layout for your workflows
  2. Create Base Data Sources: Set up commonly used datasets
  3. Implement Data Pipelines: Automate preprocessing and validation
  4. Monitor Usage: Track storage consumption and performance