Data Source Assets¶
Data source assets provide standardized access to datasets and storage systems. They simplify data mounting and ensure consistent data access patterns across workloads.
What are Data Source Assets?¶
Data sources define: - Storage Connections: PVCs, NFS, Git repositories, S3 buckets - Mount Paths: Where data appears in your containers - Access Modes: Read-only, read-write permissions - Data Organization: Directory structure and naming conventions
Types of Data Sources¶
1. Persistent Volume Claims (PVCs)¶
Dataset Storage:
Name: training-datasets
Type: PVC
PVC Name: shared-datasets-pvc
Mount Path: /data/datasets
Access Mode: ReadOnlyMany
Model Storage:
Name: model-artifacts
Type: PVC
PVC Name: model-storage-pvc
Mount Path: /data/models
Access Mode: ReadWriteMany
2. Git Repositories¶
Code Repositories:
Name: training-code
Type: Git
Repository: https://github.com/company/ml-training
Branch: main
Mount Path: /workspace/code
Configuration Repositories:
Name: experiment-configs
Type: Git
Repository: https://github.com/company/ml-configs
Branch: production
Mount Path: /workspace/configs
3. Cloud Storage¶
S3 Buckets:
Creating Data Source Assets¶
Method 1: Using the UI¶
- Navigate to Assets → Data Sources
- Click "+ NEW DATA SOURCE"
- Configure the data source:
PVC Data Source:
Name: shared-training-data
Description: Shared training datasets for all projects
Type: PVC
PVC Name: training-data-pvc
Mount Path: /data/training
Access Mode: ReadOnlyMany
Scope: Project
Method 2: Using YAML¶
Create training-data-source.yaml
:
apiVersion: run.ai/v1
kind: DataSource
metadata:
name: shared-training-data
namespace: runai-<project-name>
spec:
pvc:
name: training-data-pvc
mountPath: /data/training
accessMode: ReadOnlyMany
subPath: ""
Apply with:
Common Data Source Examples¶
1. Dataset Collections¶
Image Datasets:
Name: vision-datasets
Type: PVC
PVC: vision-data-pvc
Mount Path: /data/images
Structure:
- /data/images/imagenet/
- /data/images/cifar10/
- /data/images/coco/
Text Datasets:
Name: nlp-datasets
Type: PVC
PVC: text-data-pvc
Mount Path: /data/text
Structure:
- /data/text/wikipedia/
- /data/text/books/
- /data/text/news/
2. Model Repositories¶
Pretrained Models:
Name: pretrained-models
Type: PVC
PVC: model-hub-pvc
Mount Path: /models/pretrained
Access: ReadOnlyMany
Structure:
- /models/pretrained/bert/
- /models/pretrained/resnet/
- /models/pretrained/gpt/
Experiment Outputs:
Name: experiment-results
Type: PVC
PVC: results-pvc
Mount Path: /results
Access: ReadWriteMany
Structure:
- /results/experiments/
- /results/checkpoints/
- /results/logs/
3. Code and Configuration¶
Training Scripts:
Name: ml-training-repo
Type: Git
Repository: https://github.com/company/ml-training
Branch: main
Mount Path: /workspace/training
Files:
- train.py
- evaluate.py
- utils/
Experiment Configurations:
Name: hyperparameter-configs
Type: Git
Repository: https://github.com/company/ml-configs
Branch: production
Mount Path: /workspace/configs
Files:
- experiments/
- hyperparams/
- pipelines/
Setting Up PVC Data Sources¶
1. Create Persistent Volume Claims¶
Large Dataset Storage:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: large-datasets-pvc
namespace: runai-<project-name>
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Ti
storageClassName: nfs-storage
Fast SSD Storage:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fast-cache-pvc
namespace: runai-<project-name>
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: ssd-storage
2. Data Organization¶
Recommended Structure:
/data/
├── datasets/
│ ├── raw/ # Original, unprocessed data
│ ├── processed/ # Cleaned and preprocessed
│ └── splits/ # Train/val/test splits
├── models/
│ ├── pretrained/ # Downloaded base models
│ ├── checkpoints/ # Training checkpoints
│ └── final/ # Production-ready models
└── experiments/
├── logs/ # Training logs
├── metrics/ # Evaluation results
└── artifacts/ # Model artifacts
Using Data Source Assets¶
In Workload Creation¶
Via UI: 1. Create new workload 2. Data Sources section → Add data source assets 3. Verify mount paths don't conflict
Via CLI:
runai submit "training-job" \
--data-source shared-training-data \
--data-source model-artifacts \
--image pytorch/pytorch:latest
Multiple Data Sources¶
Mount Multiple Sources:
runai submit "comprehensive-training" \
--data-source training-datasets:/data/train \
--data-source validation-datasets:/data/val \
--data-source pretrained-models:/models \
--data-source experiment-configs:/configs \
--gpu 1
Data Source with Subpaths¶
Access Specific Subdirectories:
runai submit "specific-dataset" \
--pvc training-data-pvc:/data/datasets/imagenet \
--pvc model-storage-pvc:/models/resnet \
--gpu 1
Git Repository Integration¶
1. Public Repositories¶
Clone Public Repo:
apiVersion: run.ai/v1
kind: DataSource
metadata:
name: public-ml-repo
spec:
git:
repository: https://github.com/pytorch/examples
branch: main
mountPath: /workspace/examples
2. Private Repositories¶
Using SSH Keys:
apiVersion: run.ai/v1
kind: DataSource
metadata:
name: private-ml-repo
spec:
git:
repository: git@github.com:company/private-ml.git
branch: main
mountPath: /workspace/private
credentials:
secretName: git-ssh-credentials
Using Access Tokens:
apiVersion: run.ai/v1
kind: DataSource
metadata:
name: token-auth-repo
spec:
git:
repository: https://github.com/company/ml-project.git
branch: development
mountPath: /workspace/dev
credentials:
secretName: git-token-secret
Data Preprocessing Workflows¶
1. Data Pipeline Setup¶
Raw to Processed Pipeline:
# Submit preprocessing job
runai submit "data-preprocessing" \
--data-source raw-datasets:/data/raw \
--data-source processed-datasets:/data/processed \
--image python:3.9 \
--command "python preprocess.py --input /data/raw --output /data/processed" \
--cpu 4 \
--memory 16Gi
2. Data Validation¶
Validate Data Quality:
# data_validation.py
import os
def validate_dataset(data_path):
"""Validate dataset structure and content"""
required_dirs = ['train', 'val', 'test']
for split in required_dirs:
split_path = os.path.join(data_path, split)
if not os.path.exists(split_path):
print(f"Missing required directory: {split}")
return False
file_count = len(os.listdir(split_path))
print(f"{split}: {file_count} files")
return True
# Usage in workload
if __name__ == "__main__":
data_path = "/data/datasets/imagenet"
is_valid = validate_dataset(data_path)
print(f"Dataset valid: {is_valid}")
Best Practices¶
1. Data Organization¶
Consistent Naming:
# Good structure
/data/datasets/imagenet-2012/train/
/data/datasets/imagenet-2012/val/
/data/models/resnet50/pretrained/
/data/models/resnet50/checkpoints/
# Avoid inconsistent naming
/data/ImageNet/Training/
/data/models/ResNet-50-pretrained/
2. Access Patterns¶
Read-Only for Datasets:
# Shared datasets should be read-only
datasets-pvc:
accessMode: ReadOnlyMany
mountPath: /data/datasets
Read-Write for Outputs:
# Results and checkpoints need write access
results-pvc:
accessMode: ReadWriteMany
mountPath: /results
3. Performance Optimization¶
Storage Class Selection:
# For frequently accessed data
storageClassName: ssd-fast
# For archive/backup data
storageClassName: hdd-bulk
# For shared datasets
storageClassName: nfs-shared
Data Locality:
# Prefer local storage for intensive I/O
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: storage-type
operator: In
values: ["local-ssd"]
Monitoring Data Usage¶
1. Storage Monitoring¶
Check PVC Usage:
# View PVC status
kubectl get pvc -n runai-<project>
# Check storage consumption
kubectl describe pvc <pvc-name> -n runai-<project>
Monitor Data Transfer:
# Check mount status in workload
runai exec <workload-name> -- df -h
# Verify data accessibility
runai exec <workload-name> -- ls -la /data/datasets
2. Performance Metrics¶
I/O Performance:
# Test read performance
runai exec <workload-name> -- dd if=/data/datasets/large_file of=/dev/null bs=1M count=1000
# Test write performance
runai exec <workload-name> -- dd if=/dev/zero of=/results/test_file bs=1M count=100
Troubleshooting¶
Common Issues¶
Mount Failures:
# Check PVC status
kubectl get pvc -n runai-<project>
# Verify storage class
kubectl get storageclass
# Check pod events
kubectl describe pod <pod-name> -n runai-<project>
Permission Errors:
# Check mount permissions
runai exec <workload-name> -- ls -la /data
# Fix permissions (if needed)
runai exec <workload-name> -- chmod -R 755 /data/writable
Git Clone Issues:
# Test git access
runai exec <workload-name> -- git ls-remote <repository-url>
# Check credentials
kubectl get secret <git-secret> -n runai-<project>
Debugging Commands¶
Verify Data Sources:
# List mounted volumes
runai exec <workload-name> -- mount | grep /data
# Check data availability
runai exec <workload-name> -- find /data -type f | head -10
# Test data access
runai exec <workload-name> -- python -c "import os; print(os.listdir('/data/datasets'))"
Next Steps¶
- Plan Data Architecture: Design storage layout for your workflows
- Create Base Data Sources: Set up commonly used datasets
- Implement Data Pipelines: Automate preprocessing and validation
- Monitor Usage: Track storage consumption and performance
Related Assets¶
- Environments - Include data processing tools
- Compute Resources - Match I/O requirements with resources
- Credentials - Secure access to private data sources