Tags: dstackai/dstack
Tags
Add Crusoe Cloud backend (#3602) * Add Crusoe Cloud backend Add a VM-based Crusoe Cloud backend supporting single-node and multi-node (cluster) provisioning with InfiniBand. Key features: - gpuhunt online provider for offers with project quota filtering - HMAC-SHA256 authenticated REST API client - Image selection based on GPU type (SXM/PCIe/ROCm/CPU) - Storage: persistent data disk for types without ephemeral NVMe; auto-detects and RAID-0s NVMe for types with ephemeral storage; moves containerd storage so containers get the full disk space - Cluster support via IB partitions - Two-phase termination with data disk cleanup Tested end-to-end: - L40S: fleet, dev env, GPU, configurable disk (200GB), clean termination - A100-PCIe: fleet, dev env, GPU, NVMe auto-mount (880GB), clean termination - A100-SXM-IB cluster: IB partition created, 1 node provisioned with IB and 8x NVMe RAID-0 (7TB); 2nd node failed on capacity (out_of_stock) - Offers: quota enforcement, disk sizes correct per instance type Not tested (no capacity/quota): - H100-SXM-IB, MI300X-IB, MI355X-RoCE (no hardware available) - CPU-only instances c1a/s1a (no quota) - Spot provisioning (disabled in gpuhunt, see TODO) - Full 2-node cluster with IB connectivity test TODOs: - Spot: disabled until Crusoe confirms how to request spot billing via the VM create API endpoint - gpuhunt dependency: currently installed from PR branch; switch to pinned version after gpuhunt PR #211 is merged and released AI Assistance: This implementation was developed with AI assistance. Co-authored-by: Cursor <cursoragent@cursor.com> * Fetch Crusoe locations dynamically instead of hardcoding Co-authored-by: Cursor <cursoragent@cursor.com> * Fix VM image selection for SXM instance types The _get_image function checked gpu_type (e.g. 'A100') for 'SXM', but gpuhunt normalizes GPU names and strips the SXM qualifier. Check the instance type name instead (e.g. 'a100-80gb-sxm-ib.8x') which preserves the '-sxm' indicator. Without this fix, SXM-IB instances used the PCIe docker image which lacks IB drivers, HPC-X, and NCCL topology files. Verified with a 2-node A100-SXM-IB NCCL all_reduce test: 193 GB/s bus bandwidth. Made-with: Cursor * Switch gpuhunt dependency from PR branch to main Made-with: Cursor * Add TODOs to pin gpuhunt and remove allow-direct-references before merging Made-with: Cursor * Pin gpuhunt==0.1.17 (matches master) Made-with: Cursor --------- Co-authored-by: Cursor <cursoragent@cursor.com>
[runner] Check if repo dir exists before chown (#3589) The check is added to avoid the following log message when no repo specified or the repo is empty: > Error while walking repo dir path=/workflow err=lstat /workflow: > no such file or directory In addition, walk/chown errors log level is changed to warning to highlight possible issues.
Fix `probes=None` server incompatibility (#3543) This fixes server compatibility with clients prior to 0.20.8 that don't support `probes=None`, by replacing `None` with `[]` in responses for older clients. This incompatibility could be observed when there are both new and old clients in the same project, so old clients would fail when viewing runs submitted by new clients.
PreviousNext