Skip to content

Tags: dstackai/dstack

Tags

0.20.14

Toggle 0.20.14's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[Azure] Add support for H100 NVL and H200 VM series; refactor instanc…

…e creation methods to cleanup failed instances (#3699)

Co-authored-by: Andrey Cheptsov <andrey.cheptsov@github.com>

0.20.13

Toggle 0.20.13's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix CLI compatibility with older servers (#3664)

0.20.12

Toggle 0.20.12's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add Crusoe Cloud backend (#3602)

* Add Crusoe Cloud backend

Add a VM-based Crusoe Cloud backend supporting single-node and
multi-node (cluster) provisioning with InfiniBand.

Key features:
- gpuhunt online provider for offers with project quota filtering
- HMAC-SHA256 authenticated REST API client
- Image selection based on GPU type (SXM/PCIe/ROCm/CPU)
- Storage: persistent data disk for types without ephemeral NVMe;
  auto-detects and RAID-0s NVMe for types with ephemeral storage;
  moves containerd storage so containers get the full disk space
- Cluster support via IB partitions
- Two-phase termination with data disk cleanup

Tested end-to-end:
- L40S: fleet, dev env, GPU, configurable disk (200GB), clean termination
- A100-PCIe: fleet, dev env, GPU, NVMe auto-mount (880GB), clean termination
- A100-SXM-IB cluster: IB partition created, 1 node provisioned with IB
  and 8x NVMe RAID-0 (7TB); 2nd node failed on capacity (out_of_stock)
- Offers: quota enforcement, disk sizes correct per instance type

Not tested (no capacity/quota):
- H100-SXM-IB, MI300X-IB, MI355X-RoCE (no hardware available)
- CPU-only instances c1a/s1a (no quota)
- Spot provisioning (disabled in gpuhunt, see TODO)
- Full 2-node cluster with IB connectivity test

TODOs:
- Spot: disabled until Crusoe confirms how to request spot billing
  via the VM create API endpoint
- gpuhunt dependency: currently installed from PR branch; switch to
  pinned version after gpuhunt PR #211 is merged and released

AI Assistance: This implementation was developed with AI assistance.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fetch Crusoe locations dynamically instead of hardcoding

Co-authored-by: Cursor <cursoragent@cursor.com>

* Fix VM image selection for SXM instance types

The _get_image function checked gpu_type (e.g. 'A100') for 'SXM', but
gpuhunt normalizes GPU names and strips the SXM qualifier. Check the
instance type name instead (e.g. 'a100-80gb-sxm-ib.8x') which
preserves the '-sxm' indicator.

Without this fix, SXM-IB instances used the PCIe docker image which
lacks IB drivers, HPC-X, and NCCL topology files. Verified with a
2-node A100-SXM-IB NCCL all_reduce test: 193 GB/s bus bandwidth.

Made-with: Cursor

* Switch gpuhunt dependency from PR branch to main

Made-with: Cursor

* Add TODOs to pin gpuhunt and remove allow-direct-references before merging

Made-with: Cursor

* Pin gpuhunt==0.1.17 (matches master)

Made-with: Cursor

---------

Co-authored-by: Cursor <cursoragent@cursor.com>

0.20.11

Toggle 0.20.11's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix concurrent indexes migration (#3591)

0.20.10

Toggle 0.20.10's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
[runner] Check if repo dir exists before chown (#3589)

The check is added to avoid the following log message when
no repo specified or the repo is empty:

> Error while walking repo dir path=/workflow err=lstat /workflow:
> no such file or directory

In addition, walk/chown errors log level is changed to warning to
highlight possible issues.

0.20.9

Toggle 0.20.9's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Don't terminate unreachable SSH instances (#3568)

Fixes: #2531

0.20.8

Toggle 0.20.8's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix `probes=None` server incompatibility (#3543)

This fixes server compatibility with clients prior
to 0.20.8 that don't support `probes=None`, by
replacing `None` with `[]` in responses for older
clients. This incompatibility could be observed
when there are both new and old clients in the
same project, so old clients would fail when
viewing runs submitted by new clients.

0.20.7

Toggle 0.20.7's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix scaling during update to replica groups (#3510)

This fix prevents the number of replicas from
dropping to `replicas.min` for all existing
services during a server update to 0.20.7.

0.20.6

Toggle 0.20.6's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Hotfix. Fixed generation fleet fields in project forms (#3486)

0.20.5

Toggle 0.20.5's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add missing Box imports (#3485)