Skip to content

Add per-instance data transfer metering#3685

Open
peterschmidt85 wants to merge 8 commits intomasterfrom
feature/data-transfer-quota
Open

Add per-instance data transfer metering#3685
peterschmidt85 wants to merge 8 commits intomasterfrom
feature/data-transfer-quota

Conversation

@peterschmidt85
Copy link
Contributor

@peterschmidt85 peterschmidt85 commented Mar 23, 2026

Summary

Adds per-instance outbound data transfer metering to track billable network traffic.

  • The shim starts an iptables-based meter (dstack-nm chain) at boot that counts outbound bytes to external IPs, excluding private/VPC traffic (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16)
  • Cumulative bytes are reported via GET /api/instance/health and stored on InstanceModel.data_transfer_bytes
  • Server reads bytes during periodic health checks (~60s) and captures a final reading before instance termination
  • Exposed via the Instance API for downstream billing integration
  • Metering is best-effort: if iptables is unavailable, the shim logs a warning and continues without metering

Files changed

Shim: netmeter/ package (iptables chain setup, 10s polling, atomic Bytes() read), started at shim boot in main.go, reported via InstanceHealthResponse.data_transfer_bytes

Server: InstanceModel.data_transfer_bytes column + Alembic migration, extraction in instance health check (both pipeline_tasks and scheduled_tasks), final read in termination path, Instance API model

Test plan

  • Unit tests: iptables output parsing, Bytes() read
  • golangci-lint, ruff: 0 issues
  • Go + Python tests: all pass
  • E2E on AWS: task uploaded ~20MB, data_transfer_bytes = 22.1 MB (includes apt-get, Docker pull overhead)
  • Final read at termination: value captured before instance destroyed
  • Backward compatible: old shims → field is None/0, old servers → ignore extra field

🤖 Generated with Claude Code

@peterschmidt85 peterschmidt85 changed the title Add per-job data transfer quota (AWS) Add per-instance data transfer metering Mar 25, 2026
Andrey Cheptsov and others added 7 commits March 25, 2026 14:32
…limits

Adds a configurable per-job outbound data transfer quota (AWS only) that
terminates jobs when the total external traffic exceeds the threshold.
Metering uses iptables byte counters on the shim (host-level), excluding
private/VPC traffic. The shim notifies the runner via a new /api/terminate
endpoint so the server reads the termination reason through the existing
/api/pull flow — same pattern as log quota.

Configured via DSTACK_SERVER_DATA_TRANSFER_QUOTA_PER_JOB_AWS (bytes, 0=unlimited).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uota

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…orting

Replace the per-job quota termination approach with per-instance passive
metering. The shim starts an iptables-based netmeter at startup that
continuously tracks outbound external bytes. The server reads this via
the existing /api/instance/health endpoint during periodic health checks
(~60s) and captures a final reading before instance termination.

Changes:
- Netmeter: per-instance chain (dstack-nm), no quota, exposes Bytes()
- Shim: starts netmeter at boot, reports via InstanceHealthResponse
- Server: stores data_transfer_bytes on InstanceModel, final read at termination
- Removed: quota enforcement, /api/terminate endpoint, DATA_TRANSFER_QUOTA_EXCEEDED

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…om SSH tunnel

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@peterschmidt85 peterschmidt85 force-pushed the feature/data-transfer-quota branch from 7ff682d to 8ec5b15 Compare March 25, 2026 13:36
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
return self.health_response.dcgm is not None
return (
self.health_response.dcgm is not None
or self.health_response.data_transfer_bytes is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think having data_transfer_bytes should mean there are health check present – we wouldn't want to create health checks only because of data_transfer_bytes. Overall, having data_transfer_bytes as a part of health checks is confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants