Skip to content

fix: restart video service when first frame stalls#13944

Open
godnight10061 wants to merge 5 commits intorustdesk:masterfrom
godnight10061:fix-waiting-for-image
Open

fix: restart video service when first frame stalls#13944
godnight10061 wants to merge 5 commits intorustdesk:masterfrom
godnight10061:fix-waiting-for-image

Conversation

@godnight10061
Copy link

@godnight10061 godnight10061 commented Jan 2, 2026

Problem

In some cases the capturer can return Ok(frame) while the frame is invalid (!frame.valid()). Previously this was treated as a successful capture, so the normal WouldBlock ("no image yet") handling didn't run and peers could remain stuck waiting for the first image.

Approach

  • Treat invalid frames as WouldBlock so existing retry/fallback paths run (including Windows try_gdi).
  • Add a first-frame watchdog: if no frame is actually sent to any subscriber within ~3s, restart the video service via the existing SWITCH mechanism.

Tests

  • cargo test

@rustdesk
Copy link
Owner

rustdesk commented Jan 2, 2026

@21pages @fufesou both review this.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an issue where the video service could stall when the capturer returns invalid frames, preventing peers from receiving the first video frame. The solution treats invalid frames as WouldBlock errors (enabling existing retry/fallback mechanisms) and adds a watchdog that restarts the service if no frame is sent to subscribers within 3 seconds.

  • Converts invalid frames to WouldBlock errors to trigger retry/fallback paths
  • Implements a first-frame watchdog with a 3-second timeout
  • Tracks whether frames are successfully sent to subscribers to determine if the watchdog should trigger

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if !sent_first_frame {
if produced_frame {
sent_first_frame = true;
first_frame_watchdog.reset();
Copy link

Copilot AI Jan 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reset call here is unnecessary because once sent_first_frame is set to true on line 907, the outer condition if !sent_first_frame on line 905 will prevent this code block from executing again. The watchdog state will never be checked again, so resetting it has no effect. Consider removing this line to simplify the code.

Suggested change
first_frame_watchdog.reset();

Copilot uses AI. Check for mistakes.
@fufesou
Copy link
Collaborator

fufesou commented Jan 2, 2026

@godnight10061 Hi, this PR looks good.

I'll test more.

But I have two questions at the moment:

  1. Why did you use waited: Duration::ZERO instead of waited: Instant::now()?

The current implementation:

fn on_no_frame(&mut self, tick: Duration) -> bool {
    self.waited = self.waited.saturating_add(tick);
    self.waited >= self.timeout
}

seems more complex than:

fn on_no_frame(&mut self) -> bool {
    self.waited.elapsed() >= self.timeout
}
  1. Why did you add the FirstFrameWatchdog guard in video_service.rs rather than on the controlling side? Have you been able to reproduce the issue or the following log entry?
log::warn!(
    "No first video frame for {first_frame_timeout:?}, restarting video service"
);

@godnight10061
Copy link
Author

godnight10061 commented Jan 2, 2026

@godnight10061 Hi, this PR looks good.

I'll test more.

But I have two questions at the moment:

  1. Why did you use waited: Duration::ZERO instead of waited: Instant::now()?

The current implementation:

fn on_no_frame(&mut self, tick: Duration) -> bool {
    self.waited = self.waited.saturating_add(tick);
    self.waited >= self.timeout
}

seems more complex than:

fn on_no_frame(&mut self) -> bool {
    self.waited.elapsed() >= self.timeout
}
  1. Why did you add the FirstFrameWatchdog guard in video_service.rs rather than on the controlling side? Have you been able to reproduce the issue or the following log entry?
log::warn!(
    "No first video frame for {first_frame_timeout:?}, restarting video service"
);

@fufesou 1. Good catch. switched to Instant + elapsed() for simpler logic; kept deterministic tests by injecting instants (no real sleeps).
2. Kept the guard in video_service.rs because only the capture/encode loop knows whether a frame was actually sent and it can reuse the existing bail!("SWITCH") restart path; doing this from the controlling side would require extra signaling/state.

@fufesou
Copy link
Collaborator

fufesou commented Jan 2, 2026

Kept the guard in video_service.rs because only the capture/encode loop knows whether a frame was actually sent and it can reuse the existing bail!("SWITCH") restart path; doing this from the controlling side would require extra signaling/state.

@godnight10061 Thanks for the update.

For point 2, I still feel this guard is more robust on the controlling side.

On the controlled side, we only know that a frame was produced and scheduled to be sent.

We don’t know whether the controlling side actually received and rendered the first frame. Decode/render issues or some other issues would still leave the user stuck on “waiting for image…”, but the controlled-side watchdog would consider the first frame “done”.

If we have clear evidence that in some cases the controlled side truly never sends a first frame, then adding a watchdog in video_service.rs makes sense as a targeted fix. Or you have reproduced the "first frame stalls" issue, and it is fixed after applying this commit.

And in that case, I think we should also look for the underlying reason why the video service failed to send (or capture) the first frame.

Godnight1006 added 3 commits January 2, 2026 13:38
This change addresses robustness issues by moving the responsibility of detecting missing first video frames to the controlling side (client). A new ClientFirstFrameWatchdog triggers a refresh if the first frame is not received within 3 seconds of connection, ensuring recovery even if the server thinks it sent a frame but it was lost or not rendered.
@fufesou
Copy link
Collaborator

fufesou commented Jan 2, 2026

@godnight10061 Hi, thanks for the update.

This guard appears to be helpful.

Additionally, could you confirm whether you've experienced the "Waiting for image" issue and if this commit addresses it?

@godnight10061
Copy link
Author

godnight10061 commented Jan 3, 2026

@godnight10061 Hi, thanks for the update.

This guard appears to be helpful.

Additionally, could you confirm whether you've experienced the "Waiting for image" issue and if this commit addresses it?

@fufesou Thanks for the feedback.

You're right - I haven't personally reproduced the "first frame stalls" issue yet; I replicated it based on user reports and existing discussions like #5609 and #12574. The logic assumes the capturer occasionally returns Ok(frame) with !frame.valid(), which prevents the normal WouldBlock retry path. I'll try to reproduce this consistently (e.g., specific OS/driver/capturer mode combinations) and verify if this fix resolves it.

@godnight10061
Copy link
Author

I still don’t have a stable local repro for the full “Waiting for image…” flow, but there’s a concrete DXGI code-path that can yield “Ok but invalid” textures: in libs/scrap/src/dxgi/mod.rs get_texture() calls
QueryInterface(IID_ID3D11Texture2D, …) without checking HRESULT, so the out texture ptr can remain null and
propagate as Frame::Texture that fails valid().

If you (or anyone) can reproduce, could you share logs from both sides for ~10s around connect + Windows version,
GPU/driver, monitor topology, and what was happening (Win+L/UAC/monitor hotplug/Win+Ctrl+Shift+B)? That’ll help
confirm whether we’re hitting invalid frames vs a decode/render issue.

@rustdesk
Copy link
Owner

rustdesk commented Jan 7, 2026

We will hode this PR until we get a reproduction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants