fix: restart video service when first frame stalls#13944
fix: restart video service when first frame stalls#13944godnight10061 wants to merge 5 commits intorustdesk:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes an issue where the video service could stall when the capturer returns invalid frames, preventing peers from receiving the first video frame. The solution treats invalid frames as WouldBlock errors (enabling existing retry/fallback mechanisms) and adds a watchdog that restarts the service if no frame is sent to subscribers within 3 seconds.
- Converts invalid frames to
WouldBlockerrors to trigger retry/fallback paths - Implements a first-frame watchdog with a 3-second timeout
- Tracks whether frames are successfully sent to subscribers to determine if the watchdog should trigger
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/server/video_service.rs
Outdated
| if !sent_first_frame { | ||
| if produced_frame { | ||
| sent_first_frame = true; | ||
| first_frame_watchdog.reset(); |
There was a problem hiding this comment.
The reset call here is unnecessary because once sent_first_frame is set to true on line 907, the outer condition if !sent_first_frame on line 905 will prevent this code block from executing again. The watchdog state will never be checked again, so resetting it has no effect. Consider removing this line to simplify the code.
| first_frame_watchdog.reset(); |
|
@godnight10061 Hi, this PR looks good. I'll test more. But I have two questions at the moment:
The current implementation: fn on_no_frame(&mut self, tick: Duration) -> bool {
self.waited = self.waited.saturating_add(tick);
self.waited >= self.timeout
}seems more complex than: fn on_no_frame(&mut self) -> bool {
self.waited.elapsed() >= self.timeout
}
log::warn!(
"No first video frame for {first_frame_timeout:?}, restarting video service"
); |
@fufesou 1. Good catch. switched to Instant + elapsed() for simpler logic; kept deterministic tests by injecting instants (no real sleeps). |
@godnight10061 Thanks for the update. For point 2, I still feel this guard is more robust on the controlling side. On the controlled side, we only know that a frame was produced and scheduled to be sent. We don’t know whether the controlling side actually received and rendered the first frame. Decode/render issues or some other issues would still leave the user stuck on “waiting for image…”, but the controlled-side watchdog would consider the first frame “done”. If we have clear evidence that in some cases the controlled side truly never sends a first frame, then adding a watchdog in And in that case, I think we should also look for the underlying reason why the video service failed to send (or capture) the first frame. |
This change addresses robustness issues by moving the responsibility of detecting missing first video frames to the controlling side (client). A new ClientFirstFrameWatchdog triggers a refresh if the first frame is not received within 3 seconds of connection, ensuring recovery even if the server thinks it sent a frame but it was lost or not rendered.
|
@godnight10061 Hi, thanks for the update. This guard appears to be helpful. Additionally, could you confirm whether you've experienced the "Waiting for image" issue and if this commit addresses it? |
@fufesou Thanks for the feedback. You're right - I haven't personally reproduced the "first frame stalls" issue yet; I replicated it based on user reports and existing discussions like #5609 and #12574. The logic assumes the capturer occasionally returns |
|
I still don’t have a stable local repro for the full “Waiting for image…” flow, but there’s a concrete DXGI code-path that can yield “Ok but invalid” textures: in libs/scrap/src/dxgi/mod.rs get_texture() calls If you (or anyone) can reproduce, could you share logs from both sides for ~10s around connect + Windows version, |
|
We will hode this PR until we get a reproduction. |
Problem
In some cases the capturer can return
Ok(frame)while the frame is invalid (!frame.valid()). Previously this was treated as a successful capture, so the normalWouldBlock("no image yet") handling didn't run and peers could remain stuck waiting for the first image.Approach
WouldBlockso existing retry/fallback paths run (including Windowstry_gdi).SWITCHmechanism.Tests
cargo test