bugfix(network): Add command id overflow check to keep network commands in chronological order#2736
Conversation
ee6bcd8 to
1f31061
Compare
|
| Filename | Overview |
|---|---|
| Core/GameEngine/Source/GameNetwork/NetCommandList.cpp | Adds isCommandIdNewer() using the classic half-range (RFC 1982-style) uint16 serial number comparison behind a RETAIL_COMPATIBLE_NETWORKING guard; all three call sites in addMessage() updated correctly |
| Core/GameEngine/Source/GameNetwork/NetworkUtil.cpp | Moves commandID to file-scope static, changes start from 64001 (first returned) to 0 (post-increment); first returned value is now 0 which may need validation as a safe non-sentinel ID |
| Core/GameEngine/Source/GameNetwork/Network.cpp | Guards NET_CRC_INTERVAL=1 behind both DEBUG_CRC and !RETAIL_COMPATIBLE_NETWORKING to prevent per-frame CRC checks from triggering the overflow bug in retail-compatible builds |
| Generals/Code/GameEngine/Source/Common/CommandLine.cpp | Mirrors Network.cpp change: parseNetCRCInterval is now gated by both DEBUG_CRC and !RETAIL_COMPATIBLE_NETWORKING |
| GeneralsMD/Code/GameEngine/Source/Common/CommandLine.cpp | Identical change to Generals/Code counterpart for the Zero Hour build |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["addMessage(msg)"] --> B{"Fast-path eligible?\nsame type and playerID\nas lastMessageInserted"}
B -- Yes --> C{"isCommandIdNewer\nmsg.ID vs last.ID"}
C -- No --> F["Full linear search"]
C -- Yes --> D{"theNext is null\nor theNext comes after msg"}
D -- Yes --> E["Insert after lastMessageInserted\nfast path"]
D -- No --> F
F --> G["Scan forward by type"]
G --> H["Scan forward by playerID"]
H --> I{"isCommandIdNewer\nmsg.sortNum vs tempmsg.sortNum"}
I -- Yes --> J["Advance tempmsg"]
J --> I
I -- No --> K["Insert before tempmsg"]
subgraph compare["isCommandIdNewer logic"]
direction LR
P["diff = uint16(newVal - oldVal)"] --> Q{"diff == 0"}
Q -- Yes --> R["false - equal IDs"]
Q -- No --> S{"diff < 0x8000"}
S -- Yes --> T["true - newVal is newer"]
S -- No --> U["false - newVal is older"]
end
Reviews (8): Last reviewed commit: "Tweaked comment." | Re-trigger Greptile
1f31061 to
ebf1fa2
Compare
ebf1fa2 to
a7ff356
Compare
|
Resetting the command id just delays the potential mismatch, right? Wouldn't it be better to sort the command ids properly, even if that rarely causes mismatches with retail clients? |
It's reset every N frames, so it's not just delayed. |
a7ff356 to
3ae2622
Compare
xezon
left a comment
There was a problem hiding this comment.
The id needs to be unique per execution frame.
Can we just reset it to 0 or 1 at the beginning of every new frame? Or does that cause issues? What is the reason for it to reset every 128 frames now?
3ae2622 to
93c5ce7
Compare
|
When it's reset every 128 frames, isn't it possible that a packet from frame 127 comes in after it's already reset and then gets sorted wrong? |
|
My current understanding is that the ids only matter per frame, not across different frames. But then I do not understand why it needs to be reset every 128 frames instead of every frame. |
93c5ce7 to
1bccb3a
Compare
|
I noticed that at extremely high latency (beyond 64 frames runahead) the disconnect screen could be triggered by the reset somehow. I tried a ton of things, but I was unable to find out why that happens. That's why I made some changes to the implementation and only reset the command id when it gets close to overflowing.
This check should guard against that. If the check is inverted, you'll see wrong sorting once in a while.
If the runahead decreases the ids matter across multiple frames. |
xezon
left a comment
There was a problem hiding this comment.
I noticed that at extremely high latency (beyond 64 frames runahead) the disconnect screen could be triggered by the reset somehow. I tried a ton of things, but I was unable to find out why that happens. That's why I made some changes to the implementation and only reset the command id when it gets close to overflowing.
Ok so this means every ~16 minutes there is a miniscule chance of network breakdown? It would be good to find a definite fix for it so we never need to worry about this again.
If the runahead decreases the ids matter across multiple frames.
This is a big problem for resetting. Can you gather more intel about this? Can we avoid it?
Maybe a good command reset opportunity is when there is a time frame where little to no new network commands were sent to peers?
|
Also inquired with Chad Gippy: If you have a monotonically increasing if (a > b)fails across the wrap boundary. A common solution is to compare the difference modulo 2¹⁶ and interpret it as a signed value: bool is_newer(uint16_t a, uint16_t b)
{
return static_cast<int16_t>(a - b) > 0;
}Examples: is_newer(100, 50); // true
is_newer(50, 100); // false
is_newer(0, 65535); // true (wrapped by 1)
is_newer(65535, 0); // falseThe subtraction is performed modulo 65536: 0 - 65535 == 1which becomes LimitationThis only works if the counters are never more than half the range apart. For a 16-bit counter: The value exactly half a range away ( a = 0;
b = 32768;You cannot tell whether A robust version is: bool is_newer(uint16_t a, uint16_t b)
{
uint16_t diff = a - b;
return diff != 0 && diff < 0x8000;
}This is essentially the same technique used for sequence numbers in networking protocols such as TCP, where wraparound is expected. |
|
So what you could do for the non retail compatible sorting: bool isCommandIdNewer(uint16_t newVal, uint16_t oldVal)
{
#if RETAIL_COMPATIBLE_NETWORKING
return newVal > oldVal;
#else
uint16_t diff = newVal - oldVal;
return diff != 0 && diff < 0x8000;
#endif
}This would require no manual ID resetting. Needs testing. |
This reverts commit f386e5f.
9f1d368 to
60e1229
Compare
Yeah, this is probably the simplest solution after all (but not 100% retail compatible). Code and PR description updated. |
Maybe still use the command id reset approach for the retail compatible network, if it reduces the chance of mismatch? |
It's only a potential mismatch issue if The only reason I found out about this issue at all is because Legionnaire was using a special build for mismatch analysis that had the interval set to 1, which resulted in spurious mismatches that coincided with the command id overflow. |
Synchronized network commands are given a unique id so that they can be sorted chronologically. The id needs to be unique per execution frame. It's a uint16 value so it'll overflow quite fast, especially because it's never reset and the starting value is 64000 by default.
When the network runahead decreases it's expected that multiple frames are executed right after each other. I noticed an issue with the overflow when I set the CRC interval to once per frame. When the overflow happened for a client combined with a decrease of the runahead, the order of the CRC messages was no longer guaranteed to be chronological for that execution frame. That would cause a mismatch if the overflow didn't happen for all clients at the same time.
With a starting value of 64000 and a CRC message every frame the first overflow happens after ~25 seconds. You can see it mismatch here because of the overflow (this was with a special test build):
https://www.youtube.com/live/V_l-q9Y-DmA?t=621s
https://www.youtube.com/live/V_l-q9Y-DmA?t=9499s
I've added some test code in the first commit. If that's used without this fix, it becomes very apparent that the messages are not in the right order when the command id overflows.