|Update: Client Lag Fix|
February 10, 2019
What made this bug so hard to find and fix was the fact that it affected so few people, relatively speaking. However, for the affected people, it affected them all the time, and pretty much ruined the game for them.
The symptom: in busy areas, apparent network lag would grow and grow, resulting in up to twenty seconds of delay between trying to do something (like pick a berry) and have the action resolve (like have the berry in your hand). On its face, this sounds like classic network lag. The first thought is that the server isn't keeping up with demand. However, other people playing in the same area were not experiencing lag. In fact, the affected player would often ask about lag, in-game, and be told by others that there was no lag for them. Also, if the server was being bogged down, the lag would be experienced everywhere in the game world, not just in busy areas, because all areas are processed in the same loop.
Maybe they were in a remote part of the real world. Maybe they were on spotty WiFi. The problem would often clear itself up instantly if they walked out of the busy areas. And certainly, the server is sending them fewer messages out there, because it filters the messages based on what is relevant to your location. In a busy city, you need to receive a lot of information, because so many people are walking around. In the wilderness, there's much less change happening. So this symptom was generally consistent with network lag.
A while back, I built a /PING and /FPS command into the game, so that people could run tests if they were experiencing lag. Sure enough, during these lag situations, ping times would balloon. Normal ping times in the US are below 100ms, and not more than 400ms anywhere in the world. But during lag, the ping would grow to five, ten, or even twenty seconds. That's really bad, and probably transcends any normal network lag.
And for these people, things have only gotten worse when we moved everyone to bigserver2. Big cities are much more common, so many of the affected people were experiencing unplayable lag almost every life. Of course, for everyone else---those who never experienced lag---bigserver2 was great.
But finally, almost miraculously, I experienced this issue myself for the first time this week. A unicorn! I was playing in a busy city, on my slow dev laptop with a weak GPU, and sure enough lag. Bad lag. Really bad lag. My in-game ping time grew to more than 14 seconds. The game was totally unplayable.
During this time, I also noticed that my FPS dropped from around 60 down to 40 or so. Frame rate and network lag aren't necessarily related, but my lag was very hard to reproduce---it would come and go seemingly at random, even in the big city, depending on where I walked---and it seemed to be correlated with this drop in FPS.
I set up a chaotic Eve-only city on bigserver2 on Friday to conduct a real stress test. 120 players all spawning in the same spot (0,0) is no joke, and I could very consistently trigger lag on my slow dev laptop.
I also found that my gaming rig would not see lag in the same area, but it is running at a solid 85 FPS (odd, I know, but it's a CRT). So, same network, different CPU and GPU, higher FPS, no lag. So yeah, with proper hardware, the client can easily handle 120 players all in the same area. It was chaos, but buttery smooth chaos.
Someone pointed out that outside-game-pings (using the command line) aren't necessarily slow during an in-game lag, and I was able to confirm this. Someone else suggested that I sniff the raw network packets and figure out exactly how quickly the server was responding to my PING with a PONG---just to rule out server-side lag. Sure enough, while my client took 14 seconds to register the PONG, the PONG arrived on the network within the normal 70 ms, even on the slow dev laptop. There was some kind of networking issue inside the client.
I spent quite a bit of time testing my underlying networking code and looking for reasons that network messages might get backed up, but found no issue in isolated network tests. I also considered some kind of kernel networking issue (my laptop is running Linux, while my gaming rig tests were on Windows7). No dice.
Meanwhile, someone else had been able to pinpoint the exact problem in the client, and they posted their fix in an old, lingering Github issue. Finally, someone drew my attention to this fix, which was rather hidden on the Github side.
JRuldolf, we all owe you one!
Turns out that this problem has been with us since an update back in October, before the Steam release, when message frames were added. A frame groups messages together that occur during the same server processing step, forcing the client to wait to react to any of these messages until all the messages in the frame arrive. This prevents, for example, a message about a map change from being processed before the matching player update is received (if a player dumps a bowl of water into a bucket, the bucket on the map changes, and so does the bowl in their hand, and there are two separate messages, but they only make sense if they occur client-side at the same time).
This frame system was great, and fixed a heap of potential inconsistencies in client behavior.
However, there was also a bug in the way that frames were processed. Each client step (each rendering frame), the client would read the next message and check if it was an end-of-frame message. If not, it would put the message in the holding queue and go on to the next rendering step.
You can see how this can cause trouble when message frames contain more and more messages (which they do in busy areas): a frame with five messages takes at least five client frames to fully receive, even if all five messages have arrived, because we only check one message per frame. Once the 6th message is checked, the end of frame message, we call the frame ready, and process all five messages together.
What we need to do, instead, is loop as long as any messages are available, checking for the end-of-frame message, but if it's not there, continuing on to the next message, until no more received messages are available. Thus, we process all received messages every client frame, regardless of how long the message frame is. This even allows us to process multiple server message frames on a single client rendering frame, if several server frames are waiting client-side.
If we don't do this, during times with high message rates and large, multi-message frames, we can see how a message backlog would build up. Assuming, of course, that more than 60 messages were arriving per second.
And if the FPS drops on top of that, you can see how it would get even worse, because we are running even fewer processing steps per second. So players with weaker GPUs were pretty much experiencing the perfect storm in busy areas. Lots more messages, and a slower client-side rendering loop that was effectively only processing one message per rendering frame.
The fix was literally two lines, putting a loop in there where it should be.
And suddenly, the client could handle the very busiest areas with absolutely no network lag. Even if I artificially reduced the frame rate to 5 FPS, the game was completely playable in busy areas (yes, it was choppy, but each action was executed instantly, with no lag). Before the fix, such a low frame rate would spell disaster in a busy area.
Now, how did such a devastating yet simple bug go unnoticed for so long? Well, as long as the frame rate is high enough, and the incoming message rate is low enough, it generally doesn't matter. We're processing at least one message every frame, and 60 messages a second is a lot, so we usually keep up, even if we don't process all available messages as soon as we have them. I didn't write the code this way on purpose---the original code, before message frames were added, intentionally processed all available messages every rendering frame. But the implementation of message frames quietly subverted this intention.
The move to bigserver2 made this very rare bug less rare, because the cities got bigger, and the message rate higher, causing slightly more people to experience the issue. Including, finally and thankfully, me.
Bug fixes take a long time, but they are worth it. More bug fixes next week. The plan is to get clothing and heating working in a more sensible way.