|Update: Jungle Flora|
November 19, 2018
Wow, what a week! Steam, of course, had the expected effect: an absolute flood of new players, and the resulting chaos.
This is the first weekly update post-Steam, and as you can tell, it's a bit later than usual. And the release of the update, for any of you playing on Sunday evening when it came out, was quite a bit bumpier than usual. I'll explain more about that in a bit, but first, the update itself.
As you can see from the GIF, there's a brand new biome, the first new one in a while. If you spend some time there, you'll find that it has a pretty good balance between benefits and drawbacks, though that balance will be tweaked over time. It also contains raw ingredients that will help you push transportation tech into even more efficient and durable realms.
Beyond the new content, the biggest change client-side is that the crafting hint filter system has been overhauled to align better with people's expectations. In general, the crafting hint system shows you hints for what you can do with what you are holding, or what you just touched. It does not show you specifically how to make an end product, but steps along the way that you might want to explore. The filter system (/hatchet) is meant to cull down the list for what you are holding to show you only steps along thew way that lead to a hatchet.
First, an easy fix. The old filter system would show you any steps along the way AND anything you could do with the hatchet once you got it. This isn't very helpful---you're only filtering by /hatchet because you don't have one yet. Who cares what you can do with it? So those are hidden now, when filtering.
One point of confusion: What happens when a hatchet itself can be used to make some of the end-products for a hatchet? This may not be true for a hatchet, but it's true for an axe, which requires kindling to make, but can also make kindling. Still, it's best not to show this when filtering for /axe. There have to be other ways to make kindling, after all, because you don't have the axe yet.
Finally, what happens when none of the steps for what you're holding are relevant to /hatchet? Let's say you grab a piece of flint. In the old system, the full, unfiltered list for flint would be shown, and "NONE RELEVANT" would be shown next to the filter. But a lot of people didn't see this, and it was confusing. Why is it showing that I can skin a wolf with flint? I'm trying to make a hatchet! Now it shows no steps at all, and spells it out for you. "MAKING HATCHET (FLINT NOT RELEVANT)"
Also, I found and fixed some mistakes in the filtering logic that led to the display of extra, irrelevant steps. To summarize, the whole filtering thing works much better than it used to. Try it. You'll like it.
Okay, so what happened during the update process?
First, an old version mismatch issue which had bitten me before reared its head. The client makes sure that it is at or ahead of the server's version, and if it is ahead, it makes sure that it's data version at least matches the server's version. This works fine most of the time. For the weekly update, I can decide to either update the client binary first, or update the data an server first.
But during weeks where the client and server receive several updates, while the data version remains untouched (like this week, when I was fixing all kinds of steam issues), we can get into a dangerous situation. The server and data version MUST be updated together, first, before the client is updated.
But this week, there were protocol changes requiring that the client be updated first. The server was going to send newly-formatted messages that the client wouldn't be able to parse. So, I blindly updated the client first, without realizing that we were already several versions ahead of the data version.
On next startup, the client checks if its version matches the server. Nope. Server is v167. Client is v168. Okay, well, is the client data version at least matching the server's version? Nope. Data is v164. Client displays a version mismatch error.
This was put in place to make sure that people connecting to private servers wouldn't experience content-mismatch crashes if their private server lagged behind the official update schedule. But actually, the above logic is too strict. If the client is really at v168, we are ahead of the server, but we can see that our data version number is behind the server, so we're actually okay. What we really need to worry about, in the case of private servers, is if both the data version and binary version number are ahead of the server's version number. So, the client logic has been fixed to prevent this kind of version mismatch in future updates.
So that was fire #1. It caused about 34 minutes of downtime for people who were downloading the update and trying to reconnect. You can see the first huge downward spike in the above graph.
After that got sorted, and everyone had a working client again, it was time to push out the content update. That all went according to plan, and the Steam clients got the update correctly. Everything went fine, until server2 tried to start up. It got through its init process, and started accepting connections, but then got bogged down into a huge processing load. 99% cpu, even with 0 players connected. And still accepting connections.
The reflector, the bit of server code that tells you which server you should go to, had a 3-second timeout in place for a server. So, when checking on server2, it would wait 3 seconds, and then timeout. But with hundreds of client requests coming in, 3 seconds of waiting for each one was too much. And server2 was at the top of the list, meaning that the reflector would consider it first, for everyone, before considering other servers. 100s of stalled web requests means that nginx starts running out of processes, eventually triggering the dreaded 504 gateway timeout. This results in no one getting successfully assigned to a server, and error messages getting displayed in the client.
Fortunately, I was able to track the problem to server2 very quickly, and take it off the list. I think there was something like 15 minutes of downtime from this, fire #2.
But I also realized that server1, which was in the queue to update next, would likely have a similar problem. Other servers didn't seem to be affected.
Initial debugging showed that server2 was bogged down with 200,000 moving animals. Hmm.... maybe I made mosquitoes too common in this update. That must be it.... the Mosquito Meltdown.
But further investigation showed that it wasn't mosquitoes at all, but wild sheep and boars. 200,000 of them, all trying to move at the same time, right after server startup.
Well, thank you Steam, for exploring more of the map on server2 than had ever been explored, because you unearthed a horrible, lurking issue. Animals aren't supposed to move unless they are seen by a player. So, even if there are 200,000 of them known to be in the map (seen by some player at some time in the past), at server startup, this should be no different than a map containing 200,000 trees or rocks.
However, a particular bit of server map processing code was inadvertently "looking" at these moving animals, causing them all to get put in the queue of moving things that need active updating.
This has apparently been a lurking problem for a while, but historically small enough that the server could simply brute-force its way through it at start up, and get back to processing requests. Not this week. 200,000 moving animals.
The first step was to fix the reflector so that it would timeout more quickly. These servers are all connected by high-speed networks. 3 seconds is overkill. I was actually able to bring the timeout down to 1/8 second (125 ms). If a sever is too bogged to respond in that time, better think of it as offline anyway, for now, right?
And with that fix in place, when server1 restarted and had exactly the same issue as server2, there was no widespread outage triggered.
With everyone back online and happy, I spent the rest of a very long night refining the server startup code to avoid "looking at" any of the map objects that are being processed. I got server1 down to the point where it could startup, process the entire map, including all 200,000 animals, but not result in a single animal waking up and moving, at least not until a player spawned in to see them.
This also dramatically sped up the server startup and shutdown process. The other servers will get this improved code during the next update, later this week.