|Update: Temperature Overhaul|
February 17, 2019
The problem of temperature in the game was much harder to solve than you might think. The old model was based on a thermodynamic cellular simulation, which would supposedly allow for heat from fires to be captured in rooms and flow out open doors. The model was accurate, but it was based on thermal conduction, not convection (which is much harder to simulate), and the result was hot areas right around heat sources, and cold areas everywhere else, even in enclosed buildings. In other words, buildings were pretty useless for keeping warm.
Clothing also fit into this simulation, but in a bit of a strange way (it served as extra insulation in the tile that you were standing on). Clothing would amplify any heat source in your tile, turning fires into extreme heat death traps. Finally, biomes were also part of the simulation, adding small heat sources (or sinks for cold biomes) at every cell in the simulation grid. Again, clothing, which insulated the center cell of the simulation grid (where you were standing), would also amplify biome heat. And biome heat effects would blend at biome boundaries (a thermal grid simulation is actually a form of blurring between the grid cells). This meant that there were near perfect areas at the boundaries between hot and cold biomes.
Players, being the rational folks that they are, reacted to the peculiarities of this thermal simulation by avoiding buildings, founding towns along desert boundaries, wearing minimal clothing, and generally not depending on heat sources for warmth. This was never my intention for the game, of course, but that's where things stood. I envisioned a game were buildings, clothing, and heat sources brought crucial advantages to a civilization, and all of the more advanced civilizations would depend on all three.
So, how could I fix this? A different thermal model of course, but what model? And if I wanted both hot and cold biomes (which make a lot of sense), how could I prevent exploitation of the boundaries? I really wanted there to be no "perfect" spot on the map that would make temperature regulation technology irrelevant. If such a spot existed, the smart players would find that, and settle there, always. Cold biomes should be too cold. Hot biomes should be too hot. There should be no "middle ground" in between.
First of all, many thanks to all of the players who engaged in a lengthy discussion in the forums. Also thanks go to my local designer friend Casey, who stuck with me through at least three hours of in-depth discussion about this topic (at the end of our first two-hour discussion, we had pages full of notes, diagrams, and graphs, but still no workable solution to the biome boundary problem).
Okay, now the solutions.
I should mention that what I'm calling "R value" here is different than the standard term as used in the insulation industry. My R value is a fractional heat retention value between 0 (no insulation that loses all heat) and 1 (perfect insulation). This makes it easier to reason about and program for. I suppose I should call it something else, but I don't know what to call it, so I've been calling it R.
First, for walls, I really want to simulate some kind of convection, so that heat spreads more evenly in indoor spaces. Instead of a cellular simulation, I'm now walking through the entire airspace around the player, flood-fill style, until I hit a boundary of insulating walls (or the edge of the 8x8 simulation grid). After that, I find the insulating boundaries, and compute an average R value for those boundaries. The heat sources inside that airspace (which may be the entire 8x8 grid, if there are no walls) produce heat which is spread evenly throughout the tiles of the airspace. That heat is modulated by the R-value of the boundaries of the airspace (if the average R value is 0.5, then half the heat is lost, and the rest is spread evenly in the enclosed space). Floors themselves count as part of the boundary of the space (if there's no floor in a tile, that tile counts as one of the air boundaries, thus reducing the average R value).
So what happens in this new model when you open a door? Suddenly, your airspace gets much bigger (the inside of your house plus the area outside your house), and your airspace boundary also gets bigger---and likely includes some air boundaries at the edge of the 8x8 simulation grid---so the average R value of the boundary decreases. Thus, opening a door, if a fire is running inside, will cause the house to get colder. Closing the door causes it to warm up again.
Thus, we're essentially modeling perfectly even convection throughout the entire enclosed airspace.
But shouldn't standing next to a fire also warm you up, even if there are no walls at all? Yes, but that's not due to convection. There's also a radiant component in the new model, which is based on your distance from each heat source that is in your airspace (which might included everything in the 8x8 simulation grid, if you are outside). So, getting close to a heat source, indoors or out, warms you intensely (perhaps too intensely, depending on the heat source). In other words, up close, radiant. Further away in a house, convection. The effect of radiant heat becomes negligible beyond a few tiles away.
Next, the biome effect is based only on the tile that you're currently standing on, and it's added into the heat calculation after the heat at your tile is computed based on heat sources and walls. If you're in an enclosed airspace, the biome heat contribution is modulated by the average R value of the airspace boundary, but only if the entire airspace also has floors. This means that an enclosed house with a floor can make a hot biome cooler, and a colder-than-normal biome, like the polar biome, warmer.
Next, clothes are applied in a separate part of the code, and they slow the transition from your body heat level to the environmental heat level (as computed based on walls, heat sources, and biome). If you're naked, you change temperatures pretty quickly. If you're fully clothed, you change temperatures very slowly. Thus, you can warm up in a house, near a fire, until you are just right, and then put on clothes before a journey to "hold it in" for a long time, and keep yourself close to perfect along the way.
And finally, the hard part: biome boundaries. As the new system is described so far, the old boundary-blending issue is fixed (because only your current biome tile contributes to your heat equation, without blending), but an exploit is still possible: by jumping back and forth across a boundary, between a hot and cold biome, you could warm yourself up to perfect temperature without fire, clothes, or walls.
So, I added a system for thermal shocks. This occurs whenever you go from a too-cold biome into a too-hot biome, or vice versa. Your temperature instantly jumps from the cold side of the scale to the hot side, right to the new biome's target temperature (or from hot to cold, if crossing the other way). This shock effect is also modulated by clothing. More and better clothing reduces the magnitude of this shock. Furthermore, the shock is never allowed to bring you closer to perfect on the other side of the temperature scale than you were before crossing. So if perfect is 0.5, and you were at 0.3, you will jump to at least 0.7 when you cross into a hot biome, no matter what clothes you are wearing (if you're naked, you might jump all the way up to 0.9, though, so clothing still helps).
This means that you can never improve your food consumption rate by crossing between hot and cold biomes. In the very best case, your consumption rate will remain the same, but it will usually get a bit worse (and if you're naked, it might get a lot worse).
There is also still a small body heat effect inside clothing, so in a cold biome, clothing will gradually warm you up over time. This effect is somewhat larger than it was before. The general idea is that, in cold biomes, clothing gets you 1/3 of the way to perfect, while fire and walls take you the rest of the way there. If you actually want to work in one area and remain at a perfect temperature the entire time, you're going to need all three bits of technology.
One other problem in the old system was that the desert, while hot, was not as hot as the other biomes were cold. The jungle was too close to perfect, and the mosquitoes didn't offer enough of a trade-off. So the jungle is now as hot as the other biomes were cold (moving between prairie and jungle now results in no change to your hunger rate), while desert is now as hot as the polar biome is cold. You've always been freezing to death in the snow, and you are now cooking in the desert. Think of it like hot snow.
The other biomes remain unchanged for the naked player. Thus, the game isn't really any harder now than it was before, unless you count the loss of the desert-boundary exploit as making the game harder (yes, that was easy, but the game was never supposed to be easy like that). Clothing and walls are so much more helpful now, that the game might even be easier, ignoring the old exploit.
Here's hoping that the new system leads players toward advanced civilizations full of heated buildings and clothed residents.
|Update: Client Lag Fix|
February 10, 2019
What made this bug so hard to find and fix was the fact that it affected so few people, relatively speaking. However, for the affected people, it affected them all the time, and pretty much ruined the game for them.
The symptom: in busy areas, apparent network lag would grow and grow, resulting in up to twenty seconds of delay between trying to do something (like pick a berry) and have the action resolve (like have the berry in your hand). On its face, this sounds like classic network lag. The first thought is that the server isn't keeping up with demand. However, other people playing in the same area were not experiencing lag. In fact, the affected player would often ask about lag, in-game, and be told by others that there was no lag for them. Also, if the server was being bogged down, the lag would be experienced everywhere in the game world, not just in busy areas, because all areas are processed in the same loop.
Maybe they were in a remote part of the real world. Maybe they were on spotty WiFi. The problem would often clear itself up instantly if they walked out of the busy areas. And certainly, the server is sending them fewer messages out there, because it filters the messages based on what is relevant to your location. In a busy city, you need to receive a lot of information, because so many people are walking around. In the wilderness, there's much less change happening. So this symptom was generally consistent with network lag.
A while back, I built a /PING and /FPS command into the game, so that people could run tests if they were experiencing lag. Sure enough, during these lag situations, ping times would balloon. Normal ping times in the US are below 100ms, and not more than 400ms anywhere in the world. But during lag, the ping would grow to five, ten, or even twenty seconds. That's really bad, and probably transcends any normal network lag.
And for these people, things have only gotten worse when we moved everyone to bigserver2. Big cities are much more common, so many of the affected people were experiencing unplayable lag almost every life. Of course, for everyone else---those who never experienced lag---bigserver2 was great.
But finally, almost miraculously, I experienced this issue myself for the first time this week. A unicorn! I was playing in a busy city, on my slow dev laptop with a weak GPU, and sure enough lag. Bad lag. Really bad lag. My in-game ping time grew to more than 14 seconds. The game was totally unplayable.
During this time, I also noticed that my FPS dropped from around 60 down to 40 or so. Frame rate and network lag aren't necessarily related, but my lag was very hard to reproduce---it would come and go seemingly at random, even in the big city, depending on where I walked---and it seemed to be correlated with this drop in FPS.
I set up a chaotic Eve-only city on bigserver2 on Friday to conduct a real stress test. 120 players all spawning in the same spot (0,0) is no joke, and I could very consistently trigger lag on my slow dev laptop.
I also found that my gaming rig would not see lag in the same area, but it is running at a solid 85 FPS (odd, I know, but it's a CRT). So, same network, different CPU and GPU, higher FPS, no lag. So yeah, with proper hardware, the client can easily handle 120 players all in the same area. It was chaos, but buttery smooth chaos.
Someone pointed out that outside-game-pings (using the command line) aren't necessarily slow during an in-game lag, and I was able to confirm this. Someone else suggested that I sniff the raw network packets and figure out exactly how quickly the server was responding to my PING with a PONG---just to rule out server-side lag. Sure enough, while my client took 14 seconds to register the PONG, the PONG arrived on the network within the normal 70 ms, even on the slow dev laptop. There was some kind of networking issue inside the client.
I spent quite a bit of time testing my underlying networking code and looking for reasons that network messages might get backed up, but found no issue in isolated network tests. I also considered some kind of kernel networking issue (my laptop is running Linux, while my gaming rig tests were on Windows7). No dice.
Meanwhile, someone else had been able to pinpoint the exact problem in the client, and they posted their fix in an old, lingering Github issue. Finally, someone drew my attention to this fix, which was rather hidden on the Github side.
JRuldolf, we all owe you one!
Turns out that this problem has been with us since an update back in October, before the Steam release, when message frames were added. A frame groups messages together that occur during the same server processing step, forcing the client to wait to react to any of these messages until all the messages in the frame arrive. This prevents, for example, a message about a map change from being processed before the matching player update is received (if a player dumps a bowl of water into a bucket, the bucket on the map changes, and so does the bowl in their hand, and there are two separate messages, but they only make sense if they occur client-side at the same time).
This frame system was great, and fixed a heap of potential inconsistencies in client behavior.
However, there was also a bug in the way that frames were processed. Each client step (each rendering frame), the client would read the next message and check if it was an end-of-frame message. If not, it would put the message in the holding queue and go on to the next rendering step.
You can see how this can cause trouble when message frames contain more and more messages (which they do in busy areas): a frame with five messages takes at least five client frames to fully receive, even if all five messages have arrived, because we only check one message per frame. Once the 6th message is checked, the end of frame message, we call the frame ready, and process all five messages together.
What we need to do, instead, is loop as long as any messages are available, checking for the end-of-frame message, but if it's not there, continuing on to the next message, until no more received messages are available. Thus, we process all received messages every client frame, regardless of how long the message frame is. This even allows us to process multiple server message frames on a single client rendering frame, if several server frames are waiting client-side.
If we don't do this, during times with high message rates and large, multi-message frames, we can see how a message backlog would build up. Assuming, of course, that more than 60 messages were arriving per second.
And if the FPS drops on top of that, you can see how it would get even worse, because we are running even fewer processing steps per second. So players with weaker GPUs were pretty much experiencing the perfect storm in busy areas. Lots more messages, and a slower client-side rendering loop that was effectively only processing one message per rendering frame.
The fix was literally two lines, putting a loop in there where it should be.
And suddenly, the client could handle the very busiest areas with absolutely no network lag. Even if I artificially reduced the frame rate to 5 FPS, the game was completely playable in busy areas (yes, it was choppy, but each action was executed instantly, with no lag). Before the fix, such a low frame rate would spell disaster in a busy area.
Now, how did such a devastating yet simple bug go unnoticed for so long? Well, as long as the frame rate is high enough, and the incoming message rate is low enough, it generally doesn't matter. We're processing at least one message every frame, and 60 messages a second is a lot, so we usually keep up, even if we don't process all available messages as soon as we have them. I didn't write the code this way on purpose---the original code, before message frames were added, intentionally processed all available messages every rendering frame. But the implementation of message frames quietly subverted this intention.
The move to bigserver2 made this very rare bug less rare, because the cities got bigger, and the message rate higher, causing slightly more people to experience the issue. Including, finally and thankfully, me.
Bug fixes take a long time, but they are worth it. More bug fixes next week. The plan is to get clothing and heating working in a more sensible way.
|Update: The Miracle of Flight|
February 1, 2019
The early days of flight were fraught with uncertainty and peril. Instruments? Who needs instruments? We're talking VFR, folks. Pick a direction, and hope you can find a safe spot to land.
My father is a pilot of small planes. When I was growing up, he used to take me to the Wadsworth Municipal Airport on Saturdays to pal around with his pilot buddies at their hangars. Sometimes we'd take short trips, just for fun, to some other municipal airport nearby. The diner at Carrol County Airport served great pies. But forget about the pies---I got to fly! As a little kid, he'd stick me in the copilot seat and let me take the yoke from time to time.
There were a number of pilot sayings from that era that stuck with me. My father had a few of these on placards in his hangar.
"There are old pilots, and there are bold pilots, but there are no old, bold pilots."
"The air, like the sea, is very unforgiving of an error."
And one of the nearby airports had "Steve's Weather Rock," a 20-pound hunk of granite on a chain, hanging outside of its administration building, with a sign that read:
"If it's wet, it's raining."
"If it's white, it's snowing."
"If it's swinging, it's windy."
"If it's hanging straight out, it's too windy to fly."
Keep that in mind as you take to the sky.
|Update: Big Server|
January 26, 2019
What we had: players spread out onto three or four servers for load-balancing purposes. During peak times, this was necessary to prevent any individual server from becoming too overloaded. During off-peak times, we kept sending players to all the previously-active servers to avoid any one server dying out unfairly (see the earlier Population Stabilization update). But this meant that during off-peak times, even with plenty of people still playing, the population on each server got a little thin.
What we want: everyone playing on one server, together, all the time.
The problem: CPU overload when populations get high results in lag for players, not to mention Linode sending me warning emails (these server nodes are virtual servers co-hosted on multi-core machines---I don't want to be a bad neighbor to other users who have virtual servers on the same host machine).
It has been a long time since I examined this problem in detail, so I wasn't really sure where the issue was, or if there even was an issue anymore. I was keeping the server population caps relatively low to avoid lag at all costs while I worked on other things.
So, I needed to do some stress-testing and some profiling. Server1, with its ancient, gigantic map that has maybe only been wiped once in the past eight months, was historically the biggest offender in this department, so it made the perfect candidate for a stress test. How many people can we put on there before it chokes?
Does the database engine need another overhaul?
Well, it turns out that with the existing database engine (which was written from scratch for our purposes and heavily optimized by me many months ago), we could pretty much house all the active players on server1 with no player lag. CPU usage, however, was going above and beyond what keeps Linode happy, though. At one point, our externally-monitored CPU usage was over 120%.
How is that possible? Well, it turns out that a virtual CPU consumes additional CPU resources on its host CPU, apparently overhead from the virtualization process itself. So, while I was seeing server1 sitting happily at 60% internally, it was well over 100% as far as Linode was concerned.
By running a busy-wait test program in parallel with server1 on the same node, I was able to push my internal CPU (viewed through top) up to 100%, and that brought Linode's CPU measurement up to 140%. Yikes. This likely means that my virtual server is so resource-hungry that the virtualization process is itself consuming resources from more than one physical core. I'm not sure of the details here, but that's my best guess.
Regardless, we want to steer WAY clear of 140%.
But the lack of lag when 170 players were together on the usually-bedraggled server1 was promising.
Were there any unnecessary hot spots left in the code that could be eliminated? Maybe the database engine needs to be rewritten again. Keeping the database in RAM is one idea that might speed things up, but who knows?
This is where profiling is supposed to help.
But existing profilers do a notoriously poor job at measuring actual performance issues in I/O-bound processes. My server is likely spending a lot of time waiting for data from the disk. Asleep, essentially. Not running code, in the way that a profiler might measure, but still slow.
After testing every profiling tool under the sun, and finding nothing that worked for this purpose, I ended up writing my own. More details about that, and proof that it works, and examples of why other profilers don't work, can be found here:
Profiling a toy program with a toy profiler is one thing, but profiling an extremely complex, multi-faceted server process is quite another. This made an excellent test case that helped me actually turn my toy profiler into a working, useable tool. At some point along the line, I realized that the text data that the profiler was outputting (essentially annotated stack traces) was too tedious to read through by hand, so I even wrote a conversion program that allows the resulting profile to be viewed in the Kcachegrind profile visualizer.
With all that working, here is a rough visualization of where server1 was spending its time while hosting 155 simultaneous players:
Now, before you tell me that I've lost my mind, let me reassure you that such an image isn't all that useful in practice. It's just the best way to quickly represent the complexity of the profile visually. In reality, I'm looking at sorted lists of functions and the amount of samples that hit each function. But a screen shot of that doesn't make for a very interesting picture.
Anyway, from that image, we can see what looks like a pretty "clean room." That big "empty space" in the middle is indeed empty space: time the server spent waiting on epoll for incoming client messages. We're doing that 54% of the time. The rest of the clutter around the edges of the room is actual work being done.
The biggest forehead-slapper in the profile, which can actually be seen here in this image, is the 12% of our running time spent on recomputeHeatMap. This is the bit of code that examines the environment around you to determine how cold you are (the thermal propagation simulation). This is an expensive bit of code to run, but it's only supposed to be updated for two players every server step (thus spreading the load), so what's going on here?
It turns out that the wall-clock duration of a "server step" varies depending on the rate at which messages are arriving. Big gaps between messages means the server sleeps longer before executing the next step. Short gaps mean many steps happen in a short time. The server is intentionally player-reactive in this way, actually using almost no resources at all if no on is logged in.
Checking the logs, I found that with such a huge population of players, with such a high inbound player message rate, the server step was being run something like 65 times per second. Yikes. Not only did this result in excessive calls to recomputeHeatMap (recomputing maps for something like 130 players every second, which isn't even useful), there were a bunch of other regular-interval parts of the server step that were being triggered 65 times per second as well. We don't need to check whether a player's curse score is decremented 65 times a second, for example.
After finding the parts of the server step that weren't necessarily reactive, I put them on fixed timesteps so that they would only run if enough time has passed, not every single step. Heat maps are now limited to 20 players per second, max, for example, regardless of how quickly messages are coming in.
The results are pretty dramatic. Here's the new profile picture, after these changes, with about 150 players on server 1:
And here's a 30-minute monitor graph of both old and new (sampled every 5 seconds, for 360 samples total):
Yes, that's around half the CPU used per player now. This should allow us to double the number of players that occupy a given server.
But even so, when we start getting above 60% internal CPU, external resource consumption can get up into the 90% range, which does not make Linode happy.
However, they did inform me that 2-core nodes (which are more expensive) are allowed to go up to 160% utilization, and 4-core nodes are allowed to go up to 320% utilization.
The server code is single-threaded, so it can't take advantage of more than one physical core directly, but the external resource consumption from virutalization, including disk access and so on, apparently can.
So, today I introduce a brand new server on the front line: bigserver1.onehouronelife.com
2 cores, 4x the RAM, a bigger disk, and a bigger upstream network pipe. Most of these extra resources aren't needed, but the extra core may help with external resource usage. Four times the cost, though. Is it worth it? How many players can we put on this sucker before it starts to choke?
To give you a taste of the difference between internal and external resource consumption on a virtual server, bigserver1 currently has 155 players on it. Internally, in top, it is using less than 1% of its CPU. Something around 0.3%, to be exact. Hard to believe, but true. A fresh---and tiny---map database likely helps with this, for sure.
But externally, as far as Linod is concerned? 50% CPU. Granted, I can safely go up to 160%, but still, 50% is way different than 0.3%. My external networking and disk access graphs are relatively high, though, and my guess is that some of those aspects contribute to external CPU usage. Again, my guess is that the process of virtualizing networking and disk involves extra host CPU operations that wouldn't be necessary on non-virtual hardware.
As another example, if I run a pure-CPU test process that busy loops, I see both 100% internally and externally, but that's a process that isn't touching the disk or network at all.
So, over the next few weeks, we'll see where bigserver1 can take us, in terms of a large population of players all in one cohesive world.
|Update: Romeo Charlie|
January 18, 2019
I will let this week's GIF speak for itself.
This is Jason Rohrer, signing off for now.