Update: Big Server

jasonrohrer · 2019-01-26 07:12:57

What we had: players spread out onto three or four servers for load-balancing purposes. During peak times, this was necessary to prevent any individual server from becoming too overloaded. During off-peak times, we kept sending players to all the previously-active servers to avoid any one server dying out unfairly (see the earlier Population Stabilization update). But this meant that during off-peak times, even with plenty of people still playing, the population on each server got a little thin.

What we want: everyone playing on one server, together, all the time.

The problem: CPU overload when populations get high results in lag for players, not to mention Linode sending me warning emails (these server nodes are virtual servers co-hosted on multi-core machines---I don't want to be a bad neighbor to other users who have virtual servers on the same host machine).

It has been a long time since I examined this problem in detail, so I wasn't really sure where the issue was, or if there even was an issue anymore. I was keeping the server population caps relatively low to avoid lag at all costs while I worked on other things.

So, I needed to do some stress-testing and some profiling. Server1, with its ancient, gigantic map that has maybe only been wiped once in the past eight months, was historically the biggest offender in this department, so it made the perfect candidate for a stress test. How many people can we put on there before it chokes?

Does the database engine need another overhaul?

Well, it turns out that with the existing database engine (which was written from scratch for our purposes and heavily optimized by me many months ago), we could pretty much house all the active players on server1 with no player lag. CPU usage, however, was going above and beyond what keeps Linode happy, though. At one point, our externally-monitored CPU usage was over 120%.

How is that possible? Well, it turns out that a virtual CPU consumes additional CPU resources on its host CPU, apparently overhead from the virtualization process itself. So, while I was seeing server1 sitting happily at 60% internally, it was well over 100% as far as Linode was concerned.

By running a busy-wait test program in parallel with server1 on the same node, I was able to push my internal CPU (viewed through top) up to 100%, and that brought Linode's CPU measurement up to 140%. Yikes. This likely means that my virtual server is so resource-hungry that the virtualization process is itself consuming resources from more than one physical core. I'm not sure of the details here, but that's my best guess.

Regardless, we want to steer WAY clear of 140%.

But the lack of lag when 170 players were together on the usually-bedraggled server1 was promising.

Were there any unnecessary hot spots left in the code that could be eliminated? Maybe the database engine needs to be rewritten again. Keeping the database in RAM is one idea that might speed things up, but who knows?

This is where profiling is supposed to help.

But existing profilers do a notoriously poor job at measuring actual performance issues in I/O-bound processes. My server is likely spending a lot of time waiting for data from the disk. Asleep, essentially. Not running code, in the way that a profiler might measure, but still slow.

After testing every profiling tool under the sun, and finding nothing that worked for this purpose, I ended up writing my own. More details about that, and proof that it works, and examples of why other profilers don't work, can be found here:

https://github.com/jasonrohrer/wallClockProfiler

Profiling a toy program with a toy profiler is one thing, but profiling an extremely complex, multi-faceted server process is quite another. This made an excellent test case that helped me actually turn my toy profiler into a working, useable tool. At some point along the line, I realized that the text data that the profiler was outputting (essentially annotated stack traces) was too tedious to read through by hand, so I even wrote a conversion program that allows the resulting profile to be viewed in the Kcachegrind profile visualizer.

With all that working, here is a rough visualization of where server1 was spending its time while hosting 155 simultaneous players:

Now, before you tell me that I've lost my mind, let me reassure you that such an image isn't all that useful in practice. It's just the best way to quickly represent the complexity of the profile visually. In reality, I'm looking at sorted lists of functions and the amount of samples that hit each function. But a screen shot of that doesn't make for a very interesting picture.

Anyway, from that image, we can see what looks like a pretty "clean room." That big "empty space" in the middle is indeed empty space: time the server spent waiting on epoll for incoming client messages. We're doing that 54% of the time. The rest of the clutter around the edges of the room is actual work being done.

The biggest forehead-slapper in the profile, which can actually be seen here in this image, is the 12% of our running time spent on recomputeHeatMap. This is the bit of code that examines the environment around you to determine how cold you are (the thermal propagation simulation). This is an expensive bit of code to run, but it's only supposed to be updated for two players every server step (thus spreading the load), so what's going on here?

It turns out that the wall-clock duration of a "server step" varies depending on the rate at which messages are arriving. Big gaps between messages means the server sleeps longer before executing the next step. Short gaps mean many steps happen in a short time. The server is intentionally player-reactive in this way, actually using almost no resources at all if no on is logged in.

Checking the logs, I found that with such a huge population of players, with such a high inbound player message rate, the server step was being run something like 65 times per second. Yikes. Not only did this result in excessive calls to recomputeHeatMap (recomputing maps for something like 130 players every second, which isn't even useful), there were a bunch of other regular-interval parts of the server step that were being triggered 65 times per second as well. We don't need to check whether a player's curse score is decremented 65 times a second, for example.

After finding the parts of the server step that weren't necessarily reactive, I put them on fixed timesteps so that they would only run if enough time has passed, not every single step. Heat maps are now limited to 20 players per second, max, for example, regardless of how quickly messages are coming in.

The results are pretty dramatic. Here's the new profile picture, after these changes, with about 150 players on server 1:

And here's a 30-minute monitor graph of both old and new (sampled every 5 seconds, for 360 samples total):

Yes, that's around half the CPU used per player now. This should allow us to double the number of players that occupy a given server.

But even so, when we start getting above 60% internal CPU, external resource consumption can get up into the 90% range, which does not make Linode happy.

However, they did inform me that 2-core nodes (which are more expensive) are allowed to go up to 160% utilization, and 4-core nodes are allowed to go up to 320% utilization.

The server code is single-threaded, so it can't take advantage of more than one physical core directly, but the external resource consumption from virutalization, including disk access and so on, apparently can.

So, today I introduce a brand new server on the front line: bigserver1.onehouronelife.com

2 cores, 4x the RAM, a bigger disk, and a bigger upstream network pipe. Most of these extra resources aren't needed, but the extra core may help with external resource usage. Four times the cost, though. Is it worth it? How many players can we put on this sucker before it starts to choke?

To give you a taste of the difference between internal and external resource consumption on a virtual server, bigserver1 currently has 155 players on it. Internally, in top, it is using less than 1% of its CPU. Something around 0.3%, to be exact. Hard to believe, but true. A fresh---and tiny---map database likely helps with this, for sure.

But externally, as far as Linod is concerned? 50% CPU. Granted, I can safely go up to 160%, but still, 50% is way different than 0.3%. My external networking and disk access graphs are relatively high, though, and my guess is that some of those aspects contribute to external CPU usage. Again, my guess is that the process of virtualizing networking and disk involves extra host CPU operations that wouldn't be necessary on non-virtual hardware.

As another example, if I run a pure-CPU test process that busy loops, I see both 100% internally and externally, but that's a process that isn't touching the disk or network at all.

So, over the next few weeks, we'll see where bigserver1 can take us, in terms of a large population of players all in one cohesive world.

lionon · 2019-01-26 07:51:30

Awesome, one question about virtualization. If virtualization is an issue, can't you afford a dedicated server given the scale of the game by now?

They are about 100-200 bucks a month. I don't know how much the 15 VPS are in comparison, but maybe this checks out to switch?

Last edited by lionon (2019-01-26 07:52:42)

Alias · 2019-01-26 08:16:28

50% virtualization overhead on CPU usage? Even when usage within your VM is below 1%??
Jason, something's seriously wrong on the Linode's part and it would be good to ask them for details where this monster overhead comes from. I have never seen anything like this while managing VMware virtualization. Sure, intensive disk usage or even network (even memory) creates CPU overhead, but if your case was anywhere near "normal", virtualization would never be such a great tech.

Those stats show most certainly problem on their part and they should be more eager to work on their side instead of throwing it on you. I don't want to jump to any conclusions but it would be embarrassing to show clients that their 1% CPU usage in VM generates 50% usage by hypervisor - without any cause presented.

I suppose warning emails are courtesy on their part not to to cut you down to contract quotas immediately, right? Because you can set quotas, limits etc to ensure noone gets to little or too much.

There are so many things that can be wrong here. But one thing good for single thread application (when it runs efficiently) is that it's single core VM can be more easily squized into hosts CPU, while e.g. 4 vCPU VM needs to wait for 4 cores available - when there is resource contention between VMs. While going from 1 to 2 cores might be not big change in this, now your VM would be subject to (in VMware terms at least) CPU co-stop - situation when one vCPU is ready to go, but another is pending resources from the host. You don't want to do that on single thread application. Even when one vCPU does heavy lifting, both need to be scheduled on physical cores for VM to run.
That's why increasing vCPU count on VMs that don't efficiently use it usually, counterintuitively, decreases performance, if there is any contention for resources between VM (and there always is unless someone pays extra )

Sure, they use different technology, so I don't claim it's all applicable. But from both business and virtualization tech perspective this amount of resource waste/overhead doesn't seem viable.

Last edited by Alias (2019-01-26 08:23:40)

Tarr · 2019-01-26 14:20:32

Just in your testing alone I was able to find a second living village within range of the camp I was born into. Having everyone together means that we can finally have lasting towns, strangers meeting in the wild, and both types of play can be supported (early game + late game.) I'm not sure how often I'll run into another alive camp but before the update I could pretty reliably find Eve bones from my spawn points which became a game of find the suicider.

Overall this is a great update and I'm glad you've solved the issue of too many live servers without enough live players to make them usable.

CrazyEddie · 2019-01-26 15:07:10

Congratulations and thanks for the hard work on optimization! Well done.

I second everything that Alias said above. Please engage your Linode account rep and/or their tech support staff, and don't be reluctant to escalate until you find someone who really understands the situation and can help you.

SSDarkMoon · 2019-01-26 17:29:55

you said the game server code is single-threaded,
is it very hard to transform the code to 4 threads?

lionon · 2019-01-26 20:03:12

SSDarkMoon wrote:

you said the game server code is single-threaded,
is it very hard to transform the code to 4 threads?

Without knowing the OHOL code too detailed, generally speaking. Yes!

True native multi-threading I'd really avoid unless you want to get crazy about flow control etc. Especially race condition finding is one of the most horror and time consuming and sometimes close to futile things in programming.

Personally I think the server would greatly improve from using non-blocking I/O already without the need of actual native threading and thus have "pseudo-threads" / event based, like node.js without the need to go actually multi-thread.

fragilityh14 · 2019-01-26 23:17:51

i'm having the game freeze upon starting, it did it once last night and now three times today. I get born then everyone else stops and it disconnects. Some of the time when it reconnects i'm in the same game and my mom has kept me alive.

This happening to anyone else? It never happened to me before last night.

Keks · 2019-01-27 09:13:58

fragilityh14 wrote:

i'm having the game freeze upon starting, it did it once last night and now three times today. I get born then everyone else stops and it disconnects. Some of the time when it reconnects i'm in the same game and my mom has kept me alive.
This happening to anyone else? It never happened to me before last night.

Yeah, I have that problem as well

AmyJ · 2019-01-27 19:52:14

fragilityh14 wrote:

i'm having the game freeze upon starting, it did it once last night and now three times today. I get born then everyone else stops and it disconnects. Some of the time when it reconnects i'm in the same game and my mom has kept me alive.
This happening to anyone else? It never happened to me before last night.

Same with me. Glad I'm not alone! Things are usually fine for about 10 seconds, and then everyone freezes and I eventually disconnect.

Alias · 2019-01-28 08:16:52

Same. Plus picking up, eating, speaking had 5-7 seconds delayed. Literally impossible to play.

Spoonwood · 2019-12-11 21:34:53

The bigserver system is a flawed concept. You want to grow the game signifanctly? Then you can't have all players on the same server, because the technology isn't there to handle it, as the recent sale demonstrated. So, load balancing has to happen in significant measure when it's necessary. But with the bigserver system, players get put onto server1 by load balancing, and then lack descendants, because those not checking a custom server join or re-join the game end up on the bigserver after load balancing is no longer necessary. But, that contradicts the promise implicit in

jasonrohrer wrote:

Eternal survival of one family will be possible through distant resource outposts and migration.

StrongForce · 2019-12-12 08:37:54

FFS stop necros in news!!!

Spoonwood · 2019-12-12 12:58:06

I think it's news that there may exist or does exist reason to believe that the bigserver system is a flawed concept.

StrongForce · 2019-12-12 17:20:50

Attention grabbing is what it is.

Saolin · 2019-12-12 17:25:28

I thought this was a new update and just read all that mumbo-jumbo before noticing the next post was almost a year old. C'mon spoonwood! >_<

Last edited by Saolin (2019-12-12 17:30:17)

Spoonwood · 2019-12-12 17:46:29

Again, the bigserver system is not compatible with eternal survival of one family.

One Hour One Life Forums

#1 2019-01-26 07:12:57

Update: Big Server

#2 2019-01-26 07:51:30

Re: Update: Big Server

#3 2019-01-26 08:16:28

Re: Update: Big Server

#4 2019-01-26 14:20:32

Re: Update: Big Server

#5 2019-01-26 15:07:10

Re: Update: Big Server

#6 2019-01-26 17:29:55

Re: Update: Big Server

#7 2019-01-26 20:03:12

Re: Update: Big Server

#8 2019-01-26 23:17:51

Re: Update: Big Server

#9 2019-01-27 09:13:58

Re: Update: Big Server

#10 2019-01-27 19:52:14

Re: Update: Big Server

#11 2019-01-28 08:16:52

Re: Update: Big Server

#12 2019-12-11 21:34:53

Re: Update: Big Server

#13 2019-12-12 08:37:54

Re: Update: Big Server

#14 2019-12-12 12:58:06

Re: Update: Big Server

#15 2019-12-12 17:20:50

Re: Update: Big Server

#16 2019-12-12 17:25:28

Re: Update: Big Server

#17 2019-12-12 17:46:29

Re: Update: Big Server

Board footer