|Update: Bison and Butter|
August 11, 2018
This is the first substantial content update in a while.
The past few weeks have had me focused on rooting out the causes of server lag issues. After testing various solutions on Server1, the issues were finally resolved. Earlier this week, the final replacement database engine was rolled out to all fifteen servers, allowing us to push the player population cap up to 200 per server.
After that, I returned my focus to content. I've been wanting to add more animal husbandry for a while, and this update gives you a new one to domesticate. And what good is butter if you don't have anything to spread it on?
|Update: All servers are now running the new, low-latency database|
August 8, 2018
After testing for a bit on Server 1, the new database engine has been rolled out to all fifteen servers, and the player caps on each server has been pushed up to 200. CPU and disk usage on the servers is looking good.
Server lag should be a thing of the past.
I'm finally returning to working on new content for the game. There will be some kind of content update at the end of this week.
|Update: Even newer, even-lower-latency database engine is live|
August 4, 2018
You probably feel like you've heard about this issue too much already.
But making a lag-free server is a top priority, and getting the database engine right is the most important part.
Last week, I introduced a custom-coded database engine that made null-lookups (when you find out that nothing is there on the map) extremely fast. These are the most common database actions---imagine someone walking around in the wilderness, and we need to find out that nothing human-made is on each and every map cell that they're exploring. That's a lot of null-lookups.
However, the architecture of this new database engine also made inserts---a much less common operation---quite a bit slower. The hash table is spread across the data file, and that means that newly inserted data is spread randomly across the data file. KISSDB and my previous replacement STACKDB did not distribute the data across the disk in hash table order, but I never thought about why.
It turns out that writing in a bunch of random locations in a file is really slow, because it causes cache misses constantly. So it's best to write all new data at the end of the file, in an arbitrary order, instead of treating the file like one big hash table.
Of course, this doesn't matter most of the time.
Except when loading the tutorial map into the world for a player. At that moment, we insert thousands of new records into the database. This was taking something like 8 seconds on server1 with the new database engine. That's an 8-second lag for EVERYONE every time any player loads the tutorial map.
The latest database engine does things quite a bit differently, keeping the entire hash table in RAM and keeping data on the disk in the order that they are inserted. Thus, when a big sequence of inserts happens, like when the tutorial is loaded into the world for someone, all of those inserts happen in order at the end of the file, without a single file seek along the way.
And, given that the entire table is kept in RAM, null-lookups, and all other operations, are substantially faster than in any previous database engine.
The result of all this is that Server1 is back online and hopefully more lag-free than ever. 43 players are currently on it, and it's only using 7% CPU for that load.
There's still a small bit of optimization work to be done. When the map database is huge, as it is on Sever1, tutorial map inserts are still a bit slower than I would like them to be (a little over 1 second), but given that it's 9pm on Friday night, and my kids are waiting to play League Of Legends with me, that sounds like something I will tackle on Monday morning.
And yes, the bison is coming soon... promise!
|Update: New low-latency database engine is live|
July 28, 2018
With weary coding fingertips I type to let you know that a very long week has paid off.
More profiling of the bedraggled Server1, which has the largest map data set of any server, revealed that file IO inefficiencies in the custom-coded StackDB were to blame. StackDB was designed to quickly answer questions about recently-accessed map cells---assuming that people in cities are often looking at the same stuff, so that stuff should be kept near the top of the stack. The old off-the-shelf KissDB did not do that, meaning that the newest stuff was the slowest to access as the data set grew.
However, none of these optimizations addressed what is actually the most common case: asking about a map cell that isn't in the database at all. When you wander around in the wilderness and look at the empty ground, we have to ask the database to confirm that that patch of ground is indeed empty. Maybe someone visited that spot earlier and dropped something there that you should see.
It turns out that in both KissDB and StackDB, this is the slowest operation of all. A non-existent cell can never be at the top of the stack, because it doesn't exist, which means that we need to walk to the bottom of the stack to find out for sure that it doesn't exist.
Finally, KissDB and StackDB are both hash table systems, but both of them use fixed size hash tables. In the case of Server1, there were 15 million data records crammed into an 80,000 slot hash table. This means lots of pages to look through in each slot (KissDB) or deep stacks to wade to the bottom of (StackDB) to find out that a given map cell really isn't there, and therefore is empty.
Even worse, the architecture of both engines requires loads of random-access disk seeks to move through the pages or the stack. And disk seeks are extremely slow, relatively speaking, especially when they are jumping around a huge file and missing the cache over and over.
LinearDB, my latest custom-coded database engine, has an ever-expanding hash table based on a very clever algorithm---which I did not invent---called Linear Hashing. The hash table grows gradually along with the data, essentially never letting it get too many layers deep. In addition, a kind of "mini map" of data fingerprints is kept in RAM, allowing us to ask questions about map cells that don't exist without touching the disk at all.
The performance gains here are pretty astounding.
On a simple benchmark where a single player walks in a straight line through the wilderness for a minute, the old database engine performed 1.9 million disk seeks and 3.8 million disk reads.
During the same single-player journey, the new database engine performs less than 4700 seeks and 1600 reads.
Yes, that's a 427x and 2400x in disk seeks and reads, respectively.
According to a system call time profiler, this results in approximately 180x less time spent waiting for the disk. In other words, this part of the server is now one hundred and eighty times faster than it used to be.
Server1 is the only server that has this new engine installed so far. It has the biggest data sets and was seeing the most lag with the old database, so it's the best server to stress test with. It's now back in circulation at the top of the list and seems to be lag-free. I'll be incrementally increasing the player cap over the next few days and seeing how it handles the load.
Assuming that all goes well, I will be rolling the new database engine out to the other servers next week. The end of server lag is almost in sight.
A big thanks goes out to sc0rp, who spent many hours discussing the intricacies of these systems with me in the forums, and filled my head with all sorts of great ideas. I had never even heard of linear hashing until sc0rp told me about it.
|Update: Server lag optimzation and client improvements|
July 21, 2018
With the recent influx of new players to the game, the servers have been struggling to keep up.
Several months ago, I spent a lot of time on server database optimization, which made the servers around seven times more efficient than they were originally. But as maps fill up with player content, there's more and more information to process, and the load generally grows over time. In the past week, Server1, which has the most extensively settled map, had gotten very laggy. It was time to take another look.
This round of profiling revealed a bunch of hot spots that weren't database-related. The optimization process involves running a server with a profiler (I use the amazing Callgrind profiler from the Valgrind project), finding the most obvious hotspot, figuring out if there's a way to speed it up or---even better---skip that operation entirely, and then repeating with a new build to find the next biggest hotspot. I ended up going through this process nine times, fixing nine hotspots along the way. Some of these changes resulted in 7x speedups to certain parts of the server code.
However, even after several days of intensive work that showed huge performance gains in the profiler, when I finally brought Server1 back online for public use and a load of 38 players, the lag returned. There are more issue afoot with Server1 than just slow code. CPU usage jumped up back to 70%, while Server3 sits happily at around 20% with exactly the same player load.
So, Server1 and Server2 will remain "on ice" over the weekend, at the end of the server list where no one will use them by default, until Monday when I will resume diagnosing Server1's lag issues.
You may have also noticed that the connection management features of the client have been greatly improved. You can now specify a custom server from the SETTINGS screen, and copy/paste server addresses to share with friends. Bugs in the twin matching have also been resolved, so joining twin games is reliable. Some issues that caused logins to fail have also been fixed.