LaserBlue.org



23 June 2013

Diablos: Multiplayer synchronization

Posted by Matt

For the past few years I've been working on and off on a project dubbed Diablos. Diablos is, for all intents and purposes, a clone of Blizzard Entertainment's Diablo II and is being developed for three main reasons:

One of the most interesting areas of game programming is how to solve the problem of client synchronization. Diablo II's approach for instance is what's often referred to as the dumb client model: each client sends its input to the server, the server integrates the input then sends the updated game state back to each player. The problem with this model is that it is considerably difficult to extend and also requires an unholy amount of bandwidth due to the amount of state information that the server must transmit. The problem is exacerbated in games of this type where there are often high numbers of densely packed mobile units (e.g, monsters) each resulting in several discrete updates sent per second per player in range.

To solve these issues I've been exploring a method called simultaneous simulation which I originally discovered while reverse engineering another game. The premise of this model is that if a game simulation is fundamentally deterministic then you can achieve perfect synchronization by simply feeding each simulation the same input (which is quite obvious when you think about it; game's are good examples of finite state machines)

Using this model addresses both issues concisely. Firstly, it only requires the absolute minimal of information shared to achieve synchronization as the server operates as a simple broadcasting system of user input. Secondly, as only user input is required to be communicated we are able to create an arbitrarily complex game engine without having to concern ourselves with how to encode it's delta state for propagation to the clients.

Simultaneous Simulation Implementation

The theoretical requirement of this model is that each simulation must integrate the exact same input at the exact same point in time. Dealing with the first requirement is relatively simple: have the server buffer client input and send it to clients at a frequency equal to a multiple of the simulation [fixed] time-step. To overcome the second we decouple the simulation from real-time and explicitly advance the simulation only when frames of input (representing periods of time) are received from the server -- a sort of asynchronous lock-step.

The result of this design is that, although the integrity of the simulation is preserved from the effects of latency, jitter in frames of input will have a proportional influence on jitter in the simulation itself causing a discrepancy of sorts between real-time and our notion of virtual-time.

There are multiple methods of addressing this issue however the one I'm currently leaning towards involves maintaining a parallel view of the simulation. The purpose of this additional view is to enable us to maintain an extrapolated version of the current game state –the same way you would for any latency-compensation technique– to predict the state of the server. The major difference here is in the actual implementation; we must preserve the information regarding the authoritative state.

Of course, this additional view is merely to provide the player with a more responsive experience and will still have to be periodically aligned with the authoritative state.

Another pending issue is still that of performance. For this model to work the client must run the entire simulation, even for that which is of no relevance to them (such as areas of the map which the client is not within proximity of). The obvious solution is to partition the world-space (e.g, into a regular grid) and have the client remain synchronized only with the regions which are within reasonable proximity. It's a nice approach in theory however there's one major problem: unless the game logic is constrained by these boundaries, events in inactive areas of the simulation can still influence the regions which are active (e.g, a unit provoked by another player in an adjacent region wandering into the region of another client). If you've played Diablo II you might've witnessed this phenomenon yourself where-in your character takes damage from an "invisible" missile (often arrows). In this case, it's caused due to each player maintaining a 3x3 sector synchronized view of the world-space (roughly 2x2 screens worth) and when a missile is generated by a unit in a sector outside this range it's capable of entering your view silently (missiles are not applicable to synchronization)

Comments
14 June 2013

A fixed-size memory allocator

Posted by Matt

Memory allocators have been a particular interest of mine for as long as I can remember. During my early adventures into this domain as a young teen I devised an allocator that has continued to serve me to this day (albeit not as exclusively as it once was). Although later I would understand it for what it really was —fixed-size, external, bitmap/tree based with an arbitrary placement policy— at the time it was a crazy experiment.

The data structure is quite simple: given an n-sized pool of blocks there exists an n-sized associative array of bits. Each bit indicates the state of the associated block; set=free clear=used. Each word-sized set of bits is then indexed within a separate associative array of bits wherein each bit is set in the condition that it's associated child word contains one or more set bits. This process is repeated for each word worth of bits; figuratively going up the tree

The key to using this data structure is the bsf instruction which is part of the x86 instruction set. If you're unfamiliar with bsf it's the bitscan-forward instruction; it computes the index of the first set bit of the given operand, e.g, 0x80000001 would be 0 (if the operand is 0 the result is undefined however the ZF flag is set). Hopefully this has cleared up the reason for such a seemingly arbitrary unit size as a word.

So now to allocate a block of memory you would start at the top of the tree and recursively search for the next set bit at each level of the tree and once you find the final child bit, clear it. Importantly, if the word containing the bit cleared is now zero you must propagate this information up the tree as far as necessary. The actual pool index calculation can be done in one of two ways, either you calculate it relative to the base of the base associative array or you can base it off your traversal through the tree; I'm pretty sure I did it the latter way but I can't remember why (probably because I was being a smartass)

There's not really much else to say about it apart from it being quite space efficient and reasonably computationally efficient; several somewhat local reads when allocating, a bit more overhead when a word changes state. One good optimization would be to index a cache-line worth of bits rather than a single word. I was originally going to draw a diagram but I haven't left myself the time so if you're interested here's the original source that I have yet to update -- I was a pretty pedantic commentor when I was 14. Also, although I discussed the implementation generally, what you will find is that within the implementation is that the number of tree levels is static (it was designed to fit precisely within one page (4KB))

Comments
6 June 2013

Effective multithreading

Posted by Matt

Leveraging multithreading can yield tremendous performance gains when applied correctly, however the impression that I've received from many programmers is that the mere practice is some sort of arcane art that should be avoided for the sake of your sanity.

There is ofcourse merit to this adversity; concurrent execution can be difficult to cleanly integrate into many common design paradigms and introduces problems that otherwise would not exist in a single-threaded model. I wont be discussing these issues directly (that could fill a book), rather I will be focusing on how to exploit SMP machines and cover some of the common design pitfalls.

Although I'm not a fan of the x86 instruction set I will be using Intel's architecture in any discussion that is implementation specific simply because of it's ubiquitous nature.

Preface

If there's any infallible rule in concurrent programming it would be make sure the problem is suitable. Some workloads inherently do not benefit from being parallelized, the simplest of which being those which are I/O bound; not all are this obvious however.

Evaluating multithreading suitability first involves determining where your application is going to spend the majority of it's time. The emphasis here is on time – real time – which I find easiest to be thought of as a resource, and just like any other resource it can be utilized efficiently (or not). This paradigm is not unique to the threading level, rather it's an all-encompassing view that is manifested in various ways, from asynchronous I/O to instruction-level parallelism.

To give a [specially contrived] example, consider the following snippet of x86-esque assembly:

mov eax, [0x1000] ; x
add eax, 1
mov edx, [0x1004] ; y
add edx, 1

Lets assume for a moment we're on a dumb architecture; the above code loads two words from memory, stores them in registers, and adds the value 1 to each respectively. The first characteristic of this workload is that there are two sequences of operations that can be evaluated without consideration of the other i.e, they're not dependent. The second is that loads take a considerable amount more time than addition. Now, consider the revised version:

mov eax, [0x1000] ; x
mov edx, [0x1004] ; y
add eax, 1
add edx, 1

In the prior example the addition of x could not be performed until its load completed and execution stalls, this remains true, however unlike the prior example this reordered sequence enables the dispatching of y's load to take place while the operation on x is processed resulting in greater efficiency (albeit minor in the real world)

The purpose of this example is to demonstrate how, even at a micro level, managing time is important. Also what has been demonstrated is the identification of synchronous operation sequences which is key when you are interested in exploiting parallel processing and is also the basis of Amdahl's law which states (I'm paraphrasing here): The potential speedup is inversely proportional the amount of serial processing involved.

The Broken Client:Thread Model

This is a good example of a common mistake made by programmers when applying threading: partitioning threads based on logic rather than the actual characteristics of the workload. To understand why this model is broken the first thing to realize is that utilizing more threads than there are existing processors is entirely redundant, in fact it's detrimental to performance due to the inherent overhead of threading: context switches.

Context switches are one of the most important factors when deciding how to schedule the processing of your workload. Although the up front cost of a context switch is considerably expensive (usually around 8000 clocks) the major impact it has is on caching. Recall that processor caching exploits locality-of-reference and a context switch that involves a migration – wherein a thread is scheduled on a different core than the last – will mitigate the usefulness of the cache by forcing the invocation of various cache coherency mechanisms to rebuild the previous core's state and potentially incur other penalties

The second point is that such a design suggests the result, and act of, processing client input constitutes the basic unit of work; this is highly unlikely, most of the work that can be scaled processing-wise usually resides in some back-end area of the server. Additionally, by employing such a large number of threads what typically results is a very irregular pattern of memory access which leads to cache thrashing wherein cached memory has no time to be effectively utilized before it's evicted to make way for the next thread's working set.

There are many other environmental (namely OS) issues surrounding contexts switches and these designs however I will leave them for a separate post

Why Synchronize?

Synchronizing access to resources is easily the most daunting design overhead of multithreaded software and can often be a deal-breaker when time is money. However what is often overlooked is that the mere notion of synchronizing is against the principle of parallel processing because as previously mentioned: potential gains are a function of the amount of work that can be executed concurrently and synchronization is an inherently serial affair. Therefor, if you're faced with a myriad of synchronization issues it suggests one of two things: 1) The problem itself is inherently serial (and thus a poor candidate) or 2) Your program is poorly structured to take advantage of multiple threads.

The leads us to a key technique: partitioning workloads based on data access patterns. The objective is to structure your program so there are independent memory access patterns of which can be concurrently processed with little to no synchronization required. Ensuring that the memory is not subject to being accessed by another processor is key here for reasons discussed earlier. It should also be noted that what is considered "shared" is highly architecture dependent, however on modern processors this is a cache-line which is 64 bytes (aligned). This means that if you have one word at address 0x1000 and another at 0x1010 it would be considered sharing to access both.

So to summarize; an optimal software architecture has the following properties:

This is obviously unrealistic for real-world software, however it is something to strive for.

And the recommended tips/techniques:


-----------

In a follow-up post I will cover the unique scenario of SMT (HyperThreading) and how it changes the game.

Comments
5 June 2013

Slowly into the Path of Exile

Posted by Matt
[ Since this post is targeted towards non-programmers it will be kept simple and concepts explained briefly where necessary. Follow up posts will delve into the technical aspects of some of the topics I discuss here ]

Being of the Diablo II persuasion, Path of Exile — a game promoted as a homage to the original Diablo series — was something I felt compelled to experience. However, shortly after slapping a few goatman-esque monsters around, my attention was drawn to the frequent in-game chatter of the seemingly popular topic of the game's technical issues and, being a programmer first (gamer second), this was beginning to sound like something interesting to investigate.

After a brief scan of the support forum, several of the frequently reportedly issues stood out:

"Unable to load resource.x: E_OUTOFMEMORY"

This was interpreted by frustrated players as being reflective of their system's lack of physical memory, which although possible, I considered to be unlikely given the ridiculous amount of memory gamers seem to have installed these days. So what could it mean then? My initial hypothesis, which has explained similar issues in the past, was that the application had reached the working set threshold which sets an upper (and lower) bound on the amount of physical memory a process can consume. This was supported by the fact it's a game, and a simple optimization technique often used in games is to request that the operating system keep as much of the program in physical memory as possible to avoid having to read from the harddrive when some portion of the program's memory is accessed.

Since I was not personally experiencing this issue I could not immediately test my hypothesis, so the next step was to find evidence for it by disassembling the game and determining whether or not the developers were locking portions of the game into memory, and if so were they ensuring the working set was keeping up. The result: Neither, they weren't locking memory or modifying the working set in any way which ruled out the working-set theory completely. Thus what remained was an application that barley consumed 1.5GB of memory that was being denied requests for more from systems which had up to 16GB (based on user reports)

This set back was short-lived however as key information came in with an error I suspected was related...

"Unable to MapViewOfFile"

Path of Exile uses a single — last I checked, 12GB — file to store the game assets (models, textures, etc) which is internally structured as a file-system (hence it's a vfs). A method developers often use to interact with such large files is called file-mapping as PoE does; this enables the game to treat portions of the file as if it existed in memory which often simplifies development by providing a unified model to access information, and can [in some limited cases] provide an performance increase.

Although there are numerous possibilities for this error alone (alignment comes to mind), when considered in the context of the existence of the other error (E_OUTOFMEMORY) it suggested that, along with the reports which were specifying 32-bit builds of Windows, that rather than insufficient memory it was insufficient contiguous address space. Consider address space to be a finite sequence of numbers an application uses to identify discrete segments of memory; the process of associating these numbers with segments of memory is called allocation, this is required because applications only establish this association on demand.

Unfortunately there's a problem with the address-space theory, although PoE exclusively targets an x86 (32-bit) environment which [under Windows] provides it with 2GB of address-space, the game is observed to only utilize around 1.5GB of memory. However, this can be explained by the phenomenon of fragmentation where-in, although in total, there is a sufficient quantity of address-space, there is no sufficiently sized contiguous area of the address-space that is capable of being allocated.

If true, this would make for quite an interesting turn of events; in my 15 years of programming I've never come across such a situation. Confident I found the answer, I powered up VMMap to analyze the address-space allocation pattern; sure enough, PoE's data was sprayed all over the place. Except, there's one problem: why wasn't I experiencing the same issue? The answer lies in the fact my machine is 64-bit, and although PoE is an 32-bit executable, it was decided by the Windows developers to free up the high-half of the 32-bit address space for WoW64 processes and allow 32-bit processes to utilize the entire 32-bit address-space if the image is marked as such (ie., /LARGEADDRESSAWARE). Finally there's an explaination, but what about a solution? Luckily Windows has a boot flag that enables processes (on 32-bit systems) to access up to 3GB of memory by shifting the user/kernel address-space allocation around a bit. This doesn't solve the problem, but it will delay it's onset.

There is one remaining question, if the PoE developers were aware (evidenced the image being marked with the aforementioned flag) that the game could potentially require more address-space than is available on 32-bit Windows, then why would they not mention it to the players?



--------

Originally I was going to cover a performance issue aswell (which is what the title is referring to) however I think this post has gone on long enough.

Comments