About an year ago while working on the XNA TPS game I was surprised how my old single core Athlon 64 2.0 GHz processor could handle my game at 100+ fps (frames per second), while I was gettings less than 40 fps on the 3.2 GHz Xbox 360’s Xeon processor. (Note that my game was CPU bound and not optimized!)
Performance issues are common for XNA developers on the Xbox 360 – specially when your entire game was developed on Windows – and even professional developers, like Torpex team, had to face them. In general, these performance issues are due to: .NET Compact Framework (which runs on the Xbox) and Xbox 360’s Xeon PowerPC architecture. In this post, I will discuss about Xeon’s architecture and why it might run your game slower than your desktop Intel/AMD x86 processor. Let’s start looking at the Xeon specs at Xbox.com:
- Three symmetrical cores running at 3.2 GHz each.
- Two hardware threads per core; six hardware threads total.
Looking at these specs you may wonder how powerfull this processor is: “Wow, six hardware threads running at 3.2Ghz cores!“. In fact it is really powerfull, if you know how to effectively take advantage of its architecture! Do not expect a great performance from your ordinary (not optimized) single thread code.
When a manufacturer like IBM design a chip they are concerned about its surface size (or die size). The die size is the space they have to put everything the chip needs: logic units, control units, registers, cache and so on. But bigger chips means heat, high power consuption and high prices (they make money by getting as many chips from a silicon wafer as possible). In order to put a triple-core CPU in a small chip in early 2005, IBM had to remove some hardware optimizations from its cores, relying more on software optimizations. So, what makes Xeon processor different?
- In-order-execution: When one of the operands of an instruction is not available the processor waits untill it gets available (what can take longer if the data it is not in cache). In this case, an out-of-order execution would allow other instuctions to execute reducing latency.
- Poor branch prediction: When a conditional branch is found, branch prediction tries to guess if it will be taken or not, before evaluating its condition. Then, the next program instructions can be pre-fetched but a branch misprediction may cause bubbles in the pipeline. Some compilers allows developers to label branchs that are “likely” to be taken, generating a more optimized code. Branch intensive codes like IA and Game Logic may suffer here!
- Small cache size: Xeon features 32KB/32KB (instructions/data) L1 cache and 1MB L2 cache. It is indeed a small cache (my laptop’s Intel Core 2 Duo Mobile processor has a 3MB L2 cache) and if you do not handle it your program might suffer from lots of cache miss.
Furthermore, .NET Compact Framework JIT (just in time) compiler makes things more difficult for XNA developers, generating very unoptimized codes! But I will leave this discussion for my next post! =) Finally, I hope in the future (maybe on XNA 3.0) we can get access to the VMX 128 units on the Xbox. These units are a great help for math intensive code and the Xbox 360 features three of them.