libavg’s rendering has been fast enough for many applications for a while. A decent desktop computer could render between 2000 and 5000 nodes with a framerate of 60 in version 1.7. This is probably already more than most frameworks, but for big applications, it’s not enough. For instance, someone tried to build a game of life application with one node per grid point – and ran into performance issues. SimMed spends an inordinate amout of time rendering 2D as well. Also, particle animations and similar effects need lots of nodes.
So, I went and optimized the rendering pipeline. As a bonus, I was able to remove lots of deprecated OpenGL function usage, thus getting us a lot closer to mobile device support.
tl;dr: On a desktop system with a good graphics card, the benchmarks now show libavg rendering two or three times as many nodes as before.
The new rendering pipeline
One mantra that’s often repeated when optimizing graphics pipelines is “minimize state changes” (See Tom Forsyths blog entry on Renderstate change costs and NVidias GDC talk slides). Pavel Mayer once (over-)simplified this to “minimize the number of GL calls”, and my experience has been that that’s actually a very good starting point.
Today’s graphics cards are optimized for large, complex 3D models with comparatively few textures. 2D applications rendered using 3D graphics cards render lots of small primitives – mostly rectangles – with different textures. A naive implementation uses one vertex buffer per primitive. That results in a huge number of state changes and is about the worst way to use current graphics cards.
The new rendering pipeline makes the most of the situation by:
- Putting all vertex coordinates into one big vertex buffer. This vertex buffer is uploaded once per frame, activated and used for all rendering. The one big upload takes less time than actually figuring out what needs to be uploaded and doing the work piecewise.
- Using one standard shader for all nodes. This shader handles color space transforms, brightness/contrast/gamma and masks, meaning it does a lot more work than is necessary for most nodes. However, the shader never changes during the main rendering pass. It turns out that the increased per-pixel processing is no problem for all but the slowest GPUs, while the state changes that would otherwise be needed cost signficant time on the CPU side.
- FX nodes are rendered to textures in a prerender pass with their own shaders.
- Generally moving GL state changes outside of the render loop if possible and substituting shader parameters for old-style GL state.
- Caching all other GL state changes. There are just a few GL state variables that still change during rendering (To be precise:
glBlendColor
, the active blend function, and parameters to the standard shader). Now, setting a shader parameter to the same value repeatedly doesn’t cause several GL calls.
There were also a few non-graphics related optimizations – profiling information is now only collected if profiling is turned on, for example.
Results
Without further ado, here are some benchmarks using avg_checkspeed
and avg_checkpolygonspeed
. They show nodes per Frame at 60 FPS on a typical desktop system (Core i7 920 Bloomfield, 2.66 MHz, NVidia GF260):
Desktop, Linux (Ubuntu 12.04, Kernel 3.2)
libavg Version |
Images |
Polygons |
1.7 |
2200 |
3500 |
Current |
7000 |
7000 |
Desktop, Win 7
libavg Version |
Images |
Polygons |
1.7 |
2700 |
5000 |
Current |
10000 |
9500 |
On my MacBook Pro (Mid-2010, Core i7 Penryn, 2.66 MHz, NVidia GF330M graphics, Snow Leopard), the maximum number of nodes rendered did not increase. However, the CPU load while rendering went down – so we have a GPU bottleneck here:
MacBook Pro
libavg Version |
Images |
Polygons |
1.7 |
1000, 100% CPU load |
1600, 100% CPU load |
Current |
1000, 80% CPU load |
1600, 40% CPU load |
More precisely, since changing multisampling settings has an effect on speed, fragment processing is the bottleneck. Changing to minimal shaders doesn’t have an effect on speed either, so I’m guessing at texture fetches at the moment. But that’s for the next iteration of optimizations.