Speed

libavg just got another major optimization.

I implemented an image registry and cache for libavg. ImageNodes that reference the same image file now reference the same bitmap in CPU memory and the same texture in GPU memory. This is completely hidden from the app developer, who just specifies the file location for all instances. The obvious benefit is that this saves a lot of memory if an application re-uses lots of bitmaps. The less obvious benefit is that it speeds things up as well: avg_checkspeed, which tests with thousands of identical ImageNodes, can now handle around 15000 Nodes at 60 FPS on my old i7 (still Core i7 920 Bloomfield, 2.66 MHz, NVidia GF260 like in the old benchmarks). This is twice as many as before.

Multithreading, Realtime Graphics and Process Affinity Masks

In libavg, we try to make it as easy as possible to have a consistent framerate that matches the screen refresh rate. For almost all current systems (and ignoring new developments such as NVidia G-Sync), that means delivering a new frame every 16.67 milliseconds.

To make this possible, libavg is designed as a multi-threaded system and long-running tasks are moved to separate threads. So, for instance, the BitmapManager class loads image files in one or more background threads (the number of background threads is configurable using BitmapManager.setNumThreads()), the VideoWriter uses a background thread to encode and write video files, and all videos are decoded in background threads as well. Besides enabling quick screen updates in the main thread, this also allows libavg-based programs to utilize more than one core in a multi-core computer. The threads are distributed among the cores by the operating system according to the load, and in general, this works pretty well.

However, the operating system has no way of knowing that one of the libavg threads is special and should be able to churn out frames at 60 fps. So, if the background threads cause too much load, some of them will run on the same core that the main thread is running on, and framerate can become irregular.

Happily, there’s a cure for the issue: We lock the screen update thread to one specific core and forbid all other threads from using this core using thread affinity functions.

Supporting Twelve Screens at Once

Our latest and biggest (not to mention coolest) toy at the Interactive Media Lab Dresden is a ten square-meter interactive wall that’s fully touch-sensitive and supports markers and pens as well. It consists of twelve Full HD monitors hooked up to two Radeon 7970 graphics cards in a single dual-Xeon workstation. Since it’s driven by a single workstation, we can drive the complete wall with a single application, which is really cool and sets it apart from most similar setups. However, the dual-graphics-card setup causes issues: Under Linux, we have two separate desktops, and under Windows, applications that span the graphics card boundary are extremely slow.

To get full-screen rendering at interactive speeds, you basically have to open two borderless windows – each spanning 6 screens and pinned to one of the GPUs. Then you render the same scene with different viewports in each of the windows. That means that all context-specific data – textures, vertex buffers, shaders, framebuffer objects, and even caching of shader parameters – needs to be replicated across both contexts. Also, we can’t switch contexts too often, because that would make things slow.

libavg renders in two passes: The first (implemented in the Node.preRender() functions) prepares textures and vertex data. It also renders FX nodes. The second pass (implemented in Node.render()) actually sends render commands to the graphics card. The multi-context code changes a few things: While preRender() is still executed only once, render() is executed once per GPU. Uploads of data as well as effects that need to be rendered are scheduled in preRender and actually executed at the beginning of the render. In total, refactoring everything accordingly was (obviously) a lot of work that impacts lots of code all over the graphics engine, but the result is good rendering performance with 24 megapixels of resolution.

The code is still on a branch (The svn repository is at https://www.libavg.de/svn/branches/experiments/libavg_multicontext/), but it passes all tests, and I’ll merge it to trunk after we’ve used it a bit.

Raspberry Pi Support

I’m sure most of you have heard of the Raspberry Pi, a $25 ARM computer that runs Linux. We’ve spent quite a bit of time in the last weeks getting libavg to run on this machine, and I’m happy to say that we have a working beta. We render to a hardware-accelerated OpenGL ES surface and almost all tests succeed. Besides full image, text and software video support, that includes all compositing and even offscreen rendering and general support for shader-based FX. We have brief setup instructions at https://www.libavg.de/site/projects/libavg/wiki/RPI. Update: The setup instructions have been updated for cross-compiling (much faster!) and moved to https://www.libavg.de/site/projects/libavg/wiki/RaspberryPISourceInstall.

Most of the work was getting libavg to work with OpenGL ES. We now decide whether to use desktop or mobile OpenGL depending on a configure switch, an avgrc entry and the hardware capabilities. Along the way, we implemented mobile context support under Linux for NVidia and Intel graphics systems, so we can now test most things without actually running (and compiling!) things on the Raspberry. Speaking of which – compiling for the Raspberry takes a long time. Compiling on it is impossible because there just isn’t enough memory. We currently chroot into a Raspberry file system and compile there (see the notes linked above).

A lot of things are already implemented the way they should be for a mobile system. That means that, for example, bitmaps are loaded (and generated, and read back from texture memory…) in either RGB or BGR pixel formats depending on the flavor of OpenGL used and the vertex arrays are smaller now so we save bandwidth. Still, there’s a lot of optimization to do. Our next step is getting things stable and fast. We want hardware video decoding, compressed textures – and in general, we’ll be profiling to find spots that take more time than they should.

Intel Graphics

After the rendering optimization I desribed in my last post, tests with Intel Atom chipset graphics (N10 chipset) uncovered a problem. The system was running in software rendering mode, which slows things down by a factor of about a thousand. It turns out that more than two texture accesses in a shader are too much for the hardware. Additionally, lots of Intel chips render all vertex shaders in software, and that also causes a tenfold slowdown if the libavg 3-line vertex shader is in use.

So now, there’s a second rendering path with minimal shaders that does vertex processing the old-fashioned way (glMatrixMode etc.) and uses a different shader for those nodes that don’t need any special processing. Still, I recommend staying away from Intel Atom graphics. There is way better hardware out there at the same price point.

Speeding up Rendering

libavg’s rendering has been fast enough for many applications for a while. A decent desktop computer could render between 2000 and 5000 nodes with a framerate of 60 in version 1.7. This is probably already more than most frameworks, but for big applications, it’s not enough. For instance, someone tried to build a game of life application with one node per grid point – and ran into performance issues. SimMed spends an inordinate amout of time rendering 2D as well. Also, particle animations and similar effects need lots of nodes.

So, I went and optimized the rendering pipeline. As a bonus, I was able to remove lots of deprecated OpenGL function usage, thus getting us a lot closer to mobile device support.

tl;dr: On a desktop system with a good graphics card, the benchmarks now show libavg rendering two or three times as many nodes as before.

The new rendering pipeline

One mantra that’s often repeated when optimizing graphics pipelines is “minimize state changes” (See Tom Forsyths blog entry on Renderstate change costs and NVidias GDC talk slides). Pavel Mayer once (over-)simplified this to “minimize the number of GL calls”, and my experience has been that that’s actually a very good starting point.

Today’s graphics cards are optimized for large, complex 3D models with comparatively few textures. 2D applications rendered using 3D graphics cards render lots of small primitives – mostly rectangles – with different textures. A naive implementation uses one vertex buffer per primitive. That results in a huge number of state changes and is about the worst way to use current graphics cards.

The new rendering pipeline makes the most of the situation by:

  • Putting all vertex coordinates into one big vertex buffer. This vertex buffer is uploaded once per frame, activated and used for all rendering. The one big upload takes less time than actually figuring out what needs to be uploaded and doing the work piecewise.
  • Using one standard shader for all nodes. This shader handles color space transforms, brightness/contrast/gamma and masks, meaning it does a lot more work than is necessary for most nodes. However, the shader never changes during the main rendering pass. It turns out that the increased per-pixel processing is no problem for all but the slowest GPUs, while the state changes that would otherwise be needed cost signficant time on the CPU side.
  • FX nodes are rendered to textures in a prerender pass with their own shaders.
  • Generally moving GL state changes outside of the render loop if possible and substituting shader parameters for old-style GL state.
  • Caching all other GL state changes. There are just a few GL state variables that still change during rendering (To be precise: glBlendColor, the active blend function, and parameters to the standard shader). Now, setting a shader parameter to the same value repeatedly doesn’t cause several GL calls.

There were also a few non-graphics related optimizations – profiling information is now only collected if profiling is turned on, for example.

Results

Without further ado, here are some benchmarks using avg_checkspeed and avg_checkpolygonspeed. They show nodes per Frame at 60 FPS on a typical desktop system (Core i7 920 Bloomfield, 2.66 MHz, NVidia GF260):

Desktop, Linux (Ubuntu 12.04, Kernel 3.2)

libavg Version Images Polygons
1.7 2200 3500
Current 7000 7000

Desktop, Win 7

libavg Version Images Polygons
1.7   2700 5000
Current 10000 9500

On my MacBook Pro (Mid-2010, Core i7 Penryn, 2.66 MHz, NVidia GF330M graphics, Snow Leopard), the maximum number of nodes rendered did not increase. However, the CPU load while rendering went down – so we have a GPU bottleneck here:

MacBook Pro

libavg Version Images Polygons
1.7 1000, 100% CPU load 1600, 100% CPU load
Current 1000, 80% CPU load 1600, 40% CPU load

More precisely, since changing multisampling settings has an effect on speed, fragment processing is the bottleneck. Changing to minimal shaders doesn’t have an effect on speed either, so I’m guessing at texture fetches at the moment. But that’s for the next iteration of optimizations.