My own little DirectX FAQ


Wednesday, April 12, 2006
Why is Present so slow?

Well - it isn't! On its own, Present does very little except tell the GPU that the current frame is done, and it should display that to the screen. It might also do a few blits and clears, but those are very quick operations on today's cards.

The reason you're seeing Present on your profile is that Present is also when the CPU and GPU "sync up" with each other. They are two separate units operating asynchronously in parallel (well, ideally :-), and you need to be aware that at certain times, one may need to wait for the other. Obviously the GPU can't get ahead of the CPU (because the CPU generates commands for the GPU), but it is fairly easy to give the GPU so much work to do that you can generate rendering commands with the CPU faster than the GPU can complete those commands.

Let's say the GPU is managing to render a frame every 30ms. But what if the CPU only takes 10ms to generate the data for those frames? Obviously you can put some buffering in there to ensure smooth progress (and DirectX does this for you, unless you break it), but at some point the CPU is going to have a lot of frames queued up and the GPU can't finish them fast enough.

You could let this happen indefinitely, but then your CPU is generating data that won't be rendered and displayed for seconds - or even minutes. So you move the mouse left, and seconds later your view actually goes left. That's not very good for gmes. So DirectX puts a limit of the CPU getting at most two frames ahead of the GPU (in practice because of the way screen updates are handled, the time to when you actually see the image on your screen might be three frames). This is considered to be a good balance between responsiveness and keeping the pipeline flowing to ensure a high framerate - the more buffering in the system, the less the GPU and CPU have to wait for each other.

So after a few frames of rendering, the CPU is now two frames ahead, and calls Present. At this point, DirectX notices that the CPU is too far ahead, and deliberately waits for the GPU to catch up and finish rendering its current frame - which will take 20ms (30ms - 10ms). This is why you're seeing such a long time spent in Present - the call itself is very simple, and doesn't do all that much. The thing that is taking the time is that the CPU is waiting for the GPU to catch up.

If you want to use this spare CPU time for someting else, then you can use the D3DPRESENT_DONOTWAIT flag. What this means is that if Present would have stalled and waited for the GPU, instead it returns with the error D3DERR_WASSTILLDRAWING to say "the GPU is still busy". You can then do some work and try doing the Present again later.

Note that in general it is quite tricky to do any useful work in this period. You have to remember that on some systems with slower CPUs and faster GPUs, or in other parts of the game where the CPU has a lot to do and the GPU doesn't have as much, the CPU may be the slower of the two and you will never get a D3DERR_WASSTILLDRAWING result - the GPU will always be waiting for the CPU. Which means that any time reclaimed using this method is completely unreliable - you can't count on it at all. So be careful with this functionality!




Why is DrawPrimitive so slow?

Well - it isn't! If you simply do a whole bunch of DrawPrim calls in a row and time them, you will find they run very fast indeed. However, if you insert some rendering state changes (almost any of the Set* calls) in between those calls, that's when they start to get slow. Note that your profiler will not show the extra time being taken in the Set* calls themselves - they are always very fast.

What is happening is that the Set* calls don't actually do anything. For example, when you call SetTexture, it doesn't actually change the texture, it just remembers that you wanted the texture changed (it just writes a pointer into a memory location). Then when you actually call DrawPrim, it looks through all the changes you made since the last DrawPrim call and actually performs them. In the case of a texture change, it makes sure all the mipmap levels are loaded from disk, have been sent to video memory, are in the right format, and then tells the video card to update its internal pointers (in reality it's even more complex - but you get the idea). This is often called "lazy state changes" - the change is not made when you tell the change to happen, it happens when it absolutely finally has to happen (just before you draw triangles).

By the way, when people talk about "DrawPrim" calls, they usually mean any of DrawPrimitive, DrawIndexedPrimitive, DrawPrimitiveUP and DrawIndexedPrimitiveUP. Note however that the *UP calls can be a bit slower than the other two - in general you should try to avoid them and use Vertex Buffers. However, if you spent too much CPU time avoiding the *UP calls, it becomes counter-productive, so don't avoid them like the plague - just be aware that they are slightly more expensive - they are still appropriate for some operations.

People always ask "which state changes are most expensive?" Well, it depends. A lot. On all sorts of things - mainly what video card is in the system. So any list is bound to be a bit fuzzy and imprecise, but here's a general guide, listed most expensive to least expensive.

-Pixel shader changes.
-Pixel shader constant changes.
-Vertex shader changes.
-Vertex shader constant changes.
-Render target changes.
-Vertex format changes (SetVertexDeclaration).
-Sampler state changes.
-Vertex and index buffer changes (without changing the format).
-Texture changes.
-Misc. render state changes (alpha-blend mode, etc)
-DrawPrim calls.

I haven't talked about any of the older DX8-style states like SetTextureStageState - they're emulated by shaders on most modern cards, and so count as the equivalent shader call, for the purposes of profiling.

Why are cards so bad at shader changes? It's because many of them can only be running one set of shaders and shader constants at a time. Remember that video cards are not like a CPU - they are a very long pipeline - it may take hundreds of thousands of clock cycles for a triangle to get from being given to the card to being rendered. GPUs are so fast because they can have huge numbers of triangles and pixels being processed at once, so although they have very poor "latency" (the time taken from an input to go all the way through to the end and finish processing), they have extremely impressive "throughput" (how many things you can finish processing in a second).

So when you change shader, many cards have to wait for their entire pipeline to drain fully empty, then upload the new shader or set of shader constants into a completely idle pipeline, and then start feeding work into the start of the pipeline again. This draining is often called a pipeline "bubble" (though it's a slightly inaccurate term from the point of view of a hardware engineer), and in general you want to avoid them.

Newer cards can change shaders in a pipelined way, so changing shader or shader state is much cheaper - at least as long as you don't do it too often (they have a certain small number of active shaders at any one time - so you should still try to avoid changing frequently).

Be aware that some seemingly innocent changes can cause "hidden" shader changes. For example, a lot of hardware has to do special things inside the shader to deal with cube maps as opposed to standard 2D texture maps. So if you change from a cube map to a 2D map or vice versa, even though you think you didn't change the shader, the driver has to upload a new version anyway. The same applies to SetVertexDeclaration and SetSamplerState - for some cards, these also requires a new shader. And lastly, changing some states such as stencil mode, Z mode or alpha-test mode can cause pipeline flushes and/or shader changes as well. So it's all very tricky - as I said, the list above is only for very general guidance.

Just to complicate matters further, in most cases DrawPrim calls are also put into a pipeline (a software one this time, not a hardware one), and only after a certain number of them are batched does the card's driver actually get called. This means that timing individual DrawPrim calls is basically pointless - you won't see how expensive that call was, you'll see some fairly arbitrary chunk of time being taken. In some cases, the driver was not even called - all that happened was the command was added to a queue. In other cases, the driver was called with a big queue of lots of DrawPrim calls and render state changes at once. In general, trying to profile your GPU by timing the CPU is going to be confusing and misleading. The only way to disagnose a GPU's true performance is to use something like PIX or some of the hardware vendor's own tools - they are the only things that can tell you what the graphics card is really doing.

Another general hint - when profiling, never run anything that renders faster than 10ms per frame (100fps). Always do more work to get the speed below 100fps (e.g. render the scene multiple times, or render more objects), and then measure how much extra work you did. For example, saying one object rendered in 2ms is not very useful - because you might infer from that that you could render only ten in 20ms. But because of overhead per frame and suchlike, you may actually be able to render a hundred or a thousand objects in 20ms.

If you get the framerate down below 100fps, i.e. your frame takes more than 10ms to render, then you have mostly got rid of the overhead and you are using the full power of pipelining. The lower you get the speed, the more reliable the timings (usually!). 10ms per frame about when you can start to draw straight lines on graphs. When I see people saying things like "I did XYZ and my framerate dropped from 2378fps to 1287fps - wow, that's a really expensive operation!" - it makes me cringe. They're just chasing phantoms - it's not useful data at all.

Another error is to measure frames per second. Always convert to microseconds or milliseconds. If I render ten object As and it takes me 10ms, and ten object Bs and it takes me 20ms, you can be fairly certain that rendering ten of each will take me 30ms - it's simple addition. Whereas if you say ten As runs at 100fps and ten Bs runs at 50fps, then what does ten of each take - 0fps? 25fps? 150fps? It's all very unintuitive (the correct answer would be 33fps). Think in seconds per frame, not frames per second.

Questions prompted by this post:

"Why do you say pixel shaders are more expensive than vertex shaders". Well, because they are very slightly more expensive to change on some cards, simply because the pipeline has to drain further before the upload can start. But the difference is slight, and on other cards, the two are basically the same cost.

"Isn't changing textures a lot more expensive? Especially if you have to transfer them to video memory?" Certainly if the texture is not in video memory and has to uploaded to the card, that is going to be a huge cost. I'm not talking about that here - these priorities assume that the texture is already in video memory (because you used it in a previous frame and you're not low on video memory). Also note that it really only applies to changing from one "simple" texture format to another. An example of a simple texture format is a 2D mipmapped DXT1 or 8888 texture - cards are pretty good about changing between these fairly quickly. More complex texture formats such as cube maps or floating-point formats may need some shader assistance, and changing those can be more expensive.




Saturday, December 24, 2005
Vertex cache optimisation

Most graphics cards have a vertex cache that sits after the Vertex Shader or Fixed-Function Transform & Light stage and caches the results. (they also have a pre-VS cache, I'll discuss that below). The idea of this cache is to make sure the card never processes any vertex twice. However, getting it to work perfectly is impossible - the best you can hope for is that the number of vertices that are re-processed is kept to a minimum. By sending triangles in the correct order to the graphics card, you can keep the vertex cache nice and warm, and the card won't re-process a significant number of vertices. The crucial thing is to choose a triangle ordering that hits the cache a lot. It is so important, that when finding an ordering, you should ignore whether your eventual primitive type is lists, strips or fans (see the post below) - the only important thing is the order you draw the triangles in.

The efficiency of a vertex cache is usually measured in "number of vertices processed for each triangle drawn", or verts/tri. On most meshes that we use in games, a ratio of 0.5 is the best you can possibly get on a totally regular mesh with an infinitely-sized cache. In practice, we don't have perfectly regular meshes or infinite caches, so this is nearly unobtainable. Getting somewhere between 0.6 and 0.8 verts/tri is far more usual.

There are many ways to choose a good ordering. Here are five common ones, roughly ordered from worst to best:

1 - "Stripping" your mesh - making it into triangle-strip primitives as long as possible. This is what people have done since the 80s with programs such as "Stripe" and a paper on "Tunneling for Triangle Strips" (by A. James Stewart), and it's not too bad. However, stripping was designed for hardware that did not have indexed primitives, just strips, so it has rather different goals. While still acceptable on indexed hardware, it only gets a ratio of around 1 vert/tri, which is a long way off the practical best of 0.6. However, it is still better than a totally random order!

Let me say this just once, because academics are still spending time and money researching this subject. You're wasting your time. Strips are obsolete - they are optimising for hardware that no longer exists. Indexed vertex caches are much better at this than non-indexed strips, and the hardware is totally ubiquitous. Please go and spend your time and money researching something more interesting.

I'll get off my soapbox now.

2 - Hugues Hoppe's paper "Optimization of mesh locality for transparent vertex caching" (available at http://research.microsoft.com/~hoppe/). This has an interesting discussion of the theoretical limits of vertex caching, and also has a nice algorithm to optimise for a given size cache. However, there is a hidden problem with it - it assumes you know how big your cache is, and how it works. On the PC, you usually don't, because each user will have a different sort of graphics card, and they all have different types and sizes. Some even change their size depending on the vertex format. The problem with Hoppe's method is that if you optimise for a cache of 16 vertices, and the real one is only 12 vertices, you will "thrash" the cache badly, and actually get very bad performnce. Conversely, if you optimise for a cache of 8 vertices and the real one has 24 entries, you will get very little benefit from the extra space. So this method is useful for known targets (e.g. consoles), but not that useful for PCs, or for cross-platform optimisation.

3 - Bogomjakov and Gotsman's paper "Universal Rendering Sequences for Transparent Vertex Caching of Progressive Meshes" (http://www.cs.technion.ac.il/~gotsman/caching/). This is the best paper I have seen about finding an ordering that works well on all caches. I didn't really understand the second part of the paper about Progressive Meshes, because that bit relies on a very specific implementation of Progressive Meshes that I don't like much (very CPU-hungry). So just read the first bit. Essentially, what you want is an ordering that continually twists around on itself, rather than the long zig-zag patterns that Hoppe's version tends to generate. They start with the ideal of a Hilbert curve (which constantly curves in on itself), and try to generate that on more arbitrary meshes. Unfortunately, while their results look good, the algorithm is very complex (i.e. I don't actually understand how to implement it :-), and rather slow.

4 - The D3DXOptimizeFaces function. This is an implementation of Hoppe's algorithm with several improvements that you can just use directly, which is handy. The previous weaknesses apply though - this function gets around them by picking a fairly small vertex cache and optimising for that. Keeping the target cache small means you are unlikely to start thrashing the real cache. While not completely optimal, it is better than nothing, and much better than a stripper.

5 - The optimiser in the Granny Preprocessor (included in Granny: http://www.radgametools.com/gramain.htm). This is one I wrote recently (late 2005). It generates Hilbert-ish patterns that give good measured coherency for caches from 4 to 64 vertices on a wide range of meshes, so it's very general-purpose. However, it is a much simpler algorithm than the above ones, it's just been tweaked a lot. The advantage is that (a) you get source, (b) you can probably understand the source, if you want to modify it for your own purposes or platform oddities, and (c) it's very fast, which means you can run it at load-time rather than actually at export time. Speed doesn't sound like a big deal, because most of the time you will be doing offline processing, but it does solve one tricky problem. When exporting an animated mesh, you frequently need to chop it into sub-meshes so that each one only uses a certain number of bones (because they must all fit in the Vertex Shader constant space at once). But because different graphics cards have different-sized constant spaces, it would be nice to be able to do this splitting at load time, when you actually know how big the constant space of the user's card is (the Granny preprocessor does this for you as well!). Once you've split the mesh up into submeshes, you'd like to re-order each of the submeshes for best vertex cache use. This is where having a really fast vertex-cache optimiser comes in handy - you can do this when the user loads the game, rather than having to do it at export time.


So if you're buying Granny, the answer is obvious! If you're not, then I would use D3DXOptimizeFaces, because it's pretty good. If you can't do that, maybe because your asset pipeline doesn't like using D3D functions, then things like Stripe are at least better than nothing - though personally I would be tempted to write my own. There is a discussion on the DirectXDev list about writing good strippers (titled simply "Stripping", started on 19th December 2005) - certainly worth a read.


One final point - there is another cache, the pre-Vertex Shader cache. This is usually just a simple cache of the Vertex Buffer memory. Fortunately, the way to optimise for this is easy. When using indexed primitives, your vertices can be in any order in the Vertex Buffer you wish - you simply renumber the indices. So once you have found the right triangle ordering that makes the post-VS cache happy, scan the list of tris and as you get a vertex that has not been referenced before, add it to the vertex buffer. This way, as you draw the mesh, you will be picking up vertices in a basically linear memory order, start to finish. This keeps the pre-VS cache happy (and also the AGP/PCIe bus happy). It's a simple reordering, and unlike the optimisations above, works basically the same on all hardware. D3DXOptimizeVertices will do this for you, but it's not difficult to write your own.




Strips, lists, fans, indexed or not?

D3D has a wide array of primitive types, and they each have different performance characteristics. So which one should you use? To lay down some ground rules, these are the things you need to optimise for when making this choice - most important first:

1: Number of DrawPrimitive/DrawIndexedPrimitive calls.
2: Number of vertices processed by the vertex shader (or transformed and lit by the fixed-function pipeline).
3: Total bandwidth of data sent to the chip.

Generally, each point is far more important than the ones after it - often by a factor of 10. This means that we rarely if ever have to make any trade-offs between different points - the earlier-listed ones are always more important, even if it affects the ones further down.

In D3D, there's six type of triangle primitive. There's strips, fans and lists, and each comes in indexed and non-indexed flavours. First of all - ignore fans. They're not even worth thinking about in D3D, because you have to keep restarting them, which means an extra DrawPrimitivie call each time, and as this is factor number 1, that's never ever a good idea - don't even think it!

For similar reasons, you can ignore non-indexed strips, because either you have to keep restarting them with a DrawPrim call, or you have to use duplicate vertices to create degenerate triangles. But unlike the indexed case, those duplicate vertices are actual fat 20-40byte vertices, and each one is another vertex to be vertex shaded, violating factors 2 and 3. The same argument can be used against non-indexed lists, but even moreso - far too many vertices to be sensible.

So now we're down to two choices - indexed strips and indexed lists. If you don't know what indices are, look them up, because they're essential. There's very few cases where you don't need to use indices (e.g. particle systems), and even then I tend to use indices anyway just because it means all of my code uses the same code-path, and because in the case of particles, whether you use indices or not is irrelevant to performance in practice (it's fillrate that kills you on particles).

OK - so you've read up on indices? Good. Major fact to know about indices - they're the only way the vertex cache can do its job. If you don't have indices, the vertex cache is switched off, and this is a Very Bad Thing, because by using the vertex cache, you can dramatically reduce factor #2 - how many vertices actually get shaded. The vertex cache is typically a small cache on the GPUthat avoids re-shading a vertex if it already shaded it recently. It is usually between 8 and 32 entries, which isn't large, but it turns out that 32 is about the maximum useful size anyway.

Whichever primtitive type you use, your priority is to first optimise the order tris are drawn in to keep the vertex cache happy. For this, you need to just ignore what primitive type you're going to use - the order of the triangles is all that matter to the vertex cache. See the entry above (Vertex Cache Optimisation) for a discussion of this.

Once you've optimised your triangle ordering for good vertex-cache coherency, whether you then choose to use indexed lists, strips or fans is a fairly minor point. Some hardware prefers one type of primitive slightly over another, but in general you can use whatever method you like. The best one to use is the one that minimises the number of indices you generate, simply because more indices = more memory. On the face of it, strips should be better than lists, right? Well, not really - if you need to keep making degenerate tris to join "striplets" together, then you'll generate more indices than lists (which don't need degenerates). Although degenerate triangles are cheap with indexed primitives, because the hardware simply checks whether two indices are the same and throws the triangle away, they are not completely free - this culling does take a bit of time.

Some APIs (not PC DX5-DX9, but some consoles, and possibly future PC APIs) also allow a "restart" index, which means you don't need to use degenerate triangles, and they can make strips better than lists, and also make fans a reasonable option as well. My personal choice is to always use indexed lists, because the difference between them and indexed strips is tiny, they are far easier to construct on the fly (e.g. for particle effects or text rendering or HUDs), and it makes the code far simpler to use a single primitive type everywhere. But using indexed strips is a perfectly valid choice as well.

The important thing is - when choosing a primitive type, DO NOT change the order of the triangles, or you'll start hurting vertex cache efficiency, and that is never a good thing.


NOTE - the above advice may change in DX10, because I hear the primitive types might get some tweaking there. If you're reading this and DX10 has been released, then email me and remind me to update this entry for DX10!




Friday, December 16, 2005
Cunning ways to avoid clearing the framebuffer.

Or rather, why you shouldn't try to do this! In ye olden tymes of the Voodoo2, filling pixels was expensive, and you could avoid having to do a clear by using a convoluted method (use only half the Z-range, flip it every even/odd frame, never actually clear any pixels).

But that is all silly nowdays, and indeed it's usually slower. There's two big things to remember with modern cards:

(1) Clearing pixels to constant colours is fast. Amazingly fast. They can set constant colours or Z values at up to 256 times the speed of the fastest shaders! But you only get this if you're clearing an entire surface to a single value. You get this speed whether you're clearing Z/stencil or a colour.

(2) Z and framebuffer compression. Almost all modern cards can have pixels in either "compressed" or "uncompressed" states. Obviously, compressed pixels are faster to render to. Pixels can change from compressed to uncompressed by rendering triangles to them, but on many DX8 and DX9 cards, they can only change back to being compressed if you do a clear. If you never clear the framebuffer or Z/stencil buffers, then pixels can never become compressed again, and after a few frames of rendering, all your pixels will be uncompressed, and rendering speed will suffer.


So, taking these two facts, you can see that it is important to always clear both framebuffer and Z/stencil every frame if you can. Obviously there are some funky effects that involve not clearing them, but unless that's deliberately what you intend, always do clears. Note that this applies to both the framebuffer and any auxilliary buffers you render to - shadowbuffers, environment cubemaps, etc.

Also, if you are only rendering to a portion of a buffer (e.g. you have a 1024x1024 texture handy, but you only want a 512x512 shadowbuffer), make sure you clear the whole thing anyway, not just the region you're going to use. Cards want to do the equivalent of a C memset() - a single contiguous memory wipe. If you specify a subrectangle, sometimes they can't do this. Some schemes (such as rendering impostors) require you to only do subrect clears. If that's what you need, then do it - it's still quite fast. Sometimes a renderer's gotta do what a renderer's gotta do. But if you can clear the whole buffer, do so.

Also remember about "SLI" setups - multiple cards in one machine. Ideally, you want to stop the cards having to exchange data. To do this, these cards frequently run "Alternate Frame Rendering", so card 1 does even frames while card 2 does odd ones. But if you don't clear the framebuffer&Z/stencil, they might think you need the data from one frame to be used in the next, and card 1 has to send its FB/Z to card 2, and this is slow. By doing a full-screen clear of both buffers, you are telling the cards they don't need to transfer the data, because you're just going to set it all to black anyway. Drivers are smart enough to spot this, and your framerates will improve.

Also, tile-based renderers (PowerVR and intel cards) will thank you for doing clears, for lots of reasons.

Finally, remember to always clear Z and stencil, even if you don't use stencil. Because they're packed together, if you only clear Z, the card has to carefully preserve the stencil buffer, and you won't get the fast clears. Obviously if that's what you want to happen, because you deliberately want to preserve the stencil value, then that's what has to happen. But if you don't use or care about the stencil, clear it, and then the card can do its fast clears.

So, to summarise - if you have an indoor scene with a skybox outside, the best sequence of rendering is this:

1 - Clear the framebuffer and Z/stencil.
......This is very fast and sets all pixels to their compressed states.
2 - Enable Z write & Z test.
3 - Render the indoors scene, trying to do so mostly front to back.
...... This makes sure you lay down a Z buffer that will reject as many hidden pixels as possible.
4 - Disable Z writes, keep Z tests enabled.
5 - Render the skybox.
...... This rejects as many skybox pixels as possible. Since the skybox can't have anything behind it, there's no point in writing Z.
6 - Render any translucent objects (glass, particles, etc).

There are many other orders to do this in, but this is generally the most efficient on all cards.




Sunday, August 31, 2003
State blocks and the D3DX Effects framework

Dan Baker from Microsoft had a handy hint when using the Effects framework on older drivers. This is related to another FAQ article What's the best way to do state changes?, so go read that one too.

Effects make extensive use of stateblocks. Generally, performance stabilities issues with D3DXEffects are actually issues with a drivers stateblock handling. Unfortunatly, with a pure device, there is no way to save/restore state without using stateblocks.

You can massivly decrease Effects use of StateBlocks by telling D3DXEffect not to preserve and retain state by using the D3DX_DONOTSAVESTATE parameter in the Begin Method. This will mean that Effects will no longer maintain state.

If you think it is stateblocks that are causing the problem, set the following flag in the registry key under D3D

EmulateStateBlocks

Make it a DWORD value and set it 1. This will cause d3d to emulate state blocks rather then pass it on to the driver. This will only work for non PURE devices so you will need to create your device differently.


The full registry address he mentioned is HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Direct3D

One thing that was mentioned is that this is true even in release builds. And some published games use the D3DXEffects stuff, because it's rather nifty. So if you're having strange slowdowns with your older nVidia driver, maybe it's worth trying this tweak. At least then you can moan at them with a higher precision than before.




Monday, February 03, 2003
What happens when I set a texture to NULL?

Well, there's a number of different options. There's what should happen, there's what often happens, there's what more frequently happens, and there's what you can rely on happening. So let's address these one by one.

What should happen

According to the refrast source, if you read any texels from a NULL texture, you get opaque black. So that RGB=0.0, and alpha=1.0.

However, if you are using the TextureStateState pipeline (i.e. not a Pixel Shader), and D3DTS_ARG1 is set to D3DTA_TEXTURE, that that TSS stage is effectively set to D3DTOP_DISABLE instead, and this means all the subsequent stages will be ignored. Note this does not happen if it's ARG2 that is set to TEXTURE - then you read opaque black and the stage does what it usually does. Which is a bit arbitrary, but there it is.

Remember - that's according to the refrast source, and in theory that's the official spec. But read on...

What often happens

Some cards get the bit about reading opaque black texels right, but many don't get the bit about disabling that TSStage and all subsequent ones. So you just get normal TSS processing.

What more frequently happens

Most cards actually follow the OpenGL standard which is to read opaque white. And also ignore the thing about disabling stages.

The other common behaviour is to give you random texture data.

What you can rely on

Nothing. Absolutely nothing at all. The D3D spec has changed a few times, the drivers have changed many times, different IHVs implement different things, and frankly anything could happen - you're doing well if it doesn't simply BSOD your machine, call you names and spit in your beer. Well OK - I've never actually had a machine die on me for setting a NULL texture, but you do get fairly unpredictable results on your screen. The best advice is to avoid doing this. Write your Pixel Shaders or TextureStageState setups to do what you want to do, and no more. If that means you need a different shader/TSS for the "NULL texture" case, then do it. Usually that shader/TSS will run a lot faster than using a NULL texture, and it will be a lot more predictable. This is one of those odd little corners of the D3D spec that are very badly supported - best to avoid it.

Warning - some or all of the above my be inaccurate :-)




Friday, August 23, 2002
What TextureStateState stuff can different cards do?

It's a difficult question to answer fully. Drivers change, bugs get fixed (or created!), etc. The short answer is "use ValidateDevice()". The longer answer is "know roughly what most cards do, know what the gotchas are, write multiple versions of your TSS shaders, and use ValidateDevice()". The addendum is "sometimes ValidateDevice() lies". For StarTopia, I ended up just having a huge list of cards, identified by DeviceID, with various messages to the game telling them which things the card did wrong. Sometimes cards had sections on various driver versions, because older drivers got more stuff wrong. Note that this file did not tell the game which TSS setup to use - that is driven by ValidateDevice(). What it does say is when ValidateDevice() gets it wrong, which is a lot less info to gather on all known cards.

Naturally cards released after the game is will have unknown DeviceIDs, so you need sensible defaults. So far StarTopia has done sensible and fast things on all of them, which is fairly pleasing.

Anyway, to be able to code a sensible set of three or four TSS setups, you need to know roughly what the different cards can and can't do. Here's a brief summary of what I know. Note that most of my intensive card-testing experience was done just before the ship date of StarTopia, so mainly around June 2001. This means a lot of this material may be out of date. So watch yourselves. The newer fancier cards have obviously been tested more recently, though mainly with pixel shaders rather than big TSS blends. The basic two and three stage TSS blends seem to work on all the pixel shader cards, which is what you expect.

Any IHVs reading - feel free to correct or expand any of this info - I'll keep it updated.


-Single-texture cards - a real mixed bag. All sorts of limitations and legacy nastiness. Mal Lansell has good info on most of these, and this covers most of what I know as well.

-Rage 128 - see Mal's post above. Older drivers have trouble with TFACTOR (see below), now fixed.

-Radeon 1,7x00: Three-stage TSS implementation. Supports TEMP and triadic blends. Everything seems to work, no suprises. Older drivers have trouble with TFACTOR (see below), now fixed.

-Radeon 8x00: PS1.4, but also exposes a 6-texture TSS implementation with TEMP and triadics. I haven't tried more than about four stages (anything more and I start to use pixel shaders instead), but so far no suprises.

-Radeon 9x00: haven't tried one yet. Unlikely to be any gotchas when using TSS, seeing as it exposes PS2.0.

-TNT1,2,Ultra,etc - see Mal's post above.

-GeForce 1,2,4MX,4Go: basically two general-purpose TSS stages. Aditional first stage can do SELECTARG1(TEXTURE) as well. Additional last stage can sometimes do MODULATE(CURRENT,DIFFUSE), though it seems temperamental whether it will let you or not. Always check with ValidateDevice(). Recent drivers expose triadic blends and TEMP register, but I've not played with that aspect yet. They also have the "8-stage blend" hack that drives the register combiners directly rather than using the TSS model - you'll need to contact nVidia devrel for more info. Drivers have trouble with TFACTOR (see below).

-GeForce 3,4Ti: PS1.1/1.3. Fairly sound 4 TSS stage implementation - not found any suprises yet, but again, if I'm doing anything complex, I usually go for pixel shaders, so the TSS stuff has not had a great deal of testing. Triadics and TEMP supported, and work. Drivers may have trouble with TFACTOR (see below).

-G4xx and G5xx series: Two-stage TSS implementation, though as mentioned recently on the list, ARG1 must be TEXTURE (and no triadics or TEMP). I vaguely remember vague vagueness of a third stage being available to do MODULATE(DIFFUSE,CURRENT), but people seem to have trouble getting it working.

-Parhelia: not played with one yet. Exposes PS1.3. I would expect a fully-featured 4-stage TSS implementation.

-PowerVR Kyro series: 8-stage TSS exposed. Yes, it really works. Not sure if triadics and/or TEMP are supported - from what I know of the hardware I don't _think_ so, but a PowerVR devrel person should rush to correct me if I'm wrong.

-3Dlabs P10: not played with one. Again, lots of pixel-shader goodness, so TSS implementations is probably fully-featured.


I don't know anything about any of the other chips.


The "trouble with TFACTOR" mentioned above is when doing multiple renders, and only updating the TFACTOR register between them, e.g. to change colour or opacity. A suprising number of drivers get this very wrong. Bugs seen:

-TFACTOR colour is effectively random. Presumably some timing problem, but I couldn't fix it in any sensible way.

-TFACTOR colour is the first-rendered thing. Subsequent changes are ignored, and everything gets rendered with the same colour.

-TFACTOR colour is the last-rendered thing. Everything gets rendered with the same colour.

-TFACTOR changes are ignored unless you change something else in the TSS pipeline (to force the driver to do a "rethink"). I usually found something simple like changing an alpha-channel blend from SELECTARG1(TEXTURE) to SELECTARG2(TEXTURE) or whatever.

This is the sort of annoying thing that needs testing for and overriding according to DeviceID and driver version - you can usually find a way to do things without TFACTOR, notably by using the D3DMATERIAL stuff. Changing material is fairly quick, and so far hasn't failed on any cards.


Hope that gives you a feel for the cards. When in doubt, always check with the IHV concerned. If you can't get a blend to work, they will usually be able to tell you the most efficient way to do the blend in a slightly different way, or give you the available options for multi-pass versions.




Wednesday, July 24, 2002
Why ZBIAS is not a good thing

ZBIAS looks great at first sight. It's a little tweak to your Z values and stops Z-fighting when things are coplanar but don't actually share vertices (e.g. posters stuck on walls). However, it's a problem. The problem is that the actual behaviour of ZBIAS is not well defined. The number you feed in has no absolute meaning - each card is free to interpret it any way it likes. So a ZBIAS of 3 may work in a particular scene on card A. Card B only need a ZBIAS of 1 to work. However, card C still Z-fights, so you need to increase the number to 7. OK, so you set it to be 7 on everything. Except now that's far too high for card B, and now when a person walks past the poster on the wall, the poster is drawn in front of the person! Argh. This is not a mythical example, this actually happens on two very common cards - the value that works for one is far too high for the other, and vice versa.

For this reason, many cards simply do not support ZBIAS. This is actually a Good Thing :-)

The safest option is not to use ZBIAS. The easy alternative is to push your near and far clip planes away from the camera a bit when rendering things that need biasing towards the camera. This will not change where on the screen the points are rendered, it will simply move those objects slightly nearer the camera in the Z buffer. The advantage of this method is that it works the same way on all cards, and the same bias works consistently (within a small factor because of slightly variable Z-precision). It is not very expensive to change the projection matrix (no more than any other matrix change), and if you store two projection matrices (one biased, one normal) every time the viewpoint changes (which isn't very often), then there's no recalculation needed, you just send the correct matrix to the card.

ZBIAS is not implemented at all the same on each card, and is not support by some. Moving clip planes does, and is.




Wednesday, July 03, 2002
What's the best way to do state changes?

Remember - there are three sides to this question. (1) what the app does, (2) what the D3D layer does and (3) what the videocard driver does. All affect performance in different ways. A few facts:

  • On a PURE device, SRS and STSS calls go straight to the driver (or rather, they go into the D3D command-buffer that goes to the driver).

  • On a non-PURE device, they are filtered by D3D. Redundant calls (i.e. changes to the same value) get filtered. Non-redundant values are added to the command-buffer.

  • On almost all drivers, when they get a state change command, they simply remember it. When they get a DIP-style command, then they think about setting states and so on.

  • I believe all IHVs now handle state blocks well. Older nVidia drivers do support state blocks natively, but with an eccentric and slow implementation. Careful about them.

  • On drivers that don't handle state blocks natively, the D3D runtime expands them internally into multiple SRS and STSS calls, filters redundant ones, and adds the others to the command-stream.

  • On drivers that handle state blocks natively, on a PURE device, the D3D runtime just puts the "SetStateBlock" command into the command buffer. It does not snoop it for redundant states.

  • On drivers that handle state blocks natively, on a non-PURE device, I think the D3D runtime snoops the state block command for the renderstates it contains, updates its internal settings (so it can filter out future redundant states) and passes the SetStateBlock command into the command buffer.

  • Drivers that handle tate blocks natively will either expand into SRS and STSS states internally (only slightly more efficient than reading them from the command-stream), or will do UltraCunningThings that minimise bus traffic and CPU time and all that malarky. Note that redundant states stored in the state block will be vaped very very quickly in either case.


    From these facts we can see that:

  • Redundant state sets get vaped by the D3D layer, except on PURE device. But you cost yourself an app->D3D call.

  • Redundant state sets and multiple state sets (i.e. the same state, set several times, to different values) get vaped by the driver. But you cost yourself an app->D3D call and an entry in the command-buffer (a tiny cost).

  • All states are set at the same time in the driver. For this reason, most IHVs say that changing a single state is expensive, but changing lots of states at the same time is not much more expensive.

  • Using state blocks at worst saves you multiple SRS and STSS calls. At the best, it goes really fast and stuff.


    So, personally I use slightly-wrappered state blocks. If I detect that the driver is shockingly slow at state blocks (some old nVidia drivers are), then I switch to emulating them myself. Essentially, I do what the D3D runtime does for drivers that don't support them - I expand them out into SRS and STSS calls (culling redundant calls of course). This is extremely fast to do - it is faster than the conventional wrapping of SRS and STSS calls that people do, because all the setting/culling is done in one single tight loop, not spread all over the code.

    I have a big chunk of code that shows how to effectively capture and replay state to/from state blocks. So you can set your states up conventionally using SRS/STSS calls, then capture them into state blocks (done at start of day). This code can also serve as the replay of state blocks when the driver doesn't like you just calling SetStateBlock, although in practice I use a much tighter loop with redundancy checking, but it is a very similar structure. At the moment you can find it in the DirectX mailing list archives here (if that doesn't work, just search for the word "ADDRENDERSTATE" in the archives). I'll put it somewhere more sensible in time.

    (SRS = SetRenderState, STSS=SetTextureStageState. There are other state-setting calls as well such as SetTransform - include them in the above comments).




  • Tuesday, June 18, 2002
    Why is software T&L so slow?

    Well, it's slower than hardware, but not that slow. If you're finding massive speed differences (i.e. more than 10x), you may be doing something wrong. You need to be aware of the fundamental difference between software T&L and hardware T&L.

    Hardware T&L only transforms vertices as-and-when you actually use one by referencing it with in index. To prevent it from transforming vertices multiple times (when used in different triangles), hardware has a vertex cache. Keeping this vertex cache happy is the key to getting good HWT&L speed.

    Software T&L doesn't do this - it's inefficient. What is far more efficient is for the T&L engine to start at one vertex in a VB and T&L the next n vertices, ignoring which ones are actually referenced by indices. So to keep it happy, you need to make sure that when drawing a sub-section of your VB, all the vertices that are used lie in a single contiguous chunk. You tell it where this chunk is using the MinIndex and NumVertices arguments to the DrawIndexedPrimitive call (see the DrawIndexedPrimitive FAQ entry below for details).

    For software T&L, an index list of (1,2,99, 99, 2,100) is terrible. You're drawing two triangles using four vertices, but the software T&L engine is transforming vertices 1-100. Terrible efficiency (it is worth noting that hardware T&L doesn't like it much either because of memory-cache efficiency, but it's nowhere near as lethal as the software case). You need to move vertices 99 and 100 down to slots 3 and 4, then draw (1,2,3, 3,2,4) instead.

    So you need to make sure that (a) all vertices using in a single draw command are near each other in the VB and (b) that you set the MinIndex and NumVertices values to accuractely tell the software T&L engine which ones you are using - no more, no less. The DX debug runtime will warn you if you try to use MinIndex and NumVertices values that don't cover all the vertices you are using, but it cannot usually spot that you are T&Ling too many vertices. So you need to check this yourself.

    Software T&L can be fast, and with a fast CPU and a slower video card, it can be faster than hardware T&L (depending on what you are doing). If your video card does not support Vertex Shaders, it is certainly worth trying out software Vertex Shaders, rather than throwing up your hands in despair - they are pretty quick. But they work in a very different way to hardware T&L, and you need to made sure you are driving both efficiently.




    Thursday, May 30, 2002
    How do Texture Transform matrices work?

    The answer is - slightly oddly. If you are familiar with the OpenGL ones, forget them. These are subtly and confusingly different. So best just to forget them and start again. I am using the DirectX8 terminology here, but it translates fairly easily to DX7 or DX9.

    The basic operation is in four steps:

    (1) Coordinates (1-4 components) are fed into the texture transformation unit.
    (2) They are transformed by the matrix (i.e. a matrix multiply operation).
    (3) The components of the result may be divided by one of the components (used for projective texturing), and that component is discarded.
    (4) The results are fed to the texture units as texture coordinates.

    Step (1) is controlled by SetTextureStageState ( n, D3DTSS_TEXCOORDINDEX, D3DTSS_TCI_xxx ), where xxx is either:
    - PASSTHRU - you supply 1 to 4 values in the vertex itself. The number of components is determined by the FVF of the vertex, and the coordinate set is determined by the number order with D3DTSS_TCI_PASSTHRU. So passing in D3DTSS_TCI_PASSTHRU | 2 means that the coordinates should be taken from the third set (they are numbered from 0) of coordinates specified in the vertex/FVF.
    - One of the CAMERASPACEyyy enumerations - 3 values are always used, and they are the result of calculations using other components of the vertex such as position or normal.

    However they are generated, these values are padded to a 4-component vector in the following ways:
    -1 component (x) -> (x,1,0,0)
    -2 components (x,y) -> (x,y,1,0)
    -3 components (x,y,z) -> (x,y,z,1)
    -4 components (x,y,z,w) -> (x,y,z,w)

    Also note that TEXCOORDINDEX requires an FVF input field, even when not using D3DTSS_TCI_PASSTHRU. This is because the behaviour of the D3DRS_WRAPzzz renderstates operates on vertex inputs, _not_ texture coordinates. So the value of zzz corresponds to the number ORed with CAMERASPACEyyy. This behaviour is slightly more rational under DX9 fortunately.

    Step (2) is controlled by the texture transform matrix set using SetTransform ( D3DTS_TEXTUREn, &matrix ), where n is the appropriate Texture Stage State stage. The 4-component input defined by step (1) is transformed by the 4x4 matrix to produce a 4-component output vector.

    Step (3) is controlled by SetTextureStateStage ( n, D3DTSS_TEXTURETRANSFORMFLAGS, D3DTTFF_COUNTxxx ). The value of xxx determines how many components are required by the rasteriser. Also, the argument may be ORed with D3DTTFF_PROJECTED. This causes the result of step (2) to have three its components divided by the other one. Which component is the divisor is determined by the value of D3DTTFF_COUNTxxx:
    -D3DTTFF_COUNT1 | D3DTTFF_PROJECTED: invalid combination
    -D3DTTFF_COUNT2 | D3DTTFF_PROJECTED: (x,y,z,w) -> (x/y,*,*,*)
    -D3DTTFF_COUNT3 | D3DTTFF_PROJECTED: (x,y,z,w) -> (x/z,y/z,*,*)
    -D3DTTFF_COUNT4 | D3DTTFF_PROJECTED: (x,y,z,w) -> (x/w,y/w,z/w,*)

    An asterisk(*) means that this value is not well-defined by the docs, and trying to actually use the value generated here is unwise, as it may vary between different bits of hardware. For example, do not use D3DTTFF_COUNT2 | D3DTTFF_PROJECTED when the texture used is a 2D one.

    For more details, look at the DX8.1 docs section entitled "Texture Coordinate Processing", and all its sub-sections. Especially note the one called "Special Effects", as it shows the correct matrix for scrolling a 2D texture around.




    Tuesday, May 28, 2002
    What is the difference between D3DCAPS8::MaxTextureBlendStages and D3DCAPS8::MaxSimultaneousTextures?

    The first is how many TSS blend stages the driver understands. The second is how many different textures you can use at a time. These two are not the same things - you may be able to use extra blend stages without using extra textures. In fact one of the commonest blends on 2-texture hardware (e.g. Radeon 1/7x00, GeForce1/2, Permedia3) is this one for bump mapping:

    Stage0: DotProduct3 ( Texture, Diffuse )
    Stage1: Modulate ( Texture, Current )
    Stage2: Modulate ( Tfactor, Current )
    Stage3: Disable

    Here, the Diffuse register holds the light vector, and the Tfactor register holds the light's colour. As you can see, we are using only two textures, but three blend stages. So something like the GeForce 1 and 2 usually have MaxSimultaneousTextures=2, but MaxTextureBlendStages=8. Note that this does not mean you can always use 8 stages - in practice you can rarely use more than 3 on the GF1&2 - you must always use ValidateDevice() at the start of day to check which of your blend modes actually works on the current hardware.

    (note that the alpha channel pipeline is not shown above for clarity - but if you want to use this blend on real hardware, you need to put a sensible one in - see the FAQ about disabling alpha before colour below)




    Monday, May 27, 2002
    DrawIndexedPrimitive

    A common question is - what do all those arguments actually mean?

    The DX9 version of DrawIndexedPrimitive is probably the best model. The DX8 one does not have a BaseVertexIndex argument in the DIP call itself - it is supplied when you call SetIndices. But both do the same thing, the argument is just in a different call on DX8.

    HRESULT DrawIndexedPrimitive(
    
    D3DPRIMITIVETYPE Type,
    INT BaseVertexIndex,
    UINT MinIndex,
    UINT NumVertices,
    UINT StartIndex,
    UINT PrimitiveCount
    );

    StartIndex - the number of the first index used to draw primitives, numbered from the start of the index buffer. So if this is 20, then the first twenty entries in the index buffer are skipped, and the first primitive starts rendering from the 21st index.

    PrimitiveCount - the number of primitives drawn.

    BaseVertexIndex - this is added to all indices before being used to look up a vertex in a vertex buffer. So if this is 100, then the indexed triangle 1,2,4 will actually use vertices 101, 102, 104 from the VB.

    MinIndex - this is the lowest-numbered vertex used by the call. Note that BaseVertexIndex is added before this is used. So the first actual vertex used will be (MinIndex+BaseVertexIndex).

    NumVertices - the number of vertices in the VB after MinIndex used by this call. So the last vertex in the VB used by the call must be less than (MinIndex+NumVertices+BaseVertexIndex).

    Note that MinIndex and NumVertices are not used by hardware pipelines - they transform and light vertices only when referenced by an index. But software pipelines are far more efficient when then can start at a place in the VB and T&L the next n vertices. That is what these numbers are for - they are to help the software pipeline. For this reason, you should avoid spreading vertices all over your VB - try to keep all the vertices for a single model in one continuous block withing VBs. This will also help the hardware T&L pipe to some extent, simply because most hardware has a pre-T&L vertex cache, and this will work best if vertices are close to each other.

    So, to recap. This call:

    HRESULT DrawIndexedPrimitive(
    
    D3DPT_TRIANGLELIST, // D3DPRIMITIVETYPE Type,
    20, // INT BaseVertexIndex,
    3, // UINT MinIndex,
    5, // UINT NumVertices,
    6, // UINT StartIndex,
    3 // UINT PrimitiveCount
    );

    With this index buffer:

    1,2,3, 3,2,4, 3,4,5, 4,5,6, 6,5,7, 5,7,8

    It will draw PrimitiveCount=3 triangles, starting with index number StartIndex=6. So it ignores the first two and the last one triangles in the index buffer and only uses these:

    3,4,5, 4,5,6, 6,5,7

    You have told it that you are only using vertices from MinIndex=3 to (MinIndex+NumVertices)=8, not including the last one, so you say you are only using vertices 3 to 7. Which is true - only the numbers 3 to 7 appear in this draw, even though other vertices are used elsewhere in the index buffer.

    Then BaseVertexIndex is added to the indices before the vertices are fetched from the vertex buffer. So the vertex numbers actually used will be:

    23,24,25, 24,25,26, 26,25,27

    ...and the software T&L pipe will know it only needs to transform vertices 23 to 27 inclusive.

    Many ask what the point of the BaseVertexIndex value is. This is useful when you have dynamic data that you are streaming through a dynamic vertex buffer. Because you are using the NOOVERWRITE/DISCARD model (you are, aren't you? Good, just checking), you will get a portion of vertex buffer to write vertices to that will probably change every frame. So you don't know ahead of time where the first vertex is going to be. But if you simply change BaseVertexIndex to be the number of this vertex, then you can precompute the index list at start of day, and then you don't need to change the indices each frame. Which can be a useful speedup.




    Thursday, May 23, 2002
    When to disable the TextureStateState pipeline.

    Drivers can get very confused if you don't set D3DTOP_DISABLE in both pipes at the same time. In fact according to the docs in DX8 (this was not true in DX7), it's just plain Wrong:

    D3DTOP_DISABLE
    Disables output from this texture stage and all stages with a higher index. To disable texture mapping, set this as the color operation for the first texture stage (stage 0). Alpha operations cannot be disabled when color operations are enabled. Setting the alpha operation to D3DTOP_DISABLE when color blending is enabled causes undefined behavior.

    In theory the driver should know what to do. In practice it mostly works, but every now and then drivers get confused and do the wrong thing. Which is probably why the DX8 docs were changed to forbid it. It's evil because it works 90% of the time (i.e. when you're coding it), and then falls over horribly and without warning on 10% of cards/drivers (i.e. when you're trying to ship the game).




    The first FAQ is of course, "how do I unsubscribe". The answer is at the bottom of every post: http://DISCUSS.MICROSOFT.COM/archives/DIRECTXDEV.html




    This is my own little version of the DX FAQ. OK, there might be some non-DX FAQ entries in here (especially on VIPM or something), but it's mainly for the benefit of people on the DXDev mailing list:

    -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    MSDN DirectX Developer Centre: http://msdn.microsoft.com/DirectX
    -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    Web Interface: http://DISCUSS.MICROSOFT.COM/archives/DIRECTXDEV.html
    Problems/Suggestions: DIRECTXDEV-request@discuss.microsoft.com
    Use the Web Interface (above) to unsubscribe from the list.
    Use plain-text only. HTML is not accepted. Attachments are removed
    -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

    Note that this is not the official FAQ, that is here. And it's pretty good, has quite a bit of stuff in it, and gets updated fairly frequently. Annoyingly, it also moves around fairly frequently, so the above link might not work, and you may need to go to MSDN and search for it.

    TomF - tomf AT radgametools DOT com