I have managed to compile vdrift on VC9. So I used the opportunity to test AMD CodeAnalyst on VDrift. ( I actually was interested in physics code profiling
)
Code:
VDrift, Windowed, 800x600, Shaders disabled, Car: 360, Track: Paul Ricard
Top 20 functions, 1215 instructions, 89.35% of samples in the module, 25.92% of total session samples
89.35% of samples in the module
70528 SCENENODE::GetCollapsedDrawList
19959 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::FrustumCull
10908 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::DrawList
5826 std::_Vb_const_iterator<unsigned int,int,std::vector<bool,std::allocator<bool> > >::operator+=
5093 SOUNDSOURCE::SampleAndAdvanceWithPitch16bit
5035 std::vector<bool,std::allocator<bool> >::operator[]
4333 std::_Vb_reference<unsigned int,int,std::vector<bool,std::allocator<bool> > >::_Getptr
4275 GRAPHICS_SDLGL::GLSTATEMANAGER::Set
4218 MATRIX4<float>::TransformVectorOut
3661 std::_Tree<std::_Tmap_traits<DRAWABLE_FILTER *, ...
2823 TEXTURE_GL::Activate
2227 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTexturing
1894 SCENETRANSFORM::IsIdentityTransform
1829 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectFlags
1268 _ftol2_pentium4
1192 std::_Transform<SCENEDRAW const *, ...
1138 MATRIX4<float>::Equals
997 QUATERNION<double>::operator*
959 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTransformStart
891 MATRIX4<float>::Multiply
A closer look at GetCollapsedDrawList:
Code:
36.36% of module samples
14122 for (list <DRAWABLE>::const_iterator i = drawlist.begin(); i != drawlist.end(); ++i)
0 {
32113 if (mi->first->Matches(*i))
0 {
15093 if (i->GetDrawEnable())
0 {
175 mi->second.push_back(SCENEDRAW(*i, this_transform));
0 }
0 }
0 }
So it is run on all drawables(also the static geometry) per frame?
50% of cpu time was spent in the gpu driver (169721 samples in atioglxx.dll).
The OpenGL API Trace looks pretty slick (Paul Ricard has 1309 objects + 133 textures):
Code:
408 glCallList
21 glDrawElements
Here an excerpt:
Code:
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,21
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,25
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,29
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65,
glCallList,36
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,67
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,83
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,89
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,96
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,112
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,133
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,137
So we are using glCallList for static geometry. One interesting thing is that we sort by the texture, but it is still bound per list.
I wonder if it's possible to set the texture once for a number of lists. Or even batch the meshes sharing the same texture?
Hum, I know the scenegraph is a little heavier than it needs to be, but I'm surprised it's seeing so many samples in GetCollapsedDrawList. The only scenegraph nodes we should be collapsing per frame are the non-static objects. The static track objects only get collapsed once: at the end of GAME::LoadTrack. You're profiling the release optimized version, right? I wonder if the VC9 compiler is missing some optimization opportunities. You might want to try compiling with GCC and using the GNU profiler, then comparing the results. If you do "scons profiling=1" it'll compile with release optimizations, debugging symbols, and profiling all enabled (the framerate is lower with profiling enabled but it's still very playable). On my computer the frustum culling seems to take the longest. Moving to a hierarchical culling system would speed that up. I've also been thinking about how to make the scenegraph a little lighter weight.
About the texture stuff, I've always heard that OpenGL render state calls are pretty lightweight. Also, I assumed that the driver would take care of doing a NOOP instead of switching from a texture to an identical texture, but I could add that to VDrift pretty easily if it'd improve framerate (I'm not so sure it would). I've been meaning to get VDrift under perfHUD, but decided not to worry about it just yet since I'm planning on rewriting the rendering engine anyway.
Here's what the profile looks like on my machine after 30 seconds in the menus and 5 minutes on Paul Ricard:
Code:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
23.33 6.63 6.63 156017736 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::FrustumCull(SCENEDRAW&) const
9.41 9.31 2.67 90222322 0.00 0.00 QUATERNION<double>::operator*(QUATERNION<double> const&) const
4.85 10.69 1.38 58049499 0.00 0.00 AABB<float>::Intersect(AABB<float>::RAY const&) const
4.15 11.87 1.18 3331646 0.00 0.00 btQuantizedBvh::walkStacklessQuantizedTreeAgainstRay(btNodeOverlapCallback*, btVector3 const&, btVector3 const&, btVector3 const&, btVector3 const&, int, int) const
4.08 13.03 1.16 14907 0.00 0.00 SCENENODE::GetCollapsedDrawList(std::map<DRAWABLE_FILTER*, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> >, std::less<DRAWABLE_FILTER*>, std::allocator<std::pair<DRAWABLE_FILTER* const, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> > > > >&, MATRIX4<float> const&) const
3.50 14.03 0.99 6517888 0.00 0.00 CARDYNAMICS::GetWheelPositionAtDisplacement(WHEEL_POSITION, double) const
3.20 14.94 0.91 223250134 0.00 0.00 DRAWABLE_FILTER::Matches(DRAWABLE const&) const
3.13 15.82 0.89 94649152 0.00 0.00 GRAPHICS_SDLGL::GLSTATEMANAGER::Set(int, bool)
2.99 16.68 0.85 432324 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::DrawList(GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.76 17.46 0.79 11867239 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTransformStart(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.18 18.08 0.62 13613592 0.00 0.00 TEXTURE_GL::Activate() const
1.90 18.62 0.54 1598723 0.00 0.00 CARDYNAMICS::GetWheelSteeringAndSuspensionOrientation(WHEEL_POSITION) const
1.55 19.06 0.44 11866739 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectFlags(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
1.44 19.47 0.41 13613938 0.00 0.00 std::basic_string<char, std::char_traits<char>, std::allocator<char> > std::operator+<char, std::char_traits<char>, std::allocator<char> >(char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
1.37 19.86 0.39 35426224 0.00 0.00 void QUATERNION<double>::RotateVector<MATHVECTOR<double, 3u> >(MATHVECTOR<double, 3u>&) const
1.34 20.24 0.38 1229774 0.00 0.00 CARDYNAMICS::ApplyWheelForces(double, double, int, MATHVECTOR<double, 3u> const&, MATHVECTOR<double, 3u>&, MATHVECTOR<double, 3u>&)
1.23 20.59 0.35 19758101 0.00 0.00 BEZIER::Bernstein(float, MATHVECTOR<float, 3u>*) const
1.23 20.94 0.35 1844588 0.00 0.00 ROTATIONALFRAME<double>::RecalculateSecondary()
... etc
The hierarchical view is more interesting but it's too long for here so I put the top 100 on pastebin:
http://pastebin.com/r5RfggmV
Oh yeah, I almost forgot, if you run VDrift with the -profiling argument it will display some rough data about which subsystems are taking the longest.
Oh, I forgot to mention. The data is from the bullet branch. But it shouldn't matter as I didn't modify the scenegraph code.
I am testing it on a Windows 7 x64 machine(Athlon 64 X2). The VC9 build is compiled with /Ox. It is in fact about 20-30% slower than the GCC4.4.3 -O3 build.
Here the GCC build profile:
Code:
time seconds seconds calls ms/call ms/call name
26.43 23.14 23.14 SCENENODE::GetCollapsedDrawList(std::map<DRAWABLE_FILTER*, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> >, std::less<DRAWABLE_FILTER*>, std::allocator<std::pair<DRAWABLE_FILTER* const, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> > > > >&, MATRIX4<float> const&) const
18.22 39.09 15.95 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::FrustumCull(SCENEDRAW&) const
3.16 41.86 2.77 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::DrawList(GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.86 44.36 2.50 GRAPHICS_SDLGL::GLSTATEMANAGER::Set(int, bool)
2.79 46.80 2.44 __gnu_cxx::__exchange_and_add(int volatile*, int)
2.78 49.23 2.43 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTransformStart(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.46 51.38 2.15 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectFlags(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.35 53.44 2.06 std::_Rb_tree_increment(std::_Rb_tree_node_base*)
2.27 55.43 1.99 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTexturing(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
1.90 57.09 1.66 std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned int)
1.85 58.71 1.62 std::string::append(std::string const&)
1.60 60.11 1.40 TEXTURE_GL::Activate() const
1.31 61.26 1.15 std::string::append(char const*, unsigned int)
1.27 62.37 1.11 std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
1.23 63.45 1.08 std::basic_string<char, std::char_traits<char>, std::allocator<char> > std::operator+<char, std::char_traits<char>, std::allocator<char> >(char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
1.20 64.50 1.05 __gnu_cxx::__atomic_add(int volatile*, int)
1.07 65.44 0.94 std::string::reserve(unsigned int)
Funny huh? Will profile vdrift trunk asap.
Quote:About the texture stuff ...
Yeah, I think it's just me not trusting the OpenGL drivers on Windows. :lol:
Here the vdrift trunk profile(MinGW GCC4.4.3):
Code:
% cumulative self self total
time seconds seconds calls s/call s/call name
22.75 15.90 15.90 75346 0.00 0.00 SCENENODE::GetCollapsedDrawList(std::map<DRAWABLE_FILTER*, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> >, std::less<DRAWABLE_FILTER*>, std::allocator<std::pair<DRAWABLE_FILTER* const, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> > > > >&, MATRIX4<float> const&) const
14.74 26.20 10.30 95579158 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::FrustumCull(SCENEDRAW&) const
3.56 28.69 2.49 40361028 0.00 0.00 AABB<float>::Intersect(AABB<float>::RAY const&) const
2.52 30.45 1.76 527415 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::DrawList(GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.48 32.18 1.73 62275440 0.00 0.00 QUATERNION<double>::operator*(QUATERNION<double> const&) const
2.40 33.86 1.68 122087853 0.00 0.00 GRAPHICS_SDLGL::GLSTATEMANAGER::Set(int, bool)
1.99 35.25 1.39 17632958 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectFlags(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
1.90 36.58 1.33 __gnu_cxx::__exchange_and_add(int volatile*, int)
1.85 37.87 1.29 17633034 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTexturing(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
1.77 39.11 1.24 17632896 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTransformStart(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
1.76 40.34 1.23 std::_Rb_tree_increment(std::_Rb_tree_node_base*)
1.59 41.45 1.11 T.382
1.59 42.56 1.11 std::string::append(std::string const&)
1.46 43.58 1.02 std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned int)
1.20 44.42 0.84 std::string::reserve(unsigned int)
And I get similar results on my intel machine. So I guess it has to be OS or compiler related.
Interesting! It's starting to sound like it might be operating system related. I'm using GCC 4.4.1 and an AMD X4 on Linux.
Does codeanalyst's line-by-line profiling information give you any ideas about optimizing this function? It's pretty cache unfriendly at the moment due to iterating over maps and lists, and it does a fair amount of data copying, but maybe there's something else that's an obvious reason for its slowness that I'm missing.
I'm thinking about how to rewrite the scenegraph system and I'm undecided about where to go with it. Right now it's a sort of scenegraph lite, with only two types of nodes. Should I make it a more conventional scenegraph system with nodes for texture changes, shader changes, etc? Should I try to make it only re-traverse branches of the tree that have changed since last time (but increasing the overhead of changing a node)? Should I add a hierarchical space partitioning system so culling is faster? If so, how to integrate that into the scenegraph and minimize render state changes while keeping the culling process as fast as possible?
The profiling counters are inconspicuous. GetCollapsedDrawList has a CPI of two. It is the amount of retired instructions that bothers me. It looks like it is doing a lot of work!?
Code:
CS:EIP Symbol + Offset 64-bit CPU clocks Ret inst
0x18b6e0 SCENENODE::GetCollapsedDrawList 10000 5542
0x122080 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::FrustumCull 3016 1799
2 functions, 215 instructions, Total: 20357 samples, 51.96% of samples in the module, 13.95% of total session samples
The bad code segment:
Code:
CPU clocks Ret inst
for (map <DRAWABLE_FILTER *, vector <SCENEDRAW> >::iterator mi = drawlist_output_map.begin(); mi != drawlist_output_map.end(); ++mi) 241 168
{ 0 0
for (list <DRAWABLE>::const_iterator i = drawlist.begin(); i != drawlist.end(); ++i) 2343 846
{ 0 0
if (mi->first->Matches(*i)) 4608 3814
{ 0 0
if (i->GetDrawEnable()) 1622 433
{ 0 0
mi->second.push_back(SCENEDRAW(*i,this_transform)); 30 8
} 0 0
} 0 0
} 0 0
} 0 0
The Matches() function:
Code:
bool DRAWABLE_FILTER::Matches(const DRAWABLE & drawable) const
{
return (drawable.filterspeedup.filtervalue & filtermask) == (filtervalue & filtermask);
}
I added a counter to see how often the sort in GetCollapsedDrawList is executed. It is run if drawlist or childlist of the scenenode are not empty:
245 times/frame in GUI
251 times/frame in car selection
253 times/frame in game
Edit:
It won't solve the problem with the sort loop, but maybe it's worth to move GUI into an own scenenode like the track. I don't understand why it is faster on linux. Will install a distro asap.
Edit2:
After rearranging the loop body:
Code:
CPU clocks Ret inst
for (map <DRAWABLE_FILTER *, vector <SCENEDRAW> >::iterator mi = drawlist_output_map.begin(); mi != drawlist_output_map.end(); ++mi) 252 240
{ 0 0
for (list <DRAWABLE>::const_iterator i = drawlist.begin(); i != drawlist.end(); ++i) 2879 2586
{ 0 0
if (i->GetDrawEnable()) 5389 1661
{ 0 0
if (mi->first->Matches(*i)) 68 75
{ 0 0
mi->second.push_back(SCENEDRAW(*i, this_transform)); 25 15
} 0 0
} 0 0
} 0 0
}
It looks like there is a serious penalty accessing the drawlist elemets.
Interesting. I knew lists were slower to iterate through than vectors, but that's the first time I've seen them actually be the bottleneck.
I did another profiling run with focus on physics(360/ring2007).
8.6% in physics in total for a single car
4.2% in cardynamics
2.0% in bezier patch raytest
1.6% in bullets raytest
We spend half of our physics time doing ray tests. I hope to be able to switch to cylinder collision shapes for wheels. It should be faster than rays and more realistic. But we still will be running the bezier patch tests. Will see if I can optimize it a bit.
NaN Wrote:I hope to be able to switch to cylinder collision shapes for wheels. It should be faster than rays and more realistic.
Cylinder collision faster than a raycast? That's surprising to me.
I don't have any numbers yet. But that is what I gathered from bullet forum. The ray test is not optimized. It is traversing the whole scene. Collision objects are exploiting frame coherence to accelerate collision detection. If the object is not too fast and not too big the collision detection should be very cheap. I am not sure if the frame coherence is considered in the narrow phase(object vs object) but they are caching/reusing collision contacts from previous frames.
Ah cool, very interesting. Their raycast is actually way faster than the one I was doing, so getting it faster still would be really cool.
We could probably use some frame coherence to speed up the bezier raycasts. Right now they are accelerated using an r-tree sort of data structure (aabb_space_partitioning.h) and we could probably start from the last frame leaf and traverse up the tree and then back down to the new leaf. Also, this data structure needs to be cache-optimized. I never dynamically add objects to it, and each level has only 2 branches, so it could easily be stored as a vector (like is commonly done with binary heaps). This would also speed up the object culling, which uses the same data structure.
The current OpenGL api trace(shaders disabled) looks like:
Code:
614,OpenGL_1.0,glCallList,984,
615,OpenGL_1.3,glActiveTexture,GL_TEXTURE0,
616,OpenGL_1.1,glBindTexture,GL_TEXTURE_2D 177,
617,OpenGL_1.0,glTexEnvi,GL_TEXTURE_ENV GL_TEXTURE_ENV_MODE 8448,
618,OpenGL_1.0,glCallList,983,
619,OpenGL_1.3,glActiveTexture,GL_TEXTURE0,
620,OpenGL_1.1,glBindTexture,GL_TEXTURE_2D 124,
621,OpenGL_1.0,glTexEnvi,GL_TEXTURE_ENV GL_TEXTURE_ENV_MODE 8448,
622,OpenGL_1.0,glCallList,977,
623,OpenGL_1.3,glActiveTexture,GL_TEXTURE0,
624,OpenGL_1.1,glBindTexture,GL_TEXTURE_2D 170,
625,OpenGL_1.0,glTexEnvi,GL_TEXTURE_ENV GL_TEXTURE_ENV_MODE 8448,
The whole trace:
http://pastebin.com/JW7Wr32z
8448 means GL_TEXTURE_ENV_MODE is GL_MODULATE. This is the texture color blending, right?