03-25-2010, 08:47 PM,
|
|
NaN
Posting Freak
|
Posts: 2,024
Threads: 120
Joined: Jan 2010
|
|
profiling
I have managed to compile vdrift on VC9. So I used the opportunity to test AMD CodeAnalyst on VDrift. ( I actually was interested in physics code profiling )
Code: VDrift, Windowed, 800x600, Shaders disabled, Car: 360, Track: Paul Ricard
Top 20 functions, 1215 instructions, 89.35% of samples in the module, 25.92% of total session samples
89.35% of samples in the module
70528 SCENENODE::GetCollapsedDrawList
19959 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::FrustumCull
10908 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::DrawList
5826 std::_Vb_const_iterator<unsigned int,int,std::vector<bool,std::allocator<bool> > >::operator+=
5093 SOUNDSOURCE::SampleAndAdvanceWithPitch16bit
5035 std::vector<bool,std::allocator<bool> >::operator[]
4333 std::_Vb_reference<unsigned int,int,std::vector<bool,std::allocator<bool> > >::_Getptr
4275 GRAPHICS_SDLGL::GLSTATEMANAGER::Set
4218 MATRIX4<float>::TransformVectorOut
3661 std::_Tree<std::_Tmap_traits<DRAWABLE_FILTER *, ...
2823 TEXTURE_GL::Activate
2227 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTexturing
1894 SCENETRANSFORM::IsIdentityTransform
1829 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectFlags
1268 _ftol2_pentium4
1192 std::_Transform<SCENEDRAW const *, ...
1138 MATRIX4<float>::Equals
997 QUATERNION<double>::operator*
959 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTransformStart
891 MATRIX4<float>::Multiply
A closer look at GetCollapsedDrawList:
Code: 36.36% of module samples
14122 for (list <DRAWABLE>::const_iterator i = drawlist.begin(); i != drawlist.end(); ++i)
0 {
32113 if (mi->first->Matches(*i))
0 {
15093 if (i->GetDrawEnable())
0 {
175 mi->second.push_back(SCENEDRAW(*i, this_transform));
0 }
0 }
0 }
So it is run on all drawables(also the static geometry) per frame?
|
|
03-25-2010, 08:50 PM,
|
|
NaN
Posting Freak
|
Posts: 2,024
Threads: 120
Joined: Jan 2010
|
|
50% of cpu time was spent in the gpu driver (169721 samples in atioglxx.dll).
The OpenGL API Trace looks pretty slick (Paul Ricard has 1309 objects + 133 textures): Code: 408 glCallList
21 glDrawElements
Here an excerpt:
Code: glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,21
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,25
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,29
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65,
glCallList,36
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,67
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,83
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,89
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,96
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,112
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,133
glActiveTexture,GL_TEXTURE0
glBindTexture,GL_TEXTURE_2D 65
glCallList,137
So we are using glCallList for static geometry. One interesting thing is that we sort by the texture, but it is still bound per list.
I wonder if it's possible to set the texture once for a number of lists. Or even batch the meshes sharing the same texture?
|
|
03-25-2010, 10:20 PM,
|
|
joevenzon
Administrator
|
Posts: 2,679
Threads: 52
Joined: Jun 2005
|
|
Hum, I know the scenegraph is a little heavier than it needs to be, but I'm surprised it's seeing so many samples in GetCollapsedDrawList. The only scenegraph nodes we should be collapsing per frame are the non-static objects. The static track objects only get collapsed once: at the end of GAME::LoadTrack. You're profiling the release optimized version, right? I wonder if the VC9 compiler is missing some optimization opportunities. You might want to try compiling with GCC and using the GNU profiler, then comparing the results. If you do "scons profiling=1" it'll compile with release optimizations, debugging symbols, and profiling all enabled (the framerate is lower with profiling enabled but it's still very playable). On my computer the frustum culling seems to take the longest. Moving to a hierarchical culling system would speed that up. I've also been thinking about how to make the scenegraph a little lighter weight.
About the texture stuff, I've always heard that OpenGL render state calls are pretty lightweight. Also, I assumed that the driver would take care of doing a NOOP instead of switching from a texture to an identical texture, but I could add that to VDrift pretty easily if it'd improve framerate (I'm not so sure it would). I've been meaning to get VDrift under perfHUD, but decided not to worry about it just yet since I'm planning on rewriting the rendering engine anyway.
|
|
03-25-2010, 10:37 PM,
|
|
joevenzon
Administrator
|
Posts: 2,679
Threads: 52
Joined: Jun 2005
|
|
Here's what the profile looks like on my machine after 30 seconds in the menus and 5 minutes on Paul Ricard:
Code: Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
23.33 6.63 6.63 156017736 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::FrustumCull(SCENEDRAW&) const
9.41 9.31 2.67 90222322 0.00 0.00 QUATERNION<double>::operator*(QUATERNION<double> const&) const
4.85 10.69 1.38 58049499 0.00 0.00 AABB<float>::Intersect(AABB<float>::RAY const&) const
4.15 11.87 1.18 3331646 0.00 0.00 btQuantizedBvh::walkStacklessQuantizedTreeAgainstRay(btNodeOverlapCallback*, btVector3 const&, btVector3 const&, btVector3 const&, btVector3 const&, int, int) const
4.08 13.03 1.16 14907 0.00 0.00 SCENENODE::GetCollapsedDrawList(std::map<DRAWABLE_FILTER*, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> >, std::less<DRAWABLE_FILTER*>, std::allocator<std::pair<DRAWABLE_FILTER* const, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> > > > >&, MATRIX4<float> const&) const
3.50 14.03 0.99 6517888 0.00 0.00 CARDYNAMICS::GetWheelPositionAtDisplacement(WHEEL_POSITION, double) const
3.20 14.94 0.91 223250134 0.00 0.00 DRAWABLE_FILTER::Matches(DRAWABLE const&) const
3.13 15.82 0.89 94649152 0.00 0.00 GRAPHICS_SDLGL::GLSTATEMANAGER::Set(int, bool)
2.99 16.68 0.85 432324 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::DrawList(GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.76 17.46 0.79 11867239 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTransformStart(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.18 18.08 0.62 13613592 0.00 0.00 TEXTURE_GL::Activate() const
1.90 18.62 0.54 1598723 0.00 0.00 CARDYNAMICS::GetWheelSteeringAndSuspensionOrientation(WHEEL_POSITION) const
1.55 19.06 0.44 11866739 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectFlags(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
1.44 19.47 0.41 13613938 0.00 0.00 std::basic_string<char, std::char_traits<char>, std::allocator<char> > std::operator+<char, std::char_traits<char>, std::allocator<char> >(char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
1.37 19.86 0.39 35426224 0.00 0.00 void QUATERNION<double>::RotateVector<MATHVECTOR<double, 3u> >(MATHVECTOR<double, 3u>&) const
1.34 20.24 0.38 1229774 0.00 0.00 CARDYNAMICS::ApplyWheelForces(double, double, int, MATHVECTOR<double, 3u> const&, MATHVECTOR<double, 3u>&, MATHVECTOR<double, 3u>&)
1.23 20.59 0.35 19758101 0.00 0.00 BEZIER::Bernstein(float, MATHVECTOR<float, 3u>*) const
1.23 20.94 0.35 1844588 0.00 0.00 ROTATIONALFRAME<double>::RecalculateSecondary()
... etc
The hierarchical view is more interesting but it's too long for here so I put the top 100 on pastebin:
http://pastebin.com/r5RfggmV
|
|
03-26-2010, 03:33 AM,
|
|
NaN
Posting Freak
|
Posts: 2,024
Threads: 120
Joined: Jan 2010
|
|
Oh, I forgot to mention. The data is from the bullet branch. But it shouldn't matter as I didn't modify the scenegraph code.
I am testing it on a Windows 7 x64 machine(Athlon 64 X2). The VC9 build is compiled with /Ox. It is in fact about 20-30% slower than the GCC4.4.3 -O3 build.
Here the GCC build profile:
Code: time seconds seconds calls ms/call ms/call name
26.43 23.14 23.14 SCENENODE::GetCollapsedDrawList(std::map<DRAWABLE_FILTER*, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> >, std::less<DRAWABLE_FILTER*>, std::allocator<std::pair<DRAWABLE_FILTER* const, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> > > > >&, MATRIX4<float> const&) const
18.22 39.09 15.95 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::FrustumCull(SCENEDRAW&) const
3.16 41.86 2.77 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::DrawList(GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.86 44.36 2.50 GRAPHICS_SDLGL::GLSTATEMANAGER::Set(int, bool)
2.79 46.80 2.44 __gnu_cxx::__exchange_and_add(int volatile*, int)
2.78 49.23 2.43 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTransformStart(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.46 51.38 2.15 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectFlags(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.35 53.44 2.06 std::_Rb_tree_increment(std::_Rb_tree_node_base*)
2.27 55.43 1.99 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTexturing(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
1.90 57.09 1.66 std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned int)
1.85 58.71 1.62 std::string::append(std::string const&)
1.60 60.11 1.40 TEXTURE_GL::Activate() const
1.31 61.26 1.15 std::string::append(char const*, unsigned int)
1.27 62.37 1.11 std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
1.23 63.45 1.08 std::basic_string<char, std::char_traits<char>, std::allocator<char> > std::operator+<char, std::char_traits<char>, std::allocator<char> >(char const*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
1.20 64.50 1.05 __gnu_cxx::__atomic_add(int volatile*, int)
1.07 65.44 0.94 std::string::reserve(unsigned int)
Funny huh? Will profile vdrift trunk asap.
Quote:About the texture stuff ...
Yeah, I think it's just me not trusting the OpenGL drivers on Windows. :lol:
|
|
03-28-2010, 06:18 AM,
|
|
NaN
Posting Freak
|
Posts: 2,024
Threads: 120
Joined: Jan 2010
|
|
Here the vdrift trunk profile(MinGW GCC4.4.3):
Code: % cumulative self self total
time seconds seconds calls s/call s/call name
22.75 15.90 15.90 75346 0.00 0.00 SCENENODE::GetCollapsedDrawList(std::map<DRAWABLE_FILTER*, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> >, std::less<DRAWABLE_FILTER*>, std::allocator<std::pair<DRAWABLE_FILTER* const, std::vector<SCENEDRAW, std::allocator<SCENEDRAW> > > > >&, MATRIX4<float> const&) const
14.74 26.20 10.30 95579158 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::FrustumCull(SCENEDRAW&) const
3.56 28.69 2.49 40361028 0.00 0.00 AABB<float>::Intersect(AABB<float>::RAY const&) const
2.52 30.45 1.76 527415 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::DrawList(GRAPHICS_SDLGL::GLSTATEMANAGER&)
2.48 32.18 1.73 62275440 0.00 0.00 QUATERNION<double>::operator*(QUATERNION<double> const&) const
2.40 33.86 1.68 122087853 0.00 0.00 GRAPHICS_SDLGL::GLSTATEMANAGER::Set(int, bool)
1.99 35.25 1.39 17632958 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectFlags(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
1.90 36.58 1.33 __gnu_cxx::__exchange_and_add(int volatile*, int)
1.85 37.87 1.29 17633034 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTexturing(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
1.77 39.11 1.24 17632896 0.00 0.00 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::SelectTransformStart(SCENEDRAW&, GRAPHICS_SDLGL::GLSTATEMANAGER&)
1.76 40.34 1.23 std::_Rb_tree_increment(std::_Rb_tree_node_base*)
1.59 41.45 1.11 T.382
1.59 42.56 1.11 std::string::append(std::string const&)
1.46 43.58 1.02 std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned int)
1.20 44.42 0.84 std::string::reserve(unsigned int)
And I get similar results on my intel machine. So I guess it has to be OS or compiler related.
|
|
03-28-2010, 07:28 PM,
|
|
NaN
Posting Freak
|
Posts: 2,024
Threads: 120
Joined: Jan 2010
|
|
The profiling counters are inconspicuous. GetCollapsedDrawList has a CPI of two. It is the amount of retired instructions that bothers me. It looks like it is doing a lot of work!?
Code: CS:EIP Symbol + Offset 64-bit CPU clocks Ret inst
0x18b6e0 SCENENODE::GetCollapsedDrawList 10000 5542
0x122080 GRAPHICS_SDLGL::RENDER_INPUT_SCENE::FrustumCull 3016 1799
2 functions, 215 instructions, Total: 20357 samples, 51.96% of samples in the module, 13.95% of total session samples
The bad code segment:
Code: CPU clocks Ret inst
for (map <DRAWABLE_FILTER *, vector <SCENEDRAW> >::iterator mi = drawlist_output_map.begin(); mi != drawlist_output_map.end(); ++mi) 241 168
{ 0 0
for (list <DRAWABLE>::const_iterator i = drawlist.begin(); i != drawlist.end(); ++i) 2343 846
{ 0 0
if (mi->first->Matches(*i)) 4608 3814
{ 0 0
if (i->GetDrawEnable()) 1622 433
{ 0 0
mi->second.push_back(SCENEDRAW(*i,this_transform)); 30 8
} 0 0
} 0 0
} 0 0
} 0 0
The Matches() function:
Code: bool DRAWABLE_FILTER::Matches(const DRAWABLE & drawable) const
{
return (drawable.filterspeedup.filtervalue & filtermask) == (filtervalue & filtermask);
}
I added a counter to see how often the sort in GetCollapsedDrawList is executed. It is run if drawlist or childlist of the scenenode are not empty:
245 times/frame in GUI
251 times/frame in car selection
253 times/frame in game
Edit:
It won't solve the problem with the sort loop, but maybe it's worth to move GUI into an own scenenode like the track. I don't understand why it is faster on linux. Will install a distro asap.
Edit2:
After rearranging the loop body:
Code: CPU clocks Ret inst
for (map <DRAWABLE_FILTER *, vector <SCENEDRAW> >::iterator mi = drawlist_output_map.begin(); mi != drawlist_output_map.end(); ++mi) 252 240
{ 0 0
for (list <DRAWABLE>::const_iterator i = drawlist.begin(); i != drawlist.end(); ++i) 2879 2586
{ 0 0
if (i->GetDrawEnable()) 5389 1661
{ 0 0
if (mi->first->Matches(*i)) 68 75
{ 0 0
mi->second.push_back(SCENEDRAW(*i, this_transform)); 25 15
} 0 0
} 0 0
} 0 0
}
It looks like there is a serious penalty accessing the drawlist elemets.
|
|
07-19-2010, 04:25 AM,
|
|
NaN
Posting Freak
|
Posts: 2,024
Threads: 120
Joined: Jan 2010
|
|
The current OpenGL api trace(shaders disabled) looks like:
Code: 614,OpenGL_1.0,glCallList,984,
615,OpenGL_1.3,glActiveTexture,GL_TEXTURE0,
616,OpenGL_1.1,glBindTexture,GL_TEXTURE_2D 177,
617,OpenGL_1.0,glTexEnvi,GL_TEXTURE_ENV GL_TEXTURE_ENV_MODE 8448,
618,OpenGL_1.0,glCallList,983,
619,OpenGL_1.3,glActiveTexture,GL_TEXTURE0,
620,OpenGL_1.1,glBindTexture,GL_TEXTURE_2D 124,
621,OpenGL_1.0,glTexEnvi,GL_TEXTURE_ENV GL_TEXTURE_ENV_MODE 8448,
622,OpenGL_1.0,glCallList,977,
623,OpenGL_1.3,glActiveTexture,GL_TEXTURE0,
624,OpenGL_1.1,glBindTexture,GL_TEXTURE_2D 170,
625,OpenGL_1.0,glTexEnvi,GL_TEXTURE_ENV GL_TEXTURE_ENV_MODE 8448,
The whole trace: http://pastebin.com/JW7Wr32z
8448 means GL_TEXTURE_ENV_MODE is GL_MODULATE. This is the texture color blending, right?
|
|
|