@Redrobes: True, there is a huge number of variables that come into play. Likely we shouldn't start talking about the effects of multi-core catch misses and dealing with the standard thread based processing system. And we really shouldn't get into talking about what you can do with floats on a graphics card, with insane hardware optimization for these things.

In the end, the data fetching needs to be done either way, and it can become an issue of whether the processing of the blocked data saves more time than the faster In Processor time.