Main bottle neck of modern CPU(shared memory architecture) is always related to accessing memory because memory is the slowest part in modern computer architecture. In order to solve this problem, modern CPU are using smart cache mechanism by introducing L1,L2,L3 caches. So if your code can minimize the cache misses, you can achieve the best performance. Also you might need to understand how cache mechanism works in modern cpu. The simplest solution is that if you can allocate data in continuous way, you can easily achieve the goal.
You can find one of good example of this from the following article.