c++ - work - unified cache
How many bytes does a Xeon bring into the cache per memory access? (4)
A cache line for any current Xeon processor is 64 bytes. One other thing that you might want to think about is the TLB. If you are really doing random accesses across 10GB of memory then you are likely to have a lot of TLB misses which can potentially be as costly as cache misses. You can get work around with with large pages, but it's something to keep in mind.
I am working on a system, written in C++, running on a Xeon on Linux, that needs to run as fast as possible. There is a large data structure (basically an array of structs) held in RAM, over 10 GB, and elements of it need to be accessed periodically. I want to revise the data structure to work with the system's caching mechanism as much as possible.
Currently, accesses are done mostly randomly across the structure, and each time 1-4 32-bit ints are read. It is a long time before another read occurs in the same place, so there is no benefit from the cache.
Now I know that when you read a byte from a random location in RAM, more than just that byte is brought into the cache. My question is how many bytes are brought in? Is it 16, 32, 64, 4096? Is this called a cache line?
I am looking to redesign the data structure to minimize random RAM accesses and work with the cache instead of against it. Knowing how many bytes are pulled into the cache on a random access will inform the design choices I make.
Update (October 2014): Shortly after I posed the question above the project was put on hold. It has since resumed and based on suggestions in answers below, I performed some experiments around RAM access, because it seemed likely that TLB thrash was happening. I revised the program to run with huge pages (2MB instead of the standard 4KB), and observed a small speedup, about 2.5%. I found great information about setting up for huge pages here and here.
Today’s CPUs fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache.
You might want to head over to http://agner.org/optimize/ and grab the optimization PDFs available there - there's a lot of good (low-level) information in there. Pretty focused on assembly language level, but there's lessons to be learned for C/C++ programmers as well.
Volume 3, "The microarchitecture of Intel, AMD and VIA CPUs" should be of interest :-)