Over at the Parallel for All blog, Mark Harris writes that Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access ...
Over at the Nvidia Developer Zone, Mark Harris looks at how to efficiently access device memory, in particular global memory, from within kernels. Global memory access on the device shares performance ...