Shared Memory
Fast, on-chip memory shared among threads in a block
Shared memory is a small, fast, on-chip memory space that is shared among all threads within a single block. It is often used as a programmer-managed cache.
Example
__global__ void reduce(float *input, float *output, int n) {
__shared__ float tile[256]; // declare shared memory
int tid = threadIdx.x;
tile[tid] = input[blockIdx.x * blockDim.x + tid];
__syncthreads(); // wait for all threads to load
// Parallel reduction in shared memory
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) tile[tid] += tile[tid + s];
__syncthreads();
}
if (tid == 0) output[blockIdx.x] = tile[0];
}
Key Characteristics
- Size: Typically 48-164 KB per block (configurable)
- Latency: ~20-30 cycles (vs. ~400+ for global memory)
- Scope: Visible only to threads in the same block
- Requires explicit
__syncthreads()for synchronization