Thunder Compute logo

Shared Memory

Fast, on-chip memory shared among threads in a block

Shared memory is a small, fast, on-chip memory space that is shared among all threads within a single block. It is often used as a programmer-managed cache.

Example

__global__ void reduce(float *input, float *output, int n) {
    __shared__ float tile[256];  // declare shared memory

    int tid = threadIdx.x;
    tile[tid] = input[blockIdx.x * blockDim.x + tid];
    __syncthreads();  // wait for all threads to load

    // Parallel reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) tile[tid] += tile[tid + s];
        __syncthreads();
    }

    if (tid == 0) output[blockIdx.x] = tile[0];
}

Key Characteristics

  • Size: Typically 48-164 KB per block (configurable)
  • Latency: ~20-30 cycles (vs. ~400+ for global memory)
  • Scope: Visible only to threads in the same block
  • Requires explicit __syncthreads() for synchronization

See Also