---
title: "Shared Memory"
canonical: "https://www.thundercompute.com/glossary/memory/shared-memory"
description: "Fast, on-chip memory shared among threads in a block"
sidebarTitle: "Shared Memory"
icon: "share-nodes"
iconType: "solid"
---

**Shared memory** is a small, fast, on-chip memory space that is shared among all threads within a single block. It is often used as a programmer-managed cache.

## Example

```cpp
__global__ void reduce(float *input, float *output, int n) {
    __shared__ float tile[256];  // declare shared memory

    int tid = threadIdx.x;
    tile[tid] = input[blockIdx.x * blockDim.x + tid];
    __syncthreads();  // wait for all threads to load

    // Parallel reduction in shared memory
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) tile[tid] += tile[tid + s];
        __syncthreads();
    }

    if (tid == 0) output[blockIdx.x] = tile[0];
}
```

## Key Characteristics

- **Size**: Typically 48-164 KB per block (configurable)
- **Latency**: ~20-30 cycles (vs. ~400+ for global memory)
- **Scope**: Visible only to threads in the same block
- Requires explicit `__syncthreads()` for synchronization

## See Also

- [Global Memory](/memory/global-memory)
- [Thread, Block, Grid](/cuda-programming/thread-block-grid)
