Tensor Cores
Specialized GPU cores for matrix multiply-accumulate operations
Tensor Cores are specialized processing units on NVIDIA GPUs (Volta and later) that accelerate matrix multiply-accumulate operations — the core computation in deep learning.
Example
import torch
# Tensor Cores are used automatically with mixed precision
with torch.autocast(device_type="cuda", dtype=torch.float16):
x = torch.randn(512, 512, device="cuda")
w = torch.randn(512, 512, device="cuda")
y = x @ w # matrix multiply — runs on Tensor Cores
Key Facts
- Perform 4x4 matrix multiplications in a single clock cycle
- Require specific data types: FP16, BF16, TF32, INT8
- Dramatically accelerate training and inference when used with mixed precision