Introduction to the Three Key Processing Cores Inside NVIDIA GPUs

Introduction to the Three Key Processing Cores Inside NVIDIA GPUs

In the evolution of modern NVIDIA GPU architecture, a single general-purpose computing core can no longer meet the demands of diverse workloads. To address this challenge, NVIDIA introduced a highly heterogeneous hardware design within its Streaming Multiprocessors (SM). GPUs are no longer composed solely of general-purpose arithmetic logic units, but instead integrate three types of specialized cores: CUDA cores handle general-purpose vector parallel computing, Tensor cores focus on deep learning matrix acceleration, and RT cores are dedicated to hardware acceleration for ray tracing. This triadic architecture signifies the evolution of GPUs from traditional ‘general-purpose parallel processors’ to accelerated computing platforms supporting the fusion of AI and graphics computing.

The Architectural Foundation of GPU Processing Cores

In the NVIDIA GPU architecture, the Streaming Multiprocessor (SM) is the basic processing unit in the GPU responsible for thread scheduling and execution resource management. It is responsible for thread management, instruction scheduling, and unified coordination of underlying execution resources.

Unlike traditional “core stacking” designs, NVIDIA does not distribute different types of computing resources across different areas of the chip. Instead, it integrates CUDA cores, Tensor cores, and RT cores within a fixed ratio within the SM (Schedule Controller). The design logic of the SM directly determines how these processing cores are scheduled and collaborate.

This SM-centric organization allows different processing cores to share scheduling logic, cache levels, and execution context, enabling efficient collaboration within the same workload, rather than operating as isolated acceleration modules.

Three Types of GPU Processing Cores

CUDA Core: The Processing Core of General-purpose Parallel Computing

As a general-purpose parallel computing core, it supports single-precision (FP32), double-precision (FP64), and integer operations, making it suitable for a wide range of parallel computing tasks, such as graphics rendering, general AI computing, and scientific computing requiring high precision. In AI training and inference, it primarily handles non-matrix operation tasks.

From an architectural perspective, CUDA cores are the core support for the GPU’s parallel execution model. Each SM contains a large number of CUDA cores for executing instructions within warps. This design enables the GPU to maintain high throughput in highly parallel workloads.

Even in AI workloads, CUDA cores remain irreplaceable. They handle not only some non-matrix computation tasks but also control logic, data preparation, and the scheduling and coordination of dedicated acceleration units. For example, the Hopper architecture released in 2022 includes 144 SMs (Multi-Site Controllers), each integrating the computing power equivalent to 64 FP32 CUDA cores, and equipped with 8 Tensor cores, supporting over 2 million concurrent threads. For a more comprehensive understanding of the core architecture of CUDA cores and their overall role and application characteristics in NVIDIA GPUs, please refer to our in-depth guide to NVIDIA CUDA cores.

Article content

Tensor Core: Dedicated Processing Cores for AI

As the scale and complexity of deep learning models continue to increase, relying solely on general-purpose computing units to perform matrix operations is no longer sufficient to achieve ideal performance and energy efficiency. To address this, NVIDIA introduced Tensor cores for the first time in its Volta architecture, serving as specialized cores specifically designed for AI workloads.

The core design goal of Tensor cores is to efficiently perform matrix multiply-accumulate (MMA) operations, i.e., D = A x B + C, aiming to accelerate the extremely computationally intensive matrix multiplication in deep learning.

These operations are the core computational patterns in neural network training and inference. By implementing high-level optimizations for matrix operations at the hardware level, Tensor cores can complete a large number of multiply-accumulate operations within a single clock cycle, with throughput far exceeding that of general-purpose CUDA cores. In AI model training and inference, matrix computations are the most demanding; in scenarios closely matching matrix computation patterns, Tensor cores can achieve orders of magnitude performance improvements compared to using only CUDA cores.

Article content

Furthermore, Tensor core supports various mixed-precision computation formats (such as FP16, BF16, TF32, INT8, etc.), significantly improving computational efficiency and energy efficiency while ensuring controllable model accuracy. This makes it an indispensable processing core in modern AI training and inference. For more details on the design background, operating principles, and development of Tensor core in different NVIDIA GPU architectures, please refer to the systematic introduction to Tensor core.

RT Core: Dedicated Acceleration Core for Ray Tracing

Beyond AI computing, real-time ray tracing places entirely new demands on GPU computing power. To address this, NVIDIA introduced RT cores in its Turing architecture as dedicated hardware acceleration cores for ray tracing computations.

RT cores primarily accelerate critical steps in ray tracing, such as BVH traversal and ray-geometry interaction testing. Performing these computations entirely by general-purpose CUDA cores would incur significant performance overhead. By offloading these tasks to RT cores, the GPU can achieve higher-quality graphics rendering while maintaining real-time performance.

In the overall architecture, RT cores work in conjunction with CUDA cores and Tensor cores, enabling the GPU to handle graphics rendering, AI inference, and general-purpose computing simultaneously.

RT cores are primarily found in GPUs designed for graphics and visualization scenarios (such as the RTX series and some data center/workstation GPUs). Data center GPUs focused on computing and AI (such as the A100 and H100) typically do not integrate RT cores, concentrating chip resources on Tensor cores and FP64 computing capabilities.

Article content

Key Differences in GPU Processing Cores

Both CUDA cores and Tensor cores participate in the execution path of general-purpose computing and AI workloads, exhibiting a clear division of labor and collaborative relationship. In contrast, RT cores are dedicated acceleration cores for ray tracing, primarily serving graphics rendering scenarios. Their computational model and usage differ significantly from the former two, and they are generally not used for comparison.

CUDA cores and Tensor cores are designed for different types of workloads. The following are their main differences:

Purpose and Workload Optimization

CUDA cores are general-purpose parallel processing cores designed to cover a wide range of applications, from graphics rendering to general-purpose computing (GPGPU). They excel in highly parallelizable tasks, such as rasterization, physics simulation, and various traditional computationally intensive workloads, emphasizing flexibility and versatility.

Tencent cores, on the other hand, are dedicated processing cores optimized for specific scenarios. Their core goal is to accelerate matrix multiplication — a fundamental operation in deep learning and artificial intelligence computing. By directly supporting tensor operations at the hardware level, Tensor cores can perform computations with significantly higher throughput than CUDA cores, thereby significantly improving the performance of AI workloads such as deep learning training and inference.

Precision and Data Types

CUDA cores are primarily designed for single-precision (FP32) and double-precision (FP64) floating-point computation, balancing accuracy and versatility, and are suitable for various scientific computing and general-purpose parallel workloads.

Tensor cores natively support mixed-precision computation, including low-precision formats such as FP16, BF16, INT8, and INT4, and perform accumulation using FP32. While ensuring controllable model accuracy, they can perform computations more efficiently, which is key to improving the performance of deep learning training and inference.

Computational Efficiency and Speed

CUDA cores execute instructions in massively parallel mode, with multiple cores processing computational tasks simultaneously. They execute instructions at the warp level using a SIMT model, emphasizing versatility and flexible scheduling.

Tensor cores integrate fused multiply-accumulate (FMA) units at the hardware level, enabling them to complete multiple operations within a single clock cycle. This allows them to achieve throughput far exceeding that of CUDA cores in computationally intensive tasks such as matrix multiplication, making them particularly suitable for artificial intelligence and high-performance scientific computing scenarios.

Use Cases

CUDA cores primarily serve traditional parallel computing scenarios, including game graphics rendering, physics simulation, video processing, and various general-purpose GPU computing tasks.

Tensor cores, on the other hand, are specifically optimized for AI, deep learning, and high-performance scientific computing. Addressing the core need for large-scale matrix operations, they significantly improve the efficiency of neural network training and model inference.

Article content

Conclusion

The computing power of modern NVIDIA GPUs no longer stems from a single type of processing core, but rather from the deep collaboration of multiple core types at the SM (Multi-Screen) level. CUDA cores provide the foundation for general-purpose parallel computing, Tensor cores focus on AI acceleration, and RT cores are specifically optimized for ray tracing. It is this heterogeneous computing architecture built around the SM that enables NVIDIA GPUs to maintain their leading position in multiple fields such as AI, graphics, and high-performance computing, and lays the foundation for more complex computing workloads in the future.

To view or add a comment, sign in

More articles by NADDOD

Others also viewed

Explore content categories