
Nvidia Changes How Its Accelerators Perform Double-Precision Calculations
The company Nvidia has shifted its approach to handling 64-bit floating-point operations (FP64) in its supercomputing processors. According to reports, the company halted the development of specialized hardware units for this purpose in its new generations. Instead, it relies on simulating these tasks through algorithms within its CUDA libraries. This method allows it to match or exceed theoretical performance in certain scenarios without consuming specific silicon area. 🔄
Performance Figures Reveal the New Direction
Nvidia's official data clearly shows this evolution. Its latest accelerator, Rubin, claims 33 teraflops in vector FP64 operations via hardware, a figure comparable to the H100 from years ago. However, by enabling software emulation, Nvidia claims that Rubin can achieve up to 200 teraflops in matrix FP64 calculations. Even the Blackwell generation, with this technique, could reach 150 teraflops, more than double its predecessor Hopper running natively. 📊
Key Performance Comparison:- Rubin (Hardware): 33 TFLOPS in vector FP64.
- Rubin (Software): Up to 200 TFLOPS in emulated matrix FP64.
- Blackwell (Software): Around 150 TFLOPS, far surpassing Hopper.
In numerous studies with partners and internal research, we discovered that the precision achieved through emulation is, at minimum, equal to the precision obtained from hardware tensor cores.
Validated Precision Drives the Change
Dan Ernst, Nvidia executive for supercomputing, explained the reason for this strategic shift. Internal validation and with partners confirmed that the accuracy achieved by emulating FP64 is at least equivalent to running on dedicated hardware cores. This finding allows Nvidia to optimize the design of its chips for domains like artificial intelligence, where lower precisions (FP32, FP16) predominate, without neglecting the demands of the high-performance computing (HPC) sector that still needs FP64. ⚖️
Advantages of Software Emulation:- Frees up transistors and chip area for other functions.
- Allows achieving higher peak performance in specific workloads.
- Maintains the necessary precision for scientific and engineering applications.
A New Software-Defined Architecture
It seems that, in the competition to lead artificial intelligence, allocating silicon resources to emulate rather than execute natively has become the new paradigm of architectural efficiency. This is a shift where software not only supports hardware but fundamentally redefines what it needs to be. The boundary between them blurs to create more versatile solutions. 🚀