Digital Event Horizon

NVIDIA Blackwell Revolutionizes AI Inference Performance: A New Era for AI Factories

NVIDIA Corporation has announced its latest achievement in artificial intelligence (AI) inference performance with the NVIDIA Blackwell platform, delivering up to 30x higher throughput on the Llama 3.1 405B benchmark over the NVIDIA H200 NVL8 submission this round.

NVIDIA Corporation has announced its latest achievement in artificial intelligence (AI) inference performance with the NVIDIA Blackwell platform.

The platform delivers up to 30x higher throughput on the Llama 3.1 405B benchmark over previous submissions.

The GB200 NVL72 system powered by the NVIDIA Blackwell platform demonstrated exceptional performance across the board.

The new Llama 2 70B Interactive benchmark features stricter latency requirements, reflecting production deployments' need for better user experiences.

Production inference deployments have reduced latency constraints on time to first token (TTFT) and time per output token (TPOT).

NVIDIA's Hopper architecture continues to power AI inference factories with increasing throughput through software optimization.

The NVIDIA Blackwell platform is available across all cloud service providers and server makers worldwide, reflecting its reach in the ecosystem.

NVIDIA Corporation, a global leader in graphics processing units (GPUs) and high-performance computing systems, has announced its latest achievement in artificial intelligence (AI) inference performance with the NVIDIA Blackwell platform. This groundbreaking development marks a significant milestone in the company's pursuit of delivering accurate answers to queries quickly, at the lowest cost, and to as many users as possible.

In the realm of AI factories, which are designed to manufacture intelligence at scale by transforming raw data into real-time insights, the complexity of pulling off this feat is substantial. As AI models grow to billions and trillions of parameters to deliver smarter replies, the compute required to generate each token increases. This requirement reduces the number of tokens that an AI factory can generate and increases cost per token. Keeping inference throughput high and cost per token low requires rapid innovation across every layer of the technology stack, spanning silicon, network systems, and software.

The latest updates to MLPerf Inference, a peer-reviewed industry benchmark of inference performance, include the addition of Llama 3.1 405B, one of the largest and most challenging-to-run open-weight models. The new Llama 2 70B Interactive benchmark features much stricter latency requirements compared with the original Llama 2 70B benchmark, better reflecting the constraints of production deployments in delivering the best possible user experiences.

The NVIDIA Blackwell platform, which connects 72 NVIDIA Blackwell GPUs to act as a single, massive GPU, delivered up to 30x higher throughput on the Llama 3.1 405B benchmark over the NVIDIA H200 NVL8 submission this round. This feat was achieved through more than triple the performance per GPU and a 9x larger NVIDIA NVLink interconnect domain.

The GB200 NVL72 system, powered by the NVIDIA Blackwell platform, demonstrated exceptional performance across the board, with performance increasing significantly over the last year on Llama 2 70B thanks to full-stack optimizations. This milestone represents a significant achievement for AI factories and paves the way for delivering higher intelligence, increased throughput, and faster token rates.

Production inference deployments often have latency constraints on two key metrics: time to first token (TTFT), or how long it takes for a user to begin seeing a response to a query given to a large language model; and time per output token (TPOT), or how quickly tokens are delivered to the user. The new Llama 2 70B Interactive benchmark has a 5x shorter TPOT and 4.4x lower TTFT — modeling a more responsive user experience.

The NVIDIA Hopper architecture, introduced in 2022, powers many of today's AI inference factories and continues to power model training. Through ongoing software optimization, NVIDIA increases the throughput of Hopper-based AI factories, leading to greater value. On the Llama 2 70B benchmark, first introduced a year ago in MLPerf Inference v4.0, H100 GPU throughput has increased by 1.5x, while the H200 GPU extends that increase to 1.6x.

The breadth of submissions from 15 partners, including ASUS, Cisco, CoreWeave, Dell Technologies, Fujitsu, Giga Computing, Google Cloud, Hewlett Packard Enterprise, Lambda, Lenovo, Oracle Cloud Infrastructure, Quanta Cloud Technology, Supermicro, Sustainable Metal Cloud, and VMware, reflects the reach of the NVIDIA platform, which is available across all cloud service providers and server makers worldwide.

The work of MLCommons to continuously evolve the MLPerf Inference benchmark suite to keep pace with the latest AI developments and provide the ecosystem with rigorous, peer-reviewed performance data is vital to helping IT decision-makers select optimal AI infrastructure. As the landscape of AI continues to evolve at an unprecedented pace, the importance of benchmarking and validation cannot be overstated.

The NVIDIA Blackwell platform represents a significant breakthrough in AI inference performance, marking a new era for AI factories and paves the way for delivering higher intelligence, increased throughput, and faster token rates. This achievement serves as a testament to NVIDIA's commitment to pushing the boundaries of innovation and its dedication to empowering organizations to harness the full potential of AI.

Related Information:

https://www.digitaleventhorizon.com/articles/NVIDIA-Blackwell-Revolutionizes-AI-Inference-Performance-A-New-Era-for-AI-Factories-deh.shtml

https://blogs.nvidia.com/blog/blackwell-mlperf-inference/

Published: Wed Apr 2 12:26:33 2025 by llama3.2 3B Q4_K_M

Today's AI/ML headlines are brought to you by ThreatPerspective

NVIDIA Blackwell Revolutionizes AI Inference Performance: A New Era for AI Factories