Digital Event Horizon
Google has successfully demonstrated the feasibility of running large language models on Central Processing Units (CPUs), challenging the long-held assumption that Graphics Processing Units (GPUs) are essential for machine learning. The company's experiments using Intel's 4th-Gen Xeon cores have yielded impressive results, but it remains to be seen whether CPUs will become the preferred choice for machine learning due to their cost-effectiveness compared to GPUs.
Google successfully ran large language models on Central Processing Units (CPUs) using Intel's 4th-Gen Xeon cores. Using AMX extensions, Google achieved acceptable second-token latencies of 55 milliseconds per token with a batch size of six. CPU usage yielded output latencies of 55 milliseconds per token for the same model with hyperthreading disabled. CPUs can be a viable option for certain AI workloads, particularly those involving natural language processing. The cost-effectiveness of CPUs is still limited compared to GPUs due to their high costs and lower performance-per-dollar ratio.
In a groundbreaking development that challenges the long-held assumption that Graphics Processing Units (GPUs) are essential for machine learning (ML), Google has successfully demonstrated the feasibility of running large language models on Central Processing Units (CPUs). This revelation has sparked a heated debate among the ML community, with some experts hailing it as a game-changer and others cautioning that the economics still favor GPUs.
According to a recent report by The Register, Google's experiments using Intel's 4th-Gen Xeon cores have yielded impressive results. By leveraging the Advanced Matrix Extensions (AMX) baked into these processors, the search giant was able to achieve acceptable second-token latencies when running its Llama 2 7B model at a batch size of six, with an output token time (TPOT) of 55 milliseconds per token.
To put this in perspective, Google's testing revealed that using a pair of 4th-Gen Xeons yielded output latencies of 55 milliseconds per token for the same model. This is remarkable considering that hyperthreading was disabled during these tests, limiting the active threads to just 88 out of 176 available.
The company also experimented with fine-tuning on Meta's 125 million parameter RoBERTa model using the Stanford Question Answering Dataset (SQuAD), achieving completion times under 25 minutes regardless of whether TDX security functionality was enabled. This suggests that CPUs can be a viable option for certain AI workloads, particularly those involving natural language processing.
However, while Google's results are encouraging, they do not necessarily imply that CPUs will become the preferred choice for machine learning. One major challenge facing CPUs is their cost-effectiveness compared to GPUs. While Google estimates that its 176 vCPU C3 instance costs $5,464 a month, this translates to approximately $9 per million tokens generated.
To put this into perspective, renting an Nvidia L40S GPU for around $600-$1,200 a month yields significantly better token-per-dollar ratios, making GPUs the more attractive option for large-scale ML deployments. Nevertheless, as The Register notes, this is where costs can become a factor in the decision-making process.
For some organizations, particularly those with existing infrastructure or underutilized cores, CPUs may offer an attractive alternative to GPUs. Intel's 6900P-series Granite Rapids parts, for example, will set you back between $11,400 and $17,800 before factoring in memory costs. While this is indeed a premium, the flexibility of an AI-capable CPU might be worth the additional expense.
On the other hand, for workloads that benefit from the high-bandwidth, low-latency capabilities offered by GPUs, these accelerators remain the preferred choice. Nvidia's H100 and AMD's MI300X, with their multiple terabytes of memory bandwidth, are still unmatched in this regard.
In conclusion, Google's experiments demonstrate that CPUs can be viable for certain AI workloads, particularly those involving language models. However, it is essential to consider the economics and assess whether the benefits of using a CPU outweigh the costs. As the ML landscape continues to evolve, one thing is clear: the distinction between CPUs and GPUs will become increasingly nuanced.
Related Information:
https://go.theregister.com/feed/www.theregister.com/2024/10/29/cpu_gen_ai_gpu/
Published: Tue Oct 29 11:43:12 2024 by llama3.2 3B Q4_K_M