Digital Event Horizon

AWS AI chips deliver high performance and low cost for Llama 3.1 models on AWS

Today, we are excited to announce AWS Trainium and AWS Inferentia support for fine-tuning and inference of the Llama 3.1 models. The Llama 3.1 family of multilingual large language models (LLMs) is a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes. In a previous post, we covered how to deploy Llama 3 models on AWS Trainium and Inferentia based instances in Amazon SageMaker JumpStart. In this post, we outline how to get started with fine-tuning and deploying the Llama 3.1 family of models on AWS AI chips, to realize their price-performance benefits. Additionally, if you want to use vLLM to deploy the models, you can refer to the continuous batching guide to create the environment. After you create the environment, you can use vLLM to deploy Llama 3.1 8/70B models on AWS Trainium or Inferentia. The following an example to deploy Llama 3.1 8B: from vllm import LLM, SamplingParams # Sample prompts. prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Create an LLM. llm = LLM( model="meta-llama/Meta-Llama-3.1-8B", max_num_seqs=8, # The max_model_len and block_size arguments are required to be same as max sequence length, # when targeting neuron device. Currently, this is a known limitation in continuous batching # support in transformers-neuronx. max_model_len=128, block_size=128, # The device can be automatically detected when AWS Neuron SDK is installed. # The device argument can be either unspecified for automated detection, or explicitly assigned. device="neuron", tensor_parallel_size=8) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") Conclusion AWS Trainium and Inferentia deliver high performance and low cost for fine-tuning and deploying Llama 3.1 models. We are excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, refer to Model Samples and Tutorials in AWS Neuron Documentation. About the Authors John Gray is a Sr. Solutions Architect in Annapurna Labs, AWS, based out of Seattle. In this role, John works with customers on their AI and machine learning use cases, architects solutions to cost-effectively solve their business problems, and helps them build a scalable prototype using AWS AI chips. Pinak Panigrahi works with customers to build ML-driven solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative AI models on AWS AI chips. Kamran Khan, Head of Business Development for AWS Inferentina/Trianium at AWS. He has over a decade of experience helping customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium. Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Published: 2024-07-23T16:18:57

Today's AI/ML headlines are brought to you by ThreatPerspective

AWS AI chips deliver high performance and low cost for Llama 3.1 models on AWS