Digital Event Horizon

Amazon SageMaker inference launches faster auto scaling for generative AI models

Today, we are excited to announce a new capability in Amazon SageMaker inference that can help you reduce the time it takes for your generative artificial intelligence (AI) models to scale automatically. You can now use sub-minute metrics and significantly reduce overall scaling latency for generative AI models. With this enhancement, you can improve the [ ] You can use the new concurrency-based target tracking auto scaling policies in tandem with existing invocation-based target tracking policies. When a container experiences a crash or failure, the resulting requests are typically short-lived and may be responded to with error messages. In such scenarios, the concurrency-based auto scaling policy can detect the sudden drop in concurrent requests, potentially causing an unintentional scale-in of the container fleet. However, the invocation-based policy can act as a safeguard, avoiding the scale-in if there is still sufficient traffic being directed to the remaining containers. With this hybrid approach, container-based applications can achieve a more efficient and adaptive scaling behavior. The balance between concurrency-based and invocation-based policies allows the system to respond appropriately to various operational conditions, such as container failures, sudden spikes in traffic, or gradual changes in workload patterns. This enables the container infrastructure to scale up and down more effectively, optimizing resource utilization and providing reliable application performance. Sample runs and results With the new metrics, we have observed improvements in the time required to invoke scale-out events. To test the effectiveness of this solution, we completed some sample runs with Meta Llama models (Llama 2 7B and Llama 3 8B). Prior to this feature, detecting the need for auto scaling could take over 6 minutes, but with this new feature, we were able to reduce that time to less than 45 seconds. For generative AI models such as Meta Llama 2 7B and Llama 3 8B, we have been able to reduce the overall end-to-end scale-out time by approximately 40%. The following figures illustrate the results of sample runs for Meta Llama 3 8B. The following figures illustrate the results of sample runs for Meta Llama 2 7B. As a best practice, it’s important to optimize your container, model artifacts, and bootstrapping processes to be as efficient as possible. Doing so can help minimize deployment times and improve the responsiveness of AI services. Conclusion In this post, we detailed how the ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy metrics work, explained why you should use them, and walked you through the process of implementing them for your workloads. We encourage you to try out these new metrics and evaluate whether they improve your FM and LLM workloads on SageMaker endpoints. You can find the notebooks on GitHub. Special thanks to our partners from Application Auto Scaling for making this launch happen: Ankur Sethi, Vasanth Kumararajan, Jaysinh Parmar Mona Zhao, Miranda Liu, Fatih Tekin, and Martin Wang. About the Authors James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends. You can find him on LinkedIn. Praveen Chamarthi is a Senior AI/ML Specialist with Amazon Web Services. He is passionate about AI/ML and all things AWS. He helps customers across the Americas scale, innovate, and operate ML workloads efficiently on AWS. In his spare time, Praveen loves to read and enjoys sci-fi movies. Dr. Changsha Ma is an AI/ML Specialist at AWS. She is a technologist with a PhD in Computer Science, a master’s degree in Education Psychology, and years of experience in data science and independent consulting in AI/ML. She is passionate about researching methodological approaches for machine and human intelligence. Outside of work, she loves hiking, cooking, hunting food, and spending time with friends and families. Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family. Kunal Shah is a software development engineer at Amazon Web Services (AWS) with 7+ years of industry experience. His passion lies in deploying machine learning (ML) models for inference, and he is driven by a strong desire to learn and contribute to the development of AI-powered tools that can create real-world impact. Beyond his professional pursuits, he enjoys watching historical movies, traveling and adventure sports. Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Published: 2024-07-25T21:13:21

Today's AI/ML headlines are brought to you by ThreatPerspective

Amazon SageMaker inference launches faster auto scaling for generative AI models