Digital Event Horizon
Intel Labs has developed Universal Assisted Generation (UAG), a method that enables pairing any target model with an assistant model from different tokenizer families, overcoming a major limitation of assisted generation. With significant speedups ranging from 1.52x to 1.91x compared to standard assisted generation methods, UAG offers a promising solution for accelerating inference times in the field of NLP.
The Intel Labs has developed Universal Assisted Generation (UAG), an innovative method for natural language processing (NLP) that enables pairing any target model with an assistant model from a different tokenizer family.UAG introduces 2-way tokenizer translations, allowing users to pair target and assistant models regardless of their tokenizer.The core idea behind UAG involves using an iterative process where the assistant model generates tokens autoregressively, and the target model verifies all the assistant tokens in a single forward pass.UAG requires a more sophisticated approach to handle discrepancies between different tokenizers by prepending a context window of previous tokens.The benefits of UAG are demonstrated through benchmark results showing significant latency improvements compared to standard assisted generation methods.Intel Labs plans to extend UAG with support for speculative sampling and integration into the Hugging Face Transformers library for seamless use.
Intel Labs, in collaboration with Hugging Face, has made a groundbreaking achievement in the field of natural language processing (NLP) with the development of Universal Assisted Generation (UAG). This innovative method enables users to pair any target model with an assistant model from a different tokenizer family, thereby overcoming one of the major limitations of assisted generation.
The concept of assisted generation involves using a pair of models, known as the target and assistant models. The assistant model is typically a smaller, more efficient version of the target model, which allows for faster inference times without compromising accuracy. However, this approach requires both models to share the same tokenizer, limiting its applicability to only those models that have compatible tokenizers.
UAG, on the other hand, introduces the idea of 2-way tokenizer translations, allowing users to pair any target and assistant model regardless of their tokenizer. For instance, a user can utilize gemma-2-9b as the target model with the tiny vicuna-68m as the assistant model. This flexibility opens up new possibilities for accelerating inference times from various decoder or Mixture of Experts models.
The core idea behind UAG involves using an iterative process where the assistant model generates a sequence of tokens autoregressively, one at a time, and then the target model verifies all the assistant tokens in the sequence in a single forward pass. This verification step enables the target model to confirm multiple tokens in each forward pass, thereby achieving speedups comparable to assisted generation.
However, UAG requires a more sophisticated approach to handle discrepancies between different tokenizers. To accurately re-encode newly generated assistant tokens, it is essential to prepend a context window consisting of several previous tokens. This entire sequence is then re-encoded into the target token format and aligned with the most recent target tokens to pinpoint the exact location where the newly generated tokens should be appended.
The benefits of UAG are demonstrated in a series of benchmarks that showcase significant latency improvements observed when pairing target models with assistant models using different tokenizers. The table below presents these benchmark results, which demonstrate impressive speedups ranging from 1.52x to 1.91x compared to standard assisted generation methods.
| Target Model | Assistant Model | Dataset | Task | Speedup |
| --- | --- | --- | --- | --- |
| codellama/CodeLlama-13b-Instruct-hf | bigcode/tiny_starcoder_py | openai/humaneval | code generation | 1.90x |
| mistralai/Mixtral-8x22B-Instruct-v0.1 | double7/vicuna-68m | cnn_dailymail | summarization | 1.52x |
| google/gemma-2-9b | double7/vicuna-68m | cnn_dailymail | summarization | 1.76x |
| mistralai/Mixtral-8x22B-Instruct-v0.1 | Qwen/Qwen2-0.5B-Instruct | tau/scrolls | long-context summarization | 1.78x |
| meta-llama/Llama-3.1-70B | Qwen/Qwen2-0.5B-Instruct | tau/scrolls | long-context summarization | 1.78x |
| microsoft/Phi-3-medium-128k-instruct | Qwen/Qwen2-0.5B-Instruct | tau/scrolls | long-context summarization | 1.91x |
The integration of UAG into the Hugging Face Transformers library allows users to leverage its benefits with minimal effort. By passing tokenizer and assistant_tokenizer parameters during generation, users can tap into the power of Universal Assisted Generation.
Looking ahead, Intel Labs plans to extend UAG by adding support for speculative sampling, which would enable users to achieve even greater speedups compared to standard assisted generation methods. Furthermore, integration into Transformers pipelines is also on the horizon, providing a more streamlined and concise experience for users.
In conclusion, Universal Assisted Generation represents a significant breakthrough in the field of NLP, offering unparalleled flexibility and performance enhancements for accelerating inference times from various models.
Related Information:
https://huggingface.co/blog/universal_assisted_generation
Published: Tue Oct 29 07:40:14 2024 by llama3.2 3B Q4_K_M