Redo You

Advancements in large language machine-learning models have revolutionized the capabilities of chatbots like ChatGPT. However, when engaged in prolonged conversations, these powerful AI systems often encounter performance issues.

Researchers from MIT and other institutions have successfully identified the underlying cause of this problem and developed a simple yet effective solution. By making a slight adjustment to the key-value cache, which acts as a conversation memory for these language models, the researchers have enabled chatbots to maintain uninterrupted conversations without experiencing crashes or slowdowns.

Their method, known as StreamingLLM, ensures that the first few data points remain in the chatbot’s memory even when the cache reaches its capacity. This allows the chatbot to continue functioning seamlessly, regardless of the conversation length.

Compared to other methods that involve constant recomputation of past conversations, the StreamingLLM approach is more than 22 times faster. With this breakthrough, chatbots can now conduct long conversations throughout the workday without the need for continuous reboots, making them highly efficient AI assistants for various tasks like copywriting, editing, and code generation.

Lead author of the paper on StreamingLLM, Guangxuan Xiao, an electrical engineering and computer science (EECS) graduate student from MIT, expresses the significance of this development. He explains, “Now, with this method, we can persistently deploy these large language models. By making a chatbot that we can always chat with, and that can always respond to us based on our recent conversations, we could use these chatbots in some new applications.”

The researchers involved in this breakthrough also include Song Han, an associate professor in EECS and a distinguished scientist of NVIDIA; Yuandong Tian, a research scientist at Meta AI; Beidi Chen, an assistant professor at Carnegie Mellon University; and senior author Mike Lewis, a research scientist at Meta AI. The findings of their research will be presented at the International Conference on Learning Representations.

The researchers uncovered a peculiar phenomenon associated with large language models. These models encode data into representations called tokens and utilize an attention mechanism to generate new text based on the tokens stored in the cache. However, when the cache becomes overloaded or tokens are evicted due to space limitations, the performance of the model decreases significantly.

To overcome this issue, the researchers introduced the concept of an “attention sink” by retaining the first token in the cache. They found that the first token acts as a reference point for generating text and must be preserved to maintain the model’s dynamics. By incorporating four attention sink tokens at the beginning of the cache, they achieved optimal performance.

Furthermore, the researchers discovered that the positional encoding of each token must remain consistent even when new tokens are added or others are removed from the cache. This crucial insight, along with the attention sink mechanism, allowed the StreamingLLM model to outperform existing methods.

For example, when the cache size is 256 tokens, the recomputation method takes 63 milliseconds to decode a new token, whereas StreamingLLM only requires 31 milliseconds. As the cache size increases to 4,096 tokens, recomputation takes a staggering 1,411 milliseconds for a new token, while StreamingLLM completes the task in just 65 milliseconds.

Impressed by the innovative approach taken by StreamingLLM, Yang You, a presidential young professor of computer science at the National University of Singapore, explains its transformative potential, “The innovative approach of StreamingLLM, centered around the attention sink mechanism, ensures stable memory usage and performance, even when processing texts up to 4 million tokens in length. This capability is not just impressive; it’s transformative, enabling StreamingLLM to be applied across a wide array of AI applications. The performance and versatility of StreamingLLM mark it as a highly promising technology, poised to revolutionize how we approach AI-driven generation applications.”

Another expert, Tianqi Chen, an assistant professor at Carnegie Mellon University, supports this sentiment, stating, “Streaming LLM enables the smooth extension of the conversation length of large language models. We have been using it to enable the deployment of Mistral models on iPhones with great success.”

While StreamingLLM allows AI models to conduct continuous conversations, it does have limitations. The model cannot remember words that are not stored in the cache. However, the researchers intend to address this limitation in future research by exploring methods to retrieve evicted tokens or enable the model to memorize previous conversations.

StreamingLLM has already been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM, further demonstrating its practicality and applicability. This groundbreaking work has received funding from various sources, including the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation.