Top Important LLM Papers for the Week from 06/05 to 12/05

Stay Updated with Recent Large Language Models Research

May 14, 2024

Large language models (LLMs) have advanced rapidly in recent years. As new generations of models are developed, researchers and engineers need to stay informed on the latest progress.

This article summarizes some of the most important LLM papers published during the Second Week of May 2024.

The papers cover various topics shaping the next generation of language models, from model optimization and scaling to reasoning, benchmarking, and enhancing performance.

Keeping up with novel LLM research across these domains will help guide continued progress toward models that are more capable, robust, and aligned with human values.

My E-book: Data Science Portfolio for Success Is Out!

Youssef Hosni

September 15, 2023

My E-book: Data Science Portfolio for Success Is Out!

I recently published my first e-book Data Science Portfolio for Success which is a practical guide on how to build your data science portfolio. The book covers the following topics: The Importance of Having a Portfolio as a Data Scientist How to Build a Data Science Portfolio That Will Land You a Job?

Read full story

1. LLM Progress & Benchmarking

1.1. AlphaMath Almost Zero: process Supervision without process

Recent advancements in large language models (LLMs) have substantially enhanced their mathematical reasoning abilities. However, these models still need help with complex problems that require multiple reasoning steps, frequently leading to logical or numerical errors.

While numerical mistakes can largely be addressed by integrating a code interpreter, identifying logical errors within intermediate steps is more challenging. Moreover, manually annotating these steps for training is not only expensive but also demands specialized expertise.

In this study, we introduce an innovative approach that eliminates the need for manual annotation by leveraging the Monte Carlo Tree Search (MCTS) framework to generate both the process supervision and evaluation signals automatically.

Essentially, when an LLM is well pre-trained, only the mathematical questions and their final answers are required to generate our training data, without requiring the solutions.

We proceed to train a step-level value model designed to improve the LLM’s inference process in mathematical domains. Our experiments indicate that using automatically generated solutions by LLMs enhanced with MCTS significantly improves the model’s proficiency in dealing with intricate mathematical reasoning tasks.

ArVix Page
View PDF

1.2. Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems.

Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibit an incipient comprehension of physical laws. In this survey, we embark on a comprehensive exploration of the latest advancements in world models.

Our analysis navigates through the forefront of generative methodologies in video generation, where world models stand as pivotal constructs facilitating the synthesis of highly realistic visual content.

Additionally, we scrutinize the burgeoning field of autonomous-driving world models, meticulously delineating their indispensable role in reshaping transportation and urban mobility.

Furthermore, we delve into the intricacies inherent in world models deployed within autonomous agents, shedding light on their profound significance in enabling intelligent interactions within dynamic environmental contexts.

At last, we examine the challenges and limitations of world models and discuss their potential future directions. We hope this survey can serve as a foundational reference for the research community and inspire continued innovation.

ArVix Page
View PDF

1.3. CLLMs: Consistency Large Language Models

Parallel decoding methods such as Jacobi decoding show promise for more efficient LLM inference as it breaks the sequential nature of the LLM decoding process and transforms it into parallelizable computation.

However, in practice, it achieves little speedup compared to traditional autoregressive (AR) decoding, primarily because Jacobi decoding seldom accurately predicts more than one token in a single fixed-point iteration step.

To address this, we develop a new approach aimed at realizing fast convergence from any state to the fixed point on a Jacobi trajectory. This is accomplished by refining the target LLM to consistently predict the fixed point given any state as input.

Extensive experiments demonstrate the effectiveness of our method, showing 2.4× to 3.4× improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks.

ArVix Page
View PDF

1.4. Mammoth: Building Math Generalist Models Through Hybrid Instruction Tuning

We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset.

MathInstruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (CoT) and program-of-thought (PoT) rationales and also ensures extensive coverage of diverse fields in math. The hybrid of CoT and PoT not only unleashes the potential of tool use but also allows different thought processes for different math problems.

As a result, the MAmmoTH series substantially outperforms existing open-source models on nine mathematical reasoning datasets across all scales with an average accuracy gain between 16% and 32%.

Remarkably, our MAmmoTH-7B model reaches 33% on MATH (a competition-level dataset), which exceeds the best open-source 7B model (WizardMath) by 23%, and the MAmmoTH-34B model achieves 44% accuracy on MATH, even surpassing GPT-4’s CoT result.

Our work underscores the importance of diverse problem coverage and the use of hybrid rationales in developing superior math generalist models.

ArVix Page
View PDF

1.5. DrEureka: Language Model Guided Sim-To-Real Transfer

Transferring policies learned in simulation to the real world is a promising strategy for acquiring robot skills at scale. However, sim-to-real approaches typically rely on manual design and tuning of the task reward function as well as the simulation physics parameters, rendering the process slow and human labor-intensive.

In this paper, we investigate using Large Language Models (LLMs) to automate and accelerate sim-toreal design. Our LLM-guided sim-to-real approach requires only the physics simulation for the target task and automatically constructs suitable reward functions and domain randomization distributions to support real-world transfer.

We first demonstrate our approach can discover sim-to-real configurations that are competitive with existing human-designed ones on quadruped locomotion and dexterous manipulation tasks.

Then, we showcase that our approach is capable of solving novel robot tasks, such as quadruped balancing and walking atop a yoga ball, without iterative manual design.

Project Page
View PDF

2. LLM Reasoning

2.1. THOUGHTSCULPT: Reasoning with Intermediate Revision and Search

We present THOUGHTSCULPT, a general reasoning and search method for tasks with outputs that can be decomposed into components. THOUGHTSCULPT explores a search tree of potential solutions using Monte Carlo Tree Search (MCTS), building solutions one action at a time and evaluating according to any domain-specific heuristic, which in practice is often simply an LLM evaluator.

Critically, our action space includes revision actions: THOUGHTSCULPT may choose to revise part of its previous output rather than continue to build the rest of its output.

Empirically, THOUGHTSCULPT outperforms state-of-the-art reasoning methods across three challenging tasks: Story Outline Improvement (up to +30% interestingness), mini-crossword solving (up to +16% word success rate), and Constrained Generation (up to +10% concept coverage).

ArVix Page
View PDF

3. LLM Training, Evaluation & Inference

3.1. Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size.

As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression rates in different heads and layers.

We retrofit pre-trained LLMs such as Llama 2 (7B, 13B, and 70B) into DMC Transformers, achieving up to ~3.7x throughput increase in auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters.

We find that DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA). GQA and DMC can even be combined to obtain compounded gains. As a result, DMC fits longer contexts and larger batches within any given memory budget.

ArVix Page
View PDF

4. Attention Models

4.1. xLSTM: Extended Long Short-Term Memory

In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular, they constituted the first Large Language Models (LLMs).

However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs?

Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, and (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule.

Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.

ArVix Page
View PDF

4.2. Is Flash Attention Stable?

Training large-scale machine learning models poses distinct system challenges, given both the size and complexity of today’s workloads. Recently, many organizations training state-of-the-art Generative AI models have reported cases of instability during training, often taking the form of loss spikes.

Numeric deviation has emerged as a potential cause of this training instability, although quantifying this is especially challenging given the costly nature of training runs. In this work, we develop a principled approach to understanding the effects of numeric deviation and construct proxies to put observations into context when downstream effects are difficult to quantify.

As a case study, we apply this framework to analyze the widely adopted Flash Attention optimization. We find that Flash Attention sees roughly an order of magnitude more numeric deviation than Baseline Attention at BF16 when measured during an isolated forward pass.

We then use a data-driven analysis based on the Wasserstein Distance to provide upper bounds on how this numeric deviation impacts model weights during training, finding that the numerical deviation present in Flash Attention is 2–5 times less significant than in low-precision training.

ArVix Page
View PDF

Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:

Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM

To Data & Beyond

My E-book: Data Science Portfolio for Success Is Out!

Discussion about this post

To Data & Beyond

Top Important LLM Papers for the Week from 06/05 to 12/05

Stay Updated with Recent Large Language Models Research

Table of Contents:

My E-book: Data Science Portfolio for Success Is Out!

1. LLM Progress & Benchmarking

1.1. AlphaMath Almost Zero: process Supervision without process

1.2. Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

1.3. CLLMs: Consistency Large Language Models

1.4. Mammoth: Building Math Generalist Models Through Hybrid Instruction Tuning

1.5. DrEureka: Language Model Guided Sim-To-Real Transfer

2. LLM Reasoning

2.1. THOUGHTSCULPT: Reasoning with Intermediate Revision and Search

3. LLM Training, Evaluation & Inference

3.1. Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

4. Attention Models

4.1. xLSTM: Extended Long Short-Term Memory

4.2. Is Flash Attention Stable?

Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:

Discussion about this post