Important LLMs Papers for the Week from 27/01 to 01/02
Stay Updated with Recent Large Language Models Research
Large language models (LLMs) have advanced rapidly in recent years. As new generations of models are developed, researchers and engineers need to stay informed on the latest progress.
This article summarizes some of the most important LLM papers published during the Fifth Week of January 2025. The papers cover various topics shaping the next generation of language models, from model optimization and scaling to reasoning, benchmarking, and enhancing performance.
Keeping up with novel LLM research across these domains will help guide continued progress toward models that are more capable, robust, and aligned with human values.
Table of Contents:
LLM Progress & Benchmarking
LLM Reasoning
LLM Evaluation
LLM Preference Optimization & Alignment
LLM Scaling & Optimization
LLM Safety
LLM Training & Post Training
Retrieval Augmented Generation (RAG)
1. LLM Progress & Benchmarking
1.1. Humanity’s Last Exam
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities.
In response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences.
HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading.
Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions.
1.2. Qwen2.5–1M Technical Report
We introduce Qwen2.5–1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5–1M series has significantly enhanced long-context capabilities through long-context pre-training and post-training.
Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs.
To promote the use of long-context models among a broader user base, we present an open-source inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training.
To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance.
By leveraging our inference framework, the Qwen2.5–1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models.
The Qwen2.5–1M series currently includes the open-source models Qwen2.5–7B-Instruct-1M and Qwen2.5–14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo.
Evaluations show that Qwen2.5–1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5–14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.
1.3. Baichuan-Omni-1.5 Technical Report
We introduce Baichuan-Omni-1.5, an Omni-modal model that not only has Omni-modal understanding capabilities but also provides end-to-end audio generation capabilities.
To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects.
First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision).
Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM.
Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities.
Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
2. LLM Reasoning
2.1. Large Language Models Think Too Fast To Explore Effectively
Large Language Models have emerged in many intellectual capacities. While numerous benchmarks assess their intelligence, limited attention has been given to their ability to explore, an essential capacity for discovering new information and adapting to novel environments in both natural and artificial systems.
The extent to which LLMs can effectively explore, particularly in open-ended tasks, remains unclear. This study investigates whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones.
Results show most LLMs underperform compared to humans, except for the o1 model, with those traditional LLMs relying primarily on uncertainty-driven strategies, unlike humans who balance uncertainty and empowerment.
Representational analysis of the models with Sparse Autoencoders revealed that uncertainty and choices are represented at earlier transformer blocks, while empowerment values are processed later, causing LLMs to think too fast and make premature decisions, hindering effective exploration.
These findings shed light on the limitations of LLM exploration and suggest directions for improving their adaptability.
2.2. GuardReasoner: Towards Reasoning-based LLM Safeguards
As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason.
Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models.
In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability.
Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average.
2.3. Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Large language models (LLMs) such as OpenAI’s o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking.
However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution.
This behavior leads to inadequate depth of reasoning and decreased performance, particularly when challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses.
We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with a switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path.
Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.
3. LLM Evaluation
3.1. Atla Selene Mini: A General Purpose Evaluation Model
We introduce Atla Selene Mini, a state-of-the-art small language model-as-a-judge (SLMJ). Selene Mini is a general-purpose evaluator that outperforms the best SLMJs and GPT-4o-mini on overall performance across 11 out-of-distribution benchmarks, spanning absolute scoring, classification, and pairwise preference tasks.
It is the highest-scoring 8B generative model on RewardBench, surpassing strong baselines like GPT-4o and specialized judges. To achieve this, we develop a principled data curation strategy that augments public datasets with synthetically generated critiques and ensures high quality through filtering and dataset ablations.
We train our model on a combined direct preference optimization (DPO) and supervised fine-tuning (SFT) loss, and produce a highly promptable evaluator that excels in real-world scenarios.
Selene Mini shows dramatically improved zero-shot agreement with human expert evaluations on financial and medical industry datasets. It is also robust to variations in prompt format.
Preliminary results indicate that Selene Mini is the top-ranking evaluator in a live, community-driven Judge Arena.
4. LLM Preference Alignment
4.1. Towards General-Purpose Model-Free Reinforcement Learning
Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice, however, RL algorithms are often tailored to specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices.
Recently, powerful model-based RL methods have shown impressive general results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings.
To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories.
We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.
5. LLM Scaling & Optimization
5.1. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear.
This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains.
We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios.
Further analysis reveals that RL improves the model’s underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain.
Despite RL’s superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model’s output format, enabling subsequent RL to achieve its performance gains. These findings demonstrate the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.
5.2. Optimizing Large Language Model Training Using FP4 Quantization
The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs.
While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity.
This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse.
To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens.
With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.
6. LLM Safety
6.1. o3-mini vs DeepSeek-R1: Which One is Safer?
The irruption of DeepSeek-R1 constitutes a turning point for the AI industry in general and the LLMs in particular. Its capabilities have demonstrated outstanding performance in several tasks, including creative thinking, code generation, maths, and automated program repair, at apparently lower execution costs.
However, LLMs must adhere to an important qualitative property, i.e., their alignment with safety and human values. A clear competitor of DeepSeek-R1 is its American counterpart, OpenAI’s o3-mini model, which is expected to set high standards in terms of performance, safety, and cost.
In this paper, we conduct a systematic assessment of the safety level of both, DeepSeek-R1 (70b version) and OpenAI’s o3-mini (beta version). To this end, we make use of our recently released automated safety testing tool, named ASTRAL.
By leveraging this tool, we automatically and systematically generate and execute a total of 1260 unsafe test inputs on both models. After conducting a semi-automated assessment of the outcomes provided by both LLMs, the results indicate that DeepSeek-R1 is highly unsafe as compared to OpenAI’s o3-mini. Based on our evaluation, DeepSeek-R1 answered unsafely to 11.98% of the executed prompts whereas o3-mini only to 1.19%.
6.2. Early External Safety Testing of OpenAI’s o3-mini: Insights from the Pre-Deployment Evaluation
Large Language Models (LLMs) have become an integral part of our daily lives. However, they impose certain risks, including those that can harm individuals’ privacy, perpetuate biases, and spread misinformation.
These risks highlight the need for robust safety mechanisms, ethical guidelines, and thorough testing to ensure their responsible deployment. The safety of LLMs is a key property that needs to be thoroughly tested prior the model to being deployed and accessible to general users.
This paper reports the external safety testing experience conducted by researchers from Mondragon University and the University of Seville on OpenAI’s new o3-mini LLM as part of OpenAI’s early access to safety testing program.
In particular, we apply our tool, ASTRAL, to automatically and systematically generate up-to-date unsafe test inputs (i.e., prompts) that help us test and assess different safety categories of LLMs.
We automatically generate and execute a total of 10,080 unsafe test inputs on an early o3-mini beta version. After manually verifying the test cases classified as unsafe by ASTRAL, we identified a total of 87 actual instances of unsafe LLM behavior.
We highlight key insights and findings uncovered during the pre-deployment external testing phase of OpenAI’s latest LLM.
7. LLM Training & Post Training
7.1. WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy.
One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data-generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date.
We extend the existing WildChat dataset to include responses not only from GPT but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters.
We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples.
7.2. Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks — models lose their safety alignment ability after fine-tuning on a few harmful samples.
For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable.
Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance.
Finally, the key message we want to convey through this paper is that: it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attacks, as it cannot solve the inherent safety issue of the pre-trained LLMs.
7.3. Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
Supervised fine tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones.
Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding of traits often overlooked by standard SFT.
To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of (input=[query; noisy response], output=critique).
CFT on this dataset yields a consistent 4–10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math, and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT.
Notably, our Qwen2.5-Math-CFT model-trained on just 50K sample matches or outperforms competitive models such as AceMath and Qwen2.5-Math-Instruct on most benchmarks, both of which use over 2M samples. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model.
Through these findings, we argue that critique-based training offers a more effective alternative to advance the reasoning of language models.
8. Retrieval Augmented Generation (RAG)
8.1. Chain-of-Retrieval Augmented Generation
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.
Conventional RAG methods usually perform a single retrieval step before the generation process, which limits their effectiveness in addressing complex queries due to imperfect retrieval results.
In contrast, our proposed method, CoRAG (Chain-of-Retrieval Augmented Generation), allows the model to dynamically reformulate the query based on the evolving state.
To train CoRAG effectively, we utilize rejection sampling to automatically generate intermediate retrieval chains, thereby augmenting existing RAG datasets that only provide the correct final answer.
At test time, we propose various decoding strategies to scale the model’s test-time compute by controlling the length and number of sampled retrieval chains.
Experimental results across multiple benchmarks validate the efficacy of CoRAG, particularly in multi-hop question-answering tasks, where we observe more than 10 points improvement in EM score compared to strong baselines.
On the KILT benchmark, CoRAG establishes a new state-of-the-art performance across a diverse range of knowledge-intensive tasks. Furthermore, we offer comprehensive analyses to understand the scaling behavior of CoRAG, laying the groundwork for future research aimed at developing factual and grounded foundation models.
Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:
Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM
Given the continuing interest in R1 (and DeepSeek in general), the following report provides insights into s1 and DeepSeek-R1 that you may find valuable:
From Brute Force to Brain Power: How Stanford's s1 Surpasses DeepSeek-R1
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5130864