Research & Development Blog

Retaining by doing the role of on Policy data in mitigating forgetting

2026-02-22T00:00:00+08:00

Source: “Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting,” arXiv: arXiv:2510.18874.

Introduction

Background

Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities, known as catastrophic forgetting. This phenomenon has been observed in both supervised fine-tuning (SFT) for instruction following and reinforcement learning (RL) for preference alignment. However, the comparative susceptibility of SFT and RL to forgetting remains underexplored. The paper addresses this by comparing their forgetting patterns across diverse tasks and models.

Research Gap

While prior work has noted forgetting in LM post-training, there is limited understanding of how SFT and RL differ in their forgetting behaviors. Existing studies often focus on one method or lack systematic comparisons across tasks and models. The role of on-policy data in mitigating forgetting has not been thoroughly investigated.

Hypothesis

RL is more robust to forgetting than SFT due to its mode-seeking nature and use of on-policy data, allowing it to maintain prior knowledge while adapting to new tasks.

Conclusion

The paper demonstrates that RL consistently forgets less than SFT across various LM families and tasks, attributing this to on-policy data usage. It highlights the potential of approximately on-policy data for efficient forgetting mitigation.

Literature Review

Prior work on catastrophic forgetting in LMs includes studies on SFT and RL, but comparative analyses are scarce. Related efforts explore continual learning techniques, but focus less on post-training specifics. Concurrent work by Lai et al. and Shenfeld et al. also find RL forgets less, but attribute it differently.

Methodology

Baseline Methods

The baselines include SFT using responses from Llama-3.3-70B-Instruct, Self-SFT using filtered responses from the initial model, and RL via GRPO with KL regularization.

Proposed Modifications

The paper proposes using on-policy data in SFT, such as Iterative-SFT (updating data at each epoch) and SFT on RL-generated data, to reduce forgetting.

Experimental Setup

Experiments use Llama 3 and Qwen 2.5 models (1B to 8B parameters) on IFEval, MMLU, and Countdown tasks. Training for 2 epochs with AdamW, batch size 128/64, and specific learning rates. Evaluation measures gain on target task and drop on non-target tasks (MATH, WildJailbreak, WildGuardTest).

Experiment

Main Findings

RL consistently shows less forgetting than SFT and Self-SFT across all tasks and models, achieving comparable or higher target performance. For example, on IFEval, RL gains 17-18% with drops of 0.2-3.4%, while SFT drops 15-27%. Simulations reveal that in multi-modal settings, mode-seeking RL preserves old modes better than mode-covering SFT.

Comparative Analysis

Compared to SFT, RL achieves similar gains but with much lower drops (e.g., RL drop 0.2% vs SFT 27% on IFEval). Ablations show on-policy data is key, not KL regularization or advantage estimation. Iterative-SFT reduces forgetting compared to Self-SFT, approaching RL’s performance.

Statistical Significance

Results are consistent across multiple runs and models, with clear trends in figures showing RL’s superiority. Ablations confirm on-policy data’s role through controlled experiments.

Discussion

Interpretation of Results

The mode-seeking behavior of RL, enabled by on-policy data, allows shifting to new modes without erasing old ones in multi-modal policies. This contrasts with SFT’s mode-covering approach, which redistributes mass and causes forgetting. Approximately on-policy data in SFT can mimic this effect.

Implications

For LM post-training, using RL or incorporating on-policy data in SFT can preserve existing capabilities. This has practical implications for efficient continual learning and safer model updates, especially in safety-critical applications.

Limitations

Experiments are limited to specific tasks and models; scaling to larger models or more diverse tasks may reveal different patterns. The theoretical analysis uses simplified Gaussian mixtures, which may not fully capture LM complexity.

Reference

[1] H. Chen, N. Razin, K. Narasimhan, and D. Chen, “Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting,” Dec. 03, 2025, arXiv: arXiv:2510.18874. doi: 10.48550/arXiv.2510.18874.

Agent Based automated claim matching with instruction Following llms

2026-02-14T00:00:00+08:00

Source: “Agent-based Automated Claim Matching with Instruction-following LLMs,” arXiv: arXiv:2510.23924.

Introduction

Background

Automated fact-checking pipelines rely on claim matching to identify claims that can be verified using the same evidence or fact-check. This task is crucial for scaling fact-checking efforts, as it helps in grouping related claims for efficient verification. Previous work has framed claim matching as a ranking problem or a binary classification task, with state-of-the-art results achieved using few-shot learning with instruction-following large language models (LLMs) based on manually crafted prompts. However, these approaches require human effort in prompt engineering, limiting automation. This paper addresses this gap by proposing an agent-based approach that automates prompt generation using LLMs themselves, aiming to reduce manual intervention while maintaining or improving performance.

The motivation stems from the need to overcome limitations in existing LLM-based claim matching methods. Earlier studies, such as Pisarevskaya and Zubiaga (2025), demonstrated strong results with hand-crafted prompts but highlighted the potential for automated prompt engineering to enhance scalability. Inspired by advancements in LLM agents and multi-agent systems, the authors draw from related fields like paraphrase detection and natural language inference to adapt LLMs for this task. By leveraging LLMs to generate prompts, the approach seeks to capture nuanced understandings of claim matching without predefined templates.

Objective

The primary objective of this work is to develop a novel agent-based pipeline for automated claim matching using instruction-following LLMs, specifically designed to outperform existing state-of-the-art methods that rely on manual prompt engineering. The authors aim to investigate automated prompt engineering techniques, including the selection of few-shot examples and the generation of prompts by LLMs, to reduce human effort and enhance performance.

Additionally, the study seeks to compare the proposed agent-based approach against baseline methods, such as random or similarity-based few-shot selection and prompt-tuning, while exploring the use of different LLMs for prompt generation and classification. A key goal is to gain insights into how LLMs understand the claim matching task, revealing effective prompt structures and potential limitations.

Conclusion

This paper introduces a pioneering agent-based pipeline for automated claim matching, leveraging instruction-following LLMs to generate prompts and perform binary classification. The key contributions include the development of automated prompt engineering methods that select optimal few-shot examples and generate effective prompts, outperforming state-of-the-art results from manual prompt designs and prompt-tuning approaches. By demonstrating that LLM-generated prompts can achieve superior performance, the work advances the automation of fact-checking pipelines.

The experimental results show that prompts generated by smaller LLMs, such as Llama, can effectively guide larger or different models like Mistral, achieving F1 scores up to 96.9%. This highlights the potential for resource-efficient implementations and the transferability of prompts across models. Furthermore, the study provides insights into LLMs’ understanding of claim matching, emphasizing the importance of concepts like ‘same event’ or ‘topic’ in prompt design, while identifying limitations such as over-reliance on consistency markers.

Overall, the agent-based approach not only improves performance but also opens avenues for further research, including multi-iteration prompt refinement, multilingual applications, and integration with other NLP tasks in fact-checking.

Literature Review

Claim matching has been addressed in prior research as both a ranking task and a classification problem. Ranking-based approaches, such as those by Shaar et al. (2020) and Kazemi et al. (2021, 2022), focus on retrieving relevant claims from large corpora. More recently, classification frameworks have emerged, including Choi and Ferrara (2024a,b) who treat it as a textual entailment task with three classes. Pisarevskaya and Zubiaga (2025) framed claim matching as a binary classification using paraphrase detection, natural language inference, or direct matching prompts, achieving state-of-the-art few-shot results with LLMs but relying on manual prompt templates.

The rise of LLM agents has enabled complex task automation. Works like Zhao et al. (2024) and Wang et al. (2024) demonstrate agents that interact and complete tasks, while multi-agent systems (Guo et al., 2024; Liang et al., 2024) enhance collaboration. This paper adapts a pipeline interaction inspired by Chan et al. (2023) and Fang et al. (2025), where one agent generates prompts and another performs classification.

Automated prompt engineering aims to optimize prompts without manual effort. Techniques from Liu et al. (2021) and Schulhoff et al. (2025) include generating prompts with LLMs (Reynolds and McDonell, 2021; Zhou et al., 2023; Ye et al., 2024). Prompt-tuning, as in Lester et al. (2021) and Liu et al. (2022), fine-tunes model parameters for efficiency. The authors compare their agent-based method to prompt-tuning on the same templates, showing superior performance.

Overall, while prior work has advanced claim matching and prompt automation separately, this study uniquely combines them in an agent-based pipeline for this specific task.

Methodology

The study uses the ClaimMatch dataset from Pisarevskaya and Zubiaga (2025), based on Nakov et al. (2022), with 500 matching and 500 non-matching claim pairs in the test set. Few-shot examples are drawn from the original work, and additional pairs are used for prompt-tuning.

Experiments begin with automated selection of few-shot examples, comparing random selection, similarity-based sorting (using All-MiniLM-L6-v2 for semantic similarity), and borderline approaches against the original manual selection. This is tested on three hand-crafted prompt templates: CM-t (direct matching), PD-t (paraphrase detection), and NLI-t (natural language inference).

The core agent-based pipeline consists of two steps: (1) Prompt generation, where LLMs (Mistral-7B, Llama-3-8B, and larger variants Mistral-Small-24B, Llama-3.3-70B) are prompted to create new prompts based on few-shot examples, without explicit task definitions to assess LLM understanding. (2) Binary classification, where generated prompts are used for claim matching with the same or different LLMs, evaluating combinations like Mistral with Llama prompts.

Models are quantized for efficiency, and experiments compare performance against SOTA few-shot and prompt-tuning baselines. Prompt-tuning uses PEFT with 5 epochs on the same templates.

This setup allows investigation of cross-model prompt transferability and insights into LLM interpretations of claim matching.

Experiment

Few-shot example selection shows model-dependent performance: Mistral benefits from sorted examples on CM-t and PD-t, while Llama improves with random or borderline on all templates, outperforming SOTA. However, the original manual selection remains robust for Mistral.

LLM-generated prompts significantly outperform SOTA, with Llama prompts yielding the best results (e.g., L2 prompt achieving 96.9% F1 and accuracy for Mistral). Cross-model usage (e.g., Llama prompts for Mistral) proves effective, saving resources, and even surpasses prompt-tuning baselines. Larger models do not necessarily generate better prompts, as smaller Llama prompts outperform those from larger variants.

Error analysis reveals that effective prompts emphasize ‘same event’ or ‘topic’, but limitations include false negatives from minor variations and over-strict consistency checks. Mistral prompts sometimes lead to worse performance due to misinterpretations, while Llama’s broader markers reduce false negatives.

Overall, the agent-based pipeline demonstrates superior automation and performance, with insights into LLM task understanding, though further refinements like step-by-step reasoning are suggested for handling edge cases.

Reference

[1] Pisarevskaya, Dina and Zubiaga, Arkaitz. Zero-shot and few-shot learning with instruction-following LLMs for claim matching in automated fact-checking. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9721–9736, Abu Dhabi, UAE. Association for Computational Linguistics, 2025. [2] Barrón-Cedeño, Alberto et al. Overview of checkthat!2020: Automatic identification and verification of claims in social media. Preprint, arXiv:2007.07997v1, 2020. [3] Shaar, Shaden et al. That is a known lie: Detecting previously fact-checked claims. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3607–3618, Online, 2020. [4] Choi, Eun Cheol and Ferrara, Emilio. Automated claim matching with large language models: Empowering fact-checkers in the fight against misinformation. In Companion Proceedings of the ACM Web Conference 2024, WWW ’24, page 1441–1449, New York, NY, USA. Association for Computing Machinery, 2024. [5] Zhao, Yongchao et al. Expel: LLM agents are experiential learners. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’24/IAAI’24/EAAI’24. AAAI Press, 2024. [6] Liu, Pengfei et al. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55:1–35, 2021. [7] Lester, Brian et al. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, 2021.

The 10,000x explosion reproducing deepseek’s mhc at scale

2026-02-12T00:00:00+08:00

The 10,000x Explosion: Reproducing DeepSeek’s mHC at Scale

In the world of Large Language Models, “stability” is often the difference between a state-of-the-art model and an expensive pile of NaNs. When DeepSeek released their paper on Manifold Hyper-Connections (mHC), they proposed a solution to a hidden rot in transformer architectures: the explosion of signal through residual connections.

Developer Taylor Kolasinski recently took these claims to the lab, and the results were nothing short of explosive. Here is the breakdown of how he found “the bomb” hidden in transformer architectures.

1. The “Stream Persistence” Breakthrough

Reproduction started small—a 10M parameter sandbox. Initially, everything looked identical. The math was right, but the results were flat. The culprit? A “collapsing stream” bug.

In Hyper-Connections, the model maintains multiple parallel streams of information. Taylor realized he was accidentally merging these streams back into one at every layer, effectively neutering the architecture. Once he allowed the streams to stay independent (“persistent”), the beast woke up. The signal amplification jumped 9.2× almost immediately.

2. Scaling Until it Breaks

Small models are one thing; 1.7 billion parameters is another. Moving to the “big leagues” meant swapping Colab for a $20/hour 8× H100 cluster from Lambda Labs.

Engineering at this scale is a war of attrition. Taylor documented a gauntlet of OOM (Out of Memory) errors, “thundering herd” problems where multiple GPUs tried to download the 300GB C4 dataset simultaneously, and even a physical hardware failure on “GPU 3.”

The lesson? When training LLMs, assume everything will break. Memory math matters more than your code, and cloud GPUs are as fallible as any other hardware.

3. Detonating “The Bomb”

The climax of the experiment came at the 1.7B parameter mark. As the training runs progressed, the standard Hyper-Connections (HC) model began to exhibit what Taylor calls “quiet violence.”

The signal didn’t just drift; it amplified by over 10,000×. In a startling discovery, he found that this instability doesn’t start deep in the network—it starts at Layer 0, right at the input embedding.

The only reason the model didn’t instantly vaporize into “Not a Number” (NaN) errors was gradient clipping—a safety net that was working overtime to mask the structural instability of the architecture.

The Verdict

The experiment was a total success for the DeepSeek thesis. While the unconstrained Hyper-Connections created a signal bomb, the Manifold Hyper-Connections (mHC) fix kept the signal pinned at a perfect 1.0 across every seed and every depth.

Taylor’s journey proves that as we push toward 10B and 100B parameter models, we can no longer rely on “heroic” gradient clipping to save us. We need architectures that are stable by design.

Reference

[1] https://taylorkolasinski.com/devlog/mhc-reproduction-session-4/ [2] https://taylorkolasinski.com/devlog/mhc-reproduction-session-5/ [3] https://taylorkolasinski.com/devlog/mhc-reproduction-session-6/

Mhc manifold Constrained hyper Connections

2026-02-07T00:00:00+08:00

Source: “mHC: Manifold-Constrained Hyper-Connections,” arXiv: arXiv:2512.24880.

Introduction

Background

Deep neural network architectures have evolved significantly since the introduction of ResNets in 2016, with residual connections becoming a cornerstone of modern models like Transformers and large language models (LLMs). Hyper-Connections (HC) extended this paradigm by expanding the residual stream width and diversifying connectivity patterns, yielding performance gains but compromising identity mapping stability, leading to training instability and scalability issues. This paper addresses the need for stable, efficient macro-architectures in LLMs to support large-scale training without excessive memory overhead.

Objective

The primary goal is to develop Manifold-Constrained Hyper-Connections (mHC), a framework that constrains HC’s residual mappings to a doubly stochastic manifold using the Sinkhorn-Knopp algorithm, restoring identity mapping properties for stable signal propagation. Additionally, mHC incorporates infrastructure optimizations like kernel fusion, recomputing, and DualPipe communication to ensure efficiency, enabling scalable training of LLMs with minimal overhead.

Conclusion

mHC successfully stabilizes HC by constraining residual mappings to a doubly stochastic manifold, achieving stable training with only 6.7% time overhead. It outperforms HC on multiple benchmarks and demonstrates superior scalability, contributing to deeper understanding of topological architecture design. The framework opens avenues for exploring diverse manifold constraints for future LLM architectures.

Literature Review

Macro-design in deep learning focuses on inter-block topological structures, with ResNet establishing residual connections as fundamental. Extensions like DenseNet, FractalNet, and DLA increased connectivity complexity. Recent works like HC, RMT, MUDDFormer, and Residual expand residual streams but often compromise stability. mHC builds on HC by constraining mappings to manifolds, addressing instability while maintaining expressivity, and differentiates from prior work through rigorous infrastructure optimizations for efficiency.

Methodology

mHC constrains HC’s residual mapping H_res to the Birkhoff polytope (doubly stochastic matrices) using the Sinkhorn-Knopp algorithm, ensuring norm preservation and compositional closure for stable propagation. Input and output mappings H_pre and H_post are constrained to non-negativity. For efficiency, kernel fusion combines operations, recomputing reduces memory footprint by discarding intermediates and recomputing on backward pass, and DualPipe overlapping optimizes communication in pipeline parallelism. Experiments use MoE-based LLMs with expansion rate n=4.

Experiment

Experiments on 3B, 9B, and 27B MoE models show mHC achieves stable training, mitigating HC’s loss surges and gradient explosions. mHC outperforms HC on downstream benchmarks, with gains on BBH and DROP, and maintains performance advantages across scales. Stability analysis reveals mHC’s gain magnitudes are bounded (max ~1.6) versus HC’s extreme values (up to 3000). Scalability curves indicate robust improvements with compute, and token scaling shows sustained gains. Limitations include slight deviations from perfect doubly stochasticity due to finite Sinkhorn iterations.

Reference

[1] Z. Xie et al., “mHC: Manifold-Constrained Hyper-Connections,” Jan. 05, 2026, arXiv: arXiv:2512.24880. doi: 10.48550/arXiv.2512.24880.

Conftuner training large language models to express their confidence verbally

2026-02-03T00:00:00+08:00

Source: “ConfTuner: Training Large Language Models to Express Their Confidence Verbally,” arXiv:2508.18847.

Introduction

Background

Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs often generate incorrect answers with high confidence, a phenomenon known as overconfidence. This undermines trust and poses challenges for safe LLM deployment. Recent efforts have focused on calibrating LLMs’ verbalized confidence, but existing approaches rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, which have limited effectiveness and generalizability.

Objective

The central research question is whether LLMs can be naturally calibrated during training without relying on ground-truth confidence scores or proxy confidence estimates. Inspired by proper scoring rules in classical machine learning, the authors propose ConfTuner, a fine-tuning method using a tokenized Brier score to incentivize LLMs to express confidence that reflects their true likelihood of correctness.

Conclusion

ConfTuner provides accurate confidence estimates that outperform baselines in calibration metrics, generalize across tasks and models, and enable practical benefits like improved self-correction and cost-effective model cascades. The method advances trustworthy LLM systems by aligning verbalized confidence with actual reliability.

Literature Review

Prior work on LLM calibration includes prompt-based methods for eliciting verbalized confidence, which have limited effects. Training-based approaches fine-tune LLMs with proxy confidence scores from heuristics like model accuracy on similar questions or consistency across responses. However, these proxies introduce bias and noise. Traditional calibration in classifiers uses proper scoring rules like Brier score, but adapting this to verbalized confidence in LLMs is novel. ConfTuner extends this by defining proper scoring rules for tokenized confidence expressions.

Methodology

ConfTuner consists of two steps:

Compute probability distribution over confidence tokens (e.g., 0-100%) by extracting logits for confidence tokens after the LLM generates an answer.
Fine-tune using the tokenized Brier score loss, which penalizes the squared error between predicted confidence and correctness. The loss is proven to be a proper scoring rule, ensuring calibration. Training uses LoRA on HotpotQA, with minimal overhead, no ground-truth confidences needed.

Experiment

ConfTuner outperforms baselines (Base, Ensemble, LACIE, SaySelf) in ECE and AUROC across LLaMA, Qwen, and Ministral models on HotpotQA and out-of-distribution datasets (GSM8K, TriviaQA, StrategyQA, TruthfulQA). It generalizes to linguistic confidence (high/medium/low) and implicit expressions, and calibrates black-box models like GPT-4o. Ablations show efficiency (4 min training, 2,000 samples), and applications include better self-correction and model cascades with up to 9.3% accuracy gains.

Reference

[1] Y. Li, M. Xiong, J. Wu, and B. Hooi, “ConfTuner: Training Large Language Models to Express Their Confidence Verbally,” Aug. 26, 2025, arXiv: arXiv:2508.18847. doi: 10.48550/arXiv.2508.18847.

The zero temperature myth why greedy doesn’t always mean same

2026-01-29T00:00:00+08:00

The Zero Temperature Myth: Why “Greedy” Doesn’t Always Mean “Same”

It’s one of the most common assumptions in AI: set the Temperature ($T$) to 0, and you’ll get a perfectly deterministic, robotic, and consistent output every single time. It feels logical. In a world of probabilities, $T = 0$ should mean “stop guessing and just give me the top answer.”

But if you’ve spent enough time prompting, you’ve likely noticed the “ghost in the machine.” Even at absolute zero, the model might occasionally swap a “the” for an “a” or change the structure of a sentence.

Let’s dive into the math, the hardware, and the quirks of probability to understand why $T = 0$ isn’t the “lockdown” we think it is.

1. The Math: What Temperature Actually Does

In a Large Language Model, the final layer doesn’t output words; it outputs logits (raw scores for every possible word in its vocabulary). To turn these scores into a probability distribution, we use the Softmax function.

The Temperature hyperparameter is a scaler applied to those logits before they hit the Softmax. The formula looks like this:

P_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}

Where:

$P_i$ is the probability of token $i$
$z_i$ is the logit (score) for that token
$T$ is the Temperature

The Behavior of $T$

High $T$ ($T > 1$): Flattens the distribution. The gap between the “best” word and the “okay” word shrinks, making the model more creative (and chaotic).
Low $T$ ($0 < T < 1$): Sharpens the distribution. The “best” word gets a massive boost in probability compared to everything else.
$T = 0$: Mathematically, you can’t divide by zero. In practice, setting $T = 0$ tells the model to perform Greedy Decoding. It bypasses the probability distribution entirely and simply picks the token with the highest logit.

2. Why the Inconsistency? The “GPU Chaos”

If the model is just picking the highest score, why would it ever change its mind? The answer usually lies in floating-point math and parallel processing.

Non-Associative Summation

Computers handle numbers using “floating-point” representation. Unlike pure math, in computer science, $(A + B) + C$ is not always exactly equal to $A + (B + C)$.

When a model runs on a GPU, it performs billions of operations in parallel. Because the order of these operations can vary slightly depending on the GPU’s workload or how the threads are scheduled, tiny “rounding errors” occur at the 10th or 15th decimal place.

The Butterfly Effect: If two words (e.g., “Apple” and “Banana”) have scores that are nearly identical, a tiny rounding error in the 16th decimal place can cause “Banana” to suddenly have a higher logit than “Apple.” At $T = 0$, the model picks the winner, even if it won by a microscopic margin.

The Problem of “Ties”

Sometimes, the model genuinely cannot decide. If two tokens have the exact same logit score, the system has to break the tie. Depending on the specific implementation of the software, this might fall back to a random choice or the token’s position in the library, adding another layer of potential variance.

3. Architecture and Service Quirks

Beyond the hardware, the way AI is served to you can cause shifts:

Mixture of Experts (MoE): Modern models (like some versions of Gemini or GPT-4) use a “routing” system where different parts of the model handle different parts of a query. If the router has any internal variance, different “experts” might handle your prompt, leading to slightly different logits.
Quantization: Many models are compressed to run faster. This compression makes the logit scores even closer together, increasing the likelihood that a tiny hardware fluctuation will flip the “top” choice.
Batching: When servers handle multiple requests at once, the way inputs are grouped can subtly affect the numerical precision of the calculations.

Halogen fantastic llm hallucinations and where to find them

2026-01-25T00:00:00+08:00

Source: “HALoGEN: Fantastic LLM Hallucinations and Where to Find Them,” arXiv: arXiv:2501.08292.

Introduction

Background

Large language models (LLMs) excel at generating high-quality, fluent text but often produce hallucinations—statements that misalign with established world knowledge or provided input context. Measuring hallucinations is challenging due to the open-ended nature of generations and the expense of human verification. This work addresses these issues by introducing HALOGEN, a benchmark with 10,923 prompts across nine domains, including programming, scientific attribution, and summarization, paired with automatic high-precision verifiers that decompose generations into atomic units for verification against reliable knowledge sources.

Objective

The primary objectives are to create a scalable, multi-domain benchmark for hallucination evaluation in LLMs and to analyze the underlying causes of hallucinations by classifying them into types based on their relation to pretraining data: Type A (correct fact present but hallucinated), Type B (incorrect fact in data or out of context), and Type C (fabrication not present in data). This framework aims to enable principled study of why LLMs hallucinate and advance the development of trustworthy LLMs.

Conclusion

HALOGEN provides a comprehensive framework for evaluating LLM hallucinations across diverse scenarios, revealing that even top-performing models like GPT-4 exhibit high hallucination rates (up to 86% in some domains). The error classification highlights that hallucinations stem from multiple sources, varying by domain, and not a single cause. This work contributes a benchmark, evaluation metrics, and insights into hallucination origins, paving the way for more truthful LLMs through targeted mitigation strategies.

Literature Review

Hallucination in LLMs has been extensively studied, with surveys noting its prevalence (Zhang et al., 2023; Ji et al., 2022). Early detection methods focused on grounded tasks like summarization and dialogue, using entailment or QA-based systems (Maynez et al., 2020; Durmus et al., 2020). More recent reference-based approaches verify against sources like Wikipedia or search (Min et al., 2023; Agrawal et al., 2023), while reference-free methods use LLMs for consistency checks (Manakul et al., 2023). Benchmarks include FActScore for biographies (Min et al., 2023) and TruthfulQA for misconceptions (Lin et al., 2021b). HALOGEN extends these by covering diverse domains, including refusal-based tasks, and implements verifiers for code, citations, and more, enabling scalable evaluation.

Methodology

HALOGEN consists of nine tasks: Code Packages, Summarization, Simplification, Biographies, Rationalization (Binary/Numerical), Scientific Attribution, Historical Events, and False Presuppositions. Tasks are response-based (expected to respond) or refusal-based (expected to abstain). For each, prompts are constructed from diverse sources, and automatic verifiers decompose responses into atomic units (e.g., package names, citations) and verify against sources like PyPI, Semantic Scholar, or entailment models. Evaluation metrics include Hallucination Score, Response Ratio, and Utility Score. The benchmark evaluates 14 LLMs on ∼150,000 generations. Error classification traces hallucinations to pretraining data: Type A (correct fact present), Type B (incorrect/out-of-context fact), Type C (fabrication).

Results & Discussion

Evaluation of 14 LLMs on HALOGEN shows high hallucination rates, with GPT-4 achieving 4%-86% across tasks. GPT-3.5 and GPT-4 outperform open-source models, with Llama-3-70B as the best open model. Larger models generally hallucinate less on response-based tasks, but trends vary for refusal-based. Error analysis reveals domain-specific patterns: Type B errors dominate code tasks (hallucinated packages in pretraining data), Type A in senator affiliations (correct info available), and Type C in historical events (no co-occurrence in data). Intrinsic hallucinations are more common in summarization than extrinsic. This underscores the need for diverse benchmarks and multifaceted mitigation.

Reference

[1] A. Ravichander, S. Ghela, D. Wadden, and Y. Choi, “HALoGEN: Fantastic LLM Hallucinations and Where to Find Them,” Jan. 14, 2025, arXiv: arXiv:2501.08292. doi: 10.48550/arXiv.2501.08292. [2] Zhang et al. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv, 2023. [3] Ji et al. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 2022. [4] Maynez et al. On Faithfulness and Factuality in Abstractive Summarization. ACL, 2020. [5] Durmus et al. FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization. ACL, 2020. [6] Min et al. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. arXiv, 2023. [7] Agrawal et al. Do Language Models Know When They’re Hallucinating References? arXiv, 2023. [8] Manakul et al. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv, 2023. [9] Lin et al. TruthfulQA: Measuring How Models Mimic Human Falsehoods. ACL, 2021.

Why diffusion models don’t memorize the role of implicit dynamical regularization in training

2026-01-17T00:00:00+08:00

Source: “Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training,” The Thirty-ninth Annual Conference on Neural Information Processing Systems

Introduction

Background

Diffusion models have revolutionized generative AI, achieving state-of-the-art performance in tasks like image, audio, and video generation. However, a key challenge is understanding why they generalize well without memorizing training data, despite being theoretically capable of reproducing exact samples. This paper explores the role of training dynamics in this phenomenon.

Research Gap

While previous studies have observed that diffusion models memorize training data for small datasets but generalize for larger ones, the underlying mechanisms—particularly the role of training dynamics in creating implicit regularization—remain poorly understood. Existing explanations focus on architectural biases or finite learning rates, but these do not fully account for the observed scaling with dataset size.

Hypothesis

The training dynamics of diffusion models exhibit implicit regularization through spectral bias, leading to two timescales: a short τ_gen for generalization and a longer τ_mem ∝ n for memorization, allowing early stopping to prevent memorization.

Conclusion

This work demonstrates that implicit dynamical regularization in training dynamics is key to preventing memorization in diffusion models, expanding the generalization regime through early stopping. The findings bridge numerical observations with theoretical insights, providing practical guidelines for robust DM training.

Literature Review

The paper reviews empirical studies on memorization in DMs (e.g., Stable Diffusion), theoretical works on score learning in high-dimensional limits, and spectral bias in neural networks. It builds on prior analyses of generalization-memorization transitions, early stopping benefits, and dynamical regularization mechanisms.

Experimental Design

Baseline Methods

U-Net architectures with varying widths (W=8 to 64) for score estimation in DDPMs, trained on CelebA. RFNN with random features for analytical tractability in high-dimensional limit.

Proposed Modifications

Vary dataset size n (128 to 32768) and model capacity p (via U-Net width W). Monitor training dynamics with early stopping at different τ. Use RFNN to derive spectral properties analytically.

Experimental Setup

CelebA downsampled to 32x32 grayscale. U-Nets trained with SGD, batch size min(n,512), diffusion time t=0.01 for loss monitoring. Metrics: FID against 10K test samples, f_mem via nearest neighbor distance (k=1/3). RFNN in high-dim limit with tanh activation, Gaussian data.

Results

Main Findings

Training exhibits two phases: FID reaches minimum at τ_gen ≈ 100K (independent of n), then stabilizes; f_mem starts at 0 and increases after τ_mem ∝ n. RFNN theory predicts τ_mem ∝ n via eigenvalue spectrum, with generalization loss decreasing as n increases.

Comparative Analysis

Compared to smaller n, larger n delays memorization (τ_mem increases), with generalization window widening. Varying p shows τ_mem ∝ n/W, confirming dynamical over architectural regularization. RFNN matches U-Net behavior, validating theory.

Statistical Significance

Results averaged over 5 test sets for FIDs, 5 noise realizations for losses. f_mem with 95% CI from 1000 bootstrap samples. Scaling τ_mem ∝ n confirmed with rescaled plots collapsing curves.

Discussion

Interpretation of Results

The two timescales arise from spectral bias: low-frequency components (population score) learned quickly, high-frequency (dataset-specific) learned later. Dynamical regularization stabilizes smooth approximations, preventing early memorization and allowing generalization via early stopping.

Implications

Early stopping at τ_gen prevents memorization in data-scarce settings. Provides guidelines for training DMs: monitor f_mem, stop when generalization loss increases. Extends to other score-based methods like flow matching.

Limitations

Experiments on unconditional CelebA; conditional settings and broader datasets need exploration. Theoretical analysis uses simplified models (RFNN, Gaussian data); real architectures and data distributions may differ. Limited p range; full (n,p) phase diagram requires wider exploration.

Reference

[1] T. Bonnaire, R. Urfin, G. Biroli, and M. Mezard, “Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training,” presented at the The Thirty-ninth Annual Conference on Neural Information Processing Systems, Oct. 2025. Accessed: Dec. 11, 2025. [Online]. Available: https://openreview.net/forum?id=BSZqpqgqM0

Soft thinking unlocking the reasoning potential of llms in continuous concept space

2026-01-11T00:00:00+08:00

Source: “Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space,” arXiv: arXiv:2505.15778.

Introduction

Background

Large Language Models (LLMs) have shown impressive capabilities in complex reasoning tasks through Chain-of-Thought (CoT) prompting, which generates intermediate reasoning steps in natural language. However, standard CoT is constrained to discrete token embeddings representing fixed points in semantic space, limiting expressive power and potential. Human cognition involves fluid, abstract concepts beyond discrete linguistic tokens, supported by neuroscientific evidence of non-verbal conceptual processing. This discrete constraint restricts LLMs’ reasoning, causing incomplete exploration of paths in CoT due to sampling one token per step. Humans consider multiple possibilities simultaneously, integrating abstract concepts for more flexible reasoning.

Objective

The paper aims to enable LLMs to reason with soft, abstract concepts in a continuous concept space, transcending discrete language boundaries. Specifically, it proposes Soft Thinking, a training-free method that replaces discrete token selection with probabilistic soft aggregation over the vocabulary, forming concept tokens that encapsulate multiple meanings and explore various reasoning paths implicitly to converge toward correct answers more effectively.

Conclusion

Soft Thinking introduces a novel reasoning paradigm that breaks the bottleneck of discrete token-based reasoning by operating in a continuous concept space. By leveraging concept tokens formed as convex combinations of token embeddings, it enhances both the comprehensiveness of reasoning and convergence efficiency. Empirical results demonstrate consistent improvements in accuracy and token efficiency across diverse benchmarks without training. The method presents potential for future work on integrating training-based approaches and adapting to OOD inputs, paving the way for more advanced LLM reasoning capabilities.

Literature Review

Chain-of-Thought (CoT) reasoning enhances multi-step reasoning by generating intermediate steps, with approaches including prompt-based methods, supervised fine-tuning, and reinforcement learning. However, efficiency concerns arise with longer chains. Continuous space reasoning has been explored, such as decoding intermediate variables from hidden states, interventions on hidden states, and latent planning tokens. Methods like COCONUT use hidden states as embeddings, but face challenges with decoupled input/output spaces in larger models. Soft Thinking addresses this by using probability distributions as a bridge, enabling training-free alignment in continuous spaces.

Methodology

Soft Thinking replaces discrete token sampling in CoT with concept tokens, which are probability distributions over the vocabulary. At each reasoning step, the model generates a concept token ct as the full probability distribution, then computes the next embedding as a weighted sum of token embeddings using these probabilities. This forms a continuous concept space as the convex hull of embeddings. The Cold Stop mechanism monitors entropy and terminates reasoning early when confidence is high over consecutive steps, preventing collapse. Theoretically, it approximates full path-summation via linearization. Implementation uses top-k filtering for efficiency and integrates with SGLang.

Experiment

Evaluations on math (Math500, AIME 2024, GSM8K, GPQA-Diamond) and coding (HumanEval, MBPP, LiveCodeBench) benchmarks show Soft Thinking improves pass@1 accuracy by up to 2.48% and reduces token usage by up to 22.4% compared to standard CoT. It outperforms greedy CoT in accuracy while maintaining efficiency. Ablation studies confirm the superiority of probability-weighted embeddings over averages and the necessity of Cold Stop to avoid collapse. Qualitative analysis shows interpretable outputs with shorter, concise reasoning. The method generalizes across model architectures and scales without training.

Reference

[1] Z. Zhang et al., “Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space,” May 21, 2025, arXiv: arXiv:2505.15778. doi: 10.48550/arXiv.2505.15778.

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model

2026-01-04T00:00:00+08:00

Source: “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?,” The Thirty-ninth Annual Conference on Neural Information Processing Systems

Introduction

Background

Recent breakthroughs in reasoning-centric Large Language Models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have been largely driven by Reinforcement Learning with Verifiable Rewards (RLVR). In this paradigm, models are trained on tasks like mathematics and coding where the reward (correctness) is automatically computable. The common industry belief is that RLVR allows models to “self-evolve,” discovering novel reasoning strategies (like self-reflection or trial-and-error) that exceed the capabilities of their original base models.

Objective

The researchers set out to rigorously test whether current RLVR methods genuinely expand an LLM’s reasoning “boundary” or if they simply teach the model to find correct answers more efficiently within a search space it already possesses. They utilize the pass@k metric (the probability of getting at least one correct answer in $k$ attempts) to map the true potential of both base and RL-trained models.

Conclusion

The study yields a surprising “cold water” finding: Current RLVR does not elicit fundamentally new reasoning patterns. While RLVR significantly improves “sampling efficiency” (making the model more likely to get the answer right on the first try), it actually narrows the model’s overall reasoning coverage. The base model, given enough samples, can solve a wider range of problems than its RL-trained descendant. In contrast, distillation from a stronger teacher (like o1) is found to be the only current method that genuinely expands the reasoning boundary.

Literature Review

The paper situates itself within the rapid evolution of post-training for LLMs. While traditional instruction tuning (SFT) relies on human-curated data, RLVR has gained traction due to its scalability.

Traditional RL (e.g., AlphaGo): Agents discover entirely new strategies through exploration.
Current LLM RL (RLVR): Relies on policy gradient methods (like PPO or GRPO).
Related Analysis: The authors build on recent observations that “reflective” behaviors (like “Wait, let me rethink…”) might already exist in base models. This paper provides the first systematic, quantitative proof using $pass@k$ across multiple domains (math, code, vision) that the base model serves as an upper bound for RL performance.

Methodology

The authors utilize several quantitative metrics and analytical frameworks:

Metric - pass@k: To measure the “reasoning boundary,” they use an unbiased estimator: $pass@k := \mathbb{E}_{x_i \sim \mathcal{D}} \left[ 1 - \frac{\binom{n-c_i}{k}}{\binom{n}{k}} \right]$ where $n$ is the total samples, and $c_i$ is the number of correct samples.
Sampling Efficiency Gap ($\Delta_{SE}$): This measures how close an RL model’s first-try success ($pass@1$) is to the base model’s potential maximum ($pass@256$).
Perplexity Analysis ($PPL$): To see if RL models generate “new” text, they calculate the likelihood of RL-generated paths under the base model’s distribution: $PPL_m(Y | x) = \exp \left( -\frac{1}{T} \sum_{t=1}^{T} \log P(y_t | x, y_1, \dots, y_{t-1}) \right)$
Algorithms Evaluated: They tested six popular RL frameworks: PPO, GRPO, Reinforce++, RLOO, ReMax, and DAPO.

Experiment

The researchers conducted extensive tests across three main domains:

Mathematics: Tested on AIME24, MATH500, and GSM8K using Qwen2.5 and LLaMA-3.1 models.
- Result: At $k=1$, RL models win. But at $k=1024$, the base model consistently surpasses the RL model. This proves the RL model “forgot” how to solve certain complex problems while focusing on easier ones.
Code Generation: Evaluated on LiveCodeBench and HumanEval+.
- Result: Trends were identical to math. RL models showed a narrower “scope” of solvable problems compared to their base versions.
Visual Reasoning: Evaluated on MathVista and MathVision.
- Result: Even in multimodal tasks, RLVR only sharpened existing patterns rather than creating new ones.
Deep Analysis (The “Distillation” Difference):
- The authors compared RLVR against Distillation (e.g., DeepSeek-R1-Distill-Qwen).
- Finding: The distilled model’s $pass@k$ curve stays significantly above the base model’s curve even as $k$ increases. This indicates that knowledge transfer (SFT on high-quality CoT) genuinely expands a model’s capabilities, whereas self-improvement RL currently does not.

Final takeaway for developers: Current RL training is excellent for making models “reliable” and “fast” (improving $pass@1$), but to actually make a model “smarter” (expanding the boundary), we still need better data distillation or next-generation RL paradigms that incentivize true exploration.

Reference

[1] Y. Yue et al., “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?,” presented at the The Thirty-ninth Annual Conference on Neural Information Processing Systems, Oct. 2025. Accessed: Dec. 11, 2025. [Online]. Available: https://openreview.net/forum?id=4OsgYD7em5