Revisiting Long Context Modeling From Context Denoising Perspective

Source: “Revisiting Long-context Modeling from Context Denoising Perspective,” arXiv: arXiv:2510.05862

Introduction

Background

The advancement of long-context model (LCM) has emerged significantly to handle up to millions of tokens. However, some researchers found out the problem of LCM to be impacted by contextual noise such as irrelevant tokens that will mislead the attention of the model.

Objective

Develop an effective method to mitigate the context noise the the long-context inputs to improve the model’s ability on giving better attention on the critical tokens

Conclusion

This paper introduces context denoising training (CDT), an approach detects noises using integrated gradients and suppress them during training to improve long-context capabilities. The results shows that 8B models can achieve comparable performance comparing to GPT-4o on real-world tasks and significant gains on synthetic benchmarks.

Literature Review

“LCMs first retrieve salient information within the context and utilize this information for further prediction” (Source: Chapter 2.1)

From the above observation stated by other researchers, its being stated the model often have issues on extracting the correct focus in the given input. The solutions given by researchers are:

Model architecture improvements
Enhancing information extraction mechanism
Optimizing training objective

Focusing on context denoising aspect, specifically long-context post-training. This method can be classified into two types: context window scaling and long-context alignment.

Context window scaling will extend the context length with limited computational cost comparing to pretraining. It can be split into two approaches: positional extrapolation and model architecture modification.

Long-context alignment aims to enhance the ability of model with long-context capabilities and to address alignment challenges like hallucination.

Methodology

Critical Tokens Detection

Experiment Design: Injecting irrelevant documents and low-frequency words into long-context window

Two new metrics in evaluation for the influence of context noise:

Fact Retrieval (FR) Score
Integrated Gradient (IG) Score

Symbols notation:

$X = {x_i}^n_{i=1}$ : Model input with $n$ tokens
$Y = {y_j}^m_{j=1}$: Ground truth with $m$ tokens
$h$ : Index of attention head
$l$ : Index of model layer
$s_j$ : Set of tokens attended by an attention head $h$ at generation step $j$
$\mathcal{T}_r$ : Context token set of type $r \in [\text{sup, inter, irr, low}]$
$\mathcal{L}_\theta(Y\mid X)$ : Model’s prediction loss on $Y$
$A_{h,l}$ : Attention matrix of the $h$-th head in the $l$-th layer
$f_\theta$ : Model
$E_\phi(X)$ : Input token embeddings
$lr$ : Learning rate
$\beta$ : Hyperparameter of denoising level

\text{FR}^{(r)}_{h,l}=\frac{|s_j\cap\mathcal{T}_r|}{|\mathcal{T}_r|}

\text{IG}_{h,l}=A^T_{h,l}\odot|\frac{\partial \mathcal{L}_{\theta}(Y|X)}{\partial A_{h,l}}|,\quad \text{IG}^{(r)}_{h,l}=\frac{1}{|\mathcal{T}_r|}\sum_{x_i\in\mathcal{T}_r}\sum_{y_j\in Y}\text{IG}_{h,l}[i,j]

Here IG score is a matrix, where each entry $\text{IG}_{h,l}[i,j]$ represents the estimated bidirectional information flow between token $x_i$ and token $y_i$. To access the overall impact, the total contribution of tokens in $\mathcal{T}_r$ to the final prediction $Y$ and average across all attention heads and layers are calculated as the final score. Higher $IG$ indicates larger contribution of $\mathcal{T}_r$ to $Y$.

Context Denoising Training

Critical Token Detection

Identifier calculation: $\mathbb{I}(x_i)= \begin{cases} 1, & \text{if } \|\nabla_{E_\phi(x_i)}\mathcal{L}_{CE}(x_i)\|_2<t\\ 0, & \text{if } \|\nabla_{E_\phi(x_i)}\mathcal{L}_{CE}(x_i)\|_2\geq t \end{cases}$

t=\frac1N\sum^n_{i=1}\|\nabla_{E_\phi(x_i)}\mathcal{L}_{CE}(x_i)\|_2

where $\mathbb{I}(x_i)=1$ means $x_i$ is irrelevant token (noise), else critical token

Emphasizing Training

Denoised token embedding:

E_\phi(x_i)^{'}=E_\phi(x_i)-\mathbb{I}(x_i)\nabla_{E_\phi(x_i)}\times lr\times\beta

This is the new embedding use for further training, while the loss function is formulated as below:

\mathcal{L}_{CDT}(X,Y)=\mathcal{L}_{CE}(f_\theta(E_\phi(X)'),Y)

Experiment

Setup: 8 * 92GB H20, using config of sequence length up to 12K

Tasks trained on:

Real world tasks
Language modeling task
Long-form reasoning task
Synthetic task

Models used:

Short-context models (SCM): Llama-3-8B-Base (64K context length)
Long-context models (LCM):
- Llama-3.1-8B-Base
- Llama-3.1-8B-Instruct

Dataset Used:

Evaluation: LongBench-E, RULER
Context window scaling of SCM to LCM : PG-19 (64K tokens * 10000 samples)
Training: LCM-Instruct (Sampled from LongMiT & LongAlpaca with over 8000 samples context length from 16K to 128K)

Configuration:

$\beta$ : 5

Reference

[1] Z. Tang, B. Ji, J. Li, L. Wu, H. Gui, and M. Zhang, “Revisiting Long-context Modeling from Context Denoising Perspective,” Oct. 07, 2025, arXiv: arXiv:2510.05862. doi: 10.48550/arXiv.2510.05862.

Table of Contents