FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

Abstract

Human communication heavily relies on laconism and inferential pragmatics, allowing listeners to successfully reconstruct rich meaning from sparse, telegraphic speech. In contrast, large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. This overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. Inspired by the aforementioned cognitive psycholinguistic processes, we address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to retain the top-$k\%$ tokens, and obtain a sparse frugalized prompt. We establish the theoretical stability of our approach and provide strong empirical results across a suite of four NLP tasks to study the trade-off between the portion of retained tokens and performance. Experimental findings across retention settings reveal asymmetric performance patterns that suggest potential task contamination effects. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs and delineates the boundary between tasks tolerant of contextual sparsity and those requiring exhaustive context.

Key Contributions

A novel, training-free prompt frugalization strategy for LLMs that controllably filters low-importance tokens based on saliency scores from pre-trained encoders.
A theoretically grounded method with empirical results showing strong retention–performance trade-offs. Across four NLP tasks, a 20% prompt reduction preserves performance for most models with negligible parameter overhead.
Behavioral insights into how scaling performance relates to inference cost and potential task contamination in benchmarks.

Methodology

Our pipeline (shown in the overview figure above) consists of three stages: token attribution, top-$k$% filtering, and inference. We obtain saliency scores using two state-of-the-art token attribution methods: GlobEnc and DecompX, requiring only a lightweight 110M-parameter BERT encoder, substantially lower overhead than prior compressor designs.

Ranking via Saliency Scores

We define a task-specific scoring function $\varphi_\tau \in [\text{GlobEnc, DecompX}]$ which assigns a saliency score to each token:

$$\mathbf{s} = \varphi_\tau(T) = \langle s_1, s_2, \dots, s_n \rangle, \quad s_i \in \mathbb{R}$$

Tokens are ranked in decreasing order of their attribution scores to obtain a permutation $\pi$ such that $s_{\pi(1)} \ge s_{\pi(2)} \ge \cdots \ge s_{\pi(n)}$.

Top-$k$ Filtering

We retain only the top $p = \lceil \frac{k}{100} \cdot n \rceil$ tokens and reorder them to preserve temporal coherence:

$$F_k = \langle t_i \mid i \in S_{k\uparrow} \rangle$$

Example: For the input "The movie was good, and I liked it very much", with attribution scores prioritizing content words (e.g., "movie": 0.95, "good": 0.90, "much": 0.85, "liked": 0.80), filtering the top 40% yields the frugalized output: "movie good liked much"—preserving semantic essence while discarding low-salience tokens.

Theoretical Bounds

We formalize the relationship between dropping $k\%$ tokens and estimated performance. Under mild assumptions (saliency upper-bounds singleton deletion harm, and bounded pairwise interaction), we prove:

Theorem (Deletion Bound). For any deletion set $D$ with $|D| = q$:

$$f(D) \le C_\tau \sum_{i \in D} s_i + \frac{\gamma_\tau}{2} q(q-1)$$

This establishes the stability of token deletion by upper-bounding the deletion effect through saliency scores and pairwise interaction terms.

Experiments and Analysis

We evaluate across four NLP tasks: sentiment analysis (IMDb), summarization (Argilla News), commonsense QA (CosmosQA), and mathematical reasoning (GSM8k). These span both discriminative and generative paradigms, enabling evaluation under diverse linguistic and cognitive demands.

Model	Attribution	$k\%$	CLS		SUM						QA	RSN
			Acc	F1	BLEU	R-1	R-2	R-L	BERT	METEOR	Acc	pass@1
			Llama-3 8B	N/A	100	0.949	0.949	0.020	0.232	0.073	0.193	0.872	0.335	0.813	0.786
GlobEnc	80	0.942		0.942	0.017	0.226	0.069	0.189	0.871	0.330	0.798	0.500
	60	0.942		0.942	0.015	0.220	0.064	0.181	0.870	0.319	0.754	0.243
	50	0.921		0.921	0.013	0.213	0.057	0.175	0.868	0.303	0.716	0.146
DecompX	80	0.936		0.936	0.015	0.218	0.061	0.179	0.869	0.311	0.747	0.652
	60	0.900		0.900	0.014	0.206	0.055	0.169	0.866	0.286	0.681	0.425
	50	0.868		0.868	0.012	0.198	0.048	0.161	0.864	0.270	0.670	0.265
Llama-3 70B	N/A	100	0.953	0.953	0.020	0.235	0.073	0.196	0.876	0.333	0.874	0.919
	GlobEnc	80	0.948	0.948	0.018	0.235	0.071	0.195	0.875	0.331	0.862	0.669
		60	0.943	0.943	0.017	0.231	0.068	0.192	0.874	0.329	0.838	0.362
		50	0.938	0.938	0.016	0.228	0.065	0.189	0.873	0.316	0.816	0.231
	DecompX	80	0.949	0.949	0.017	0.226	0.065	0.187	0.870	0.316	0.839	0.818
		60	0.891	0.891	0.015	0.215	0.057	0.176	0.868	0.292	0.797	0.587
		50	0.839	0.837	0.014	0.209	0.054	0.171	0.866	0.281	0.770	0.409
GPT-3.5	N/A	100	0.949	0.949	0.039	0.282	0.093	0.237	0.889	0.359	0.779	0.772
	GlobEnc	80	0.945	0.945	0.017	0.176	0.056	0.146	0.874	0.291	0.753	0.498
		60	0.925	0.925	0.015	0.172	0.052	0.141	0.874	0.285	0.716	0.264
		50	0.918	0.918	0.014	0.170	0.049	0.140	0.873	0.278	0.705	0.158
	DecompX	80	0.942	0.942	0.036	0.268	0.084	0.225	0.888	0.337	0.704	0.660
		60	0.724	0.704	0.031	0.253	0.073	0.210	0.885	0.311	0.648	0.419
		50	0.642	0.595	0.027	0.241	0.065	0.200	0.882	0.291	0.619	0.288
Gemini 2.0 Flash Thinking	N/A	100	0.952	0.952	0.034	0.262	0.081	0.219	0.885	0.345	0.880	0.956
	GlobEnc	80	0.947	0.947	0.031	0.252	0.081	0.212	0.882	0.344	0.879	0.704
		60	0.934	0.934	0.029	0.247	0.077	0.208	0.881	0.335	0.846	0.423
		50	0.920	0.919	0.026	0.239	0.071	0.199	0.879	0.322	0.827	0.277
	DecompX	80	0.855	0.852	0.031	0.251	0.075	0.209	0.883	0.328	0.856	0.856
		60	0.713	0.690	0.028	0.236	0.068	0.194	0.878	0.307	0.795	0.665
		50	0.627	0.571	0.024	0.225	0.059	0.185	0.876	0.283	0.774	0.463
o3-mini	N/A	100	0.957	0.957	0.023	0.221	0.065	0.182	0.860	0.297	0.845	0.961
	GlobEnc	80	0.956	0.956	0.020	0.216	0.060	0.176	0.859	0.290	0.826	0.724
		60	0.941	0.941	0.019	0.212	0.059	0.173	0.858	0.282	0.802	0.462
		50	0.932	0.932	0.018	0.204	0.055	0.166	0.857	0.272	0.785	0.332
	DecompX	80	0.842	0.839	0.020	0.208	0.056	0.170	0.858	0.273	0.787	0.850
		60	0.727	0.707	0.018	0.195	0.049	0.159	0.854	0.253	0.724	0.679
		50	0.641	0.593	0.017	0.187	0.045	0.152	0.853	0.236	0.686	0.533

Table 1. Impact of the two variants of FrugalPrompt retaining $k\%$ tokens on baseline LLM performance across text classification (CLS), summarization (SUM), question answering (QA), and reasoning (RSN). Gray = baseline, Blue = best prompt-reduced per model. Bold = highest overall, Underline = best prompt-reduced overall.

Performance across Tasks

Performance degradation becomes more pronounced as the retention threshold decreases, particularly below 60% token retention. At 50% retention, sentiment classification accuracy drops by 5–10% and commonsense QA accuracy by 10–15%, while summarization metrics like ROUGE-1 decline by 20–25%. Mathematical reasoning exhibits the steepest fall, with pass@1 scores plummeting below 15% at lower thresholds, reflecting loss of critical numerical and logical connectors.

Performance difference for classification, QA, and reasoning tasks

Figure 2. Performance difference between reduced tokens and baseline (100% tokens) across classification (CLS), question-answering (QA), and reasoning (RSN) tasks.

Performance difference for summarization tasks

Figure 3. Performance difference between reduced tokens and baseline (100% tokens) across summarization metrics.

Choice of Attribution Method

For classification and QA tasks, GlobEnc outperforms DecompX across all metrics. However, in reasoning, DecompX outshines GlobEnc. DecompX's subword-level attribution including linear activations may help retain more important tokens for reasoning, but might overwhelm models in other tasks. For summarization, GlobEnc usually outperforms DecompX, except for the GPT-3.5 model.

Performance vs. Cost

Model	Llama-3 8B	Llama-3 70B	GPT-3.5	Gemini-2.0 FT	o3-mini
Cost/1M Input Tokens ($)	0.03	0.30	0.50	0.10	1.10
Cost/1M Output Tokens ($)	0.06	0.40	1.50	0.40	4.40

Table 2. Cost comparison of models per 1 million tokens.

For sentiment analysis and question answering, using a reduced set of tokens generally maintains performance with a gentler decline as cost decreases. However, reducing tokens sharply declines reasoning performance. Token reduction is more practical for larger, costlier models such as o3-mini, while for smaller models the cost savings are negligible relative to performance degradation.

Random & Bottom-$k$ Tokens

We select random $k\%$ and bottom $k\%$ tokens (based on attribution scores) while preserving text order to establish baselines against our method. Classification, summarization, and QA exhibit strong performance retention even with random/bottom tokens, strongly suggesting potential task contamination. Reasoning tasks experience a sharper decline, indicating genuine reliance on contextual information.

Type	Attribution	$k\%$	CLS		SUM						QA	RSN
			o3-mini		GPT-3.5						Gemini-2.0FT	o3-mini
			Acc	F1	BLEU	R-1	R-2	R-L	BERT	METEOR	Acc	pass@1
N/A		100	0.955	0.955	0.039	0.282	0.093	0.237	0.889	0.359	0.880	0.961
Random	GlobEnc	80	0.951	0.951	0.030	0.261	0.078	0.217	0.886	0.329	0.850	0.373
		60	0.893	0.893	0.022	0.240	0.064	0.197	0.883	0.293	0.761	0.103
		50	0.886	0.886	0.020	0.223	0.053	0.182	0.880	0.267	0.713	0.058
	DecompX	80	0.942	0.942	0.033	0.259	0.080	0.218	0.887	0.328	0.837	0.381
		60	0.901	0.901	0.024	0.237	0.064	0.196	0.882	0.289	0.779	0.114
		50	0.893	0.893	0.022	0.223	0.055	0.182	0.879	0.262	0.734	0.011
Bottom	GlobEnc	80	0.914	0.914	0.026	0.234	0.067	0.195	0.883	0.289	0.764	0.040
		60	0.760	0.759	0.012	0.139	0.030	0.117	0.865	0.158	0.672	0.028
		50	0.626	0.616	0.007	0.084	0.012	0.072	0.850	0.091	0.625	0.014
	DecompX	80	0.857	0.854	0.031	0.247	0.073	0.205	0.884	0.302	0.809	0.045
		60	0.755	0.740	0.022	0.208	0.053	0.170	0.876	0.249	0.714	0.021
		50	0.696	0.666	0.020	0.192	0.047	0.156	0.873	0.225	0.670	0.014

Table 3. Impact of retaining random and bottom $k\%$ tokens using the best-performing model for each task.

BibTeX

@article{raiyan2025frugalprompt,
  title={Frugalprompt: Reducing contextual overhead in large language models via token attribution},
  author={Raiyan, Syed Rifat and Ishmam, Md Farhan and Imran, Abdullah Al and Moni, Mohammad Ali},
  journal={arXiv preprint arXiv:2510.16439},
  year={2025}
}