FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

1Islamic University of Technology 2University of Utah 3University of Liverpool 4University of Queensland
FrugalPrompt pipeline overview

tldr: FrugalPrompt retains only the most semantically significant tokens in LLM prompts, reducing costs and latency while preserving task performance.

Abstract

Human communication heavily relies on laconism and inferential pragmatics, allowing listeners to successfully reconstruct rich meaning from sparse, telegraphic speech. In contrast, large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. This overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. Inspired by the aforementioned cognitive psycholinguistic processes, we address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to retain the top-$k\%$ tokens, and obtain a sparse frugalized prompt. We establish the theoretical stability of our approach and provide strong empirical results across a suite of four NLP tasks to study the trade-off between the portion of retained tokens and performance. Experimental findings across retention settings reveal asymmetric performance patterns that suggest potential task contamination effects. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs and delineates the boundary between tasks tolerant of contextual sparsity and those requiring exhaustive context.

Key Contributions

  • A novel, training-free prompt frugalization strategy for LLMs that controllably filters low-importance tokens based on saliency scores from pre-trained encoders.
  • A theoretically grounded method with empirical results showing strong retention–performance trade-offs. Across four NLP tasks, a 20% prompt reduction preserves performance for most models with negligible parameter overhead.
  • Behavioral insights into how scaling performance relates to inference cost and potential task contamination in benchmarks.

Methodology

Our pipeline (shown in the overview figure above) consists of three stages: token attribution, top-$k$% filtering, and inference. We obtain saliency scores using two state-of-the-art token attribution methods: GlobEnc and DecompX, requiring only a lightweight 110M-parameter BERT encoder, substantially lower overhead than prior compressor designs.

Ranking via Saliency Scores

We define a task-specific scoring function $\varphi_\tau \in [\text{GlobEnc, DecompX}]$ which assigns a saliency score to each token:

$$\mathbf{s} = \varphi_\tau(T) = \langle s_1, s_2, \dots, s_n \rangle, \quad s_i \in \mathbb{R}$$

Tokens are ranked in decreasing order of their attribution scores to obtain a permutation $\pi$ such that $s_{\pi(1)} \ge s_{\pi(2)} \ge \cdots \ge s_{\pi(n)}$.

Top-$k$ Filtering

We retain only the top $p = \lceil \frac{k}{100} \cdot n \rceil$ tokens and reorder them to preserve temporal coherence:

$$F_k = \langle t_i \mid i \in S_{k\uparrow} \rangle$$

Example: For the input "The movie was good, and I liked it very much", with attribution scores prioritizing content words (e.g., "movie": 0.95, "good": 0.90, "much": 0.85, "liked": 0.80), filtering the top 40% yields the frugalized output: "movie good liked much"—preserving semantic essence while discarding low-salience tokens.

Theoretical Bounds

We formalize the relationship between dropping $k\%$ tokens and estimated performance. Under mild assumptions (saliency upper-bounds singleton deletion harm, and bounded pairwise interaction), we prove:

Theorem (Deletion Bound). For any deletion set $D$ with $|D| = q$:

$$f(D) \le C_\tau \sum_{i \in D} s_i + \frac{\gamma_\tau}{2} q(q-1)$$

This establishes the stability of token deletion by upper-bounding the deletion effect through saliency scores and pairwise interaction terms.

Experiments and Analysis

We evaluate across four NLP tasks: sentiment analysis (IMDb), summarization (Argilla News), commonsense QA (CosmosQA), and mathematical reasoning (GSM8k). These span both discriminative and generative paradigms, enabling evaluation under diverse linguistic and cognitive demands.

Model Attribution $k\%$ CLS SUM QA RSN
Acc F1 BLEU R-1 R-2 R-L BERT METEOR Acc pass@1
Llama-3 8B N/A100 0.9490.9490.0200.2320.0730.1930.8720.3350.8130.786
GlobEnc80 0.9420.9420.0170.2260.0690.1890.8710.3300.7980.500
60 0.9420.9420.0150.2200.0640.1810.8700.3190.7540.243
50 0.9210.9210.0130.2130.0570.1750.8680.3030.7160.146
DecompX80 0.9360.9360.0150.2180.0610.1790.8690.3110.7470.652
60 0.9000.9000.0140.2060.0550.1690.8660.2860.6810.425
50 0.8680.8680.0120.1980.0480.1610.8640.2700.6700.265
Llama-3 70B N/A100 0.9530.9530.0200.2350.0730.1960.8760.3330.8740.919
GlobEnc80 0.9480.9480.0180.2350.0710.1950.8750.3310.8620.669
60 0.9430.9430.0170.2310.0680.1920.8740.3290.8380.362
50 0.9380.9380.0160.2280.0650.1890.8730.3160.8160.231
DecompX80 0.9490.9490.0170.2260.0650.1870.8700.3160.8390.818
60 0.8910.8910.0150.2150.0570.1760.8680.2920.7970.587
50 0.8390.8370.0140.2090.0540.1710.8660.2810.7700.409
GPT-3.5 N/A100 0.9490.9490.0390.2820.0930.2370.8890.3590.7790.772
GlobEnc80 0.9450.9450.0170.1760.0560.1460.8740.2910.7530.498
60 0.9250.9250.0150.1720.0520.1410.8740.2850.7160.264
50 0.9180.9180.0140.1700.0490.1400.8730.2780.7050.158
DecompX80 0.9420.9420.0360.2680.0840.2250.8880.3370.7040.660
60 0.7240.7040.0310.2530.0730.2100.8850.3110.6480.419
50 0.6420.5950.0270.2410.0650.2000.8820.2910.6190.288
Gemini 2.0 Flash Thinking N/A100 0.9520.9520.0340.2620.0810.2190.8850.3450.8800.956
GlobEnc80 0.9470.9470.0310.2520.0810.2120.8820.3440.8790.704
60 0.9340.9340.0290.2470.0770.2080.8810.3350.8460.423
50 0.9200.9190.0260.2390.0710.1990.8790.3220.8270.277
DecompX80 0.8550.8520.0310.2510.0750.2090.8830.3280.8560.856
60 0.7130.6900.0280.2360.0680.1940.8780.3070.7950.665
50 0.6270.5710.0240.2250.0590.1850.8760.2830.7740.463
o3-mini N/A100 0.9570.9570.0230.2210.0650.1820.8600.2970.8450.961
GlobEnc80 0.9560.9560.0200.2160.0600.1760.8590.2900.8260.724
60 0.9410.9410.0190.2120.0590.1730.8580.2820.8020.462
50 0.9320.9320.0180.2040.0550.1660.8570.2720.7850.332
DecompX80 0.8420.8390.0200.2080.0560.1700.8580.2730.7870.850
60 0.7270.7070.0180.1950.0490.1590.8540.2530.7240.679
50 0.6410.5930.0170.1870.0450.1520.8530.2360.6860.533

Table 1. Impact of the two variants of FrugalPrompt retaining $k\%$ tokens on baseline LLM performance across text classification (CLS), summarization (SUM), question answering (QA), and reasoning (RSN). Gray = baseline, Blue = best prompt-reduced per model. Bold = highest overall, Underline = best prompt-reduced overall.

Performance across Tasks

Performance degradation becomes more pronounced as the retention threshold decreases, particularly below 60% token retention. At 50% retention, sentiment classification accuracy drops by 5–10% and commonsense QA accuracy by 10–15%, while summarization metrics like ROUGE-1 decline by 20–25%. Mathematical reasoning exhibits the steepest fall, with pass@1 scores plummeting below 15% at lower thresholds, reflecting loss of critical numerical and logical connectors.

Performance difference for classification, QA, and reasoning tasks

Figure 2. Performance difference between reduced tokens and baseline (100% tokens) across classification (CLS), question-answering (QA), and reasoning (RSN) tasks.

Performance difference for summarization tasks

Figure 3. Performance difference between reduced tokens and baseline (100% tokens) across summarization metrics.

Choice of Attribution Method

For classification and QA tasks, GlobEnc outperforms DecompX across all metrics. However, in reasoning, DecompX outshines GlobEnc. DecompX's subword-level attribution including linear activations may help retain more important tokens for reasoning, but might overwhelm models in other tasks. For summarization, GlobEnc usually outperforms DecompX, except for the GPT-3.5 model.

Performance vs. Cost

Model Llama-3 8B Llama-3 70B GPT-3.5 Gemini-2.0 FT o3-mini
Cost/1M Input Tokens ($) 0.030.300.500.101.10
Cost/1M Output Tokens ($) 0.060.401.500.404.40

Table 2. Cost comparison of models per 1 million tokens.

For sentiment analysis and question answering, using a reduced set of tokens generally maintains performance with a gentler decline as cost decreases. However, reducing tokens sharply declines reasoning performance. Token reduction is more practical for larger, costlier models such as o3-mini, while for smaller models the cost savings are negligible relative to performance degradation.

Random & Bottom-$k$ Tokens

We select random $k\%$ and bottom $k\%$ tokens (based on attribution scores) while preserving text order to establish baselines against our method. Classification, summarization, and QA exhibit strong performance retention even with random/bottom tokens, strongly suggesting potential task contamination. Reasoning tasks experience a sharper decline, indicating genuine reliance on contextual information.

Type Attribution $k\%$ CLS SUM QA RSN
o3-mini GPT-3.5 Gemini-2.0FT o3-mini
Acc F1 BLEU R-1 R-2 R-L BERT METEOR Acc pass@1
N/A100 0.9550.9550.0390.2820.0930.2370.8890.3590.8800.961
Random GlobEnc80 0.9510.9510.0300.2610.0780.2170.8860.3290.8500.373
60 0.8930.8930.0220.2400.0640.1970.8830.2930.7610.103
50 0.8860.8860.0200.2230.0530.1820.8800.2670.7130.058
DecompX80 0.9420.9420.0330.2590.0800.2180.8870.3280.8370.381
60 0.9010.9010.0240.2370.0640.1960.8820.2890.7790.114
50 0.8930.8930.0220.2230.0550.1820.8790.2620.7340.011
Bottom GlobEnc80 0.9140.9140.0260.2340.0670.1950.8830.2890.7640.040
60 0.7600.7590.0120.1390.0300.1170.8650.1580.6720.028
50 0.6260.6160.0070.0840.0120.0720.8500.0910.6250.014
DecompX80 0.8570.8540.0310.2470.0730.2050.8840.3020.8090.045
60 0.7550.7400.0220.2080.0530.1700.8760.2490.7140.021
50 0.6960.6660.0200.1920.0470.1560.8730.2250.6700.014

Table 3. Impact of retaining random and bottom $k\%$ tokens using the best-performing model for each task.

BibTeX

@article{raiyan2025frugalprompt,
  title={Frugalprompt: Reducing contextual overhead in large language models via token attribution},
  author={Raiyan, Syed Rifat and Ishmam, Md Farhan and Imran, Abdullah Al and Moni, Mohammad Ali},
  journal={arXiv preprint arXiv:2510.16439},
  year={2025}
}