AttributionBench: How Hard is Automatic Attribution Evaluation?

Updates

2024/2/26: We have released the dataset and codebase for the paper. Check it out!

Abstract

Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the answer's attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for automatic attribution evaluation methods. To bridge the gap in the absence of standardized benchmarks for these methods, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information, and the discrepancy between the information the model has access to and that human annotators do.

Figure 1: The illustration of the attribution evaluation task and two typical error examples from AttributionBench generated by GPT-3.5 (w/ CoT). The references are usually manually extracted from webpages by human annotators based on what they think is useful.
Left: fine-grained information insensitivity (i.e., the model disregarded or overlooked nuanced details in either the claim or the references, as well as failing to do necessary summarization or inference from the given references, tasks that humans naturally perform).
Right: human-model accessible information mismatch (i.e., human annotators can see the whole webpage while the model is only given the extracted evidence, leading to different judgments.)

Dataset	Question Sources	Response Sources	Evidence Sources	Claim Avg. Len.	Evidence Avg. Len.	#Train	#Dev	#Test
ExpertQA	Curated by domain experts	GPT-4, Bing Chat	Google search results, Sphere	34.0	515.0	4442	470	612
Stanford-GenSearch	AllSouls, davinci-debate, ELI5, WikiHow-Keywords, NaturalQuestions	Bing Chat, NeevaAI, perplexity.ai, YouChat	Bing Chat, NeevaAI, perplexity.ai, YouChat	25.1	465.2	1510	68	148
AttributedQA	NaturalQuestions	PaLM (converted into long sentence with ChatGPT)	Wikipedia	16.3	139.5	2000	74	230
LFQA	ELI5	WebGPT, GPT-3.5, Alpaca	WebGPT docs, Human docs	26.3	141.4	2096	110	168
ID-Total						13322	1198	1610
BEGIN	Wizard-of-Wikipedia, CMU-DoG, TopicalChat	GPT2-base, T5-base, CTRL-Dialog, DoHA	Wikipedia, Reddit, news articles	17.3	181.2	-	-	436
HAGRID	MIRACL	GPT-3.5	Wikipedia	35.3	148.8	-	-	1088
AttrEval-GenSearch	Curated by human annotators	New Bing	New Bing	34.3	76.3	-	-	162
OOD-Total						-	-	1686

Table 1: Statistics of the datasets used in AttributionBench. They contain a wide range of diverse questions. 'ID' and 'OOD' denote "in-distribution" and "out-of-distribution", respectively. We make all subsets label-balanced to treat the two classes equally so that the evaluation can fairly show the performance of each class.

For both in-distribution (ID) and out-of-distribution (OOD) evaluation, we have the following findings:

Fine-tuning on NLI-related data is beneficial to attribution evaluation: T5-XXL-TRUE and AttrScore-Flan-T5 (3B) are fine-tuned on data including NLI and achieve strong performance on both ID and OOD sets. Specifically, both models outperform fine-tuned GPT-3.5 for the average F1 score on OOD sets and achieve comparable performance with fine-tuned GPT-3.5 on OOD sets. Also, through further fine-tuning on AttributionBench, AttrScore-Flan-T5 (3B) outperforms the original Flan-T5 (3B) on 4 ID and 2 OOD sets.
Automatic attribution evaluation is challenging under zero-shot setting: For tasks with shorter claims and evidence like AttribtuedQA and AttrEval-GenSearch, the performance of GPT-3.5 and GPT-4 are over 80%, while for tasks with longer evidence (like Stanford-GenSearch and HAGRID), the zero-shot performance for models is around 60%~70%, which is relatively low. For the challenging task ExpertQA, which consists of domain-specific chal- lenging questions, GPT-3.5, GPT-4 and those NLI-tuned models can only achieve under ~60% performance.
Fine-tuning on AttributionBench benefits both ID and OOD evaluation: On average among all models, fine-tuning can improve 9.0% and 4.6% on 4 ID tasks and 3 OOD tasks, respectively. Additionally, with training on only 13k examples, the fine-tuned FLAN-T5 (770M) model even surprisingly outperforms GPT-3.5 (w/ CoT) on both ID and OOD sets, indicating the effectiveness of fine-tuning on AttributionBench. Furthermore, almost all models obtain performance gain on 3 OOD sets via fine-tuning, indicating that the models did not just overfit the ID data, but also gained generalizability to solve this task on OOD examples.
Simply switching stronger models cannot significantly improve the performance: First, under zero-shot setting, GPT-3.5 and GPT-4 show their expertise in dealing with difficult tasks like ExpertQA, a dataset containing hard domain-specific responses and evidence. Nevertheless, on more other tasks like AttributedQA and LFQA, they still underperform smaller models like T5-XXL-TRUE and FLAN-UL2. Under fine-tuned setting, for the 4 ID sets, GPT-3.5 shows competing performance but still underperforms smaller models including FLAN-T5 (11B) and Flan-UL2 (20B). For the 3 OOD sets, the performance of GPT-3.5 on AttrEval-GenSearch and HAGRID is still lower than many models including T5-XXL-TRUE, AttrScore-Flan-T5 (3B), and most surprisingly, Flan-T5 (3B) and Flan-T5 (770M).

Setting	Model (Size)	ExpertQA			Stanford-GenSearch			AttributedQA			LFQA			ID-Avg.
		F1↑	FP↓	FN↓	F1↑	FP↓	FN↓	F1↑	FP↓	FN↓	F1↑	FP↓	FN↓	F1↑
Zero-shot	FLAN-T5 (770M)	38.2	1.3	47.4	73.5	15.0	11.5	80.4	12.2	7.4	37.2	0.0	48.2	57.3
	FLAN-T5 (3B)	55.6	15.8	27.9	74.0	17.2	8.7	79.8	15.2	4.8	75.3	6.5	17.9	71.2
	AttrScore-FLAN-T5 (3B)	55.7	32.4	9.6	64.6	27.3	6.5	80.5	16.5	2.6	71.4	21.4	6.5	68.1
	FLAN-T5 (11B)	52.0	36.4	7.5	59.2	32.7	5.0	78.6	18.3	2.6	79.8	10.1	10.1	67.4
	T5-XXL-TRUE (11B)	54.5	17.8	27.3	68.5	16.2	15.3	85.2	7.8	7.0	80.4	1.2	17.9	72.2
	FLAN-UL2 (20B)	59.4	22.5	18.0	72.5	19.2	8.0	82.5	13.0	4.3	80.1	4.2	15.5	73.6
	AttrScore-Alpaca (7B)	47.4	11.1	37.7	68.6	21.2	9.8	79.0	14.8	6.1	68.7	10.1	20.8	65.9
	GPT-3.5 (w/o CoT)	55.3	30.4	12.1	62.0	30.5	3.8	74.7	20.9	3.5	72.6	22.0	4.2	66.2
	GPT-3.5 (w/ CoT)	60.4	23.0	16.2	66.1	25.5	7.2	78.9	14.3	6.5	73.4	19.6	6.5	69.7
	GPT-4 (w/o CoT)	56.5	32.8	8.0	59.8	33.2	3.5	81.0	15.7	3.0	71.6	23.2	4.2	67.2
	GPT-4 (w/ CoT)	59.2	26.3	13.9	71.7	19.5	8.5	82.2	10.0	7.8	80.2	14.9	4.8	73.3
Fine-tuned	Roberta-large-mnli (330M)	52.0	13.1	33.0	64.7	14.0	21.2	71.5	10.0	18.3	68.0	24.4	6.5	64.1
	FLAN-T5 (770M)	55.0	32.7	10.0	75.2	16.0	8.7	81.6	13.5	4.8	83.3	9.5	7.1	73.8
	FLAN-T5 (3B)	54.9	33.8	8.3	79.8	10.7	9.5	82.1	12.2	5.7	89.3	6.0	4.8	76.5
	AttrScore-FLAN-T5 (3B)	56.8	30.2	11.4	81.0	10.0	9.0	82.5	11.7	5.7	90.5	4.2	5.4	77.7
	FLAN-T5 (11B)	61.8	21.9	16.2	81.7	9.5	8.8	83.4	11.7	4.8	86.9	6.0	7.1	78.5
	T5-XXL-TRUE (11B)	57.1	30.1	11.3	81.8	9.5	8.7	83.4	12.6	3.9	86.3	4.8	8.9	77.2
	FLAN-UL2 (20B)	61.6	24.8	13.1	81.8	8.2	10.0	84.3	10.0	5.7	86.3	7.1	6.5	78.5
	Llama-2 (7B)	60.5	17.2	22.2	80.2	10.5	9.3	79.9	13.5	6.5	85.6	3.0	11.3	76.6
	AttrScore-Alpaca (7B)	61.3	16.3	22.2	82.8	8.2	9.0	80.4	11.3	8.3	84.5	6.5	8.9	77.3
	GPT-3.5 (w/o CoT)	61.1	18.8	19.9	82.0	6.7	11.3	83.9	8.7	7.4	83.9	6.5	9.5	77.7
Avg. Gain*		6.4	6.0	-9.9	12.4	-11.9	0.4	2.1	-2.6	0.6	15.6	-2.8	-10.7	9.0

Table 1: The baseline performance (macro-F1 score) with different models on the 4 ID test sets of AttributionBench. The best performance within each setting is marked in bold. FP: false positive percentage (the ground truth is 'not attributable' but the model predicts 'attributable'). FN: false negative percentage (vice versa). ID-Avg.: the average score of F1 on 4 ID test sets. '↑' and '↓' represent higher and lower is better, respectively. *: The 'Avg. Gain' is calculated by averaging the performance gain of each model that is both zero-shot prompted and fine-tuned after being fine-tuned compared to being zero-shot prompted (If a model is only in one setting, its gain isn't included in the average). The results demonstrate the effectiveness of fine-tuning on the training set of AttributionBench on ID evaluation, but there is still a large room for improvement.

Setting	Model (Size)	BEGIN			AttrEval-GenSearch			HAGRID
		F1 ↑	FP ↓	FN ↓	F1 ↑	FP ↓	FN ↓	F1 ↑	FP ↓	FN ↓
Zero-shot	FLAN-T5 (770M)	79.6	9.2	11.2	80.8	6.2	13.0	75.9	13.1	10.9	78.8
	FLAN-T5 (3B)	80.2	13.3	6.4	82.0	6.2	11.7	79.0	16.9	3.8	80.4
	AttrScore-FLAN-T5 (3B)	78.9	17.7	3.0	76.3	16.7	6.8	68.6	26.9	2.6	74.6
	FLAN-T5 (11B)	72.3	25	1.1	78.1	16.7	4.9	64.5	30.6	2.0	71.6
	T5-XXL-TRUE (11B)	86.4	4.8	8.7	76.4	2.5	20.4	78.6	14.4	6.8	80.5
	Flan-UL2 (20B)	82.2	13.1	4.6	87.7	5.6	6.8	73.9	21.4	3.9	81.3
	AttrScore-Alpaca (7B)	75.9	20.4	3.0	82.1	6.8	11.1	73.9	19.9	5.6	77.3
	GPT-3.5 (w/o CoT)	79.4	15.8	4.4	76.7	18.5	4.3	70.1	25.2	2.8	75.4
	GPT-3.5 (w/ CoT)	77.6	14.9	7.3	82.1	11.1	6.8	74.0	19.7	5.1	77.9
	GPT-4 (w/o CoT)	77.5	19.7	2.1	84.3	14.2	1.2	72.1	23.9	2.8	78.0
	GPT-4 (w/ CoT)	77.5	18.3	3.7	83.3	8.0	8.6	75.9	18.5	5.2	78.9
Fine-tuned	Roberta-large-mnli (330M)	60.9	6.2	27.2	65.1	26.1	11.8	61.3	9.2	28.4	62.4
	FLAN-T5 (770M)	79.6	9.2	11.2	80.8	6.2	13.0	75.9	13.1	10.9	78.8
	FLAN-T5 (3B)	80.2	13.3	6.4	82.0	6.2	11.7	79.0	16.9	3.8	80.4
	AttrScore-FLAN-T5 (3B)	90.8	6.7	2.5	85.1	4.3	10.5	76.6	19.0	3.9	84.2
	FLAN-T5 (11B)	90.6	7.8	1.6	85.1	4.3	10.5	67.7	27.1	3.4	81.1
	T5-XXL-TRUE (11B)	91.5	7.1	1.4	81.4	6.2	12.3	77.3	18.6	3.6	83.4
	Flan-UL2 (20B)	90.1	6.9	3.0	85.1	4.9	9.9	67.8	26.6	4.0	81.0
	Llama-2 (7B)	84.1	10.6	5.3	83.3	6.8	9.9	73.0	22.5	3.5	80.1
	AttrScore-Alpaca (7B)	85.1	9.2	5.7	81.9	4.3	13.6	76.4	18.3	4.9	81.1
	GPT-3.5 (w/o CoT)	86.8	11.0	2.1	81.3	4.9	13.6	77.6	16.9	5.1	81.9
Avg. Gain*		9.9	-6.6	-3.2	2.5	-4.1	1.6	1.0	0.3	-1.2	4.6

Table 2: The baseline performance (Macro-F1 score) with different models on the 3 OOD test sets of AttributionBench. The best performance within each setting is marked in bold. FP: false positive percentage (the ground truth is 'not attributable' but the model predicts 'attributable'). FN: false negative percentage (vice versa). OOD-Avg.: the average score of F1 on 3 OOD test sets. '↑' and '↓' represent higher and lower is better, respectively. *: The 'Avg. Gain' is calculated by averaging the performance gain of each model (that is both zero-shot prompted and fine-tuned) after being fine-tuned compared to being zero-shot prompted. Fine-tuned results demonstrate fine-tuning on the training set of AttributionBench not only improves ID performance but also improves generalizability.

We also report the average macro-F1 score on 7 test sets of GPT-3.5 with different input fields. Q, C, E, R stands for question, claim, evidence, and response, respectively. Results show that merely appending the questions and responses to the input, which human annotators might have access to, is not the key to solving this task.

We also show the performance of GPT-3.5 (w/ CoT) with several different prompts. Prompt engineering only brings limited gain over 7 test sets. Rather than bringing little gain in overall performance (top), adjusting prompts is actually changing the ratio of FP and FN cases (bottom). 'comp_und' stands for 'comprehensive understanding'.

Through a detailed error analysis, we observed the following takeaways:

Over 66% error cases are caused by fine-grained information insensitivity: This means when making a wrong judgment, the model is most likely to fail in comparing fine-grained information within the claim and the evidence. Here, 'fine-grained information' encompasses a wide array of specifics, from discrete data points such as numerical figures, dates, names, and locations, to more complex elements like particular events or logical connections.
About 26.8% errors are caused by the mismatch between information accessible to the model and that accessible to human annotators: This disparity brings into question the reliability of labels within current attribution evaluation datasets. Our analysis suggests that, in certain cases, determining the label based solely on the provided claim and evidence can be challenging due to insufficient information or ambiguous references (error type: need additional information/reference ambiguity).

Error Type (%)	Example
Fine-grained information insensitivity (66.0%)	Claim: He was once an important member of the Party but became a traitor. Evidence: He seems to have been a Party leader who fell out of favor with the regime. Human-annotated Ground truth: not attributable GPT-3.5 (w/ CoT): The reference supports the claim as it indicates that the person was once an important member of the Party (Party leader) but then fell out of favor with the regime, which aligns with the claim of becoming a traitor. Final judgment: attributable.
Need additional information/reference ambiguity (15.6%)	Claim: Kellie Pickler won season 16 of Dancing with the Stars. Evidence: Title: Dancing with the Stars (American season 16) Pickler and her partner Derek Hough won the competition, marking his fourth of six wins and increasing his lead on the show as the professional dancer who has won the most mirror ball trophies. Human-annotated Ground truth: not attributable GPT-3.5 (w/ CoT): The reference supports the claim. Kellie Pickler did indeed win season 16 of Dancing with the Stars, as mentioned in the reference. Final judgment: attributable.
Label issue (11.2%)	Claim: Ireland last qualified for the World Cup in 2002, as highlighted in and. Evidence: By the time of their next potential appearance, in 2026, Ireland will not have been involved in the finals since they took La Roja to penalties in a campaign lit up by the goals of the talismanic Robbie Keane 24 years earlier. Human-annotated Ground Truth: attributable GPT-3.5 (w/ CoT): The reference does not support the claim. The reference only mentions Ireland's potential appearance in the 2026 World Cup and their absence from the finals since 1994, but it does not specifically mention their last qualification in 2002. Therefore, the reference does not support the claim. Final judgment: not attributable.

Within 'fine-grained information insensitivity', the model also makes different kinds of mistakes regarding its label error class, i.e., false positive or false negative. In the case of false positives, the model misses certain details within the claim 43% of the time, and in 41% of the instances, it made incorrect connections and inferences to indicate the claim being supported by the evidence, which is actually wrong. Conversely, within the false negative category, the model predominantly misinterprets or neglects details in the evidence around 46% of the time, and 39% of them are because of lacking necessary inference and summarization to reach the attribution judgment, which could be very easy for humans.

We also show the error case distribution in 7 test sets:

Reference

Please kindly cite our paper if you use our code, data, or results:

@misc{li2024attributionbench,
            title={AttributionBench: How Hard is Automatic Attribution Evaluation?}, 
            author={Yifei Li and Xiang Yue and Zeyi Liao and Huan Sun},
            year={2024},
            eprint={2402.15089},
            archivePrefix={arXiv},
            primaryClass={cs.CL}
      }

If used, please also cite the original datasets accordingly:

@misc{malaviya2023expertqa,
        title={ExpertQA: Expert-Curated Questions and Attributed Answers}, 
        author={Chaitanya Malaviya and Subin Lee and Sihao Chen and Elizabeth Sieber and Mark Yatskar and Dan Roth},
        year={2023},
        eprint={2309.07852},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
      }

@inproceedings{liu-etal-2023-evaluating,
          title = "Evaluating Verifiability in Generative Search Engines",
          author = "Liu, Nelson  and
            Zhang, Tianyi  and
            Liang, Percy",
          editor = "Bouamor, Houda  and
            Pino, Juan  and
            Bali, Kalika",
          booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
          month = dec,
          year = "2023",
          address = "Singapore",
          publisher = "Association for Computational Linguistics",
          url = "https://aclanthology.org/2023.findings-emnlp.467",
          doi = "10.18653/v1/2023.findings-emnlp.467",
          pages = "7001--7025",
          abstract = "Generative search engines directly generate responses to user queries, along with in-line citations. A prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). We conduct human evaluation to audit four popular generative search engines{---}Bing Chat, NeevaAI, perplexity.ai, and YouChat{---}across a diverse set of queries from a variety of sources (e.g., historical Google user queries, dynamically-collected open-ended questions on Reddit, etc.). We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5{\%} of generated sentences are fully supported by citations and only 74.5{\%} of citations support their associated sentence. We believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness. We hope that our results further motivate the development of trustworthy generative search engines and help researchers and users better understand the shortcomings of existing commercial systems.",
      }

@misc{bohnet2023attributed,
          title={Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models}, 
          author={Bernd Bohnet and Vinh Q. Tran and Pat Verga and Roee Aharoni and Daniel Andor and Livio Baldini Soares and Massimiliano Ciaramita and Jacob Eisenstein and Kuzman Ganchev and Jonathan Herzig and Kai Hui and Tom Kwiatkowski and Ji Ma and Jianmo Ni and Lierni Sestorain Saralegui and Tal Schuster and William W. Cohen and Michael Collins and Dipanjan Das and Donald Metzler and Slav Petrov and Kellie Webster},
          year={2023},
          eprint={2212.08037},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }

@misc{chen2023understanding,
          title={Understanding Retrieval Augmentation for Long-Form Question Answering}, 
          author={Hung-Ting Chen and Fangyuan Xu and Shane Arora and Eunsol Choi},
          year={2023},
          eprint={2310.12150},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }

@article{dziri-etal-2022-evaluating,
      title = "Evaluating Attribution in Dialogue Systems: The {BEGIN} Benchmark",
      author = "Dziri, Nouha  and
        Rashkin, Hannah  and
        Linzen, Tal  and
        Reitter, David",
      editor = "Roark, Brian  and
        Nenkova, Ani",
      journal = "Transactions of the Association for Computational Linguistics",
      volume = "10",
      year = "2022",
      address = "Cambridge, MA",
      publisher = "MIT Press",
      url = "https://aclanthology.org/2022.tacl-1.62",
      doi = "10.1162/tacl_a_00506",
      pages = "1066--1083",
      abstract = "Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models{'} responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at \url{https://github.com/google/BEGIN-dataset}.",
  }

@inproceedings{yue-etal-2023-automatic,
      title = "Automatic Evaluation of Attribution by Large Language Models",
      author = "Yue, Xiang  and
        Wang, Boshi  and
        Chen, Ziru  and
        Zhang, Kai  and
        Su, Yu  and
        Sun, Huan",
      editor = "Bouamor, Houda  and
        Pino, Juan  and
        Bali, Kalika",
      booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
      month = dec,
      year = "2023",
      address = "Singapore",
      publisher = "Association for Computational Linguistics",
      url = "https://aclanthology.org/2023.findings-emnlp.307",
      doi = "10.18653/v1/2023.findings-emnlp.307",
      pages = "4615--4635",
      abstract = "A recent focus of large language model (LLM) development, as exemplified by generative search engines, is to incorporate external references to generate and support its claims. However, evaluating the attribution, i.e., verifying whether the generated statement is fully supported by the cited reference, remains an open problem. Although human evaluation is common practice, it is costly and time-consuming. In this paper, we investigate automatic evaluation of attribution given by LLMs. We begin by defining different types of attribution errors, and then explore two approaches for automatic evaluation: prompting LLMs and fine-tuning smaller LMs. The fine-tuning data is repurposed from related tasks such as question answering, fact-checking, natural language inference, and summarization. We manually curate a set of test examples covering 12 domains from a generative search engine, New Bing. Our results on this curated test set and simulated examples from existing benchmarks highlight both promising signals and challenges. We hope our problem formulation, testbeds, and findings will help lay the foundation for future studies on this important problem.",
  }

@misc{kamalloo2023hagrid,
    title={HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution}, 
    author={Ehsan Kamalloo and Aref Jafari and Xinyu Zhang and Nandan Thakur and Jimmy Lin},
    year={2023},
    eprint={2307.16883},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

AttributionBench: How Hard is Automatic Attribution Evaluation?

Abstract

Data Statistics

ID & OOD Evaluation

Error Analysis

Reference