TableLlama: Towards Open Large Generalist Models for Tables

Tianshu Zhang, Xiang Yue, Yifei Li, Huan Sun

The Ohio State University
zhang.11535@osu.edu, yue.149@osu.edu, li.14042@osu.edu, sun.397@osu.edu

In this work, we introduce TableLlama and TableInstruct, the first large open-source generalist model and instruction tuning dataset for tables. Everything is open-source right now!

TableInstruct is a large-scale instruction tuning dataset with diverse, realistic tasks based on real-world tables. TableInstruct boasts a collection of 14 datasets of 11 tasks in total, which is curated from 1.24M tables containing 2.6M instances:

All data items are unified into an instruction tuning manner for further LLM training;
All data items in TableInstruct are collected from real tables and real tasks;
TableInstruct provides both in-domain training tasks to empower the model with fundamental table understanding abilities, and in-domain & out-of-domain evaluation tasks to test the model’s generalization and high-level reasoning ability.

TableLlama is a large generalist model for tables based on Llama 2 (7B) and LongLoRA, which can:

Support long input context (up to 8k);
Achieve comparable or even better performance than the SOTA on almost all of the in-domain tasks;
Achieve 5-44 absolute point gains on 6 out-of-domain datasets compared with the base model, demonstrating that TableInstruct can substantially enhance model generalizability.

Figure 1: An overview of TableInstruct and TableLlama. TableInstruct includes a wide variety of realistic tables and tasks with instructions. We make the first step towards developing open-source generalist models for tables with TableInstruct and TableLlama.

Abstract

Semi-structured tables are ubiquitous. There has been a variety of tasks that aim to automatically interpret, augment, and query tables. Current methods often require pretraining on tables or special model architecture design, are restricted to specific table types, or have simplifying assumptions about tables and tasks. This paper makes the first step towards developing open-source large language models (LLMs) as generalists for a diversity of table-based tasks. Towards that end, we construct TableInstruct, a new dataset with a variety of realistic tables and tasks, for instruction tuning and evaluating LLMs. We further develop the first open-source generalist model for tables, TableLlama, by fine-tuning Llama 2 (7B) with LongLoRA to address the long context challenge. We experiment under both in-domain setting and out-of-domain setting. On 7 out of 8 in-domain tasks, TableLlama achieves comparable or better performance than the SOTA for each task, despite the latter often has task-specific design. On 6 out-of-domain datasets, it achieves 5-44 absolute point gains compared with the base model, showing that training on TableInstruct enhances the model’s generalizability. We open-source our dataset and trained model to boost future work on developing open generalist models for tables.

Updates

2024/03/13: Our paper has been accepted by NAACL 2024 as a long paper!
2024/03/21: We refine the prompts of 4 out-of-domain evaluation datasets: FEVEROUS, HybridQA, WikiSQL and WikiTQ of TableInstruct and update the results. Check the new results!
2024/03/21: We add the results of closed-source LLMs: GPT-3.5 and GPT-4.
2023/11/20: We have released the dataset, model, and codebase for the paper. Check it out!

Note:

The evaluation results can vary largely along with the prompts. Check the detailed guidelines about how to get the suitable prompt for TableLlama and do inference!

Our Dataset: TableInstruct

We construct TableInstruct, a comprehensive table-based instruction tuning dataset that covers a variety of real-world tables and realistic tasks. We include 14 datasets of 11 tasks in total, with both in-domain and out-of-domain evaluation settings. Some examples can be found in the figure below:

The illustration of three exemplary tasks from TableInstruct

Figure 2: Illustration of three exemplary tasks: (a) Column type annotation. This task is to annotate the selected column with the correct semantic types. (b) Row population. This task is to populate rows given table metadata and partial row entities. (c) Hierarchical table QA. For subfigures (a) and (b), we mark candidates with red color in the "task instruction" part. The candidate set size can be hundreds to thousands in TableInstruct.

Data Statistics

TableInstruct includes various kinds of table tasks to comprehensively represent different table-related applications in real-world scenarios. These tasks belong to several categories: table interpretation, table augmentation, question answering, fact verification, dialogue generation, and data-to-text. The table below shows the statistics of TableInstruct:

Task Category	Task Name	Dataset	In-domain	Train (Table/Sample)	Test (Table/Sample)	min tokens	max tokens	median tokens
Table Interpretation	Column Type Annotation	TURL	Yes	397K/628K	1K/2K	106	8192	2613
	Relation Extraction		Yes	53K/63K	1K/2K	2602	8192	3219
	Entity Linking		Yes	193K/1264K	1K/2K	299	8192	4667
Table Augmentation	Schema Augmentation	TURL	Yes	288K/288K	4K/4K	160	1188	215
Table Augmentation	Row Population	TURL	Yes	286K/286K	0.3K/0.3K	264	8192	1508
Question Answering	Hierarchical Table QA	HiTab	Yes	3K/7K	1K/1K	206	5616	978
	Highlighted Cells QA	FeTaQA	Yes	7K/7K	2K/2K	261	5923	740
	Hybrid Table QA	HybridQA	No	-	3K/3K	248	2497	675
	Table QA	WikiSQL	No	-	5K/16K	198	2091	575
	Table QA	WikiTQ	No	-	0.4K/4K	263	2688	709
Fact Verification	Fact Verification	TabFact	Yes	16K/92K	2K/12K	253	4975	630
Fact Verification	Fact Verification	FEVEROUS	No	-	4K/7K	247	8192	648
Dialogue Generation	Table Grounded Dialogue Generation	KVRET	No	-	0.3K/0.8K	187	1103	527
Data-to-Text	Highlighted Cells Description	ToTTo	No	-	7K/8K	152	8192	246

In-domain Evaluation

We first evaluate TableLlama on 8 in-domain test sets. Due to the special semi-structured nature of tables, for most table-based tasks, existing work achieves SOTA results by using pretraining on large-scale tables and/or special model architecture design tailored for tables. Surprisingly, with a unified instruction tuning format without extra special design, TableLlama can achieve comparable or even better performance on almost all the tasks. The table below shows the results:

Datasets	Metric	Base	TableLlama	SOTA	GPT-3.5	§GPT-4
Column Type Annotation	F1	3.01	94.39	94.54	30.88	31.75
Relation Extraction	F1	0.96	91.95	94.91	27.42	52.95
Entity Linking	Accuracy	31.80	93.65	84.90	72.15	90.80
Schema Augmentation	MAP	36.75	80.50	77.55	49.11	58.19
Row Population	MAP	4.53	58.44	73.31	22.36	53.40
HiTab	Exec Acc	14.96	64.71	47.00	43.62	48.40
FeTaQA	BLEU	8.54	39.05	33.44	26.49	21.70
TabFact	Accuracy	41.65	82.55	84.87	67.41	74.40

Specifically, we observed the following takeaways:

By simply fine-tuning a large language model on TableInstruct, TableLlama can achieve comparable or even better performance on almost all the tasks without any table pretraining or special table model architecture design;
TableLlama displays advantages in table QA tasks: TableLlama can surpass the SOTA by 5.61 points for highlighted cell based table QA task (i.e., FeTaQA) and 17.71 points for hierarchical table QA (i.e., HiTab), which is full of numerical reasoning on tables. As LLMs have been shown superior in interacting with humans and answering questions, this indicates that the existing underlying strong language understanding ability of LLMs may be beneficial for such table QA tasks, despite with semi-structured tables;
For the entity linking task, which requires the model to link the mention in a table cell to the correct referent entity in Wikidata, TableLlama also presents superior performance with 8 points gain over the SOTA performance. Since the candidates are composed of their referent entity name and description, we hypothesize LLMs have certain abilities to understand the description which help identify the correct entities;
Row population is the only task where TableLlama has a large performance gap compared to the SOTA. We observe that, in order to correctly populate the entities from the given large number of candidates, the model needs to fully understand the inherent relation between the enquiried entity and each given candidate, which is still challenging for the current model. Detailed analysis and case study can be found in our paper's Section 4.1 and Table 5 in Appendix D.
TableLlama achieves better performance on indomain tasks compared with closed-source LLMs. It shows that even if closed-source LLMs have demonstrated strong performance in general, finetuning open-source LLMs on task-specific tablebased data still has better performance.

Out-of-domain Evaluation

To show the model's generalizability on unseen data and unseen tasks, we evaluate TableLlama on several out-of-domain datasets. Overall, TableLlama shows a remarkable generalizability on different out-of-domain tasks, by outperforming the baselines from 6 to 48 absolute points. The table below shows the results:

Datasets	Metric	Base	TableLlama	SOTA	▲ Base	GPT-3.5	§GPT-4
FEVEROUS	Accuracy	29.68	73.77	85.60	+44.09	60.79	71.60
HybridQA	Accuracy	23.46	39.38	65.40	+15.92	40.22	58.60
KVRET	Micro F1	38.90	48.73	67.80	+9.83	54.56	56.46
ToTTo	BLEU	10.39	20.77	48.95	+10.38	16.81	12.21
WikiSQL	Accuracy	15.56	50.48	92.70	+34.92	41.91	47.60
WikiTQ	Accuracy	29.26	35.01	57.50	+5.75	53.13	68.40