GroundCocoa

Abstract

"If I am leaving on Friday, I need a flight after 8 pm so I can make it after work. However, on Saturday, I prefer leaving before 10 in the morning. Also, if I'm travelling at night I need a first class seat."

The rapid progress of large language models (LLMs) has seen them excel and frequently surpass human performance on standard benchmarks. This has enabled many downstream applications, such as LLM agents, to rely on their reasoning to address complex task requirements. However, LLMs are known to unexpectedly falter in simple tasks and under seemingly straightforward circumstances - underscoring the need for better and more diverse evaluation setups to measure their true capabilities. To this end, we choose to study compositional and conditional reasoning, two aspects that are central to human cognition, and introduce GroundCocoa - a lexically diverse benchmark connecting these reasoning skills to the real-world problem of flight booking. Our task involves aligning detailed user preferences with available flight options presented in a multiple-choice format. Results indicate a significant disparity in performance among current state-of-the-art LLMs with even the best performing model, GPT-4 Turbo, not exceeding 67% accuracy despite advanced prompting techniques.

Dataset Construction

Our data construction process follows a 5-step pipeline:

Flight Data Collection: We randomly choose a source and destination from the list of the 50 busiest airports by passenger traffic. We then scrape flights between the source and destination from Google Flights using Selenium Webdriver. The flights are sampled from economy, business, and first class including all other details such as price, layover times, carbon emissions etc.
Product-of-Sums Generation: We randomly select 2-6 features or "slots" such as airline, ticket class, travel time etc. and generate a minterm table - the list of all input combinations of slots that generate a '1'.The slot symbols and generated minterms are input to SymPy which generates a product-of-sums (POS) expression.
Primitive Generalization: For each occurence of the selected slot in the POS expression, we use rule-based templates to generate a primitive condition which acts as a constraint. We also generate the negation of the condition.
LLM Paraphrasing & Human Validation: We substitute the individual primitives into the individual sum terms and combine them using another simple rule-based template. We then use gpt-4-turbo to rephrase the individual sum terms to make them sound more natural and human-like. Next, we combine the individual sum terms into a product (logical AND). The resulting flight requirement is again rephrased with gpt-4-turbo. We manually verify each generated query to ensure it is consistent with the primitives and make changes wherever necessary.
Option Matching: We divide the flights for each route into sets of 5 - each set containing one route that matches the user criteria. Thus a user query is often repeated in our dataset, but may have different flight options.

The process for generating a single query with 2 slots and 2 minterms in illustrated in the image below -

A beautiful landscape — Query Generation in GroundCocoa

Dataset Statistics

Key statistics of GroundCocoa are provided below:

Statistic	(slot, minterm) configurations						Total
Statistic	(2,2)	(3,2)	(4,2)	(4,3)	(5,2)	(6,2)	Total
Test Samples	1511	1083	710	723	451	371	4849
Test Unique Queries	124	136	117	129	121	101	728
Val. Samples	17	17	8	5	2	3	52
Val. Unique Queries	1	1	1	1	1	1	6
Avg. Query Length	65.04	88.33	103.88	119.14	124.56	148.87	95.95
Avg. Context Length	-	-	-	-	-	-	1252.27
Vocab Size	-	-	-	-	-	-	4200

Data Explorer

The data explorer allows you to view a random query from Ground Cocoa along with its corresponding POS Equation and primitives (generated from rule-based templates). Additionally you can specify the number of slots, minterms or both to view a query of a specific type. The (slot, minterm) configuration should be valid and must correspond to one of the columns in the table above.

Select Random Query to view a User Requirement from Ground Cocoa. Define slots, minterms or both to view a query of a specific type.

Slots & Minterms

Citation

@misc{kohli2025groundcocoabenchmarkevaluatingcompositional,
      title={GroundCocoa: A Benchmark for Evaluating Compositional & Conditional Reasoning in Language Models},
      author={Harsh Kohli and Sachin Kumar and Huan Sun},
      year={2025},
      eprint={2404.04237},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2404.04237},
}

Corresponding Author