Logo GroundCocoa

Evaluating Compositional & Conditional Reasoning in a Grounding Task

The Ohio State University
*Corresponding Authors.
GroundCocoa is a dataset for evaluating conditional and compositional reasoning in large language models. The benchmark is based on the real-world challenge of matching flight options with complex user criteria. GroundCocoa consists of 4849 test samples of varying conditional and compositional complexity. The samples consist of a mix of natural as well as more atypical user requirements (e.g. "I want the ticket price to be over $1000") to evaluate robustness when faced with unconventional or non-standard conditions.

Abstract

"If I am leaving on Friday, I need a flight after 8 pm so I can make it after work. However, on Saturday, I prefer leaving before 10 in the morning. Also, if I'm travelling at night I need a first class seat."

The rapid progress of large language models (LLMs) has seen them excel and frequently surpass human performance on standard benchmarks. This has enabled many downstream applications, such as LLM agents, to rely on their sophisticated reasoning to navigate complex task requirements. However, LLMs are known to unexpectedly falter in simple tasks and under seemingly straightforward circumstances - underscoring the need for better and more diverse evaluation setups to measure their true capabilities. To this end, we choose to study compositional and conditional reasoning, two cornerstones of human cognition, and introduce GroundCocoa - a lexically diverse benchmark connecting these reasoning skills to the real-world problem of flight booking. Our task involves aligning detailed user preferences with available flight options presented in a multiple-choice format. Results indicate a significant disparity in performance among current state-of-the-art LLMs with even the best performing model, GPT-4 Turbo, not exceeding 67% accuracy despite advanced prompting techniques.
A beautiful landscape
Gemini-Pro and GPT-4 Turbo responses for a flight requirement with single option and a simplified schema.

Dataset Construction

Our data construction process follows a 5-step pipeline:

  • Flight Data Collection: We randomly choose a source and destination from the list of the 50 busiest airports by passenger traffic. We then scrape flights between the source and destination from Google Flights using Selenium Webdriver. The flights are sampled from economy, business, and first class including all other details such as price, layover times, carbon emissions etc.
  • Product-of-Sums Generation: We randomly select 2-6 features or "slots" such as airline, ticket class, travel time etc. and generate a minterm table - the list of all input combinations of slots that generate a '1'.The slot symbols and generated minterms are input to SymPy which generates a product-of-sums (POS) expression.
  • Primitive Generalization: For each occurence of the selected slot in the POS expression, we use rule-based templates to generate a primitive condition which acts as a constraint. We also generate the negation of the condition.
  • LLM Paraphrasing & Human Validation: We substitute the individual primitives into the individual sum terms and combine them using another simple rule-based template. We then use gpt-4-turbo to rephrase the individual sum terms to make them sound more natural and human-like. Next, we combine the individual sum terms into a product (logical AND). The resulting flight requirement is again rephrased with gpt-4-turbo. We manually verify each generated query to ensure it is consistent with the primitives and make changes wherever necessary.
  • Option Matching: We divide the flights for each route into sets of 5 - each set containing one route that matches the user criteria. Thus a user query is often repeated in our dataset, but may have different flight options.

The process for generating a single query with 2 slots and 2 minterms in illustrated in the image below -

A beautiful landscape
Query Generation in GroundCocoa

Dataset Statistics

Key statistics of GroundCocoa are provided below:

Statistic (slot, minterm) configurations Total
(2,2) (3,2) (4,2) (4,3) (5,2) (6,2)
Test Samples 1511 1083 710 723 451 371 4849
Test Unique Queries 124 136 117 129 121 101 728
Val. Samples 17 17 8 5 2 3 52
Val. Unique Queries 1 1 1 1 1 1 6
Avg. Query Length 65.04 88.33 103.88 119.14 124.56 148.87 95.95
Avg. Context Length - - - - - - 1252.27
Vocab Size - - - - - - 4200

Data Explorer

The data explorer allows you to view a random query from Ground Cocoa along with its corresponding POS Equation and primitives (generated from rule-based templates). Additionally you can specify the number of slots, minterms or both to view a query of a specific type. The (slot, minterm) configuration should be valid and must correspond to one of the columns in the table above.
Select Random Query to view a User Requirement from Ground Cocoa. Define slots, minterms or both to view a query of a specific type.
Slots & Minterms


Citation

@misc{kohli2024cleared,
      title={Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents},
      author={Harsh Kohli and Huan Sun},
      year={2024},
      eprint={2404.04237},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Corresponding Authors