Our data construction process follows a 5-step pipeline:
- Flight Data Collection: We randomly choose a source and destination from the list of the 50 busiest airports by passenger traffic. We then scrape flights between the source and destination from Google Flights using Selenium Webdriver. The flights are sampled from economy, business, and first class including all other details such as price, layover times, carbon emissions etc.
- Product-of-Sums Generation: We randomly select 2-6 features or "slots" such as airline, ticket class, travel time etc. and generate a minterm table - the list of all input combinations of slots that generate a '1'.The slot symbols and generated minterms are input to SymPy which generates a product-of-sums (POS) expression.
- Primitive Generalization: For each occurence of the selected slot in the POS expression, we use rule-based templates to generate a primitive condition which acts as a constraint. We also generate the negation of the condition.
- LLM Paraphrasing & Human Validation: We substitute the individual primitives into the individual sum terms and combine them using another simple rule-based template. We then use gpt-4-turbo to rephrase the individual sum terms to make them sound more natural and human-like. Next, we combine the individual sum terms into a product (logical AND). The resulting flight requirement is again rephrased with gpt-4-turbo. We manually verify each generated query to ensure it is consistent with the primitives and make changes wherever necessary.
- Option Matching: We divide the flights for each route into sets of 5 - each set containing one route that matches the user criteria. Thus a user query is often repeated in our dataset, but may have different flight options.
The process for generating a single query with 2 slots and 2 minterms in illustrated in the image below -