Logo TravelPlanner

A Benchmark for Real-World Planning with Language Agents

1Fudan University 2The Ohio State University
3The Pennsylvania State University 4Meta AI
* Equal Contribution
† Corresponding to jianx0321@gmail.com, zhang.13253@osu.edu, su.809@osu.edu
geometric reasoning

Overview of Logo TravelPlanner. Given a query, language agents are tasked with employing various search tools to gather information. Based on the collected information, language agents are expected to deliver a plan that not only meet the user's needs specified in the query but also adheres to commonsense constraints.


We introduce TravelPlanner: a comprehensive benchmark designed to evaluate the planning abilities of language agents in real-world scenarios across multiple dimensions. Without losing generality, TravelPlanner casts travel planning as its test environment, with all relevant information meticulously crafted to minimize data contamination. TravelPlanner does not have a singular ground truth for each query. Instead, the benchmark employs several pre-defined evaluation scripts to assess each tested plan, determining whether the language agent can effectively use tools to create a plan that aligns with both the implicit commonsense and explicit user needs outlined in the query (i.e., commonsense constraint and hard constraint). Every query in TravelPlanner has undergone thorough human verification to guarantee that feasible solutions exist. Additionally, TravelPlanner evaluates the language agent's capability by varying the breadth and depth of planning, controlled through the number of travel days and the quantity of hard constraints.

Logo TravelPlanner Dataset


We introduce LogoTravelPlanner, a benchmark crafted for evaluating language agents in tool-use and complex planning within multiple constraints. Grounded in travel planning, a real world use-case that naturally includes diverse constraints such as user needs and commonsense constraints in the environment, TravelPlanner evaluates whether language agents can develop reasonable travel plans by collecting information via diverse tools and making decisions, while satisfying the constraints. For a given query, language agents are expected to formulate a comprehensive plan that includes transportation, daily meals, attractions, and accommodation for each day. For constraints, from the perspective of real world applications, we design three types of them: Environment Constraint, Commonsense Constraint, and Hard Constraint. TravelPlanner comprises 1,225 queries in total. The number of days and hard constraints are designed to test agents' abilities across both the breadth and depth of complex planning.

And the benchmark is divided into the training, validation, and test set.

  • Train Set: 5 queries with corresponding human-annotated plans for group, resulting in a total of 45 query-plan pairs.
  • Validation Set: 20 queries from each group, amounting to 180 queries in total.
  • Test Set: 1,000 randomly distributed queries.
Download the dataset on Hugging Face Dataset.


Dataset distribution of TravelPlanner.

Examples in train set:



LogoTravelPlanner constraint description. The environment constraint is manifested through the feedback received from the environment, assessing whether the language agent can adjust its plan appropriately. The commonsense constraint and hard constraint are evaluated based on how well the language agent's plan aligns with these specific criteria.



LogoTool description and the number of items in the database. The original data for each tool is sourced from publicly available internet data. We then modify this data, which includes adding, deleting, and altering certain keys and values to suit our requirements. In this way, we effectively avoid the problem of data contamination.

Experiment Results

Results on Existing Large Language Models and Planning Strategies

Case Study


title={TravelPlanner: A Benchmark for Real-World Planning with Language Agents},
author={Jian Xie and Kai Zhang and Jiangjie Chen and Tinghui Zhu and Renze Lou and Yuandong Tian and Yanghua Xiao and Yu Su},
booktitle={Forty-first International Conference on Machine Learning},