Beyond Clicking:

A Step Towards Generalist GUI Grounding via Text Dragging

Zeyi Liao1, Yadong Lu2, Boyu Gou1, Huan Sun1, Ahmed Awadallah2

1The Ohio State University, 2Microsoft Research
† work done while at MSR

Abstract

    Graphical user interface (GUI) grounding, the process of mapping human instructions to GUI actions, serves as a fundamental basis to autonomous GUI agents. While existing grounding models achieve promising performance to simulate the mouse click action on various click-based benchmarks, another essential mode of mouse interaction, namely dragging, remains largely underexplored. Yet, dragging the mouse to select and manipulate textual content represents a prevalent and important usage in practical GUI scenarios. To narrow this gap, we first introduce GUI-Drag, a diverse dataset of 161K text dragging examples synthesized through a scalable pipeline. To support systematic and robust evaluation, we further construct Screen-Drag, a benchmark with 5,333 examples spanning three levels of interface context, together with three dedicated metrics designed for assessing text dragging capability. Models trained on GUI-Drag with an efficient continual training strategy achieve substantial improvements on Screen-Drag, while preserving the original click-based performance on ScreenSpot, ScreenSpot-v2, and OSWorld-G. Our work encourages further research on broader GUI grounding beyond just clicking and paves way toward a truly generalist GUI grounding model.
    GUI-Drag pipeline and benchmark overview
    (Left) ScreenDrag benchmark across three interface contexts. (Right) GUI-Drag pipeline overview. The grounding model receives screenshot and instruction, and then outputs the starting and ending coordinates for text dragging.

Videos

GUI-Drag Training & Dataset

  • We design an scalable pipeline to synthesize text dragging examples directly from screenshots. It contains three stages: 1) Instruction Generation, which covers five categories of instructions for referencing text spans in five granularities, 2) Grounding, which leverages the OCR module to ground the instruction into pixel-level coordinates, and 3) Filtering, which filters out the examples that have ambiguous intention or have incorrect grounding.
  • We carefully analyze and filter existing GUI grounding datasets, including UGround and Jedi, to retain screenshots that are rich in textual content. In addition, we collect 20K public academic-style documents to better reflect real-world scenarios where text-dragging capabilities are frequently used, particularly in academic and document-editing contexts.
  • Overall, we synthesize 161K high-quality text dragging examples, which we denote as GUI-Drag.
  • Unlike prior works that train grounding models from a general foundation model with tons of examples, we adopt an efficient continual training strategy to enhance the text dragging capability while preserving the original click-based performance of the base model. Our models, trained based on Jedi-3B/7B, are denoted as GUI-Drag-3B/7B..

ScreenDrag Benchmark

  • ScreenDrag crafts 5,333 human-annotated examples across three popular productivity applications: Word, PowerPoint, and PDF reader, where the text dragging is a core part of user workflows.
  • The screenshots within ScreenDrag span three levels of interface context — document view, application window, and full desktop. This design captures diverse real-world usage scenarios of GUI agents, particularly when employing MCP techniques.
  • The instructions within ScreenDrag are also categorized into different granularities and categories, and are formatted in both explicit and implicit forms.
  • We further split the benchmark into text-sparse and text-dense subsets to quantify the fine-grained localization difficulty.

Evaluation Metrics

ScreenDrag reports complementary metrics that jointly capture action triggering, spatial alignment, and exact selection quality.

  • Drag Trigger Rate (DTR): frequency of producing a complete dragging trajectory.
  • Boundary Distance (B-Dist): average index differences between predicted and ground-truth boundary boxes.
  • Selection Rate (SR): percentage of cases where the selected text span matches the gold annotation exactly.
ScreenDrag metrics illustration
A screenshot with SOM in gray. In the top-left black box, the target text span is the first sentence, "Like … tools.". Given the ground truth and predicted bboxes (The predicted start and end coordinates fall within bbox 0 and bbox 20, which we omit here to avoid clutter.), the B-Dist is 3 according to Equation 2 in the paper. In the bottom-right box, the target span is the last sentence, "For … Word." Here, both predictions (blue and green) yield zero B-Dist, but only the green coordinates correctly capture the target span. The blue prediction fails because its dpixel does not satisfy the small threshold, whereas the green one succeed given the text snapping mechanism.

Key Results on ScreenDrag

Continual training on GUI-Drag lifts success rates by up to 18% absolute compared with the strongest baselines.

Performance over the ScreenDrag benchmark.
Experimental Setting DTR↑ Text-Sparse B-Dist↓ Text-Sparse SR↑ Text-Dense B-Dist↓ Text-Dense SR↑ Avg. B-Dist↓ Avg. SR↑
Open-Source
Qwen2.5-VL-3B5.0%44.53.2%43.80.0%44.32.26%
Qwen2.5-VL-7B5.0%27.73.35%30.81.16%28.72.64%
Qwen2.5-VL-32B55.8%23.55.84%27.11.65%24.45.09%
Jedi-3B94.1%12.119.0%17.28.53%13.416.3%
Jedi-7B77.5%14.312.1%18.34.99%15.410.3%
UI-TARS-1.5-7B*84.6%13.023.6%19.59.36%14.720.0%
Closed-Source
OpenAI CUA*85.7%9.7021.4%12.988.13%10.116.0%
Claude CUA*47.4%10.4417.0%8.9212.59%10.518.1%
OpenAI CUA (w/ hint)*91.7%8.6818.0%12.836.74%9.1516.1%
Claude CUA (w/ hint)*96.9%8.6316.9%10.7411.39%9.7316.6%
Ours
Jedi-3B (Drag)100.0%7.939.7%9.220.1%8.234.7%
Jedi-7B (Drag)100.0%7.436.1%8.816.6%7.731.2%
GUI-Drag-3B100.0%6.943.6%7.222.9%7.038.1%
GUI-Drag-7B100.0%6.238.1%6.719.8%6.433.1%

* Models with native drag support. Lower is better for B-Dist; higher is better otherwise. Highlighted entries mark the best, underlined entries mark the second best.

Analysis Highlights

  1. Bias toward clicking. Existing grounding models are heavily biased toward click actions, even when the instructions explicitly require dragging.
    Distribution of dragging actions
    Action Distributions.
  2. Continual training effectiveness. Continual training effectively enhances text dragging capability while preserving the original click-based performance.
    Drag and click performance trade-off
    Continual training results on ScreenDrag and click benchmarks (ScreenSpot, ScreenSpot-v2, OSWorld-G; averaged).
  3. Practical agent gains. Case studies on OSWorld show GUI-Drag enables agents to follow natural strategies instead of relying on peculiar shortcuts.
    OSWorld case study success rates (SR).
    Model OpenAI CUA Claude CUA o3 o3 + GUI-Drag-7B
    SR 2/3 2/3 0/3 3/3

Citations