GUI-Drag: Beyond Clicking

Abstract

GUI-Drag

Screen-Drag

GUI-Drag

Screen-Drag

GUI-Drag Training & Dataset

We design an scalable pipeline to synthesize text dragging examples directly from screenshots. It contains three stages: 1) Instruction Generation, which covers five categories of instructions for referencing text spans in five granularities, 2) Grounding, which leverages the OCR module to ground the instruction into pixel-level coordinates, and 3) Filtering, which filters out the examples that have ambiguous intention or have incorrect grounding.
We carefully analyze and filter existing GUI grounding datasets, including UGround and Jedi, to retain screenshots that are rich in textual content. In addition, we collect 20K public academic-style documents to better reflect real-world scenarios where text-dragging capabilities are frequently used, particularly in academic and document-editing contexts.
Overall, we synthesize 161K high-quality text dragging examples, which we denote as GUI-Drag.
Unlike prior works that train grounding models from a general foundation model with tons of examples, we adopt an efficient continual training strategy to enhance the text dragging capability while preserving the original click-based performance of the base model. Our models, trained based on Jedi-3B/7B, are denoted as GUI-Drag-3B/7B..

ScreenDrag Benchmark

ScreenDrag crafts 5,333 human-annotated examples across three popular productivity applications: Word, PowerPoint, and PDF reader, where the text dragging is a core part of user workflows.
The screenshots within ScreenDrag span three levels of interface context — document view, application window, and full desktop. This design captures diverse real-world usage scenarios of GUI agents, particularly when employing MCP techniques.
The instructions within ScreenDrag are also categorized into different granularities and categories, and are formatted in both explicit and implicit forms.
We further split the benchmark into text-sparse and text-dense subsets to quantify the fine-grained localization difficulty.

Evaluation Metrics

ScreenDrag reports complementary metrics that jointly capture action triggering, spatial alignment, and exact selection quality.

Drag Trigger Rate (DTR): frequency of producing a complete dragging trajectory.
Boundary Distance (B-Dist): average index differences between predicted and ground-truth boundary boxes.
Selection Rate (SR): percentage of cases where the selected text span matches the gold annotation exactly.

A screenshot with SOM in gray. In the top-left black box, the target text span is the first sentence, "Like … tools.". Given the ground truth and predicted bboxes (The predicted start and end coordinates fall within bbox 0 and bbox 20, which we omit here to avoid clutter.), the B-Dist is 3 according to Equation 2 in the paper. In the bottom-right box, the target span is the last sentence, "For … Word." Here, both predictions (blue and green) yield zero B-Dist, but only the green coordinates correctly capture the target span. The blue prediction fails because its d_pixel does not satisfy the small threshold, whereas the green one succeed given the text snapping mechanism.

Key Results on ScreenDrag

Continual training on GUI-Drag lifts success rates by up to 18% absolute compared with the strongest baselines.

Performance over the ScreenDrag benchmark.
Experimental Setting	DTR↑	Text-Sparse B-Dist↓	Text-Sparse SR↑	Text-Dense B-Dist↓	Text-Dense SR↑	Avg. B-Dist↓	Avg. SR↑
Open-Source
Qwen2.5-VL-3B	5.0%	44.5	3.2%	43.8	0.0%	44.3	2.26%
Qwen2.5-VL-7B	5.0%	27.7	3.35%	30.8	1.16%	28.7	2.64%
Qwen2.5-VL-32B	55.8%	23.5	5.84%	27.1	1.65%	24.4	5.09%
Jedi-3B	94.1%	12.1	19.0%	17.2	8.53%	13.4	16.3%
Jedi-7B	77.5%	14.3	12.1%	18.3	4.99%	15.4	10.3%
UI-TARS-1.5-7B*	84.6%	13.0	23.6%	19.5	9.36%	14.7	20.0%
Closed-Source
OpenAI CUA*	85.7%	9.70	21.4%	12.98	8.13%	10.1	16.0%
Claude CUA*	47.4%	10.44	17.0%	8.92	12.59%	10.5	18.1%
OpenAI CUA (w/ hint)*	91.7%	8.68	18.0%	12.83	6.74%	9.15	16.1%
Claude CUA (w/ hint)*	96.9%	8.63	16.9%	10.74	11.39%	9.73	16.6%
Ours
Jedi-3B (Drag)	100.0%	7.9	39.7%	9.2	20.1%	8.2	34.7%
Jedi-7B (Drag)	100.0%	7.4	36.1%	8.8	16.6%	7.7	31.2%
GUI-Drag-3B	100.0%	6.9	43.6%	7.2	22.9%	7.0	38.1%
GUI-Drag-7B	100.0%	6.2	38.1%	6.7	19.8%	6.4	33.1%

Performance over the ScreenDrag benchmark.

Experimental Setting

DTR↑

Text-Sparse B-Dist↓

Text-Sparse SR↑

Text-Dense B-Dist↓

Text-Dense SR↑

Avg. B-Dist↓

Avg. SR↑

Open-Source

Qwen2.5-VL-3B

5.0%

44.5

3.2%

43.8

0.0%

44.3

2.26%

Qwen2.5-VL-7B

5.0%

27.7

3.35%

30.8

1.16%

28.7

2.64%

Qwen2.5-VL-32B

55.8%

23.5

5.84%

27.1

1.65%

24.4

5.09%

Jedi-3B

94.1%

12.1

19.0%

17.2

8.53%

13.4

16.3%

Jedi-7B

77.5%

14.3

12.1%

18.3

4.99%

15.4

10.3%

UI-TARS-1.5-7B*

84.6%

13.0

23.6%

19.5

9.36%

14.7

20.0%

Closed-Source

OpenAI CUA*

85.7%

9.70

21.4%

12.98

8.13%

10.1

16.0%

Claude CUA*

47.4%

10.44

17.0%

8.92

12.59%

10.5

18.1%

OpenAI CUA (w/ hint)*

91.7%

8.68

18.0%

12.83

6.74%

9.15

16.1%

Claude CUA (w/ hint)*

96.9%

8.63

16.9%

10.74

11.39%

9.73

16.6%

Ours

Jedi-3B (Drag)

100.0%

7.9

39.7%

9.2

20.1%

8.2

34.7%

Jedi-7B (Drag)

100.0%

7.4

36.1%

8.8

16.6%

7.7

31.2%

GUI-Drag-3B

100.0%

6.9

43.6%

7.2

22.9%

7.0

38.1%

GUI-Drag-7B

100.0%

6.2

38.1%

6.7

19.8%

6.4

33.1%

* Models with native drag support. Lower is better for B-Dist; higher is better otherwise. Highlighted entries mark the best, underlined entries mark the second best.

Analysis Highlights

Bias toward clicking. Existing grounding models are heavily biased toward click actions, even when the instructions explicitly require dragging.

Action Distributions.
Continual training effectiveness. Continual training effectively enhances text dragging capability while preserving the original click-based performance.

Continual training results on ScreenDrag and click benchmarks (ScreenSpot, ScreenSpot-v2, OSWorld-G; averaged).

Practical agent gains. Case studies on OSWorld show GUI-Drag enables agents to follow natural strategies instead of relying on peculiar shortcuts.

OSWorld case study success rates (SR).
Model	OpenAI CUA	Claude CUA	o3	o3 + GUI-Drag-7B
SR	2/3	2/3	0/3	3/3

Beyond Clicking:

A Step Towards Generalist GUI Grounding via Text Dragging

Abstract

Videos

GUI-Drag Training & Dataset

ScreenDrag Benchmark

Evaluation Metrics

Key Results on ScreenDrag

Analysis Highlights

Citations