ScreenDrag reports complementary metrics that jointly capture action triggering, spatial alignment, and exact selection quality.
Continual training on GUI-Drag lifts success rates by up to 18% absolute compared with the strongest baselines.
Experimental Setting | DTR↑ | Text-Sparse B-Dist↓ | Text-Sparse SR↑ | Text-Dense B-Dist↓ | Text-Dense SR↑ | Avg. B-Dist↓ | Avg. SR↑ |
---|---|---|---|---|---|---|---|
Open-Source | |||||||
Qwen2.5-VL-3B | 5.0% | 44.5 | 3.2% | 43.8 | 0.0% | 44.3 | 2.26% |
Qwen2.5-VL-7B | 5.0% | 27.7 | 3.35% | 30.8 | 1.16% | 28.7 | 2.64% |
Qwen2.5-VL-32B | 55.8% | 23.5 | 5.84% | 27.1 | 1.65% | 24.4 | 5.09% |
Jedi-3B | 94.1% | 12.1 | 19.0% | 17.2 | 8.53% | 13.4 | 16.3% |
Jedi-7B | 77.5% | 14.3 | 12.1% | 18.3 | 4.99% | 15.4 | 10.3% |
UI-TARS-1.5-7B* | 84.6% | 13.0 | 23.6% | 19.5 | 9.36% | 14.7 | 20.0% |
Closed-Source | |||||||
OpenAI CUA* | 85.7% | 9.70 | 21.4% | 12.98 | 8.13% | 10.1 | 16.0% |
Claude CUA* | 47.4% | 10.44 | 17.0% | 8.92 | 12.59% | 10.5 | 18.1% |
OpenAI CUA (w/ hint)* | 91.7% | 8.68 | 18.0% | 12.83 | 6.74% | 9.15 | 16.1% |
Claude CUA (w/ hint)* | 96.9% | 8.63 | 16.9% | 10.74 | 11.39% | 9.73 | 16.6% |
Ours | |||||||
Jedi-3B (Drag) | 100.0% | 7.9 | 39.7% | 9.2 | 20.1% | 8.2 | 34.7% |
Jedi-7B (Drag) | 100.0% | 7.4 | 36.1% | 8.8 | 16.6% | 7.7 | 31.2% |
GUI-Drag-3B | 100.0% | 6.9 | 43.6% | 7.2 | 22.9% | 7.0 | 38.1% |
GUI-Drag-7B | 100.0% | 6.2 | 38.1% | 6.7 | 19.8% | 6.4 | 33.1% |
* Models with native drag support. Lower is better for B-Dist; higher is better otherwise. Highlighted entries mark the best, underlined entries mark the second best.
Model | OpenAI CUA | Claude CUA | o3 | o3 + GUI-Drag-7B |
---|---|---|---|---|
SR | 2/3 | 2/3 | 0/3 | 3/3 |