Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than from genuine resistance, and current defenses still fail to reliably mitigate the threat.
The attacker publishes a skill whose malicious payload is present at installation and remains fixed thereafter. The payload may live in SKILL.md, reference documents, helper scripts, or attacker-added artifacts. When an agent invokes the poisoned skill for a relevant task, the payload can realize the harmful outcome within that same session.
Captures direct single-session compromise.
The skill contains a mutation hook rather than an immediately harmful payload. The first session using the skill appears harmless, while silently modifying persistent skill content. The harmful outcome materializes only when a later session reuses the modified skill.
Exposes a failure mode single-session evaluations cannot observe.
A coding agent drives three sequential stages, each specified by a natural-language harness that defines stage inputs, objectives, constraints, required outputs, and review criteria.
Figure 2. Overview of AutoSkillHarm. Target selection identifies reachable attack targets; attack design generates contextualized payloads with deterministic evaluators; quality filtering removes invalid candidates.
12 risk types organized by the agent workflow component through which harm materializes.
Across six representative model-harness configurations spanning four agent harnesses (Claude Code, Codex, Gemini CLI, OpenCode), current agents remain highly vulnerable. Our analysis further surfaces a latent risk: many apparent failures occur not because the agent resists the attack, but because it never engages the poisoned file. Conditional on engagement, ASR rises sharply. Explicit refusal is rare across all evaluated agents, and standard defenses — skill scanners and defensive system prompts — fail to reliably mitigate the threat.
@article{ning2026skillharm,
title = {SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction},
author = {Ning, Yuting and Zhang, Zhehao and Lal, Yash Kumar and Gou, Boyu
and Li, Junyi and Ruan, Weitong and Ye, Chentao and Gupta, Rahul
and Yang, Diyi and Su, Yu and Sun, Huan},
journal = {arXiv preprint arXiv:2606.02540},
year = {2026},
eprint = {2606.02540},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}