SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

SkillHarm: two attack scenarios — Fixed-Payload Poisoning and Self-Mutating Poisoning.

Figure 1. The two attack scenarios in SkillHarm. In Fixed-Payload Poisoning, a poisoned skill directly triggers a harmful outcome during a single task execution. In Self-Mutating Poisoning, an initial task executes harmlessly but silently mutates the skill package, compromising a subsequent task upon reuse.

Abstract

Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than from genuine resistance, and current defenses still fail to reliably mitigate the threat.

Attack Scenarios

Fixed-Payload Poisoning FPP

The attacker publishes a skill whose malicious payload is present at installation and remains fixed thereafter. The payload may live in SKILL.md, reference documents, helper scripts, or attacker-added artifacts. When an agent invokes the poisoned skill for a relevant task, the payload can realize the harmful outcome within that same session.

Captures direct single-session compromise.

Self-Mutating Poisoning SMP

The skill contains a mutation hook rather than an immediately harmful payload. The first session using the skill appears harmless, while silently modifying persistent skill content. The harmful outcome materializes only when a later session reuses the modified skill.

Exposes a failure mode single-session evaluations cannot observe.

AutoSkillHarm: Automated Attack Construction

A coding agent drives three sequential stages, each specified by a natural-language harness that defines stage inputs, objectives, constraints, required outputs, and review criteria.

AutoSkillHarm pipeline: target selection, attack design, quality filter.

Figure 2. Overview of AutoSkillHarm. Target selection identifies reachable attack targets; attack design generates contextualized payloads with deterministic evaluators; quality filtering removes invalid candidates.

Risk Taxonomy

12 risk types organized by the agent workflow component through which harm materializes.

Data Pipeline Exploitation

Data Exfiltration. Disclose sensitive or private data to an external destination.
Output Manipulation. Alter a user-facing deliverable to be false, biased, or misleading.
Poisoning. Contaminate trusted data or configuration.

System Environment Exploitation

Privilege Escalation. Weaken access controls on a protected file or directory.
Unauthorized File Mod. Modify, delete, or encrypt out-of-scope user or system files.
Backdoor Injection. Create persistent unauthorized access or hidden execution.
Denial of Service. Plant a persistent disruption to degrade system availability.
Malware Deployment. Download and install malicious payloads.
System Corruption. Corrupt trusted system, tool, network, or runtime configuration.

Agent Autonomy Exploitation

Goal Hijacking. Divert the agent from the user's task to an unrelated objective.
Anti-Forensics. Remove or alter evidence of prior malicious activity.
Proxy Attack. Send attacker-authored content to a third party via agent authority.

Headline Results

879

attack samples

71

skills covered

86.3%

peak ASR (FPP)

69.3%

peak ASR (SMP)

Across six representative model-harness configurations spanning four agent harnesses (Claude Code, Codex, Gemini CLI, OpenCode), current agents remain highly vulnerable. Our analysis further surfaces a latent risk: many apparent failures occur not because the agent resists the attack, but because it never engages the poisoned file. Conditional on engagement, ASR rises sharply. Explicit refusal is rare across all evaluated agents, and standard defenses — skill scanners and defensive system prompts — fail to reliably mitigate the threat.

BibTeX

@article{ning2026skillharm,
  title         = {SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction},
  author        = {Ning, Yuting and Zhang, Zhehao and Lal, Yash Kumar and Gou, Boyu
                   and Li, Junyi and Ruan, Weitong and Ye, Chentao and Gupta, Rahul
                   and Yang, Diyi and Su, Yu and Sun, Huan},
  journal       = {arXiv preprint arXiv:2606.02540},
  year          = {2026},
  eprint        = {2606.02540},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}