How Indirect Prompt Injection Attacks Work in Practice

Indirect prompt injection is often explained as a theoretical flaw in large language models, but the more important question for security teams is practical: how do these attacks actually get built, delivered and executed in real environments? The answer is that attackers do not rely on a single trick. They combine web design techniques, content obfuscation, runtime execution and jailbreak wording to make malicious instructions readable to AI systems while staying hidden from humans and basic security checks. That is what makes web-based indirect prompt injection so operationally relevant.

At a high level, the attack chain is simple. An attacker places hidden or manipulated instructions inside webpage content. An AI system later visits, parses, summarizes or evaluates that content during a normal workflow. The model then treats attacker-controlled text as instructions instead of untrusted data. In agentic systems, that may lead to approval of a scam page, biased recommendations, leakage of sensitive data, resource exhaustion or even attempts to trigger transactions. This creates a web-scale attack surface as AI becomes more deeply embedded in browsers, search, moderation, analysis and autonomous workflows.

The attacker’s playbook: two layers of engineering

A useful way to understand these attacks is to separate them into two technical layers. The first is prompt delivery: how the malicious prompt is embedded in the webpage. The second is jailbreaking: how the instructions are phrased so the model is more likely to obey them. In practice, real attacks often combine both. An attacker may hide a prompt in HTML attributes or off-screen DOM elements, then wrap it in authority language such as “developer instruction” or “security compliance update” to increase compliance.

1. Visible plaintext: the simplest method still works

One of the more surprising findings in recent research is that attackers do not always need sophisticated hiding techniques. In many observed cases, the most common delivery method has simply been visible plaintext. That matters because it shows that some attackers are placing prompt-like text directly into web content and relying on the AI system to read and act on it. In some workflows, the model is easier to influence than defenders expect.

This is an important lesson for defenders: prompt injection is not only about invisible text or complicated obfuscation. Any content channel an AI ingests can become an instruction channel if the system is poorly designed. User comments, product descriptions, metadata fields and article text can all become attack surfaces.

2. Visual concealment: hiding prompts in plain sight

A more advanced and very practical method is visual concealment. Here, the attacker puts malicious instructions into the DOM but makes them invisible or nearly invisible to a human reviewer. Common tricks include setting the font size to zero, collapsing line height, hiding content inside zero-height containers, or positioning text far off-screen with extreme negative coordinates. Attackers may also use display: none, visibility: hidden, opacity: 0, or white text on a white background. In some cases, prompts are placed inside elements such as <textarea> and then hidden with CSS while remaining part of the page structure.

This works because many AI systems do not process a page the way a human sees it. They may read raw HTML, extracted text, accessibility content or rendered DOM output. So content that appears invisible to a user may still be fully visible to an AI.

3. Obfuscation in markup: making prompts look semantically irrelevant

Attackers also hide prompts inside markup structures that appear harmless to traditional parsers. For example, malicious instructions can be placed inside SVG or XML content, including CDATA sections, where they may look like non-executable data. They can also be hidden in HTML attributes such as data-* fields, where the content may not appear meaningful to a conventional browser or scanner but still remains readable to an AI pipeline.

This is important because many legacy security controls focus on rendered output or obvious script abuse. But AI systems may consume raw text from places that were never meant to act as instructions. The attacker is using the webpage as a storage container for prompt content, trusting that the model will extract meaning from locations ordinary tools may ignore.

4. Dynamic execution: building the prompt at runtime

One of the more technically sophisticated methods is runtime assembly through JavaScript. Instead of placing the full instruction in the initial HTML, the attacker encodes it and reconstructs it after the page loads. For example, a malicious page may contain a Base64-encoded payload that JavaScript decodes and inserts into an invisible DOM element after a short delay. This can help the attacker evade shallow scans that inspect only the original page source.

In some cases, attackers also use canvas-based rendering or delayed content injection so that the malicious text appears only after the page has already loaded and initial checks have passed. This creates a serious inspection problem: defenders may scan the static source code, while the AI agent later interacts with the fully rendered and dynamically altered page.

5. Payload fragmentation and evasion tricks

Attackers also use text-level obfuscation to make detection harder. They may insert zero-width Unicode characters, use homoglyphs that resemble ordinary letters, split a malicious sentence across multiple HTML elements, or encode parts of the prompt in ways that bypass simple filters. A standard string-matching rule may fail to detect a known phrase, while the language model still reconstructs the intended meaning.

This is one of the reasons indirect prompt injection is difficult to stop with conventional defenses. Many traditional filters look for exact strings or obvious patterns. LLMs, by contrast, are built to infer meaning from fragmented or noisy input. That gives attackers a clear advantage.

6. Jailbreak wording: persuading the model to cooperate

Prompt delivery gets the malicious content in front of the model. Jailbreak phrasing increases the chance that the model will obey it. Many real-world attacks rely heavily on social engineering language. The malicious text may claim to be a developer instruction, a system update, a compliance requirement or an administrative override. It may frame the task as urgent or authorized in order to manipulate the model’s behavior.

This matters because the attack is not purely technical. It also exploits the model’s learned tendency to follow instructions that sound official, privileged or time-sensitive. In effect, indirect prompt injection sits at the intersection of web exploitation and machine-targeted persuasion.

Real-world example: the ad review bypass case

One of the clearest field examples described by researchers involved a deceptive ad page that appeared designed to influence an AI-based ad review system. The page used multiple prompt injection methods at once, including hidden instructions intended to coerce the AI into approving a scam advertisement that should have been rejected. This example is significant because it shows attackers actively targeting AI moderation and review workflows, not just experimenting in theory.

Even where researchers cannot publicly confirm that a given attack succeeded against a production system, the intent is clear. Attackers are investing effort in layered payloads, runtime evasion and authority-based jailbreak language because they believe AI-enabled systems are exploitable and valuable targets.

What security teams should take away

The main lesson is that indirect prompt injection is not one bug but a family of techniques. It exploits the gap between how humans review content, how traditional scanners inspect it and how AI systems interpret it. A webpage becomes both a document and a covert instruction channel.

That is why defenders need more than keyword filtering. They need stronger separation between trusted instructions and untrusted content, better runtime inspection, DOM-aware analysis, visibility checks, and layered safeguards around what AI agents are allowed to do. As organizations give AI systems more access to tools, workflows and external content, the cost of getting this wrong rises quickly.

Source: This article is based primarily on Fooling AI Agents: Web-Based Indirect Prompt Injection Observed in the Wild by Palo Alto Networks Unit 42.

FAQ

1. What is the most common delivery method seen in real attacks?
In observed campaigns, visible plaintext has been one of the most common methods, showing that attackers do not always need elaborate hiding techniques.

2. Why do hidden prompts still affect AI if users cannot see them?
Because AI systems may process raw HTML, extracted text, DOM content, accessibility data or other machine-readable forms rather than only what appears on the screen.

3. Are these attacks only about hidden text?
No. They can also involve JavaScript-based runtime assembly, encoded payloads, SVG or XML embedding, attribute cloaking, payload splitting and social-engineering jailbreak language.

4. What makes indirect prompt injection so hard to defend against?
It exploits a structural problem: LLMs can interpret untrusted content as instructions, while many existing defenses still rely on shallow inspection or static pattern matching.