Posted in

How to Protect Against Indirect Prompt Injection

Hands typing on a laptop with an e-commerce website open, showcasing online shopping.

As AI systems become more capable, they also become more exposed. Large language models are no longer limited to answering isolated questions. They now summarize webpages, process documents, review content, interact with business tools, and in some cases act as agents that can make decisions or trigger actions. That shift creates major productivity gains, but it also creates a new class of security risk: indirect prompt injection.

In simple terms, indirect prompt injection happens when an attacker hides instructions inside content that an AI system later reads and mistakes for legitimate commands. A poisoned webpage, document, metadata field, or embedded script can influence the model’s behavior without the user realizing it. That means protecting against this threat is no longer just about filtering bad inputs. It is about designing AI systems so they can safely handle untrusted content.

For both cybersecurity professionals and security-aware consumers, the key point is this: indirect prompt injection cannot be solved with one control alone. It requires defense in depth, careful system design, and clear limits on what the model is allowed to do.

Why defense is difficult

The challenge begins with how language models work. Traditional software usually separates code from data very clearly. LLMs do not. They process everything in a shared context, which means a malicious instruction hidden in a webpage may sit next to the legitimate task the model is supposed to perform. If the system is not carefully designed, the model may struggle to distinguish between content it should analyze and instructions it should ignore.

This is why the problem is more than a typical web filtering issue. An attacker is not always trying to exploit the browser or the user directly. Instead, the goal is to exploit the AI’s interpretation layer. That makes indirect prompt injection both a security problem and an architecture problem.

1. Separate trusted instructions from untrusted content

The most important defense is to create a clear boundary between system instructions and external data.

If an AI system is asked to analyze a webpage, that page should be treated strictly as untrusted content. The model should be reminded, in the strongest possible way, that anything inside the page is data to inspect, not instructions to follow. This sounds simple, but in practice it is one of the most important safeguards. Many indirect prompt injection attacks work because the model is allowed to blur that distinction.

This principle should be built into the design of every AI workflow. The model should know:

  • which instructions come from the developer or system owner

  • which content comes from the user

  • and which material comes from external, potentially hostile sources

Without that hierarchy, the model has little chance of making safe decisions consistently.

2. Limit what the AI is allowed to do

The next major defense is least privilege.

An AI assistant that only summarizes text is less dangerous than an AI agent that can send emails, approve ads, move money, modify records, or access sensitive documents. The more authority the model has, the more damaging a successful indirect prompt injection becomes.

That means organizations should avoid giving AI systems broad, unrestricted permissions. If a model does not need access to a business system, it should not have it. If it does need access, that access should be narrow, monitored, and limited to specific tasks.

In practice, this means:

  • restricting tool use to approved functions

  • limiting access to sensitive systems and data

  • adding approval steps before high-risk actions

  • and preventing the model from chaining together powerful actions without oversight

A poisoned webpage is far less dangerous if the model cannot do much with the instructions it receives.

3. Put humans in the loop for sensitive actions

One of the best defenses is also one of the oldest: require human review where the stakes are high.

If an AI system is being used to approve advertisements, authorize payments, trigger transactions, modify customer data, or access internal information, it should not be allowed to act autonomously on weak evidence. Human review creates friction, but it also breaks the attacker’s most important objective: converting hidden instructions into real-world action.

This does not mean every AI workflow must become manual. It means the most sensitive actions should have clear escalation points. A model can assist, summarize, and prioritize, but final approval for high-impact decisions should rest with a person or with a separate control layer that validates the result.

4. Inspect content the way the AI sees it

A major reason indirect prompt injection succeeds is that human reviewers and AI systems do not always see the same thing. A webpage may look clean to a person while still containing hidden text, encoded payloads, invisible DOM elements, or dynamically injected instructions.

That means defensive inspection has to go beyond ordinary page review. Security teams need to think about how the AI processes content:

  • raw HTML

  • extracted text

  • DOM structure

  • accessibility content

  • dynamically rendered elements

  • and potentially OCR or other recovery methods

If security tools only inspect what is visible on screen, they may miss the very instructions the AI consumes. Defensive analysis needs to account for hidden CSS content, off-screen elements, obfuscated attributes, encoded strings, and JavaScript-generated text.

5. Filter and sanitize external content before it reaches the model

Another useful control is to preprocess external content before the AI sees it. This can reduce the attack surface significantly.

For example, organizations can strip or neutralize:

  • hidden elements

  • suspicious attributes

  • script-generated text

  • excessive formatting tricks

  • encoded payloads

  • and patterns commonly associated with prompt injection attempts

The goal is not to perfectly detect every attack in advance. That would be unrealistic. The goal is to reduce the amount of attacker-controlled instruction-like content that reaches the model in the first place.

Sanitization is especially important for systems that routinely ingest web pages, email content, user comments, uploaded files, and scraped text from multiple external sources.

6. Use layered prompt defenses, not prompt wording alone

Many teams respond to prompt injection by strengthening the system prompt. That helps, but it is not enough.

A well-written system instruction can remind the model to ignore commands found in external content, but prompt-only defenses are fragile. Attackers can still use authority language, social engineering, obfuscation, and repeated instructions to influence behavior. The safer approach is to combine prompt-level defenses with architectural controls.

These may include:

  • instruction hierarchy

  • context isolation

  • adversarial testing

  • response validation

  • tool access restrictions

  • and runtime monitoring

In other words, the prompt is only one layer. It should not be the entire security strategy.

7. Monitor for suspicious behavior and anomalous outputs

Indirect prompt injection often reveals itself through behavior. An AI system may suddenly produce irrelevant answers, leak internal instructions, approve unsafe content, or recommend actions that make little sense in the user’s context.

That makes monitoring essential. Security teams should log:

  • which sources the AI consumed

  • what tools it attempted to use

  • what outputs it generated

  • and whether those outputs included signs of manipulation or policy override

The goal is not only detection but learning. Over time, organizations can use these signals to identify common attack patterns, refine filters, and improve model safeguards.

8. Test AI systems like you would test any exposed application

If an organization uses AI in production, it should assume that attackers will probe it. That means indirect prompt injection needs to be part of normal security testing.

Red-team exercises should include:

  • poisoned webpages

  • malicious documents

  • hidden text payloads

  • encoded instructions

  • role-based jailbreak attempts

  • and workflows that try to trick the model into leaking data or taking unintended actions

Testing should also focus on full pipelines, not just the model in isolation. A system may appear safe in a lab but fail once browsing, rendering, tool use, or automation is introduced.

9. Reduce trust in third-party and open web content

Many AI workflows are built on the assumption that public content is mostly benign. That assumption is becoming less safe.

Organizations should treat public web content the way they treat any other untrusted input. The fact that a page appears in search results or looks professional does not mean it is safe for an AI agent to consume. The rise of hidden prompts, scam pages, manipulative SEO content, and AI-targeted payloads means the open web is now part of the threat surface for AI systems.

For consumers, the lesson is similar. If an AI assistant recommends a payment action, a subscription, or a website based on external content, it is worth pausing and checking the source manually before acting.

What organizations should do now

For companies using AI agents or LLM-powered workflows, the practical response is clear. They should review where external content enters their systems, what permissions their models have, and what would happen if a malicious page successfully influenced the output.

A mature strategy should include:

  • strict separation of trusted instructions and untrusted content

  • limited permissions for AI tools and agents

  • human approval for high-risk actions

  • preprocessing and sanitization of external content

  • runtime monitoring and anomaly detection

  • and regular adversarial testing

This is not just about preventing embarrassing outputs. It is about preventing data leaks, fraudulent actions, unsafe approvals, and loss of control over automated workflows.

Conclusion

Indirect prompt injection is dangerous because it exploits one of the core strengths of AI systems: their ability to interpret natural language from almost any source. That same flexibility can become a weakness when hostile instructions are hidden inside content the model is supposed to read.

The right response is not panic, but discipline. Organizations should assume that any external content may contain manipulative instructions and build their AI systems accordingly. Security for AI agents is not just about making models smarter. It is about making the systems around them more controlled, more observable, and harder to misuse.

Source: This article is based primarily on Fooling AI Agents: Web-Based Indirect Prompt Injection Observed in the Wild by Palo Alto Networks Unit 42.

FAQ

1. What is the single most important defense against indirect prompt injection?
The most important defense is to clearly separate trusted system instructions from untrusted external content, so the model treats outside material as data to analyze rather than commands to follow.

2. Can indirect prompt injection be solved by a stronger system prompt alone?
No. Better prompts help, but they are not enough on their own. Effective protection requires layered controls such as limited permissions, content sanitization, monitoring, and human review for sensitive actions.

3. Why is least privilege important for AI agents?
Because the impact of an attack depends heavily on what the model is allowed to do. A compromised summarizer is far less dangerous than a compromised agent that can approve ads, access data, or trigger transactions.

4. What should consumers do if they use AI assistants that browse or recommend sites?
They should treat high-stakes recommendations carefully, especially when money, logins, subscriptions, or downloads are involved, and verify important actions manually before proceeding.