Document Formats

IPI supports 7 document formats for payload generation. Each format targets a different document ingestion pathway that AI agents commonly encounter. Use qai ipi formats to list all formats with their implementation status and technique count.

PDF

File extension: .pdf PDFs are the most common document format processed by AI agents in enterprise settings — uploaded to chatbots, ingested by RAG pipelines, and parsed by document processing services. PDF has the broadest technique coverage in IPI with 10 techniques across two phases, exploiting the rich internal structure of the PDF specification (text rendering, form fields, annotations, JavaScript actions, metadata, and file attachments).

Available Techniques

Technique ID	Description
`white_ink`	White text on white background
`off_canvas`	Text at negative coordinates (off page)
`metadata`	Hidden in PDF metadata fields
`tiny_text`	0.5pt font — below visual threshold
`white_rect`	Text covered by white rectangle
`form_field`	Hidden AcroForm text field
`annotation`	PDF annotation/comment layer
`javascript`	Document-level JavaScript action
`embedded_file`	Hidden file attachment stream
`incremental`	PDF incremental update section

Parser Behavior Notes

PDF text extraction varies significantly across parsers. Some agents use pdfminer, others use PyMuPDF, pdfplumber, or cloud-based OCR. Metadata and annotation techniques tend to work across parsers, while form field and JavaScript techniques depend on the parser’s feature support.

Image

File extensions: .png, .jpg Images are processed by vision-language models (VLMs) and OCR pipelines. As multimodal AI becomes standard, image-based injection is an increasingly relevant attack surface.

Available Techniques

Technique ID	Description
`visible_text`	Human-readable text overlay
`subtle_text`	Low contrast, small font, edge-placed text
`exif_metadata`	Payload in EXIF metadata fields

Parser Behavior Notes

VLMs (GPT-4V, Claude vision, Gemini) process the visual content directly, making visible_text and subtle_text effective. EXIF metadata is only relevant when the processing pipeline extracts metadata before or alongside visual analysis. OCR-based pipelines (AnythingLLM, some Open WebUI configurations) may miss subtle text depending on contrast and font size.

Markdown

File extension: .md Markdown files are widely used in documentation, README files, and knowledge bases. AI agents frequently ingest Markdown through RAG pipelines, file uploads, and web scraping.

Available Techniques

Technique ID	Description
`html_comment`	Payload in HTML comment tags (`<!-- -->`)
`link_reference`	Payload in unused link reference definition
`zero_width`	Payload encoded with zero-width Unicode chars
`hidden_block`	Payload in HTML div with `display:none`

Parser Behavior Notes

Most AI systems process raw Markdown text, meaning HTML comments and link references are visible to the model even though they don’t render visually. Zero-width character encoding works when the pipeline passes raw bytes without Unicode normalization. Hidden blocks depend on whether the agent processes raw HTML or pre-rendered text.

HTML

File extension: .html HTML documents are encountered through web scraping, email rendering, and document conversion pipelines.

Available Techniques

Technique ID	Description
`script_comment`	Payload in JavaScript comment inside script tag
`css_offscreen`	Payload in element positioned off-screen with CSS
`data_attribute`	Payload in HTML `data-*` attribute
`meta_tag`	Payload in HTML `<meta>` tag content

Parser Behavior Notes

Agents that process raw HTML source (common in web scraping pipelines) are exposed to all four techniques. Agents that receive pre-rendered text (after browser rendering or HTML-to-text conversion) typically miss css_offscreen and script_comment content. Meta tag and data attribute techniques work when the parser extracts full DOM content.

DOCX

File extension: .docx Word documents are common in business workflows and frequently uploaded to AI assistants for summarization, analysis, and editing.

Available Techniques

Technique ID	Description
`docx_hidden_text`	Text with hidden font attribute (invisible in normal view)
`docx_tiny_text`	0.5pt font — below visual threshold
`docx_white_text`	White text on white background
`docx_comment`	Payload in Word comment/annotation
`docx_metadata`	Payload in document core properties
`docx_header_footer`	Payload in document header or footer

Parser Behavior Notes

DOCX parsing typically uses python-docx or similar libraries. The hidden text attribute is format-level metadata that most parsers include by default. Comment and metadata techniques depend on whether the parser extracts these elements alongside body text. Header/footer content is often included in full-text extraction.

ICS

File extension: .ics iCalendar files are used for calendar invites and scheduling. AI email assistants and calendar management tools parse ICS attachments to extract event details.

Available Techniques

Technique ID	Description
`ics_description`	Payload in event `DESCRIPTION` property
`ics_location`	Payload in event `LOCATION` property
`ics_valarm`	Payload in `VALARM` reminder `DESCRIPTION`
`ics_x_property`	Payload in custom `X-` extension property

Parser Behavior Notes

Calendar parsers typically extract DESCRIPTION and LOCATION as primary fields, making those techniques broadly effective. VALARM and X- property support varies — some parsers ignore custom extensions while others include all properties in their output.

EML

File extension: .eml Email files are processed by AI email assistants, support ticket systems, and email analysis tools.

Available Techniques

Technique ID	Description
`eml_x_header`	Payload in custom `X-` email header
`eml_html_hidden`	Payload in hidden HTML div (`display:none`)
`eml_attachment`	Payload in text file attachment

Parser Behavior Notes

Email processing pipelines typically extract headers, body text, and attachments. Custom X- headers are included by most parsers. HTML hidden div techniques depend on whether the agent processes raw HTML or rendered text. Attachment-based techniques work when the pipeline extracts and processes attached files alongside the email body.

Format Summary

Format	Extension(s)	Techniques	Primary Attack Surface
PDF	`.pdf`	10	Document upload, RAG ingestion
Image	`.png`, `.jpg`	3	VLM processing, OCR pipelines
Markdown	`.md`	4	Knowledge bases, documentation
HTML	`.html`	4	Web scraping, email rendering
DOCX	`.docx`	6	Business document processing
ICS	`.ics`	4	Calendar/scheduling assistants
EML	`.eml`	3	Email processing, support tools

​PDF

​Available Techniques

​Parser Behavior Notes

​Image

​Available Techniques

​Parser Behavior Notes

​Markdown

​Available Techniques

​Parser Behavior Notes

​HTML

​Available Techniques

​Parser Behavior Notes

​DOCX

​Available Techniques

​Parser Behavior Notes

​ICS

​Available Techniques

​Parser Behavior Notes

​EML

​Available Techniques

​Parser Behavior Notes

​Format Summary

PDF

Available Techniques

Parser Behavior Notes

Image

Available Techniques

Parser Behavior Notes

Markdown

Available Techniques

Parser Behavior Notes

HTML

Available Techniques

Parser Behavior Notes

DOCX

Available Techniques

Parser Behavior Notes

ICS

Available Techniques

Parser Behavior Notes

EML

Available Techniques

Parser Behavior Notes

Format Summary