Supported File Formats

Anyparser is designed to handle a wide variety of file types, making it an ideal solution for businesses and developers needing to process diverse documents, media, and other content formats. Whether you’re dealing with scanned images, PDFs, Word documents, videos, or more, Anyparser can extract structured data with ease.

Supported Document Formats

1. PDF Files

Text Extraction: Extracts plain text, including tables and images.
OCR Support: Automatically uses OCR for scanned documents or images within PDFs.
Structured Output: Converts document structure into Markdown and JSON.

2. Word Documents (DOCX)

Text and Metadata: Extracts text, headings, paragraphs, and metadata like title, author, etc.
Tables and Lists: Converts tables and bullet lists to Markdown format or JSON structure.
Preserved Formatting: Retains document formatting for easy integration.

3. Excel Files (XLSX)

Table Parsing: Parses tables and cell data into structured Markdown tables or JSON.
Complex Layouts: Handles multi-sheet and complex Excel workbooks.
Data Integrity: Maintains data accuracy and structure during conversion.

4. PowerPoint Files (PPTX)

Slide Parsing: Extracts text and images from slides.
Markdown Slides: Converts each slide into a Markdown-compatible format, preserving headers and content.

5. HTML Documents

Web Content Extraction: Extracts text, links, images, and metadata from HTML documents or websites.
Formatting Retention: Maintains headings, paragraphs, and tables for easy conversion to Markdown.

6. Text Files (TXT)

Plain Text Parsing: Extracts all text from plain text files.
Simple Formatting: Converts text into structured Markdown for easier reading and analysis.

7. Images (OCR)

Text Recognition: Uses Optical Character Recognition (OCR) to extract text from images (JPEG, PNG, TIFF).
Scanned Documents: Converts scanned paper documents into structured text or Markdown.

8. Audio and Video Files

Audio Transcription: Converts spoken words in MP3, WAV, and other audio formats into transcribed text.
Video Parsing: Extracts text from video content, including subtitles and transcriptions (MP4, AVI). Coming Soon.
Timestamps: Captures timestamps for video/audio to help retain context during data extraction.

9. Ebooks (EPUB, MOBI)

Content Parsing: Extracts text, images, and metadata from popular ebook formats.
Metadata Extraction: Captures information like title, author, and publisher.

10. Emails (EML, MSG)

Email Parsing: Extracts content from emails including body text, subject, sender, and recipient.
Attachment Handling: Supports extraction from email attachments like PDFs, images, and more.

Supported Output Formats

Anyparser can output the parsed data in two formats:

Markdown: Ideal for integrating extracted content into documents or knowledge management systems. It is perfect for chunking and embedding data into vector databases for AI applications.
JSON: Provides a structured format for easier processing in downstream applications or for database storage.

Special Use Cases

Mixed Content Files: Files with both text and embedded media (such as PDFs with images) are fully supported.
Complex Layouts: Excel files with complex layouts, PowerPoint presentations with multimedia, and websites with dynamic content can be parsed without losing structure.