Skip to content

Parse Documents

Master document processing with Anyparser’s intelligent parsing engine. From simple text extraction to complex document understanding, our SDK handles PDFs, Word documents, rich text, and more with unmatched precision. Built for scale and performance, you’ll get structured, analysis-ready data with just a few lines of code—complete with table detection, image extraction, and format preservation capabilities that make document automation effortless.

Quick Start

Let’s start with a simple example of parsing a PDF document:

from anyparser_core import Anyparser
import asyncio
async def main():
# Initialize the parser
parser = Anyparser()
# Parse a document
result = await parser.parse("docs/sample.pdf")
# Access the parsed content
print(f"File: {result.original_filename}")
print(f"Characters: {result.total_characters}")
print(f"Content:\n{result.markdown}")
asyncio.run(main())

Supported File Types

📄 Documents

  • PDF (with OCR support)
  • Microsoft Word (DOCX, DOC)
  • Rich Text (RTF)
  • Plain Text (TXT)

🖼️ Images

  • PNG
  • JPEG/JPG
  • TIFF
  • WebP See our OCR Guide for image processing details.

🌐 Web Content

🚀 Coming Soon

  • PowerPoint (PPTX, PPT)
  • Excel (XLSX, XLS)
  • Audio transcription
  • Video transcription

Configuration Options

Customize the parsing behavior to match your needs:

from anyparser_core import Anyparser, AnyparserOption
options = AnyparserOption(
format="json", # Output format
image=True, # Extract images
table=True # Extract tables
)
parser = Anyparser(options)

Check API Reference for more details.

Batch Processing

Process multiple documents efficiently in a single request:

files = ["docs/sample1.pdf", "docs/sample2.docx", "docs/sample3.png"]
async def main():
result = await parser.parse(files)
for doc in result:
print(f"Processing: {doc.original_filename}")
print(f"Type: {doc.file_type}")
print(f"Size: {doc.file_size} bytes")
print(f"Characters: {doc.total_characters}")
print("---")
asyncio.run(main())

Best Practices

  1. Choose the Right Format

    • Use json for structured data processing
    • Use markdown for RAG pipelines
    • Use html for quick viewing
  2. Optimize Performance

    • Process documents in batches when possible
    • Consider implementing application-level caching
  3. Handle Errors

    • Implement proper error handling
    • Use retries for transient failures
    • Monitor processing status
    • Log errors appropriately
  4. Monitor Usage

    • Track API consumption in Anyparser Studio
    • Set up usage alerts
    • Monitor processing times
    • Optimize based on analytics