LlamaIndex Integration

Anyparser integrates with LlamaIndex to provide powerful document parsing capabilities for your RAG (Retrieval Augmented Generation) applications. This guide covers how to use Anyparser with both Python and JavaScript versions of LlamaIndex.

Installation

Python
JavaScript

pip install anyparser-llamaindex

npm install @anyparser/llamaindex
# or
yarn add @anyparser/llamaindex

Basic Usage

Here’s how to use Anyparser with LlamaIndex:

Python
JavaScript

from anyparser_llamaindex import AnyparserReader
from llama_index import Document, VectorStoreIndex

# Initialize the reader
reader = AnyparserReader(api_key="your-api-key")

# Load documents
documents = reader.load_data("document.pdf")

# Create index from documents
index = VectorStoreIndex.from_documents(documents)

# Query your documents
query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")

import { AnyparserReader } from "@anyparser/llamaindex";
import { Document, VectorStoreIndex } from "llamaindex";

// Initialize the reader
const reader = new AnyparserReader({
  apiKey: "your-api-key"
});

// Load documents
const documents = await reader.loadData("document.pdf");

// Create index from documents
const index = await VectorStoreIndex.fromDocuments(documents);

// Query your documents
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query("What is this document about?");

Advanced Configuration

Configure the reader with various options:

Python
JavaScript

reader = AnyparserReader(
    api_key="your-api-key",
    format="markdown",      # Output format
    model="text",          # Model type
    image=True,            # Extract images
    table=True,            # Extract tables
    encoding="utf-8",      # Specify encoding
    chunk_size=1000,       # Size of text chunks
    chunk_overlap=200      # Overlap between chunks
)

const reader = new AnyparserReader({
  apiKey: "your-api-key",
  format: "markdown",      // Output format
  model: "text",          // Model type
  image: true,            // Extract images
  table: true,            // Extract tables
  encoding: "utf-8",      // Specify encoding
  chunkSize: 1000,        // Size of text chunks
  chunkOverlap: 200       // Overlap between chunks
});

Metadata Handling

Anyparser automatically extracts and preserves metadata for LlamaIndex:

Python
JavaScript

# Load documents with metadata
documents = reader.load_data(
    "document.pdf",
    include_metadata=True  # Include document metadata
)

# Access metadata
for doc in documents:
    print(f"File: {doc.metadata['filename']}")
    print(f"Pages: {doc.metadata['total_pages']}")
    print(f"Format: {doc.metadata['format']}")

// Load documents with metadata
const documents = await reader.loadData(
  "document.pdf",
  { includeMetadata: true }  // Include document metadata
);

// Access metadata
documents.forEach(doc => {
  console.log(`File: ${doc.metadata.filename}`);
  console.log(`Pages: ${doc.metadata.totalPages}`);
  console.log(`Format: ${doc.metadata.format}`);
});

Batch Processing

Process multiple documents efficiently:

Python
JavaScript

# Load multiple documents
documents = reader.load_data([
    "document1.pdf",
    "document2.docx",
    "document3.txt"
])

# Create index from batch
index = VectorStoreIndex.from_documents(documents)

// Load multiple documents
const documents = await reader.loadData([
  "document1.pdf",
  "document2.docx",
  "document3.txt"
]);

// Create index from batch
const index = await VectorStoreIndex.fromDocuments(documents);

Error Handling

Implement proper error handling:

Python
JavaScript

try:
    documents = reader.load_data("document.pdf")
except Exception as e:
    print(f"Error loading document: {str(e)}")
    # Handle error appropriately

try {
  const documents = await reader.loadData("document.pdf");
} catch (error) {
  console.error(`Error loading document: ${error.message}`);
  // Handle error appropriately
}