LlamaIndex Integration
Anyparser integrates with LlamaIndex to provide powerful document parsing capabilities for your RAG (Retrieval Augmented Generation) applications. This guide covers how to use Anyparser with both Python and JavaScript versions of LlamaIndex.
Installation
pip install anyparser-llamaindex
npm install @anyparser/llamaindex# oryarn add @anyparser/llamaindex
Basic Usage
Here’s how to use Anyparser with LlamaIndex:
from anyparser_llamaindex import AnyparserReaderfrom llama_index import Document, VectorStoreIndex
# Initialize the readerreader = AnyparserReader(api_key="your-api-key")
# Load documentsdocuments = reader.load_data("document.pdf")
# Create index from documentsindex = VectorStoreIndex.from_documents(documents)
# Query your documentsquery_engine = index.as_query_engine()response = query_engine.query("What is this document about?")
import { AnyparserReader } from "@anyparser/llamaindex";import { Document, VectorStoreIndex } from "llamaindex";
// Initialize the readerconst reader = new AnyparserReader({ apiKey: "your-api-key"});
// Load documentsconst documents = await reader.loadData("document.pdf");
// Create index from documentsconst index = await VectorStoreIndex.fromDocuments(documents);
// Query your documentsconst queryEngine = index.asQueryEngine();const response = await queryEngine.query("What is this document about?");
Advanced Configuration
Configure the reader with various options:
reader = AnyparserReader( api_key="your-api-key", format="markdown", # Output format model="text", # Model type image=True, # Extract images table=True, # Extract tables encoding="utf-8", # Specify encoding chunk_size=1000, # Size of text chunks chunk_overlap=200 # Overlap between chunks)
const reader = new AnyparserReader({ apiKey: "your-api-key", format: "markdown", // Output format model: "text", // Model type image: true, // Extract images table: true, // Extract tables encoding: "utf-8", // Specify encoding chunkSize: 1000, // Size of text chunks chunkOverlap: 200 // Overlap between chunks});
Metadata Handling
Anyparser automatically extracts and preserves metadata for LlamaIndex:
# Load documents with metadatadocuments = reader.load_data( "document.pdf", include_metadata=True # Include document metadata)
# Access metadatafor doc in documents: print(f"File: {doc.metadata['filename']}") print(f"Pages: {doc.metadata['total_pages']}") print(f"Format: {doc.metadata['format']}")
// Load documents with metadataconst documents = await reader.loadData( "document.pdf", { includeMetadata: true } // Include document metadata);
// Access metadatadocuments.forEach(doc => { console.log(`File: ${doc.metadata.filename}`); console.log(`Pages: ${doc.metadata.totalPages}`); console.log(`Format: ${doc.metadata.format}`);});
Batch Processing
Process multiple documents efficiently:
# Load multiple documentsdocuments = reader.load_data([ "document1.pdf", "document2.docx", "document3.txt"])
# Create index from batchindex = VectorStoreIndex.from_documents(documents)
// Load multiple documentsconst documents = await reader.loadData([ "document1.pdf", "document2.docx", "document3.txt"]);
// Create index from batchconst index = await VectorStoreIndex.fromDocuments(documents);
Error Handling
Implement proper error handling:
try: documents = reader.load_data("document.pdf")except Exception as e: print(f"Error loading document: {str(e)}") # Handle error appropriately
try { const documents = await reader.loadData("document.pdf");} catch (error) { console.error(`Error loading document: ${error.message}`); // Handle error appropriately}