Skip to content

Web Crawling

Master web content extraction with Anyparser’s enterprise-grade crawler engine. Built on a distributed architecture with intelligent rate limiting and adaptive traversal algorithms, our crawler seamlessly handles content extraction from websites while maintaining strict compliance with web standards and site-specific crawling policies.

Key Features

Smart Navigation

  • Respects robots.txt
  • Follows sitemap.xml
  • HTML content processing
  • Maintains crawl order

Content Extraction

  • Clean text extraction
  • Table recognition
  • Image processing
  • Metadata collection

Performance

  • Parallel processing
  • Rate limiting
  • Resource optimization
  • Cache management

Compliance

  • Ethical crawling
  • Polite delays
  • Error handling
  • Usage monitoring

Quick Start

Start crawling with a basic configuration:

from anyparser_core import Anyparser, AnyparserOption
async def main():
# Initialize crawler
parser = Anyparser(AnyparserOption(
model="crawler",
format="markdown",
max_depth=20, # Crawl up to 20 levels deep
max_executions=100, # Process up to 100 pages
strategy="LIFO", # Crawling strategy
traversal_scope="subtree", # Crawling scope
))
# Start crawling
result = await parser.parse("https://example.com")
print(result)
asyncio.run(main())

Processing Results

Handle crawled content effectively:

async def process_results(result):
for candidate in result:
print("Start URL :", candidate.start_url)
print("Total characters :", candidate.total_characters)
print("Total items :", candidate.total_items)
# print("Markdown :", candidate.markdown)
print("Robots directive :", candidate.robots_directive)
print("\n")
print("*" * 100)
print("Begin Crawl")
print("*" * 100)
print("\n")
for item in candidate.items:
if candidate.items.index(item) > 0:
print("-" * 100)
print("\n")
print("URL :", item.url)
print("Title :", item.title)
print("Status message :", item.status_message)
print("Total characters :", item.total_characters)
print("Politeness delay :", item.politeness_delay)
print("Content:\n")
print(item.markdown)
print("*" * 100)
print("End Crawl")
print("*" * 100)
print("\n")

Best Practices

  1. Start Small

    • Begin with low max_executions values
    • Test with subset of site
    • Monitor initial results
    • Adjust settings gradually
  2. Respect Websites

    • Honor robots.txt rules
    • Use appropriate delays
    • Stay within rate limits
    • Monitor server responses
  3. Optimize Performance

    • Use subtree crawling
    • Enable caching
    • Implement retries
    • Monitor resource usage
  4. Handle Errors

    • Implement timeouts
    • Retry failed requests
    • Log issues properly
    • Monitor crawl status