Smart Navigation
- Respects robots.txt
- Follows sitemap.xml
- HTML content processing
- Maintains crawl order
Master web content extraction with Anyparser’s enterprise-grade crawler engine. Built on a distributed architecture with intelligent rate limiting and adaptive traversal algorithms, our crawler seamlessly handles content extraction from websites while maintaining strict compliance with web standards and site-specific crawling policies.
Smart Navigation
Content Extraction
Performance
Compliance
Start crawling with a basic configuration:
from anyparser_core import Anyparser, AnyparserOption
async def main(): # Initialize crawler parser = Anyparser(AnyparserOption( model="crawler", format="markdown", max_depth=20, # Crawl up to 20 levels deep max_executions=100, # Process up to 100 pages strategy="LIFO", # Crawling strategy traversal_scope="subtree", # Crawling scope ))
# Start crawling result = await parser.parse("https://example.com") print(result)
asyncio.run(main())
import { Anyparser, AnyparserOption } from '@anyparser/core';
async function main() { // Initialize crawler const parser = new Anyparser({ model: 'crawler', format: 'markdown', maxDepth: 20, // Crawl up to 2 levels deep maxExecutions: 10, // Process up to 10 pages strategy: "LIFO", traversalScope: "subtree", });
// Start crawling const result = await parser.parse('https://example.com'); console.log(result);}
main().catch(console.error);
Handle crawled content effectively:
async def process_results(result): for candidate in result: print("Start URL :", candidate.start_url) print("Total characters :", candidate.total_characters) print("Total items :", candidate.total_items) # print("Markdown :", candidate.markdown) print("Robots directive :", candidate.robots_directive) print("\n") print("*" * 100) print("Begin Crawl") print("*" * 100) print("\n")
for item in candidate.items: if candidate.items.index(item) > 0: print("-" * 100) print("\n")
print("URL :", item.url) print("Title :", item.title) print("Status message :", item.status_message) print("Total characters :", item.total_characters) print("Politeness delay :", item.politeness_delay) print("Content:\n") print(item.markdown)
print("*" * 100) print("End Crawl") print("*" * 100) print("\n")
async function processResults(result) { for (const candidate of result) { console.log('\n') console.log('Start URL :', candidate.startUrl) console.log('Total characters :', candidate.totalCharacters) console.log('Total items :', candidate.totalItems) console.log('Robots directive :', candidate.robotsDirective) console.log('\n') console.log('*'.repeat(100)) console.log('Begin Crawl') console.log('*'.repeat(100)) console.log('\n')
const resources = candidate.items ?? [] as AnyparserUrl[]
for (let index = 0; index < resources.length; index++) { const item = resources[index]
if (index > 0) { console.log('-'.repeat(100)) console.log('\n') }
console.log('URL :', item.url) console.log('Title :', item.title) console.log('Status message :', item.statusMessage) console.log('Total characters :', item.totalCharacters) console.log('Politeness delay :', item.politenessDelay) console.log('Content:\n') console.log(item.markdown) } }}
Start Small
max_executions
valuesRespect Websites
Optimize Performance
subtree
crawlingHandle Errors