ℹ️ Directives
- Meta tags
- HTTP headers
- robots.txt rules
- Link attributes
Master the technical intricacies of Anyparserbot, our enterprise web crawling system. This comprehensive guide covers the crawler’s core architecture, from intelligent directive processing and adaptive rate limiting to robust security protocols. Whether you’re integrating with our infrastructure or optimizing crawler behavior, you’ll find detailed specifications, implementation patterns, and operational insights for building production-grade web scraping solutions.
ℹ️ Directives
🚦 Rate Limiting
🛡️ Security
🥇 Best Practices ✨
Anyparserbot recognizes both general and bot-specific meta directives:
<!-- Standard robots directive --><meta name="robots" content="noindex, nofollow">
<!-- Anyparserbot-specific directive --><meta name="anyparserbot" content="noindex, nofollow">
Directive | Effect | Example |
---|---|---|
noindex | Prevents content indexing | content="noindex" |
nofollow | Prevents link following | content="nofollow" |
none | Combines noindex, nofollow | content="none" |
noarchive | Prevents archiving | content="noarchive" |
nosnippet | Disables snippets | content="nosnippet" |
unavailable_after | Time-based exclusion | content="unavailable_after: [RFC 850 date]" |
max-snippet | Limits snippet length | content="max-snippet:50" |
max-image-preview | Controls image previews | content="max-image-preview:large" |
max-video-preview | Limits video previews | content="max-video-preview:0" |
notranslate | Prevents translation | content="notranslate" |
noimageindex | Prevents image indexing | content="noimageindex" |
Control crawler behavior via X-Robots-Tag
headers:
# Single directiveX-Robots-Tag: noindex
# Time-based exclusionX-Robots-Tag: unavailable_after: 25 Jun 2024 15:00:00 GMT
# Bot-specific rulesX-Robots-Tag: anyparserbot: noindex, nofollow
Priority Order
anyparserbot:
)Combination Logic
Anyparserbot follows these steps when processing robots.txt:
Location Check
/robots.txt
at domain rootRule Processing
Access Control
# Anyparserbot-specific rulesUser-agent: AnyparserBotDisallow: /private/Allow: /public/Crawl-delay: 2
# General rules (fallback)User-agent: *Disallow: /admin/Allow: /
Anyparserbot respects various link attributes:
<!-- Basic nofollow --><a href="..." rel="nofollow">Link text</a>
<!-- Multiple attributes --><a href="..." rel="nofollow sponsored">Link text</a>
<!-- Canonical reference --><link rel="canonical" href="https://example.com/canonical-page" />
Initial Behavior
Adjustment Logic
Response | Action | Retry? |
---|---|---|
429 | 50% rate reduction | Yes |
503 | Stop crawling | No |
50x | Exponential backoff | Yes |
403/401 | Skip permanently | No |
㊙️ Data Protection
🚫 Access Control
Error Code | Description | Action |
---|---|---|
ROBOTS_BLOCKED | Blocked by robots.txt | Skip |
META_BLOCKED | Blocked by meta directive | Skip |
RATE_LIMITED | Too many requests | Retry |
ACCESS_DENIED | Authentication required | Skip |
INVALID_URL | Malformed/unreachable URL | Skip |