Anyparserbot Technical Details

Master the technical intricacies of Anyparserbot, our enterprise web crawling system. This comprehensive guide covers the crawler’s core architecture, from intelligent directive processing and adaptive rate limiting to robust security protocols. Whether you’re integrating with our infrastructure or optimizing crawler behavior, you’ll find detailed specifications, implementation patterns, and operational insights for building production-grade web scraping solutions.

Quick Reference

ℹ️ Directives

Meta tags
HTTP headers
robots.txt rules
Link attributes

🚦 Rate Limiting

Automatic adjustment
Server responses
Politeness delays
Error handling

🛡️ Security

Data protection
Access control
Authentication
Limitations

🥇 Best Practices ✨

Implementation tips
Resource protection
Performance optimization
Error handling

Directive Support

Meta Tags

Anyparserbot recognizes both general and bot-specific meta directives:

General
Bot-Specific

<!-- Standard robots directive -->
<meta name="robots" content="noindex, nofollow">

<!-- Anyparserbot-specific directive -->
<meta name="anyparserbot" content="noindex, nofollow">

Supported Directives

Directive	Effect	Example
`noindex`	Prevents content indexing	`content="noindex"`
`nofollow`	Prevents link following	`content="nofollow"`
`none`	Combines noindex, nofollow	`content="none"`
`noarchive`	Prevents archiving	`content="noarchive"`
`nosnippet`	Disables snippets	`content="nosnippet"`
`unavailable_after`	Time-based exclusion	`content="unavailable_after: [RFC 850 date]"`
`max-snippet`	Limits snippet length	`content="max-snippet:50"`
`max-image-preview`	Controls image previews	`content="max-image-preview:large"`
`max-video-preview`	Limits video previews	`content="max-video-preview:0"`
`notranslate`	Prevents translation	`content="notranslate"`
`noimageindex`	Prevents image indexing	`content="noimageindex"`

HTTP Headers

Control crawler behavior via X-Robots-Tag headers:

# Single directive
X-Robots-Tag: noindex

# Time-based exclusion
X-Robots-Tag: unavailable_after: 25 Jun 2024 15:00:00 GMT

# Bot-specific rules
X-Robots-Tag: anyparserbot: noindex, nofollow

Processing Rules

Priority Order
- Bot-specific directives (anyparserbot:)
- General directives (no bot specified)
- Default behavior (if no directives)
Combination Logic
- Multiple directives are combined
- Most restrictive rules take precedence
- Bot-specific overrides general rules

robots.txt Implementation

File Processing

Anyparserbot follows these steps when processing robots.txt:

Location Check
- Looks for /robots.txt at domain root
- Handles redirects appropriately
- Times out after 30 seconds
Rule Processing
- Processes most specific rules first
- Handles wildcards and patterns
- Applies crawl-delay directives
Access Control
- Respects Allow/Disallow rules
- Handles pattern matching
- Implements politeness delays

Example Configuration

# Anyparserbot-specific rules
User-agent: AnyparserBot
Disallow: /private/
Allow: /public/
Crawl-delay: 2

# General rules (fallback)
User-agent: *
Disallow: /admin/
Allow: /

Link Processing

rel Attributes

Anyparserbot respects various link attributes:

<!-- Basic nofollow -->
<a href="..." rel="nofollow">Link text</a>

<!-- Multiple attributes -->
<a href="..." rel="nofollow sponsored">Link text</a>

<!-- Canonical reference -->
<link rel="canonical" href="https://example.com/canonical-page" />

Rate Limiting & Politeness

Dynamic Rate Control

Initial Behavior

Starts with conservative rate
Default delay: 2 seconds
Respects crawl-delay
Monitors response times

Adjustment Logic

Reduces rate on 429s
Backs off on 503s
Adapts to server load
Implements exponential backoff

Response Handling

Response	Action	Retry?
429	50% rate reduction	Yes
503	Stop crawling	No
50x	Exponential backoff	Yes
403/401	Skip permanently	No

Security & Limitations

Security Features

㊙️ Data Protection

No form processing
No cookie manipulation
No session tracking
Secure connections only

🚫 Access Control

Respects authentication
Honors IP blocking
No bypass attempts
Follows security headers

Technical Boundaries

Error Handling

Error Types & Actions

Error Code	Description	Action
`ROBOTS_BLOCKED`	Blocked by robots.txt	Skip
`META_BLOCKED`	Blocked by meta directive	Skip
`RATE_LIMITED`	Too many requests	Retry
`ACCESS_DENIED`	Authentication required	Skip
`INVALID_URL`	Malformed/unreachable URL	Skip