Skip to content

Anyparserbot Technical Details

Master the technical intricacies of Anyparserbot, our enterprise web crawling system. This comprehensive guide covers the crawler’s core architecture, from intelligent directive processing and adaptive rate limiting to robust security protocols. Whether you’re integrating with our infrastructure or optimizing crawler behavior, you’ll find detailed specifications, implementation patterns, and operational insights for building production-grade web scraping solutions.

Quick Reference

ℹ️ Directives

  • Meta tags
  • HTTP headers
  • robots.txt rules
  • Link attributes

🚦 Rate Limiting

  • Automatic adjustment
  • Server responses
  • Politeness delays
  • Error handling

🛡️ Security

  • Data protection
  • Access control
  • Authentication
  • Limitations

🥇 Best Practices ✨

  • Implementation tips
  • Resource protection
  • Performance optimization
  • Error handling

Directive Support

Meta Tags

Anyparserbot recognizes both general and bot-specific meta directives:

<!-- Standard robots directive -->
<meta name="robots" content="noindex, nofollow">

Supported Directives

DirectiveEffectExample
noindexPrevents content indexingcontent="noindex"
nofollowPrevents link followingcontent="nofollow"
noneCombines noindex, nofollowcontent="none"
noarchivePrevents archivingcontent="noarchive"
nosnippetDisables snippetscontent="nosnippet"
unavailable_afterTime-based exclusioncontent="unavailable_after: [RFC 850 date]"
max-snippetLimits snippet lengthcontent="max-snippet:50"
max-image-previewControls image previewscontent="max-image-preview:large"
max-video-previewLimits video previewscontent="max-video-preview:0"
notranslatePrevents translationcontent="notranslate"
noimageindexPrevents image indexingcontent="noimageindex"

HTTP Headers

Control crawler behavior via X-Robots-Tag headers:

# Single directive
X-Robots-Tag: noindex
# Time-based exclusion
X-Robots-Tag: unavailable_after: 25 Jun 2024 15:00:00 GMT
# Bot-specific rules
X-Robots-Tag: anyparserbot: noindex, nofollow

Processing Rules

  1. Priority Order

    • Bot-specific directives (anyparserbot:)
    • General directives (no bot specified)
    • Default behavior (if no directives)
  2. Combination Logic

    • Multiple directives are combined
    • Most restrictive rules take precedence
    • Bot-specific overrides general rules

robots.txt Implementation

File Processing

Anyparserbot follows these steps when processing robots.txt:

  1. Location Check

    • Looks for /robots.txt at domain root
    • Handles redirects appropriately
    • Times out after 30 seconds
  2. Rule Processing

    • Processes most specific rules first
    • Handles wildcards and patterns
    • Applies crawl-delay directives
  3. Access Control

    • Respects Allow/Disallow rules
    • Handles pattern matching
    • Implements politeness delays

Example Configuration

# Anyparserbot-specific rules
User-agent: AnyparserBot
Disallow: /private/
Allow: /public/
Crawl-delay: 2
# General rules (fallback)
User-agent: *
Disallow: /admin/
Allow: /

rel Attributes

Anyparserbot respects various link attributes:

<!-- Basic nofollow -->
<a href="..." rel="nofollow">Link text</a>
<!-- Multiple attributes -->
<a href="..." rel="nofollow sponsored">Link text</a>
<!-- Canonical reference -->
<link rel="canonical" href="https://example.com/canonical-page" />

Rate Limiting & Politeness

Dynamic Rate Control

Initial Behavior

  • Starts with conservative rate
  • Default delay: 2 seconds
  • Respects crawl-delay
  • Monitors response times

Adjustment Logic

  • Reduces rate on 429s
  • Backs off on 503s
  • Adapts to server load
  • Implements exponential backoff

Response Handling

ResponseActionRetry?
42950% rate reductionYes
503Stop crawlingNo
50xExponential backoffYes
403/401Skip permanentlyNo

Security & Limitations

Security Features

㊙️ Data Protection

  • No form processing
  • No cookie manipulation
  • No session tracking
  • Secure connections only

🚫 Access Control

  • Respects authentication
  • Honors IP blocking
  • No bypass attempts
  • Follows security headers

Technical Boundaries

Error Handling

Error Types & Actions

Error CodeDescriptionAction
ROBOTS_BLOCKEDBlocked by robots.txtSkip
META_BLOCKEDBlocked by meta directiveSkip
RATE_LIMITEDToo many requestsRetry
ACCESS_DENIEDAuthentication requiredSkip
INVALID_URLMalformed/unreachable URLSkip