spidr: versatile Ruby web spidering library

spidr

Spidr

Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Features

  • Follows:
    • a tags.
    • iframe tags.
    • frame tags.
    • Cookie protected links.
    • HTTP 300, 301, 302, 303 and 307 Redirects.
    • Meta-Refresh Redirects.
    • HTTP Basic Auth protected links.
  • Black-list or white-list URLs based upon:
    • URL scheme.
    • Host name
    • Port number
    • Full link
    • URL extension
    • Optional /robots.txt support.
  • Provides callbacks for:
    • Every visited Page.
    • Every visited URL.
    • Every visited URL that matches a specified pattern.
    • Every origin and destination URI of a link.
    • Every URL that failed to be visited.
  • Provides action methods to:
    • Pause spidering.
    • Skip processing of pages.
    • Skip processing of links.
  • Restore the spidering queue and history from a previous session.
  • Custom User-Agent strings.
  • Custom proxy settings.
  • HTTPS support.

Install

$ gem install spidr

Examples

Start spidering from a URL:

Spidr.start_at('http://tenderlovemaking.com/')

 

Spider as a host:

Spidr.host('solnic.eu')

 

Spider a site:

Spidr.site('http://www.rubyflow.com/')

 

Spider multiple hosts:

Spidr.start_at(

'http://company.com/',
hosts: [
'company.com',
/host[\d]+\.company\.com/
]
)

 

Do not spider certain links:

Spidr.site('http://company.com/', ignore_links: [%{^/blog/}])

 

Do not spider links on certain ports:

Spidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080])

 

Do not spider links blacklisted in robots.txt:

Spidr.site(

'http://company.com/',
robots: true
)

 

Copyright (c) 2008-2016 Hal Brodigan

Source: https://github.com/postmodern/