spidr: versatile Ruby web spidering library

by do son · June 8, 2018

Spidr

Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Features

Follows:
- a tags.
- iframe tags.
- frame tags.
- Cookie protected links.
- HTTP 300, 301, 302, 303 and 307 Redirects.
- Meta-Refresh Redirects.
- HTTP Basic Auth protected links.
Black-list or white-list URLs based upon:
- URL scheme.
- Host name
- Port number
- Full link
- URL extension
- Optional /robots.txt support.
Provides callbacks for:
- Every visited Page.
- Every visited URL.
- Every visited URL that matches a specified pattern.
- Every origin and destination URI of a link.
- Every URL that failed to be visited.
Provides action methods to:
- Pause spidering.
- Skip processing of pages.
- Skip processing of links.
Restore the spidering queue and history from a previous session.
Custom User-Agent strings.
Custom proxy settings.
HTTPS support.

Install

$ gem install spidr

Examples

Start spidering from a URL:

Spidr.start_at('http://tenderlovemaking.com/')

Spider as a host:

Spidr.host('solnic.eu')

Spider a site:

Spidr.site('http://www.rubyflow.com/')

Spider multiple hosts:

Spidr.start_at(

  'http://company.com/',

  hosts: [

    'company.com',

    /host[\d]+\.company\.com/

  ]

)

Do not spider certain links:

Spidr.site('http://company.com/', ignore_links: [%{^/blog/}])

Do not spider links on certain ports:

Spidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080])

Do not spider links blacklisted in robots.txt:

Spidr.site(

  'http://company.com/',

  robots: true

)

Source: https://github.com/postmodern/

spidr: versatile Ruby web spidering library

Spidr

Features

Install

Examples

Search

Brilliantly

Content & Links