LinkChecker

LinkChecker

Documentation

Basic usage

To check a URL like http://www.example.org/ it is enough to type linkchecker www.example.org on the command line or type www.example.org in the GUI application. This will check the complete domain of http://www.example.org recursively. All links pointing outside of the domain are also checked for validity.

Local files can also be checked. On Unix or OSX systems the syntax is file:///path/to/my/file.html. On Windows the syntax is file://C|/path/to/my/file.html. When directories are checked, all included files will be checked.

On the GUI client the Edit menu has shortcuts for bookmark files. For example if Google Chrome is installed, there will be a menu entry called Insert Google Chrome bookmark file which can be used to check all browser bookmarks.

Options

The commandline client options are documented in the linkchecker(1) manual page.

In the GUI client, the following options are available:

Configuration file

Each user can edit a configuration with advanced options for checking or filtering. The linkcheckerrc(5) manual page documents all the options.

In the GUI client the configuration file can be edited directly from the dialog Edit -> Options and the clicking on Edit.

Performed checks

All URLs have to pass a preliminary syntax test. After the syntax check passes, the URL is queued for connection checking. All connection check types are described below.

Recursion

Before descending recursively into a URL, it has to fulfill several conditions. The conditions are checked in this order:

  1. The URL must be valid.
  2. The URL must be parseable. This currently includes HTML files, Bookmarks files (Opera, Chrome or Safari), directories and on Windows systems MS Word files if Word and the Pywin32 module is installed on your system. If a file type cannot be determined (for example it does not have a common HTML file extension, and the content does not look like HTML), it is assumed to be non-parseable.
  3. The URL content must be retrievable. This is usually the case except for example mailto: or unknown URL types.
  4. The maximum recursion level must not be exceeded. It is configured with the --recursion-level command line option, the recursion level GUI option, or through the configuration file. The recursion level is unlimited by default.
  5. It must not match the ignored URL list. This is controlled with the --ignore-url command line option or through the configuration file.
  6. The Robots Exclusion Protocol must allow links in the URL to be followed recursively. This is checked by evaluating the servers robots.txt file and searching for a "nofollow" directive in the HTML header data.

Note that the local and FTP directory recursion reads all files in that directory, not just a subset like index.htm*.