![]() |
LinkChecker |
To check a URL like http://www.example.org/
it is enough to
type linkchecker www.example.org
on the command line or
type www.example.org
in the GUI application. This will check the
complete domain of http://www.example.org
recursively. All links
pointing outside of the domain are also checked for validity.
Local files can also be checked. On Unix or OSX systems the syntax
is file:///path/to/my/file.html
. On Windows the syntax is
file://C|/path/to/my/file.html
. When directories are checked,
all included files will be checked.
On the GUI client the Edit
menu has shortcuts for bookmark
files. For example if Google Chrome is installed, there will be
a menu entry called Insert Google Chrome bookmark file
which
can be used to check all browser bookmarks.
The commandline client options are documented in the linkchecker(1) manual page.
In the GUI client, the following options are available:
Recursive depth
Check recursively all links up to given depth.
A negative depth (eg. -1
) will enable infinite recursion.
Verbose output
If set, log all checked URLs. Default is to log only errors and warnings.
Debug
Prints debugging output in a separate window which can be seen with
Help -> Show debug
.
Debug memory usage
Profiles memory usage and writes statistics and a dump file when checking stops. The dump file can be examined with external tools. This option should only be useful for developers.
Warning strings
Log a warning if any strings are found in the content of the checked URL. Strings are entered one per line.
Use this to check for pages that contain some form of error, for example
"This page has moved
" or "Oracle Application error
".
Ignoring URLs
URLs matching the given regular expressions will be ignored and not checked.
Useful if certain URL types should not be checked like emails (ie.
"^mailto:
").
Each user can edit a configuration with advanced options for checking or filtering. The linkcheckerrc(5) manual page documents all the options.
In the GUI client the configuration file can be edited directly from
the dialog Edit -> Options
and the clicking on Edit
.
All URLs have to pass a preliminary syntax test. After the syntax check passes, the URL is queued for connection checking. All connection check types are described below.
HTTP links (http:
, https:
)
After connecting to the given HTTP server the given path or query is requested. All redirections are followed, and if user/password is given it will be used as authorization when necessary. Permanently moved pages (status code 301) issue a warning. All final HTTP status codes other than 2xx are errors.
For HTTPS links, the SSL certificate is checked against the given hostname. If it does not match, a warnings is printed.
Local files (file:
)
A regular, readable file that can be opened is valid. A readable directory is also valid. All other files, for example unreadable, non-existing or device files are errors.
File contents are checked for recursion. If they are parseable files (for example HTML files), all links in that file will be checked.
Mail links (mailto:
)
A mailto: link resolves to a list of email addresses. If one email address fails the whole list will fail. For each email address the following things are checked:
FTP links (ftp:
)
For FTP links the following is checked:
anonymous
, the default password is anonymous@
.Telnet links (telnet:
)
A connect and if user/password are given, login to the given telnet server is tried.
NNTP links (news:
, snews:
, nntp
)
A connect is tried to connect to the given NNTP server. If a news group or article is specified, it will be requested from the server.
Unsupported links (javascript:
, etc.)
An unsupported link will print a warning, but no error. No further checking will be made.
The complete list of recognized, but unsupported links can be seen in the unknownurl.py source file. The most prominent of them are JavaScript links.
Before descending recursively into a URL, it has to fulfill several conditions. The conditions are checked in this order:
--recursion-level
command line option, the recursion
level GUI option, or through the configuration file.
The recursion level is unlimited by default.--ignore-url
command line option or through the
configuration file.Note that the local and FTP directory recursion reads all files in that
directory, not just a subset like index.htm*
.