LINKCHECKER(1) | LinkChecker commandline usage | LINKCHECKER(1) |
NAME¶
linkchecker - command line client to check HTML documents and websites for broken linksSYNOPSIS¶
linkchecker [options] [file-or-url]...DESCRIPTION¶
LinkChecker features- •
- recursive and multithreaded checking,
- •
- output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats,
- •
- support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links,
- •
- restriction of link checking with URL filters,
- •
- proxy support,
- •
- username/password authorization for HTTP, FTP and Telnet,
- •
- support for robots.txt exclusion protocol,
- •
- support for Cookies
- •
- support for HTML5
- •
- HTML and CSS syntax check
- •
- Antivirus check
- •
- a command line, GUI and web interface
EXAMPLES¶
The most common use checks the given domain recursively, plus any URL pointing outside of the domain:linkchecker http://www.example.net/
linkchecker --ignore-url=^mailto: mysite.example.org
linkchecker ../bla.html
linkchecker c:\temp\test.html
linkchecker www.example.com
linkchecker -r0 ftp.example.org
linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
OPTIONS¶
General options¶
- -fFILENAME, --config=FILENAME
- Use FILENAME as configuration file. As default LinkChecker uses ~/.linkchecker/linkcheckerrc.
- -h, --help
- Help me! Print usage information for this program.
- --stdin
- Read list of white-space separated URLs to check from stdin.
- -tNUMBER, --threads=NUMBER
- Generate no more than the given number of threads. Default number of threads is 10. To disable threading specify a non-positive number.
- -V, --version
- Print version and exit.
Output options¶
- --check-css
- Check syntax of CSS URLs with cssutils. If it's not installed, check with the W3C online validator.
- --check-html
- Check syntax of HTML URLs with HTML tidy. If it's not installed, check with the W3C online validator.
- --complete
- Log all URLs, including duplicates. Default is to log duplicate URLs only once.
- -DSTRING, --debug=STRING
- Print debugging output for the given logger. Available loggers are cmdline, checking, cache, gui, dns and all. Specifying all is an alias for specifying all available loggers. The option can be given multiple times to debug with more than one logger. For accurate results, threading will be disabled during debug runs.
- -FTYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]
- Output to a file linkchecker-out.TYPE,
$HOME/.linkchecker/blacklist for blacklist output, or
FILENAME if specified. The ENCODING specifies the output
encoding, the default is that of your locale. Valid encodings are listed
at http://docs.python.org/library/codecs.html#standard-encodings.
- --no-status
- Do not print check status messages.
- --no-warnings
- Don't log warnings. Default is to log warnings.
- -oTYPE[/ENCODING], --output= TYPE[/ENCODING]
- Specify output type as text, html,
sql, csv, gml, dot, xml, none or
blacklist. Default type is text. The various output types
are documented below.
- --profile
- Write profiling data into a file named linkchecker.prof in the current working directory. See also --viewprof.
- -q, --quiet
- Quiet operation, an alias for -o none. This is only useful with -F.
- --scan-virus
- Scan content of URLs for viruses with ClamAV.
- --trace
- Print tracing information.
- -v, --verbose
- Log all checked URLs once. Default is to log only errors and warnings.
- --viewprof
- Print out previously generated profiling data. See also --profile.
- -WREGEX, --warning-regex=REGEX
- Define a regular expression which prints a warning if it
matches any content of the checked link. This applies only to valid pages,
so we can get their content.
- --warning-size-bytes=NUMBER
- Print a warning if content size info is available and exceeds the given number of bytes.
Checking options¶
- -a, --anchors
- Check HTTP anchor references. Default is not to check anchors. This option enables logging of the warning url-anchor-not-found.
- -C, --cookies
- Accept and send HTTP cookies according to RFC 2109. Only cookies which are sent back to the originating server are accepted. Sent and accepted cookies are provided as additional logging information.
- --cookiefile=FILENAME
- Read a file with initial cookie data. The cookie data format is explained below.
- --ignore-url=REGEX
- URLs matching the given regular expression will only have
their syntax checked.
- -NSTRING, --nntp-server=STRING
- Specify an NNTP server for news: links. Default is the environment variable NNTP_SERVER. If no host is given, only the syntax of the link is checked.
- --no-follow-url=REGEX
- Check but do not recurse into URLs matching the given
regular expression.
- -p, --password
- Read a password from console and use it for HTTP and FTP authorization. For FTP the default password is anonymous@. For HTTP there is no default password. See also -u.
- -PNUMBER, --pause=NUMBER
- Pause the given number of seconds between two subsequent connection requests to the same host. Default is no pause between requests.
- -rNUMBER, --recursion-level=NUMBER
- Check recursively all links up to given depth. A negative depth will enable infinite recursion. Default depth is infinite.
- --timeout=NUMBER
- Set the timeout for connection attempts in seconds. The default timeout is 60 seconds.
- -uSTRING, --user=STRING
- Try the given username for HTTP and FTP authorization. For FTP the default username is anonymous. For HTTP there is no default username. See also -p.
- --user-agent=STRING
- Specify the User-Agent string to send to the HTTP server,
for example "Mozilla/4.0". The default is
"LinkChecker/X.Y" where X.Y is the current version of
LinkChecker.
CONFIGURATION FILES¶
Configuration files can specify all options above. They can also specify some options that cannot be set on the command line. See linkcheckerrc(5) for more info.OUTPUT TYPES¶
Note that by default only errors and warnings are logged. You should use the --verbose option to get the complete URL list, especially when outputting a sitemap graph format.- text
- Standard text logger, logging URLs in keyword: argument fashion.
- html
- Log URLs in keyword: argument fashion, formatted as HTML. Additionally has links to the referenced pages. Invalid URLs have HTML and CSS syntax check links appended.
- csv
- Log check result in CSV format with one URL per line.
- gml
- Log parent-child relations between linked URLs as a GML sitemap graph.
- dot
- Log parent-child relations between linked URLs as a DOT sitemap graph.
- gxml
- Log check result as a GraphXML sitemap graph.
- xml
- Log check result as machine-readable XML.
- sql
- Log check result as SQL script with INSERT commands. An example script to create the initial SQL table is included as create.sql.
- blacklist
- Suitable for cron jobs. Logs the check result into a file ~/.linkchecker/blacklist which only contains entries with invalid URLs and the number of times they have failed.
- none
- Logs nothing. Suitable for debugging or checking the exit code.
REGULAR EXPRESSIONS¶
LinkChecker accepts Python regular expressions. See http://docs.python.org/howto/regex.html for an introduction.COOKIE FILES¶
A cookie file contains standard HTTP header (RFC 2616) data with the following possible names:- Scheme (optional)
- Sets the scheme the cookies are valid for; default scheme is http.
- Host (required)
- Sets the domain the cookies are valid for.
- Path (optional)
- Gives the path the cookies are value for; default path is /.
- Set-cookie (optional)
- Set cookie name/value. Can be given more than once.
Host: example.com
Path: /hello
Set-cookie: ID="smee"
Set-cookie: spam="egg"
Scheme: https
Host: example.org
Set-cookie: baggage="elitist"; comment="hologram"
PROXY SUPPORT¶
To use a proxy on Unix or Windows set the $http_proxy, $https_proxy or $ftp_proxy environment variables to the proxy URL. The URL should be of the form http://[user:pass@]host[: port]. LinkChecker also detects manual proxy settings of Internet Explorer under Windows systems. On a Mac use the Internet Config to select a proxy. You can also set a comma-separated domain list in the $no_proxy environment variables to ignore any proxy settings for these domains. Setting a HTTP proxy on Unix for example looks like this:export http_proxy="http://proxy.example.com:8080"
export http_proxy="http://user1:mypass@proxy.example.org:8081"
set http_proxy=http://proxy.example.com:8080
PERFORMED CHECKS¶
All URLs have to pass a preliminary syntax test. Minor quoting mistakes will issue a warning, all other invalid syntax issues are errors. After the syntax check passes, the URL is queued for connection checking. All connection check types are described below.- HTTP links (http:, https:)
- After connecting to the given HTTP server the given path or query is requested. All redirections are followed, and if user/password is given it will be used as authorization when necessary. Permanently moved pages issue a warning. All final HTTP status codes other than 2xx are errors. HTML page contents are checked for recursion.
- Local files (file:)
- A regular, readable file that can be opened is valid. A readable directory is also valid. All other files, for example device files, unreadable or non-existing files are errors. HTML or other parseable file contents are checked for recursion.
- Mail links (mailto:)
- A mailto: link eventually resolves to a list of email
addresses. If one address fails, the whole list will fail. For each mail
address we check the following things:
1) Check the adress syntax, both of the part before and after
the @ sign.
2) Look up the MX DNS records. If we found no MX record,
print an error.
3) Check if one of the mail hosts accept an SMTP connection.
Check hosts with higher priority first.
If no host accepts SMTP, we print a warning.
4) Try to verify the address with the VRFY command. If we got
an answer, print the verified address as an info.
- FTP links (ftp:)
-
For FTP links we do:
1) connect to the specified host
2) try to login with the given user and password. The default
user is ``anonymous``, the default password is ``anonymous@``.
3) try to change to the given directory
4) list the file with the NLST command
We try to connect and if user/password are given, login to the
given telnet server.
We try to connect to the given NNTP server. If a news group or
article is specified, try to request it from the server.
An ignored link will only print a warning. No further checking
will be made.
Here is a complete list of recognized, but ignored links. The most
prominent of them should be JavaScript links.
- ``acap:`` (application configuration access protocol)
- ``afs:`` (Andrew File System global file names)
- ``chrome:`` (Mozilla specific)
- ``cid:`` (content identifier)
- ``clsid:`` (Microsoft specific)
- ``data:`` (data)
- ``dav:`` (dav)
- ``fax:`` (fax)
- ``find:`` (Mozilla specific)
- ``gopher:`` (Gopher)
- ``imap:`` (internet message access protocol)
- ``isbn:`` (ISBN (int. book numbers))
- ``javascript:`` (JavaScript)
- ``ldap:`` (Lightweight Directory Access Protocol)
- ``mailserver:`` (Access to data available from mail servers)
- ``mid:`` (message identifier)
- ``mms:`` (multimedia stream)
- ``modem:`` (modem)
- ``nfs:`` (network file system protocol)
- ``opaquelocktoken:`` (opaquelocktoken)
- ``pop:`` (Post Office Protocol v3)
- ``prospero:`` (Prospero Directory Service)
- ``rsync:`` (rsync protocol)
- ``rtsp:`` (real time streaming protocol)
- ``service:`` (service location)
- ``shttp:`` (secure HTTP)
- ``sip:`` (session initiation protocol)
- ``steam:`` (Steam browser protocol)
- ``tel:`` (telephone)
- ``tip:`` (Transaction Internet Protocol)
- ``tn3270:`` (Interactive 3270 emulation sessions)
- ``vemmi:`` (versatile multimedia interface)
- ``wais:`` (Wide Area Information Servers)
- ``z39.50r:`` (Z39.50 Retrieval)
- ``z39.50s:`` (Z39.50 Session)
RECURSION¶
Before descending recursively into a URL, it has to fulfill several conditions. They are checked in this order:Opera bookmarks files, and directories. If a file type cannot
be determined (for example it does not have a common HTML file
extension, and the content does not look like HTML), it is assumed
to be non-parseable.
except for example mailto: or unknown URL types.
with the ``--recursion-level`` option and is unlimited per default.
the ``--ignore-url`` option.
followed recursively. This is checked by searching for a
"nofollow" directive in the HTML header data.
NOTES¶
URLs on the commandline starting with ftp. are treated like ftp://ftp., URLs starting with www. are treated like http://www.. You can also give local files as arguments.ENVIRONMENT¶
NNTP_SERVER - specifies default NNTP serverRETURN VALUE¶
The return value is 2 when- •
- a program error occurred.
- •
- invalid links were found or
- •
- link warnings were found and warnings are enabled
LIMITATIONS¶
LinkChecker consumes memory for each queued URL to check. With thousands of queued URLs the amount of consumed memory can become quite large. This might slow down the program or even the whole system.FILES¶
~/.linkchecker/linkcheckerrc - default configuration fileSEE ALSO¶
linkcheckerrc(5)AUTHOR¶
Bastian Kleineidam <calvin@users.sourceforge.net>COPYRIGHT¶
Copyright © 2000-2012 Bastian Kleineidam2010-07-01 | LinkChecker |