Note: This package is still useful and will be maintained. However new functions will go into the Perl Module HTML::TagReader which is available from http://cpan.org/authors/id/G/GU/GUS/ ------------------------------------------------------------------ The webgrep tool box consists of 7 utilities for the web-master wchck --------- This is cgi-bin to check html pages. It is written in perl. Take a look at the top of wchck file to get detailed information about this program and how to install it. lshtmlref --------- This is a nice utility to build tar archives from webpages and include all the necessary GIFs, textfiles etc.. This can ofcourse only work for relative links. .... a good web editor uses anyway only relative links as this makes the pages re-locatable :-). blnkcheck -------- blnkcheck checks web-pages for broken links. It searches only the relative links and is therefore not dependent on a web-server and very fast. On a pentium 75 Mhz with local disk it can e.g check 5000 links in 3 seconds. It is ideal to verify that your complete web-server is consistent. The idea is that after editing your page in a html or plain text editor you just type "blnkcheck the_page_you_changed.html". This tells you if all links are correct. blnkcheck checks relative links (things like href="../info.html" or src="point.jpg" etc..). You can however list the absolute links whith -a and run httpcheck as a post processor on them. NOTE: blnkcheck is designed as a fast checker for web masters that have shell and file system level access to their web-pages. It can also be used if you are able to keep a mirror of the web site on your local disk. Other programs like e.g curl (http://curl.haxx.nu/) can be used if you want to check your web-pages only remotely via a web server. curl comes with a dead link checker called checklinks.pl. httpcheck --------- Is a post processor for "blnkcheck -a" and can be used to check absolute links of protocol type http. It does the checks by sending HEAD requests to the webservers for the page in question. This is a lot faster then fetching the whole web-page but still quite slow. httpcheck is written in perl. It requires perl 5 in /usr/bin/perl As of version 1.6 httpcheck can also handle proxies. taggrep ------- taggrep is a program to grep for html tags. E.g search for meta tags or list the title of a number of web pages. To quickly see which file is which web-page you may type taggrep -c title title `find . -name '*.htm*' -print` The command > taggrep -c li,ol li doc applied to a web page that looks like this: <ol> <li>item 1 <li>item 2 <li type=disc> three <li type=square>four </ol> produces: doc:12: <li>item 1 <li> doc:13: <li>item 2 <li type=disc> doc:14: <li type=disc> three <li type=square> doc:15: <li type=square>four </ol> webfgrep -------- Is an web search engine that works well for up to websites with up to 1Mb of html pages. It uses memory maped file access and is therefore quite fast. It is best used from a perl cgi-bin wrapper that produces and evaluates the form. Make sure you write a secure cgi-bin! One solution is is to escape all meta characters with quotemeta() before passing it to the shell and the webfgrep but the simpler and more secure one may be to use the -s option of webfgrep. There is are 2 sample cgi-bin program called websearch and websearch-s available in this distribution. They are made for english web-pages and do not support any special characters that you might have in other languages. webfgrep as such is basically a very fast fgrep program that excludes tags from the seach text. srcgrep ------- srcgrep searches web-pages for <img ... src=...> or <body ... background=...> and displays the data contained in the tag in a nice readable format. This is useful if you need to re-work web-pages (e.g check what images are included). hrefgrep -------- hrefgrep is like srcgrep except that is searches for <a href=...>...</a> or an area tag of the from <area ... href=...> It takes otherwise exactly the same options. htmlpp ------ htmlpp removes line breakes in html tags that contain one of href=,name=,background=,src= and compensate the removed newlines later on by adding them after the next newline outside a tag. This way all tags start at the same line number as in the original file. This makes it possible to post-process tags with programs that work best in a line oriented mode (sed, awk, perl....). This program does not edit the file. All output goes to stdout. At the moment htmlpp is not used for anything. Scripts will follow. ------------------------ See the INSTALL file for a description on how to install this software. ------------------------ History: 1.0: first usable c-version. 1.1: 1999-02-27 hrefgrep added, documentaion improved. 1.2: 1999-03-01 hrefgrep, srcgrep: Now you can list each file name only once with the option -t 1.3: 1999-03-19 area tag added to hrefgrep 1.4: 1999-04-08 -added webfgrep, a poor man's web search engine -webfgrep -i option. 1.5: 1999-04-30 handle now comments. blnkcheck, httpcheck and lshtmlref added 1.6: 1999-05-05 timeout added to httpcheck. Some servers connect but do not respond. blnkcheck: it is now ok to have a space or \n in the path of a link httpcheck: can now handle proxies httpcheck: incompatible change with 1.5, option -b removed lshtmlref: incompatible change with 1.5, option -w removed 1.7: 1999-05-10 Some corrections in documentaion and help functions. taggrep added. blnkcheck,lshtmlref,srcgrep:check also the "background=" 1.8: 1999-05-17 Documentaion updates. httpcheck: keep a cache of already requested pages in ram to speed up repeated checks for the same URL. 1.9: statistic format of blnkcheck changed 2.0: 1999-11-22 A complete re-write: - hrefgrep and srcgrep do no longer have the options -t and -a as they were anyhow obsolate. - blnkcheck has been enhanced significantly and checks now for references to named anchors ("page.html#anchor") also that the anchor really exists in that file. - The hash tables have been improved and allow for fast random access to orderd data tables. 2.1: 1999-11-24 -cgi-bin's are now correctly checked if they are included in the same directory tree as html pages and referenced by a relative link (e.g href=../qq.pl?xx=1). -code clean up to use more regexp matching. httpcheck prints now always ERROR if it could not verify a page. 2.2: 1999-12-19 - option -a for hrefgrep - print error for anchors that are terminated with a new anchor: <a href=....>...<a href=... - print error for numterminated anchor tags 2.3: 2000-01-12 - delete @ENV{'IFS', 'CDPATH', 'ENV', 'BASH_ENV','PATH'}; added to cgi-bin - lshtmlref -Wa file.html did append "index.html" if the links did not exist. - lshtmlref: new options -A and -i, bug fix for option -L - blnkcheck: option -n and -w are now case insensitive - blnkcheck: option -O added. Bug fix for image tags nested inside anchor 2.4: 2000-01-29 -bug in httpcheck: Url in all upper case could not be checked, HTTP://WWW... -bug in httpcheck: The cache did not work correctly for urls that were not broken 2.5: 2000-04-09 -httpcheck: some web-server want to have a Host: ...\n\r in the HEAD request. -htmlpp added 2.6: -make it compile on HPUX 2.7: 2000-10-18 -some comments added in httpcheck -bug reported by Matthijs Hollemans : broken ref to named anchors in other files are only reported once. 2.8: 2000-11-05 -it is possible to have a tag like <a href=... name=...> ... </a> 2.9: 2000-12-20 -updates to lshtmlref httpcheck blnkcheck -new cgi-bin wchck -blnkcheck should also check for file:// 2.9b: 2002-01-28 - changed httpcheck to work with more recent perl versions 2.9d: 2002-10-07 - editorial updates for upload to www.ibiblio.org/pub/Linux/apps/www/misc/ 2.10: 2002-10-08 - updates to httpcheck 2.11: 2003-01-06 - better rpm spec file ------------------------ Author: Guido Socher, guido@linuxfocus.org Copyright: GPL (see http://www.gnu.org/copyleft/gpl.html) ------------------------ This program is available from: http://linuxfocus.org/~guido or http://www.toppoint.de/~utuxfan/g/ ------------------------