Contributed by Steve Eidemiller on 10/10/2003 (steve.eidemiller@childrenshc.org)
This package includes exe and dll binaries from the following sources:
htdig http://www.htdig.org
Cygwin http://www.cygwin.com
catdoc http://packages.debian.org/stable/text/catdoc.html
xpdf http://www.foolabs.com/xpdf/download.html
This package contains Windows binaries built from the htdig 3.1.6 distro
using Cygwin 1.3.12-1 and gcc 2.95.3-5 (cygwin special). These binaries
have been tested on Windows 2000 Professional SP4, Windows 2000 Server SP4, and
Windows XP Professional SP1.
I mostly followed Jim Kerslake's "Idiot's Guide to installing ht://dig on Win32",
with a few modifications:
prefix=c:/htdig
CGIBIN_DIR=c:/Inetpub/wwwroot/cgi-bin
IMAGE_DIR=c:/Inetpub/wwwroot/htdig/images
IMAGE_URL_PREFIX=/htdig/images
SEARCH_DIR=c:/Inetpub/wwwroot/htdig
This date tag patch:
=======================================
FROM: Gilles Detillieux
DATE: 02/07/2002 13:40:30
SUBJECT: [htdig] PATCH - fix meta date tag parsing in 3.1.6
This patch fixes a problem introduced in 3.1.6's handling of use_doc_date,
which wasn't in the 3.1.5 patches for this feature. The new date parsing
code in 3.1.6 didn't allow a '-' character after the year in the content
attribute of meta date tags, but only allowed white space, which is
obviously not in accordance with the ISO 8601 date format standard.
Apply this patch in your main htdig-3.1.6 source directory using the
command: patch -p0 < this-message-file
--- htdig/Retriever.cc.orig Thu Jan 31 17:47:17 2002
+++ htdig/Retriever.cc Thu Feb 7 14:47:27 2002
@@ -1139,7 +1139,7 @@ parsedcdate(char *date)
year += 1900;
else if (year >= 19100) // seen some programs do it, why not check?
year -= (19100-2000);
- while (isspace(*s))
+ while (*s == '-' || isspace(*s))
s++;
// get month...
--
Gilles R. Detillieux E-mail: <>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
=======================================
And, the following changes to the retry code in htlib\Connection.cc:
=======================================
To increase the number of retries for any given page, and to extend the timeout between retries,
the following four lines were changed in "htlib\Connection.cc". Lines changed are marked by "SteveE"
and indicate the old value. Make sure BOTH sets of values are changed!! Extra lines from the .cc
code are shown so you can get your bearings.
Connection::Connection()
{
sock = -1;
connected = 0;
peer = 0;
server_name = 0;
all_connections.Add(this);
timeout_value = 0;
retry_value = 6; //Old value = 1 -- SteveE 09/25/2003
wait_time = 10; // wait 5 seconds after a failed connection //Old value = 5 -- SteveE 09/25/2003
}
Connection::Connection(int socket)
{
sock = socket;
connected = 0;
GETPEERNAME_LENGTH_T length = sizeof(server);
if (getpeername(socket, (struct sockaddr *)&server, &length) < 0)
{
perror("getpeername");
}
peer = 0;
server_name = 0;
all_connections.Add(this);
timeout_value = 0;
retry_value = 6; //Old value = 1 -- SteveE 09/25/2003
wait_time = 10; //Old value = 5 -- SteveE 09/25/2003
}
=======================================
I also modified conv_doc.pl and added two .bat files that launch catdoc.exe and the pdf
converters. Please reference c:\htdig\contrib\htdig.conf from this distribution to see
my settings for external_parsers.
TO INSTALL THIS DISTRIBUTION:
=============================
1. Unzip the entire contents to C:\htdig (exe files should end up in C:\htdig\bin)
2. Copy C:\htdig\Inetpub\wwwroot\cgi-bin files to your virtual host's cgi-bin folder and
set execute permissions appropriately.
3. Copy C:\htdig\Inetpub\wwwroot\htdig files to /htdig on your virtual host
4. Edit C:\htdig\conf\htdig.conf appropriately. I have included my .conf file at C:\htdig\contrib\htdig.conf
for reference (edited to remove confidential information). Notable things to edit are the start_url,
limit_urls_to, exclude_urls, and maintainer.
5. Run C:\htdig\bin\htdig.bat to create your databases in C:\htdig\db. I use htdig.bat instead of rundig
because it generates nice htdig.log and htdig_error.log files. My conf file is setup to make use of the
external parsers (included) and generate conv_errors.log to log conversion errors as needed. You can also
copy C:\htdig\contrib\BrokenLinks.asp to an IIS folder and browse it to see a broken link report from
the htdig.log file generated by my htdig.bat.
6. Edit the htdig.exe and htmerge.exe parameters in htdig.bat to fit your needs.
7. You may wish to setup Task Scheduler (or equivalent) to run htdig.bat on a routine basis.
Many thanx to all the wonderful contributors on the htdig and related projects !!