Idiot's Guide to installing ht://dig on Win32.

Jim Kerslake
York, England
jimkerslake@totalise.co.uk


Introduction:

Having gone through the pain of getting ht://dig to work under Windows, I wanted to pull together the various tips and tricks, which are currently lying scattered around the FAQs and support lists, into this brief guide.

I gratefully acknowledge all sources of information from which this guide has been condensed. These earlier contributors may be found by persistent searching of the htdig.org and cygwin sites - my intention is to collate the work of numerous others, not to ride over the top of them. It seems only fair to make some contribution, considering that the ht://dig project provides such an excellent search tool which has helped me out a lot.

This guide pre-supposes that:
(i) you have some familiarity with how ht://dig works in UNIX/Linux systems
(ii) you haven't much of a clue about how Windows / IIS servers work
(iii) you are enough of an obsessive lunatic to want to try this in the first place

So - that covers pretty much all of us, then ;-)


Overview:

I have now managed to get ht://dig v.3.1.5 to work on three Win32 systems:
1.- A Windows 98 PC running PWS (Microsoft Personal Web Server) - Pentium 75 with 32 Mb RAM
2.- A Windows NT 4 Workstation - PII 450, 64Mb RAM
3.- A Windows NT 4 Server running IIS - unsure of its precise specifications because it isn't mine!

I got three different sets of behaviours and error messages from the three different systems…
No.1 was the easiest by far.
No.2 was fairly easy up to a point, but not 100% successful.
No.3 was a pain, because of a persistent htmerge error which took ages to diagnose.

So - perhaps the main thing I learned was that, if in doubt, your safest bet is to compile ht://dig from scratch on the same box on which you're intending to run it. Certainly don't build it on Win98 and just copy the files straight over onto NT Server, and expect them to run! You might get away with compiling on NT Workstation and then copying over to NT Server - but I didn't! I will supply some pre-compiled binaries for NT Server too - but don't be too surprised if you find yourself compiling instead.


Compiling the system under Windows

I will assume the following:

(a) You're going to install ht://dig into the following location:
C:\htdig\
so the main ht://dig binaries, config files, common files, databases, etc., will be at:
C:\htdig\bin\
C:\htdig\conf\
C:\htdig\common\
C:\htdig\db\
and so on.

(b) Your Web site CGI directory is at:
C:\Inetpub\wwwroot\mysite\cgi-bin\
and has been set up with appropriate executable permissions
(you might need to read some stuff in the IIS for Dummies guide, e.g. virtual directories etc., to feel confident about setting this up)

(c) You will make the search form in:
C:\Inetpub\wwwroot\mysite\search\


Here goes...

-----
1.
Get cygwin from http://sources.redhat.com/cygwin/
- I downloaded setup.exe from the /latest folder, and chose to "install from Internet"

Cygwin is great - it gives you a UNIX-like shell on your Windows desktop, within which you can compile your GNU programs… and then, by some magic, they work as binary executables under Windows !!

Only one word of advice: watch out for different versions of Cygwin.
A key file in the installation is: /bin/cygwin1.dll

This was the cause of several days' hair-tearing in my case, because I had accidentally installed an older version… this ran fine on both machines 1. and 2. above, but on the NT Server (box 3. above), it caused htmerge to fail when it tried to call sort.exe ("word sort failed"). Replacing it for a newer version (by accident!) cured the problem instantly.

The version that did work: API=0.24; Build=27/07/2000; ver=1.1.3
The version that didn't work: API=0.21; Build=06/06/2000; ver=1.1.2
(you can get this information by right-clicking on the file, then 'Properties', then Version tab)

-----
2.
Get the latest stable distribution of ht://dig (the source version, not pre-compiled binaries) and place it somewhere convenient in your cygwin directory structure (I put mine in /home/htdig/ )

From the Cygwin shell prompt ($): then untar/gzip it, with the appropriate UNIX command, which will be something like:
tar xzvf htdig-3.1.5.tar.gz


-----
3
. Still in the Cygwin shell, cd into the htdig-3.1.5 directory just created, and run:
./configure
If this fails with error messages, you might need to take another look at your Cygwin installation version, and check whether you got it all successfully downloaded and installed.


-----
4.
Now, in the same directory, find the file CONFIG and edit it:
# This specifies the root of the directory tree to be used by ht://Dig
prefix= c:/htdig

CGIBIN_DIR= c:/Inetpub/wwwroot/mysite/cgi-bin

IMAGE_DIR= c:/Inetpub/wwwroot/mysite/search/images
IMAGE_URL_PREFIX= /search/images
SEARCH_DIR= c:/Inetpub/wwwroot/mysite/search



-----
5. Edit these files to specify correct paths:

file 1 --- /htmerge/Makefile

LOCAL_DEFINES= -DSORT_PROG=\"/usr/bin/sort\"
becomes
LOCAL_DEFINES= -DSORT_PROG=\"c:/htdig/bin/sort.exe\"

(so that you don't forget it later, now is a good time to copy sort.exe from your /cygwin/bin installation into this c:/htdig/bin/ folder!)


file 2 --- /htdig/ExternalParser.cc

String path = getenv("TMPDIR");
if (path.length() == 0)
path = "/tmp";
path << "/htdext." << getpid();

FILE *fl = fopen(path, "w");

becomes:

String path = getenv("TMPDIR");
if (path.length() == 0)
path = "c:/Temp";
path << "/htdext." << getpid();

FILE *fl = fopen(path, "wb");


[ the extra 'b' above apparently specifies that PDF files should be downloaded in binary mode, to avoid file corruption ]

-----
6. From the Cygwin prompt, in the htdig-3.1.5 installation directory, run make followed by make install

-----
7. Copy cygwin/bin/cygwin1.dll and cygwin/bin/libz.dll into a location that is recognised by your Windows system as part of its PATH (you might need to edit the autoexec file and re-boot).
Alternatively, you could be lazy and just drop copies of these two DLL files into both the C:\htdig\bin\ folder and the C:\Inetpub\wwwroot\mysite\cgi-bin\ folder. I have no idea whether the latter action has any web-security implications.

-----
8. If you didn't do it earlier - remember that cygwin/bin/sort.exe needs to be in C:\htdig\bin\

-----
9. Modify the HTML search form in C:\Inetpub\wwwroot\mysite\search\ to specify:
action="/cgi-bin/htsearch.exe"

-----
10. If you want to use an external parser for PDF files, make sure you have perl installed somewhere. You can install Active Perl, but I had slightly more success with installing the port of perl5.6.0 for cygwin, which you can obtain from the cygwin site.

Also you will need to copy two files pdftotext.exe and pdfinfo.exe into C:\htdig\bin\
These are win32 ports of the files from the xpdf distribution.
I'm using pre-compiled binaries available from the xpdf site, because I failed to compile the xpdf source distribution under cygwin (some problem in the Makefiles? make seems to trawl recursively through my directories)
[** note - these might just have been superseded by a new version 0.91 ]

-----
11. Add this line to htdig.conf:

external_parsers: application/pdf->text/html "c:/perl/bin/Perl.exe c:/htdig/bin/conv_doc.pl"
(assuming you have perl installed at that location on the server)

Personally, I use:
external_parsers: application/pdf->text/html "C:/cygwin/usr/local/bin/perl5.6.0.exe c:/htdig/bin/conv_doc.pl"
(since that's the location of my cygwin port of perl5.6.0)


-----
12. put conv_doc.pl (from the NT distribution supplied by Stephane Baudet) into C:\htdig\bin and edit it:
$CATPDF = "c:/htdig/bin/pdftotext.exe";
$PDFINFO = "c:/htdig/bin/pdfinfo.exe";

-----
13. - Moment of truth:
edit your C:\htdig\conf\htdig.conf file according to your needs, and test whether htdig.exe works.
If it does, and if it makes a set of database files in C:\htdig\db\ then test whether htmerge.exe runs.
Finally, test whether your HTML search form can query the resultant database successfully.

-----
14. I normally just run the indexing process from a batch (.BAT) file:

htdig.exe -c ../conf/htdig.conf -v >makeindexlog.txt
SETLOCAL
set TMPDIR=C:\TEMP
htmerge.exe -vvv -c ../conf/htdig.conf >>makeindexlog.txt
ENDLOCAL



Bugs and Glitches:

- htdig bombs out during indexing, with loads of "no server running" messages:

On a fast machine, particularly if you are running htdig to index a site hosted from that same server, it seems as if htdig's rate of page requests can overwhelm the number of available web server processes - i.e. the server gets swamped with htdig's rate of page requests.

Try using a local_urls specification in the config file, to point ht://dig at the local HTML files (instead of retrieving them via the web server).

The problem disappears if you are indexing lots of big slow external sites at the same time as your local one.

 

- htmerge fails, with error "word sort failed":

On a UNIX system, the finger of suspicion would point first at the availability of /tmp space, which the 'sort' program needs to use.

BUT - see point 1 above: On NT, I spent ages worrying about Temp filespace, when the real culprit was an old version of cygwin1.dll

If you are sure that you have the appropriate version of cygwin1.dll (no older versions loitering around in /system directories) then you can investigate whether sort.exe is being given the correct directions to its intended Temp space, as follows:

(a) - does C:\Temp exist? (as specified above in point 5)
- do you have permissions to write to it?

(b) - try setting the TMPDIR environment variable (see my batch file in 14 above)

(c) - you can edit the file /htdig-3.1.5/htmerge/words.cc (near the top):

String command = SORT_PROG;
String tmpdir = getenv("TMPDIR");
if (tmpdir.length())
{
command << " -T " << tmpdir;
}

becomes either:

String command = SORT_PROG;
String tmpdir = "C:/Temp";
if (tmpdir.length())
{
command << " -T " << tmpdir;
}


or try removing the T parameter:

String command = SORT_PROG;
String tmpdir = getenv("TMPDIR");
if (tmpdir.length())
{
command << tmpdir;
}


or do both!

---- obviously then, having edited this source code, you must remove htmerge.exe and the .o files from that /htmerge directory, and re-run make to re-make the binary htmerge.exe file.


(d) try replacing the sort.exe which comes as a part of cygwin, with a different one:
Get the GNU textutils package, compile that under cygwin, and you get a different (bigger) sort.exe .
Use this to replace the one in c:\htdig\bin\

GNU sort (and all the textutils) are at: ftp://ftp.gnu.org/pub/gnu/textutils/
[don't try to use the sort.exe file that comes along as part of Windows and lives in the system32 folder !!]


[ In my case, I tried all 4 of the solutions above, in various combinations, with no success at all, until I finally hit upon the DLL version solution ]


- Form input glitch

If you have indexed multiple sites [www.abc.com; www.def.com; www.ghi.com] and want to build a form which uses "restrict" to allow searching to be limited to one or all of these sites:

<select name="restrict">
<option value="abc.com"> search abc.com
<option value="def.com"> search def.com
<option value="ghi.com"> search ghi.com
<option value="">search the whole lot
</select>

then I find that the last value, restrict="", gives incorrect buggy results.

I don't know whether or not this is NT-specific.
I get around it by:
<option value="/">search the whole lot


That's it - good luck!
JK