From grdetil@scrc.umanitoba.ca Wed Aug 18 09:57:01 1999 Date: Wed, 18 Aug 1999 11:18:05 -0500 (CDT) From: Gilles Detillieux To: htdig@htdig.org Cc: burditt@okstate.edu, jeff@co.mendocino.ca.us, pbuckingham@mps.com Subject: [htdig] Correction to patch for Acrobat 4 Hi again, folks. I made a silly mistake in my patch last Friday, August 13, to support Acrobat 4. Here's the fix for that mistake: --- htdig/PDF.cc.bug Tue Aug 17 11:07:17 1999 +++ htdig/PDF.cc Wed Aug 18 09:22:28 1999 @@ -109,7 +109,7 @@ PDF::parse(Retriever &retriever, URL &ur if (notfound) // we only need to complain once return; String arg0 = acroread; - char *endarg = strchr(acroread.get(), ' '); + char *endarg = strchr(arg0.get(), ' '); if (endarg) *endarg = '\0'; // If first arg is a path, check that it exists, and is a regular file. It turns out that even without the -pairs option, acroread 4 is still prone to segmentation violations when generating PostScript, so acroread 3 is a better choice anyway. However, this fix handles a few other problems with pdf_parser handling, and you may find that Acrobat 4 works OK with your files. Hopefully Adobe will fix these problems before too long. Also, if you applied last Friday's patch after applying the patch file collection I sent out last Monday, August 9, there's a hunk that would have failed to apply to htdoc/attrs.html, because of a conflicting change in the patch file collection. You can correct that by applying the patch below (as well as the one above) after Friday's patch. --- htdig-3.1.2/htdoc/attrs.html.orig Fri Aug 6 14:00:28 1999 +++ htdig-3.1.2/htdoc/attrs.html Tue Aug 17 10:55:45 1999 @@ -4283,14 +4283,33 @@ infile outfile,
where infile is a file to parse and outfile is the PostScript output of the - parser. The program is supposed to convert to a + parser. In the case where acroread is the parser, and + the -pairs option is not given, the second parameter + will be the output directory rather than the output + file name. The program is supposed to convert to a variant of PostScript, which is then parsed - internally. Currently, Adobe's - acroread program and the pdftops program - that is part of the program has been tested as a pdf_parser. + There is a bug in Acrobat 4's acroread command, which + causes it to fail when -pairs is used, hence the special + case above.
+ The pdftops program that is part of the
xpdf - 0.80 package have been tested as pdf_parsers. + package is not suitable as a pdf_parser, + because its variant of PostScript is slightly + different. However, an alternative is to + use xpdf's pdftotext program as a component + of an external + parser with the xpdf 0.90 package installed + on your system, as described in FAQ question 4.9.
+ In either case, to successfully index PDF files, + be sure to set the max_doc_size attribute + to a value larger than the size of your largest + PDF file. PDF documents can not be parsed if they + are truncated.

The default value of this attribute is determined at compile time, to include the path to the acroread -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig mailing list, send a message to htdig@htdig.org containing the single word unsubscribe in the SUBJECT of the message.