From grdetil@scrc.umanitoba.ca Wed Aug 18 09:57:01 1999
Date: Wed, 18 Aug 1999 11:18:05 -0500 (CDT)
From: Gilles Detillieux <grdetil@scrc.umanitoba.ca>
To: htdig@htdig.org
Cc: burditt@okstate.edu, jeff@co.mendocino.ca.us, pbuckingham@mps.com
Subject: [htdig] Correction to patch for Acrobat 4


Hi again, folks.  I made a silly mistake in my patch last Friday, August
13, to support Acrobat 4.  Here's the fix for that mistake:

--- htdig/PDF.cc.bug	Tue Aug 17 11:07:17 1999
+++ htdig/PDF.cc	Wed Aug 18 09:22:28 1999
@@ -109,7 +109,7 @@ PDF::parse(Retriever &retriever, URL &ur
     if (notfound)	// we only need to complain once
 	return;
     String arg0 = acroread;
-    char *endarg = strchr(acroread.get(), ' ');
+    char *endarg = strchr(arg0.get(), ' ');
     if (endarg)
 	*endarg = '\0';
     // If first arg is a path, check that it exists, and is a regular file. 


It turns out that even without the -pairs option, acroread 4 is still
prone to segmentation violations when generating PostScript, so acroread 3
is a better choice anyway.  However, this fix handles a few other problems
with pdf_parser handling, and you may find that Acrobat 4 works OK with
your files.  Hopefully Adobe will fix these problems before too long.

Also, if you applied last Friday's patch after applying the patch file
collection I sent out last Monday, August 9, there's a hunk that would
have failed to apply to htdoc/attrs.html, because of a conflicting
change in the patch file collection.  You can correct that by applying
the patch below (as well as the one above) after Friday's patch.

--- htdig-3.1.2/htdoc/attrs.html.orig	Fri Aug  6 14:00:28 1999
+++ htdig-3.1.2/htdoc/attrs.html	Tue Aug 17 10:55:45 1999
@@ -4283,14 +4283,33 @@
 		      <em>infile outfile</em>,<br>
 		      where <em>infile</em> is a file to parse and
 		      <em>outfile</em> is the PostScript output of the
-		      parser. The program is supposed to convert to a
+		      parser. In the case where acroread is the parser, and
+		      the -pairs option is not given, the second parameter
+		      will be the output directory rather than the output
+		      file name. The program is supposed to convert to a
 		      variant of PostScript, which is then parsed
-		      internally. Currently, Adobe's <a
+		      internally. Currently, only Adobe's <a
 		      href="http://www.adobe.com/prodindex/acrobat/readstep.html">
-		      acroread</a> program and the pdftops program
-		      that is part of the <a
+		      acroread</a> program has been tested as a pdf_parser.
+		      There is a bug in Acrobat 4's acroread command, which
+		      causes it to fail when -pairs is used, hence the special
+		      case above.<br>
+		       The pdftops program that is part of the <a
 		      href="http://www.foolabs.com/xpdf/">xpdf</a>
-		      0.80 package have been tested as pdf_parsers.
+		      package is not suitable as a pdf_parser,
+		      because its variant of PostScript is slightly
+		      different.  However, an alternative is to
+		      use xpdf's pdftotext program as a component
+		      of an <a href="#external_parsers">external
+		      parser</a> with the xpdf 0.90 package installed
+		      on your system, as described in FAQ question <a
+		      href="FAQ.html#q4.9">4.9</a>.<br>
+		       In either case, to successfully index PDF files,
+		      be sure to set the <a
+		      href="#max_doc_size">max_doc_size</a> attribute
+		      to a value larger than the size of your largest
+		      PDF file. PDF documents can not be parsed if they
+		      are truncated.
 			<p>
 			  The default value of this attribute is determined at
 			  compile time, to include the path to the acroread


-- 
Gilles R. Detillieux              E-mail: <grdetil@scrc.umanitoba.ca>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig@htdig.org containing the single word unsubscribe in
the SUBJECT of the message.