From grdetil@scrc.umanitoba.ca Tue Feb 15 11:55:38 2000 Date: Tue, 15 Feb 2000 13:06:23 -0600 (CST) From: Gilles Detillieux To: htdig3-dev@htdig.org Subject: Re: [htdig3-dev] Use of & as CGI variable separator vs. HTML 4.0 Hi, folks. After thinking about some recent messages expressing concern about htsearch's compliance with HTML 4.0, and its potential vulnerability to cross-site scripting, I've decided there are a couple things worth doing. One thing is to add some minimal SGML entity encoding capability to htsearch in the 3.1.x branch, and to support $&(var) and $%(var), as 3.2 does. I have working code to do this, and I'll probably commit this to the tree for an eventual 3.1.5 release. (No, I wasn't planning to do this, but it seems there now may be a need for it.) Another thing which I finally decided to do was to blow the dust off of some e-mail discussions from last April, such as this one... According to me: > According to Budd, S.: > > > From: Gilles Detillieux [SMTP:grdetil@scrc.umanitoba.ca] > > > According to Torsten Neuer: > > > > According to Gilles Detillieux: > > > > >They also recommend using ";" as an alternate separator, so we should > > > > >also handle > > > > > > > > > >?x=1;y=2 > > > > >and use this in Display::createURL(), if we're to follow their > > > > >recommendations. Does anyone see a problem with using a semicolon in > > > > >this manner? > > > > > > > > I disagree with that since AFAIK a semicolon does not necessarily need > > > > to be url-encoded. > > > > Could cause trouble if a semicolon is part of a parameter. > > > > > till now we could not use & as part of a parameter? > > I believe you're right! I was assuming that parameters were getting > URL-encoded individually, and any "&" within a parameter got encoded > to "%26" before it got appended to the query string. That's not > how Display::createURL() currently functions. It concatenates all > parameter strings, separated by "&", then encodes the whole string without > changing the ampersands. So, any "&" within a parameter would cause that > parameter to be truncated at that point when it's re-read as input in > a subsequent search. The CGI input handling code does it correctly. > It separates out the individual variables from the query string, and > decodes only the value of each variable. > > Display::createURL() really ought to do just the opposite - it should > encode the value of each variable separately, ensuring that any special > characters, including "=" & "&" (and ";" if we use it later), get encoded. > Then it should assemble the query string from these encoded variables, > using whatever separator we choose to standardize on. The encodeURL() > function allows you to give it a list of valid punctuation as a 2nd > parameter, so it should be easy to change this. We can then change > htlib/cgi.cc to handle that separator as well as "&". The vertical bar > "|" gets encoded right now, so if we do the encoding properly, theres no > reason we couldn't allow it as a variable separator too. We can also > allow "&" or "&" as separators, if we translate SGML entities > before splitting the variables. Following that discussion, there were some references to RFCs 1738 & 2396, and recommendations about how URL parameters should be encoded. At the time, I shelved this because I didn't have the time to read the RFCs and figure out the distinction between reserved and unreserved punctuation, and how htsearch should handle these. Geoff had quickly patched encodeURL() to use the reserved set of characters as the default valid list, and we left things at that, knowing more work needed to be done on this. I took some time this morning to sort it out a bit more. As I had suggested earlier, it seems the proper way to handle URL parameters is to encode them before you assemble the query string, and only allow _unreserved_ punctuation characters through unencoded. This will prevent conflicts between reserved punctuation in the individual parameters, and the reserved punctuation used to piece together the query string. I implemented this under 3.1.4, as the patch below. I think this is a reasonable approach that does what I had discussed. It now allows URL parameters to be separated by either "&" or ";", as well as "&" (in fact "&;" will work - I hope that's not a problem), but in the URLs that it generates, it will use only ";" instead of "&", for HTML 4.0 compliance. I'd appreciate some extra eyeballs to peer over this code. It compiles quietly on my Red Hat 4.2 Intel Linux system (with gcc-c++-2.7.2.1-2), and it seems to work correctly as far as I can tell. Please let me know if I've done any "no-nos" in this, especially with the encodeInput() macro. If not, I'll adapt it for 3.2 as well. Maybe the unreserved list should go right in URL.h, as the default for encodeURL(). Any preferences on this? Of course, with the parsing of the $%(var) template variable extension, it should also pass this new unreserved list to the encodeURL() call in outputVariable(), for consistency. I don't know how people had intended to use this extension, but it seems more likely to me that it would be used to encode individual parameter values rather than whole query strings, so any reserved character should be encoded. I think we'll also want to change all the default follow-up search forms (nomatch.html, syntax.html, header.html, wrapper.html) to use $&(WORDS), $&(EXCLUDE) and %&(RESTRICT) as the default values for their corresponding input tags. Have I missed any? Feedback is welcome. --- htlib/cgi.cc.oldparm Thu Dec 9 18:28:47 1999 +++ htlib/cgi.cc Tue Feb 15 10:46:51 2000 @@ -92,7 +92,7 @@ cgi::init(char *s) // // Now we need to split the line up into name/value pairs // - StringList list(results, '&'); + StringList list(results, "&;"); // // Each name/value pair now needs to be added to the dictionary --- htsearch/Display.cc.oldparm Tue Feb 15 09:16:54 2000 +++ htsearch/Display.cc Tue Feb 15 11:46:07 2000 @@ -24,6 +24,20 @@ static char RCSid[] = "$Id: Display.cc,v #include "HtURLCodec.h" #include "HtWordType.h" + +// Unreserved punctuation allowed unencoded in URLs. We use a more restricted +// list of unreserved characters than allowed by RFC 2396 (which revises and +// replaces RFC 1738), because it can't hurt to encode any of these +// characters, and they can pose problems in some contexts. RFC 2396 says +// that only alphanumerics, the unreserved characters "-_.!~*'(),", and +// reserved characters used for their reserved purposes may be used +// unencoded within a URL. We encode reserved characters because we now +// encode URL parameter values individually before piecing together the whole +// query string using reserved characters. + +static char unreserved[] = "-_.!~*"; + + //***************************************************************************** // Display::Display(char *indexFile, char *docFile) @@ -560,6 +574,7 @@ Display::createURL(String &url, int page { String s; int i; +#define encodeInput(name) (s = input->get(name), encodeURL(s, unreserved), s.get()) if (strlen(config["script_name"]) != 0) { url << config["script_name"]; @@ -570,34 +585,35 @@ Display::createURL(String &url, int page url << '?'; if (input->exists("restrict")) - s << "restrict=" << input->get("restrict") << '&'; + url << "restrict=" << encodeInput("restrict") << ';'; if (input->exists("exclude")) - s << "exclude=" << input->get("exclude") << '&'; + url << "exclude=" << encodeInput("exclude") << ';'; if (input->exists("config")) - s << "config=" << input->get("config") << '&'; + url << "config=" << encodeInput("config") << ';'; if (input->exists("method")) - s << "method=" << input->get("method") << '&'; + url << "method=" << encodeInput("method") << ';'; if (input->exists("format")) - s << "format=" << input->get("format") << '&'; + url << "format=" << encodeInput("format") << ';'; if (input->exists("sort")) - s << "sort=" << input->get("sort") << '&'; + url << "sort=" << encodeInput("sort") << ';'; if (input->exists("matchesperpage")) - s << "matchesperpage=" << input->get("matchesperpage") << '&'; + url << "matchesperpage=" << encodeInput("matchesperpage") << ';'; if (input->exists("keywords")) - s << "keywords=" << input->get("keywords") << '&'; + url << "keywords=" << encodeInput("keywords") << ';'; if (input->exists("words")) - s << "words=" << input->get("words") << '&'; + url << "words=" << encodeInput("words") << ';'; StringList form_vars(config["allow_in_form"], " \t\r\n"); for (i= 0; i < form_vars.Count(); i++) { if (input->exists(form_vars[i])) { - s << form_vars[i] << '=' << input->get(form_vars[i]) << '&'; + s = form_vars[i]; + encodeURL(s, unreserved); // shouldn't be needed, but just in case + url << s << '='; + url << encodeInput(form_vars[i]) << ';'; } } - s << "page=" << pageNumber; - encodeURL(s); - url << s; + url << "page=" << pageNumber; } //***************************************************************************** -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to htdig3-dev-unsubscribe@htdig.org You will receive a message to confirm this.