Problem
I’m building code that generates HTML automatically, and I want it to encode everything correctly.
Let’s say I want to create a link to the following URL:
http://www.google.com/search?rls=en&q=stack+overflow
All attribute values, I assume, should be HTML-encoded. (If I’m wrong, please correct me.) So that means if I’m putting the above URL into an anchor tag, I should encode the ampersand as &, like this:
<a href="http://www.google.com/search?rls=en&q=stack+overflow">
Is that correct?
Asked by JW.
Solution #1
Yes, it is correct. Because HTML entities are parsed inside HTML attributes, a stray & might cause confusion. As a result, inside all HTML attributes, you should always write & instead of just &.
Only the characters & and quotations must be encoded. If your attribute contains special characters like é, you don’t need to encode them to satisfy the HTML parser.
It used to be that URLs including non-ASCII characters, such as é, required special treatment. Because they were established by RFC 1738, you had to encode them using percent-escapes, which in this case would be percent C3 percent A9. However, RFC 1738 has been superseded by RFC 3986 (URIs, Uniform Resource Identifiers) and RFC 3987 (IRIs, Internationalized Resource Identifiers), on which the WhatWG based its work to define how browsers should behave when they see an URL with non-ASCII characters in it since HTML5. It’s therefore now safe to include non-ASCII characters in URLs, percent-encoded or not.
Answered by zneak
Solution #2
In situations like these, the ampersand must be escaped, e.g. as &, according to current official HTML standards. Browsers, on the other hand, do not require it, and the HTML5 CR proposes that it be made a rule such that special rules apply to attribute values. In this regard, current HTML5 validators are obsolete (see bug report with comments).
It will still be able to escape ampersands in attribute values, but there is no practical requirement to escape them in href values other than for validation with current tools (and there is a small risk of making mistakes if you start escaping them).
Answered by Jukka K. Korpela
Solution #3
When it comes to URLs in links (a href), you have two options.
The initial standard is RFC 1866 (HTML 2.0), which lists the characters that must be escaped when used as the value for an HTML attribute in “3.2.1. Data Characters.” (Special characters are not allowed in attributes; for example, a hr&ef=”http://… is not allowed, nor is a hr&ef=”http://….)
This was later incorporated into the HTML 4 standard, and the characters that must be escaped are:
< to <
> to >
& to &
" to "e;
' to '
The other standard is RFC 3986, also known as the “Generic URI Standard,” which governs how URLs are handled (this happens when the browser is about to follow a link because the user clicked on the HTML element).
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
It’s critical to escape certain characters so that the client can tell if they’re data or a delimiter.
Example unescaped:
https://example.com/?user=test&password&te&st&goto=https://google.com
As an example, consider the following URL.
https://example.com/?user=test&password&te%26st&goto=https%3A%2F%2Fgoogle.com
In the value of an HTML attribute, here’s an example of a totally legal URL:
https://example.com/?user=test&password&te%26st&goto=https%3A%2F%2Fgoogle.com
Also important scenarios:
I’m creating a new answer because I believe zneak’s answer lacks sufficient examples, fails to distinguish HTML and URI handling as distinct features and standards, and contains a few small flaws.
Answered by Daniel W.
Solution #4
Yes, & should be changed to &.
For questions like this, the W3C’s HTML Validator tool comes in handy. It will inform you of any errors or warnings for a certain page.
Answered by Randy Greencorn
Post is based on https://stackoverflow.com/questions/3705591/do-i-encode-ampersands-in-a-href