XSS Filtering

If you want to protect yourself from a XSS attack, what characters should you escape? I've seen 2 recommendations:

  • ', ", <, > and & should be converted to &apos;, &quot;, &lt;, &gt;, &amp;
  • Convert anything that isn't ASCII alphanumeric to &#xx;

I've seen the second recommended more and more recently. Which is best?

The argument for escaping all non-ASCII alphanumeric

It's a known security tenet that whitelisting is safer than blacklisting. If you're just escaping ', ", <, > and & then you're blacklisting, which isn't as safe as whitelisting.

There are some practical examples of how this can play out -

<a href="$">

(I'm using $ to represent the injection point. This would probably crop up in a template something like this: <a href="<%= escape(userInput) %>">)

If all the escape() function does is to escape ', ", <, > and &, then what if the user entered a data: URL? You could end up with the following output:

<a href="data:text/html;base64,PHNjcmlwdD5hbGVydCgnWFNTJyk8L3NjcmlwdD4K">test</a>

Which in case you can't do base64 in your head is equivalent to this:

<a href="data:text/html;<script>alert('XSS')</script>">test</a>

Clearly this is bad - we've let a user XSS us even though we are filtering for XSS. There are many more examples that are similar.

The argument for escaping only ', ", <, > and &

The bad news is that more filtering does not help. If we enhance our escape function to encode every non-alpha, then we would get the following output:

<a href=data&#58;text&#47;html&#59;base64&#44;PHNjcmlwdD5hbGVydCgnWFNTJyk8L3NjcmlwdD4K>test</a>

Here's the bad news - the above works. (Look: test (if this script gets into your RSS aggregator, then you need a new RSS aggregator.))

Adding the extra filtering has had the following effect:

  • It's hidden the hole, so now we're less likely to notice it, and fall in.
  • It's wasted bandwidth

So how do we keep ourselves clear of XSS attacks?

The solution is to understand about insertion points.

The following insertion points, are ones that I believe are safe if ', ", <, > and & are escaped:

  • <div>$</div> (Where div could be p, h*, li, etc - things expecting textual content)
  • <input value="$" ...> (i.e. somewhere else that expects textual content)
  • <script>str = "$";</script> (needs different escaping rules)

I think it's likely that virtually any other insertion point is likely to be dangerous. Some examples:

  • <script>$</script> (no amount of escaping will protect you, prepare to die)
  • <div $> (there are countless events we could latch into, including several non-standard, hard to find ones)
  • <div style="$">... (JavaScript pops up in CSS in many places like width:expression(script_here))
  • <a href="$">... (The example we used above)
  • <img src="$"> (For similar reasons)
  • etc.

The key it to understand the environment into which we are allowing injection. The trend for separating content, style and action into separate files is good because it more clearly defines the environment, but that doesn't stop HTML from being able to embed CSS.

I once saw some code that was JSP containing Java containing HTML containing CSS and JavaScript containing SQL all on one line. An environment so confused that it contained it's very own security hole built right in.

Filtering in DWR

DWR version 3 is nearly cooked, and our escaping functions use the simpler escaping system of just escaping ', ", <, > and &. If anyone knows of any attack that a broader filtering system would protect people from, then please comment.


Comments have been turned off on old posts