Unicode security issues on PHP

Freitag, Oktober 23rd, 2009 | PHP, Schwachstellenanalyse, Sicherheit, sseq-lib | 5 Kommentare

The last 3 days I put some other things by side to work on this here: A couple of unicode issues on PHP and Firefox. As one can see, securing web applications is also about knowing and understanding how data is coded and converted. To me it was obvious I had find out how to cope with this problems inside the security library SSEQ-LIB.

What are these vulnerabilities about

It’s once again all about not checking / encoding user input, which we all know that it’s evil. Let’s learn something about Overlong UTF-8:

Overlong UTF-8 (non-shortest form)

First go read what sirdarckcat wrote about overlong UTF-8!

So you’re back? Let us understand how an attacker can manufacture such an overlong UTF-8: We take for example the apostrophe (‚). Converted to binary it is 00100111. (Actually we put the numeric char code from apostrophe into the converter (see asciitable) which is 39.)

So we now have this binary string 00100111 which means 39 which itself corresponds to apostrophe. Now we are going to make it overlong. But before we must have a look at how UTF-8 is coded. Look exactly at this binaries in the columns Byte 1 to Byte 4: The first ones and zeros are very important because they tell the UTF-8 decoder how long the entire character is and which byte belongs to it.

Ok, back again? So we want to enlarge a UTF-8 char by one more byte, so we look in the second row of the table from wikipedia: the first bit has to start with „110“ because it means that this UTF-8 char is 2 byte long. The rest we can fill with zeros: 11000000. So we have the first Byte.

The second byte has to carry the initial value of 39 which is our apostrophe. We already know that 39 in binary is 00100111. Too bad that this string does not correspond with the UTF-8 definition for second bytes: it has to start with „10“. Well actually we replace the first 2 bit with „10“ and we’re done! Out second byte is: 10100111.

We put them together: 11000000 10100111
We convert each to hexadecimal: \xc0 \xa7 or url encoded: %c0%a7.

What’s wrong with overlong UTF-8

It is known, that interpreting non-shortest form UTF-8 is a security issue. Unfortunately PHP does interpret this overlong UTF-8. This is not a security breach by default. The point is that other software like web application firewalls, vulnerability scanners and even functions like „addslashes()“ does not interpret this overlong UTF-8 code and so attack vectors or chars which should be escaped can pass by unidentified.

So when you escapes database input like this:


Or when you rely on „magic_quotes“ (and you should not!):


Just hope that no one inserts as „name“: %c0%a7%20OR%201%2F%2A which would result in something to ask for all users in the database:
SELECT * FROM table WHERE name=“ OR 1/*‘

To sum up: addslashes() and „magic_quotes“ are not capable to interpret this overlong UTF-8 so it passes by without escaping.

What can we do about it?

I spent some time to figure out how to check if non-shortest UTF-8 data contains potentially dangerous payload or not. Finally the most precise solution seems to me to be counting the special chars before and after „utf8_decode()“. The reason why this works is that this kind of attack is based on infiltration of additional special chars which are kept hidden until they are revealed through „utf8_decode()“. So after decoding we should count some more special chars than before.

When encoding to an inappropriate encoding like from UTF-8 to iso-xxxxxx-x some characters have to be replaced by a question mark (?). This question mark we must not count.

This function tells apart potentially dangerous overlong UTF-8 from harmless overlong UTF-8:
= 33 && ord($ch) <= 62) || (ord($ch) >= 91 && ord($ch) <= 96) || (ord($ch) >= 123 && ord($ch) <= 126)) ) { $count++; } } return $count; } function seq_check_nonshortest_utf8_($string_ = '') { $count = seq_mb_count_symbols_(stripslashes($string_)); $after_count = seq_mb_count_symbols_(utf8_decode(stripslashes($string_))); if ($after_count > $count) {
return true;

return false;

Check if string is dangerous:


Tell me if it works for you too, especially when your OS has some special encoding.

Additional thematic links

Bypass addslashes with UTF-8 characters

Tags: , , , , , ,