Monday, August 15, 2011

Java: UTF-8 and Optional Byte Order Mark

UTF-8 encoding may have an optional byte order mark (BOM) at the beginning of a file. The byte order mark is to indicate if the encoding's low-order-byte is in highest address (big endian) or the low-order-byte is in lowest address (little endian). There is no big endian or little endian with UTF-8, so it is not really needed. Many text editors look for a BOM to identify UTF-8. These text editors most likely write the BOM when saving a file in the UTF-8 format. For example, since Windows 2000, notepad writes a BOM to the beginning of a file if the file is saved in UTF-8 format. Other text editors, like Programmer's Notepad, have an option to exclude the BOM.

Java I/O does not look for these marks and will read them as if they were valid data in the file. This causes a problem since they are not data. To remedy this, one can easily use the to read the first 3 bytes of the file and see if they match the UTF-8 BOM. If they do, just discard them and carry on about your business.

The UTF-8 BOM sequence is EF BB BF. So one can check to see if these match the first 3 bytes (or in this case, check if they don't match) by using the following code:

//Check and Skip UTF-8 BOM
  byte[] b = new byte[3];
  InputStream is = null;
  PushbackInputStream pbis = null;
  BufferedReader br = null;
  try {
     is = getClass().getResourceAsStream("/text/example.txt");
     pbis = new PushbackInputStream(is, 3);, 0, 3);
     if(b[0] != (byte)0xef && b[1] != (byte)0xbb && b[2] != (byte)0xbf) 
     br = new BufferedReader(new InputStreamReader(pbis, "UTF-8"));
     . . .

Here is a link to a couple of helpful classes that will recognize all Unicode BOM formats:

Although there are some people who view this as a bug, I don't view it this way because the UTF-8 specification does not require the BOM. Although some think Java should detect the BOM and throw it away, it does not mean it's a bug, it's just lacking a feature.

No comments:

Post a Comment