Friday, August 19, 2011

Java: How to Write UTF-8 Chars to Text File With OutputStream

All the regular ASCII chars on the keyboard transform to UTF-8 if written to a UTF-8 encoded text file. If you want to write extended chars not on the keyboard with the java.io.OutputStream subclasses, you must write a UTF-8 "code point". For example the UTF-8 Byte Order Mark (BOM) is EF BB BF, but to write this to a file in Java you would use the code point "\uFEFF".

Example:

  FileOutputStream fos = new FileOutputStream("File2Hex.txt");
  OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
  osw.write("\uFEFF");
  //osw.write(65279); // Optionally we can write with integer code point.

Optionally, you can use the int code point 65279.

If you are writing to file with a Windows 1252 (Cp1252) or ISO8852-1 type encoding, you can use the Windows 1252 representation of the bytes as shown below.

  FileOutputStream fos = new FileOutputStream("File2Hex.txt");
  OutputStreamWriter osw = new OutputStreamWriter(fos);
  char[] c = {0xEF, 0xBB, 0xBF}; // Cp1252 or ISO8859_1
  //int[] c = {239, 187, 191}; // Optional int representation.
  //char[] c = {'ï', '»', '¿'}; // Optional char representation.
  for(int i=0; i<3; i++)
    osw.write(c[i]);

By default the file will be written in the system's native encoding. On Windows this would be the Windows 1252 encoding. The file will be equivalent to UTF-8 encoding if the BOM is written at the beginning. If this is the case, Windows Notepad and other text editors will still see it as UTF-8 encoding and indicate this in file->"save as" or file properties. The reason for this is all the characters on the keyboard have the same byte values in UTF-8 as they do in Windows 1252 and other similar encodings. UTF-8 encoding extends the capabilities of Windows 1252, ISO8859-1 and other similar encodings from 1 byte up to 4 bytes.

The UTF-8 encoded bytes can also be represented by integer values or the Windows 1252 literal characters as shown in the code fragment above.

If you are using NetBeans IDE and you are writing the file in the default Windows 1252 encoding but write the BOM with the UTF-8 code point, it will work if you run the program from within NetBeans. If you compile the source into a JAR file and run the JAR file by double clicking on it, it will downgrade the BOM to a 0x3F UTF-8 question mark char (?).

2 comments: