Character Encodings

Character Encodings
Prev	Chapter 4. Working With Files	Next

A character encoding is a mapping from a set of characters to their on-disk representation. jEdit can use any encoding supported by the Java platform.

Buffers in memory are always stored in UTF-16 encoding, which means each character is mapped to an integer between 0 and 65535. UTF-16 is the native encoding supported by Java, and has a large enough range of characters to support most modern languages.

When a buffer is loaded, it is converted from its on-disk representation to UTF-16 using a specified encoding.

The default encoding, used to load files for which no other encoding is specified, can be set in the Encodings pane of the Utilities> Options dialog box; see the section called “The Encodings Pane”. Unless you change this setting, it will be your operating system's native encoding, for example MacRoman on the MacOS, windows-1252 on Windows, and ISO-8859-1 on Unix.

An encoding can be explicitly set when opening a file in the file system browser's Commands>Encoding menu.

Note that there is no general way to auto-detect the encoding used by a file, however jEdit supports "encoding detectors", of which there are some provided in the core, and others may be provided by plugins through the services api. From the encodings option pane the section called “The Encodings Pane”, you can customize which ones are used, and the order they are tried. Here are some of the encoding detectors recognized by jEdit:

BOM: UTF-16 and UTF-8Y files are auto-detected, because they begin with a certain fixed character sequence. Note that plain UTF-8 does not mandate a specific header, and thus cannot be auto-detected, unless the file in question is an XML file.
XML-PI: Encodings used in XML files with an XML PI like the following are auto-detected:
```
<?xml version="1.0" encoding="UTF-8">
```
html: Encodings specified in HTML files with a content= attribute in a meta element may be auto-detected:
```
<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 
```
python: Python has its own way of specifying encoding at the top of a file.
```
# -*- coding: utf-8 -*-                
```
buffer-local-property: Enable buffer-local properties' syntax (see the section called “Buffer-Local Properties”) at the top of the file to specify encoding.
```
#  :encoding=ISO-8859-1:
                
```

The encoding that will be used to save the current buffer is shown in the status bar, and can be changed in the Utilities>Buffer Options dialog box. Note that changing this setting has no effect on the buffer's contents; if you opened a file with the wrong encoding and got garbage, you will need to reload it. File>Reload with Encoding is an easy way.

If a file is opened without an explicit encoding specified and it appears in the recent file list, jEdit will use the encoding last used when working with that file; otherwise the default encoding will be used.

Commonly Used Encodings

While the world is slowly converging on UTF-8 and UTF-16 encodings for storing text, a wide range of older encodings are still in widespread use and Java supports most of them.

The simplest character encoding still in use is ASCII, or “American Standard Code for Information Interchange”. ASCII encodes Latin letters used in English, in addition to numbers and a range of punctuation characters. Each ASCII character consists of 7 bits, there is a limit of 128 distinct characters, which makes it unsuitable for anything other than English text. jEdit will load and save files as ASCII if the US-ASCII encoding is used.

Because ASCII is unsuitable for international use, most operating systems use an 8-bit extension of ASCII, with the first 128 values mapped to the ASCII characters, and the rest used to encode accents, umlauts, and various more esoteric used typographical marks. The three major operating systems all extend ASCII in a different way. Files written by Macintosh programs can be read using the MacRoman encoding; Windows text files are usually stored as windows-1252. In the Unix world, the 8859_1 character encoding has found widespread usage.

On Windows, various other encodings, referred to as code pages and identified by number, are used to store non-English text. The corresponding Java encoding name is windows- followed by the code page number, for example windows-850.

Many common cross-platform international character sets are also supported; KOI8_R for Russian text, Big5 and GBK for Chinese, and SJIS for Japanese.

Prev	Up	Next
Line Separators	Home	The File System Browser (FSB)