Vocabulary file formats

There are two types of files involved: the vocabulary files that Phrasier imports, and the CHS files storing the flash card practice sessions. The latter is a fixed format used by Phrasier, whereas the vocabulary files may take a number of different formats since they may come from various sources.

Vocabulary file formats

Phrasier is able to read a few different vocabulary file formats. The original format used by Phrasier is the CHPU format, but this is now superceded by the simpler tabular format. Phrasier can also read Chinese vocabulary files in the CEDict format: these contain both traditional and simplified characters. Phrasier should also be able to read vocabulary files in the EDict format, which was originally made for Japanese but should also work with other languages, although I have not tested this very well. All these formats are text files which can be made in a text editor; for use with Phrasier, they should all be saved using UTF-8 encoding. Although the CEDict and EDict formats are technically speaking fixed, there may be variants of these in use, so I have implemented a few different rules to recognise different formats.

If you want to make your own vocabulary files, feel free to contact me any time if you have any questions. I may also be able to convert vocabulary files between different formats.

Tabular format

This is the favoured format by Phrasier. The tabular format is a simple tab-separated file: phrase[tab]pronounciation[tab]translation[tab]comment In addition, lines starting with # are ignored. This may be used with any language: Chinese, Japanese, European languages, ...

When importing the files into Phrasier, it is important that e.g. Chinese tabular format is selected if the files contain Chinese. This allows Phrasier to interpret the tone marks: Pinyin is written on the formword# where # is the tone, and Phrasier will convert this to Pinyin with tone marks, and can also produce Bopomofo phonetics from this.

CEDict Chinese vocabulary format

The CEDict format is a text file with one line per term using the format traditional simplified [pinyin] /translations/ where the traditional and simplified Chinese should not contain space (since space is used as separators), the Pinyin is written using numbers at the end of each word to indicate the tone (e.g. pin1 yin1) and words (corresponding to characters) are separated by space, and there may be one or more translations separated by "/".

There are several variations of this format, e.g. using tabulators as separators and not having [] around the Pinyin. Some use capital letters in the Pinyin of names, although CEDict specifies that this should preferably not be done. I have tried to implement a few different varieties.

EDict vocabulary format

The basic EDict format is similar to CEDict (since CEDict was inspired by EDict): phrase [pronounciation] /translations/ or phrase /translations/ although EDict also allows for the translation to start with a ()-enclosed general information field. For Japanese, which EDict was made for, the encoding is specified, and is NOT UTF-8, so you may have to convert any Japanese vocabulary files on EDict format to UTF-8 before Phrasier can read them.

The CHPU vocabulary format

The CHPU format is adapted from the CHP format used by Chinese Practice. These contain one line per term, each on the format <CH=phrase><PI=pronounciation><OR=translation><NO=comment> The order of the tags is not important, and CHP has additional tags that Phrasier just ignores.

Format of practice session files (CHS)

The CHS file format is an XML format. This proved to be convenient for a number of reasons: not least, it makes it easier to make the files both forward and backward compatible. Also, it is editable in an ordinary text editor: if you edit the practice session file, however, be aware that Phrasier will warn you that the checksum does not match the data.

The basic format is as follows:

  • <?xml version="1.0"?>
  • <?phrasier version="Phrasier version"?>
  • <session version="format version" checksum="checksum" encoding="encoding">
  • <options>
  • option elements
  • </options>
  • <vocabulary class="PracticeTerm" locale="">
  • practice term elements
  • </vocabulary>
  • </session>
The first line is simply a standard XML specification. The second line specifies which program and program version was used to generate it: this will be mphrasier instead of phrasier if saved by MobilePhrasier.

The main element is the session element. The session format version, like the Phrasier version, is used internally to identify the format used for the session data: even when a new version of Phrasier comes, the format of the session data may be the same, and this makes compatibility easier to check. The session tag also stores a checksum which is used to verify that the data has not been corrupted: this only checks within the same session format version. The encoding should be either UTF-8 or ASCII: the UTF-8 encoding is convenient for entering or editing files using an editor since non-ASCII characters may be displayed, whereas the ASCII (non-ASCII characters encoded as &#number;) is required for MobilePhrasier.

Within the session tag are two main tags: options and vocabulary. A list of options may be included in the first tag for storing information about which fields are displayed or hidden and which fonts are used in each field: these are optional and will be created when needed. The second tag contains the list of terms included in the practice session. The class option is there primarily to specify what type of terms the vocabulary contains. The locale specifies language specific rules: the only locale implemented as of now is chinese, but others may come.

The vocabulary element contains a list of terms each of which has the following format:

  • <pterm>
  • <source>source name</source>
  • <phrase>phrase</phrase>
  • <pronounce>pronounciation</pronounce>
  • <translate>translation</translate>
  • <comment>comment</comment>
  • <learnt>learnt-index (number)</learnt>
  • <importance>importance (number)</importance>
  • <selected>selected flag (true or false)</selected>
  • <lasttime>last time practices (seconds from 1 Jan 2000)</lasttime>
  • </pterm>
The pterm element contains all information for a given term. This should contain one or more source elements specifying the source (typically file name) from which the term was imported: if a term was imported from more than one source, these instances will be merged. The phrase, pronounce and translate elements contain the phrase itself (e.g. Chinese characters), pronounciation (e.g. Pinyin), and the translation (e.g. English). The comment element is optional. Elements added by Phrasier are learnt for storing the learnt-index (floating point number), importance for storing the importance of the term (floating point defaulting to 0), selected taking values true or false indicating if the term should be included in or excluded from practice, and lasttime giving the time when the term was last practiced (based on Java system time which is milliseconds since start of 1 Jan 1970).

MobilePhrasier, in order to save space and read the file faster, condenses the tag names of the term to the first two characters: i.e. so, ph, etc. Phrasier writes full tag names, while MobilePhrasier writes condensed tag names, but both can read full as well as condensed tag names.

Last modified April 30, 2009.