UKOLN AHDS Handling International Text



Background

Digital text is one of the oldest description methods, but remains divided by differing file format, encoding methods, schemas, and encoding methods. When choosing a digital text format it is necessary to establish the project needs. Is plain text suitable for the task and are text markup and formatting required? How will the information be displayed and where? This document describes these issues and provides some guidelines for their use.

What is the Best Tool for the Job?

Digital text has existed in one form or another since the 1960s. Many computer users take for granted that they can quickly write a letter without restriction or technical considerations. A commercial project, however, requires consideration of long-term needs and goals. To avoid complications at a later date, the developer must ensure the tools in use are the most appropriate for the task and, if not, what can be used in their place. To achieve this three questions should be answered:

  1. How will textual information be viewable for the user?
  2. What problems may I encounter if textual information is stored incorrectly?
  3. How will textual information be organized?

File Formats

It is often assumed that everyone can read text. However, this is not always the case. Digital text imposes restrictions upon the content that can have a significant impact upon the project.

In particular, there are two main issues:

The choice of format will be dependent upon the following factors:

Character Encoding

For allowing universal information access, plain text remains useful. It has the advantage of being simple to interpret and small in file size. However, there are some differences in the method that is used to encode text characters. The most common variations are ASCII (American Standard Code for Information Interchange) and Unicode.

Problems

Several problems may be encountered when storing textual information. For text files it is a simple process to convert the file to Unicode. However, for more complex data, such as databases, the conversion process will become more difficult. Problems may include:

Structural Mark-up

Although ASCII and Unicode are useful for storing information, they are only able describe each character, not the method they should be displayed or organized. Structural mark-up languages enable the designer to dictate how information will appear and establish a structure to its layout. For example, the user can define a tag to store book author information and publication date.

The use of structural mark-up can provide many organizational benefits:

The most common markup languages are SGML and XML. Based upon these languages, several schemas have been developed to organize and define data relationships. This allows certain elements to have specific attributes that define its method of use (see Digital Rights document for more information). To ensure interoperability, XML is advised due to its support for contemporary Internet standards (such as Unicode).

Further Information