Unicode routines



Allegro can manipulate and display text using any character values from 0 right up to 2^32-1 (although the current implementation of the grabber can only create fonts using characters up to 2^16-1). You can choose between a number of different text encoding formats, which controls how strings are stored and how Allegro interprets strings that you pass to it. This setting affects all aspects of the system: whenever you see a function that returns a char * type, or that takes a char * as an argument, that text will be in whatever format you have told Allegro to use.

By default, Allegro uses UTF-8 encoded text (U_UTF8). This is a variable-width format, where characters can occupy anywhere from one to six bytes. The nice thing about it is that characters ranging from 0-127 are encoded directly as themselves, so UTF-8 is upwardly compatible with 7 bit ASCII ("Hello, World!" means the same thing regardless of whether you interpret it as ASCII or UTF-8 data). Any character values above 128, such as accented vowels, the UK currency symbol, and Arabic or Chinese characters, will be encoded as a sequence of two or more bytes, each in the range 128-255. This means you will never get what looks like a 7 bit ASCII character as part of the encoding of a different character value, which makes it very easy to manipulate UTF-8 strings.

There are a few editing programs that understand UTF-8 format text files. Alternatively, you can write your strings in plain ASCII or 16 bit Unicode formats, and then use the Allegro textconv program to convert them into UTF-8.

If you prefer to use some other text format, you can set Allegro to work with normal 8 bit ASCII (U_ASCII), or 16 bit Unicode (U_UNICODE) instead, or you can provide some handler functions to make it support whatever other text encoding you like (for example it would be easy to add support for 32 bit UCS-4 characters, or the Chinese GB-code format).

There is some limited support for alternative 8 bit codepages, via the U_ASCII_CP mode. This is very slow, so you shouldn't use it for serious work, but it can be handy as an easy way to convert text between different codepages. By default the U_ASCII_CP mode is set up to reduce text to a clean 7 bit ASCII format, trying to replace any accented vowels with their simpler equivalents (this is used by the allegro_message() function when it needs to print an error report onto a text mode DOS screen). If you want to work with other codepages, you can do this by passing a character mapping table to the set_ucodepage() function.

Note that you can use the Unicode routines before you call install_allegro() or allegro_init(). If you want to work in a text mode other than UTF-8, it is best to set it with set_uformat() just before you call these.

void set_uformat(int type);
Sets the current text encoding format. This will affect all parts of Allegro, wherever you see a function that returns a char *, or takes a char * as a parameter. The type should be one of the values:

      U_ASCII     - fixed size, 8 bit ASCII characters
      U_ASCII_CP  - alternative 8 bit codepage (see set_ucodepage())
      U_UNICODE   - fixed size, 16 bit Unicode characters
      U_UTF8      - variable size, UTF-8 format Unicode characters

Although you can change the text format on the fly, this is not a good idea. Many strings, for example the names of your hardware drivers and any language translations, are loaded when you call allegro_init(), so if you change the encoding format after this, they will be in the wrong format, and things will not work properly. Generally you should only call set_uformat() once, before allegro_init(), and then leave it on the same setting for the duration of your program.

int get_uformat(void);
Returns the currently selected text encoding format.

void register_uformat(int type, int (*u_getc)(const char *s), int (*u_getx)(char **s), int (*u_setc)(char *s, int c), int (*u_width)(const char *s), int (*u_cwidth)(int c), int (*u_isok)(int c));
Installs a set of custom handler functions for a new text encoding format. The type is the ID code for your new format, which should be a 4-character string as produced by the AL_ID() macro, and which can later be passed to functions like set_uformat() and uconvert(). The function parameters are handlers that implement the character access for your new type: see below for details of these.

void set_ucodepage(const unsigned short *table, const unsigned short *extras);
When you select the U_ASCII_CP encoding mode, a set of tables are used to convert between 8 bit characters and their Unicode equivalents. You can use this function to specify a custom set of mapping tables, which allows you to support different 8 bit codepages. The table parameter points to an array of 256 shorts, which contain the Unicode value for each character in your codepage. The extras parameter, if not NULL, points to a list of mapping pairs, which will be used when reducing Unicode data to your codepage. Each pair consists of a Unicode value, followed by the way it should be represented in your codepage. The table is terminated by a zero Unicode value. This allows you to create a many->one mapping, where many different Unicode characters can be represented by a single codepage value (eg. for reducing accented vowels to 7 bit ASCII).

int need_uconvert(const char *s, int type, int newtype);
Given a pointer to a string, a description of the type of the string, and the type that you would like this string to be converted into, this function tells you whether any conversion is required. No conversion will be needed if type and newtype are the same, or if one type is ASCII, the other is UTF-8, and the string contains only character values less than 128. As a convenience shortcut, you can pass the value U_CURRENT as either of the type parameters, to represent whatever text format is currently selected.

int uconvert_size(const char *s, int type, int newtype);
Returns the number of bytes that will be required to store the specified string after a conversion from type to newtype, including the zero terminator. The type parameters can use the value U_CURRENT as a shortcut to represent the currently selected encoding format.

void do_uconvert(const char *s, int type, char *buf, int newtype, int size);
Converts the specified string from type to newtype, storing at most size bytes into the output buf. The type parameters can use the value U_CURRENT as a shortcut to represent the currently selected encoding format.

char *uconvert(const char *s, int type, char *buf, int newtype, int size);
Higher level function running on top of do_uconvert(). This function converts the specified string from type to newtype, storing at most size bytes into the output buf, but it checks before doing the conversion, and doesn't bother if the string formats are already the same (either both types are equal, or one is ASCII, the other is UTF-8, and the string contains only 7 bit ASCII characters). If a conversion was performed it returns a pointer to buf, otherwise it returns a copy of s, so you must use the return value rather than assuming that the string will always be moved to buf. As a convenience, if buf is NULL it will convert the string into an internal static buffer. You should be wary of using this feature, though, because that buffer will be overwritten the next time this routine is called, so don't expect the data to persist across any other library calls.

char *uconvert_ascii(const char *s, char buf[]);
Helper macro for converting strings from ASCII into the current encoding format. Expands to uconvert(s, U_ASCII, buf, U_CURRENT, sizeof(buf)).

char *uconvert_toascii(const char *s, char buf[]);
Helper macro for converting strings from the current encoding format into ASCII. Expands to uconvert(s, U_CURRENT, buf, U_ASCII, sizeof(buf)).

extern char empty_string[];
You can't just rely on "" to be a valid empty string in any encoding format. This global buffer contains a number of consecutive zeros, so it will be a valid empty string no matter whether the program is running in ASCII, Unicode, or UTF-8 mode.

int ugetc(const char *s);
Low level helper function for reading Unicode text data. Given a pointer to a string in the current encoding format, it returns the next character from the string.

int ugetx(char **s);
int ugetxc(const char **s);
Low level helper function for reading Unicode text data. Given the address of a pointer to a string in the current encoding format, it returns the next character from the string, and advances the pointer to the character after the one just read.

ugetxc is provided for working with pointer-to-pointer-to-const char data.

int usetc(char *s, int c);
Low level helper function for writing Unicode text data. It writes the specified character to the given address in the current encoding format, and returns the number of bytes written.

int uwidth(const char *s);
Low level helper function for testing Unicode text data. It returns the number of bytes occupied by the first character of the specified string, in the current encoding format.

int ucwidth(int c);
Low level helper function for testing Unicode text data. It returns the number of bytes that would be occupied by the specified character value, when encoded in the current format.

int uisok(int c);
Low level helper function for testing Unicode text data. Tests whether the specified value can be correctly encoded in the current format.

int uoffset(const char *s, int index);
Returns the offset in bytes from the start of the string to the character at the specified index. If the index is negative, it counts backward from the end of the string, so an index of -1 will return an offset to the last character.

int ugetat(const char *s, int index);
Returns the character value at the specified index within the string. A zero index parameter will return the first character of the string. If the index is negative, it counts backward from the end of the string, so an index of -1 will return the last character of the string.

int usetat(char *s, int index, int c);
Replaces the character at the specified index within the string with value c, handling any adjustments for variable width data (ie. if c encodes to a different width than the previous value at that location). Returns the number of bytes by which the trailing part of the string was moved. If the index is negative, it counts backward from the end of the string.

int uinsert(char *s, int index, int c);
Inserts the character c at the specified index within the string, sliding the rest of the data along to make room. Returns the number of bytes by which the trailing part of the string was moved. If the index is negative, it counts backward from the end of the string.

int uremove(char *s, int index);
Removes the character at the specified index within the string, sliding the rest of the data back to fill the gap. Returns the number of bytes by which the trailing part of the string was moved. If the index is negative, it counts backward from the end of the string.

int ustrsize(const char *s);
Returns the size of the specified string in bytes, not including the trailing zero.

int ustrsizez(const char *s);
Returns the size of the specified string in bytes, including the trailing zero.

int uwidth_max(int type);
Low level helper function for working with Unicode text data. Returns the largest number of bytes that one character can occupy in the given encoding format. Pass U_CURRENT to represent the current format.

int utolower(int c);
This function returns c, converting it to lower case if it is upper case.

int utoupper(int c);
This function returns c, converting it to upper case if it is lower case.

int uisspace(int c);
Returns nonzero if c is whitespace, that is, carriage return, newline, form feed, tab, vertical tab, or space.

int uisdigit(int c);
Returns nonzero if c is a digit.

char *ustrdup(const char *src)
This functions copies the NULL-terminated string src into a newly allocated area of memory. The memory returned by this call must be freed by the caller. Returns NULL if it cannot allocate space for the duplicated string.

char *_ustrdup(const char *src, void* (*malloc_func) (size_t))
Does the same as ustrdup(), but allows the user to specify his own memory allocater function.

char *ustrcpy(char *dest, const char *src);
This function copies src (including the terminating NULL character) into dest. The return value is the value of dest.

char *ustrzcpy(char *dest, int size, const char *src);
This function copies src (including the terminating NULL character) into dest, whose length in bytes is specified by size and which is guaranteed to be NULL-terminated. The return value is the value of dest.

char *ustrcat(char *dest, const char *src);
This function concatenates src to the end of dest. The return value is the value of dest.

char *ustrzcat(char *dest, int size, const char *src);
This function concatenates src to the end of dest, whose length in bytes is specified by size and which is guaranteed to be NULL-terminated. The return value is the value of dest.

int ustrlen(const char *s);
This function returns the number of characters in s. Note that this doesn't have to equal the string's size in bytes.

int ustrcmp(const char *s1, const char *s2);
This function compares s1 and s2. Returns zero if the strings are equal, a positive number if s1 comes after s2 in the ASCII collating sequence, else a negative number.

char *ustrncpy(char *dest, const char *src, int n);
This function is like ustrcpy() except that no more than n characters from src are copied into dest. If src is shorter than n characters, NULL characters are appended to dest as padding until n characters have been written. Note that if src is longer than n characters, dest will not be NULL-terminated. The return value is the value of dest.

char *ustrzncpy(char *dest, int size, const char *src, int n);
This function is like ustrzcpy() except that no more than n characters from src are copied into dest. If src is shorter than n characters, NULL characters are appended to dest as padding until n characters have been written. Note that dest is guaranteed to be NULL-terminated. The return value is the value of dest.

char *ustrncat(char *dest, const char *src, int n);
This function is like ustrcat() except that no more than n characters from src are appended to the end of dest. If the terminating NULL character in src is reached before n characters have been written, the NULL character is copied, but no other characters are written. If n characters are written before a terminating NULL is encountered, the function appends its own NULL character to dest, so that n+1 characters are written. The return value is the value of dest.

char *ustrzncat(char *dest, int size, const char *src, int n);
This function is like ustrzcat() except that no more than n characters from src are appended to the end of dest. If the terminating NULL character in src is reached before n characters have been written, the NULL character is copied, but no other characters are written. Note that dest is guaranteed to be NULL-terminated. The return value is the value of dest.

int ustrncmp(const char *s1, const char *s2, int n);
This function compares up to n characters of s1 and s2. Returns zero if the substrings are equal, a positive number if s1 comes after s2 in the ASCII collating sequence, else a negative number.

int ustricmp(const char *s1, const char *s2);
This function compares s1 and s2, ignoring case.

char *ustrlwr(char *s);
This function replaces all upper case letters in s with lower case letters.

char *ustrupr(char *s);
This function replaces all lower case letters in s with upper case letters.

char *ustrchr(const char *s, int c);
This function returns a pointer to the first occurrence of c in s, or NULL if no match was found. Note that if c is NULL, this will return a pointer to the end of the string.

char *ustrrchr(const char *s, int c);
This function returns a pointer to the last occurrence of c in s, or NULL if no match was found.

char *ustrstr(const char *s1, const char *s2);
This function finds the first occurence of s2 in s1. Returns a pointer within s1, or NULL if s2 wasn't found.

char *ustrpbrk(const char *s, const char *set);
This function finds the first character in s that matches any character in set. Returns a pointer to the first match, or NULL if none are found.

char *ustrtok(char *s, const char *set);
This function retrieves tokens from s which are delimited by characters from set. To initiate the search, pass the string to be searched as s. For the remaining tokens, pass NULL instead. Returns a pointer to the token, or NULL if no more are found. Warning: Since ustrtok alters the string it is parsing, you should always copy the string to a temporary buffer before parsing it. Also, this function is not reentrant (ie. you cannot parse two strings at the same time).

char *ustrtok_r(char *s, const char *set, char **last);
Reentrant version of ustrtok. The last parameter is used to keep track of where the parsing is up to and must be a pointer to a char * variable allocated by the user that remains the same while parsing the same string.

double uatof(const char *s);
Convert as much of the string as possible to an equivalent double precision real number. This function is almost like `ustrtod(s, NULL)'. Returns the equivalent value, or zero if the string does not represent a number.

long ustrtol(const char *s, char **endp, int base);
This function converts the initial part of s to a signed integer, which is returned as a value of type `long int', setting *endp to point to the first unused character, if endp is not a NULL pointer. The base argument indicates what base the digits (or letters) should be treated as. If base is zero, the base is determined by looking for `0x', `0X', or `0' as the first part of the string, and sets the base used to 16, 16, or 8 if it finds one. The default base is 10 if none of those prefixes are found.

double ustrtod(const char *s, char **endp);
This function converts as many characters of s that look like a floating point number into one, and sets *endp to point to the first unused character, if endp is not a NULL pointer.

const char *ustrerror(int err);
This function returns a string that describes the error code `err', which normally comes from the variable `errno'. Returns a pointer to a static string that should not be modified or free'd. If you make subsequent calls to ustrerror, the string might be overwritten.

int usprintf(char *buf, const char *format, ...);
This function writes formatted data into the output buffer. A NULL character is written to mark the end of the string. Returns the number of characters written, not including the terminating NULL character.

int uszprintf(char *buf, int size, const char *format, ...);
This function writes formatted data into the output buffer, whose length in bytes is specified by size and which is guaranteed to be NULL terminated. Returns the number of characters that would have been written without eventual truncation (like with usprintf), not including the terminating NULL character.

int uvsprintf(char *buf, const char *format, va_list args);
This is like usprintf(), but you pass the variable argument list directly, instead of the arguments themselves.

int uvszprintf(char *buf, int size, const char *format, va_list args);
This is like uszprintf(), but you pass the variable argument list directly, instead of the arguments themselves.




Back to Contents