Character sets are collections of characters that a system can recognize and represent. These characters include letters, digits, symbols, and control codes. The representation and handling of character sets in C are critical for the portability of code, especially when dealing with internationalization. The C 2018 edition makes several provisions related to character sets to address these environmental considerations.
1. Basic Character Set: The C 2018 standard defines a basic character set that consists of 96 characters, sufficient to write portable code. This includes:
- The 26 uppercase Latin letters: A–Z
- The 26 lowercase Latin letters: a–z
- The 10 decimal digits: 0–9
- 29 graphic characters: `! " # % & ' ( ) * + , - . / : ; ? [ \ ] ^ _ { | } ~`
- 5 whitespace characters: space, horizontal tab, vertical tab, form feed, and newline
2. Extended Character Set: For applications requiring more than the basic character set, the C standard supports extended characters via multibyte and wide character sets. These are particularly useful for internationalization, allowing the representation of characters from various languages.
- Multibyte Character Sets: Multibyte characters allow the representation of larger character sets, such as UTF-8, within a fixed-width character type (`char`). This is useful for handling a wide range of characters using variable-length encoding.
- Wide Character Sets: Wide characters (`wchar_t`) provide a fixed-width, typically larger than `char`, to accommodate more extensive character sets directly. Functions in the `` library facilitate the manipulation of wide characters and strings.
3. Universal Character Names: The C 2018 standard supports universal character names (UCNs) to allow the inclusion of characters from various international character sets within the source code. UCNs use the syntax `\uXXXX` or `\UXXXXXXXX`, where `X` represents a hexadecimal digit. This enables the inclusion of Unicode characters directly in the code.
```c
char* greeting = "Hello, \u4e16\u754c"; // "Hello, 世界" in Chinese
```
4. Locale Dependencies: The behavior of certain functions, such as those for character classification and conversion, depends on the locale. The locale determines the rules for character representation for specific linguistic and regional preferences. The C standard library provides `` to set and query the locale, ensuring consistent behavior across different environments.
```c
#include
int main() {
setlocale(LC_ALL, "en_US.UTF-8"); // Setting the locale to US English with UTF-8 encoding
// Locale-dependent operations here
return 0;
}
```
5. Environmental Limits: The standard specifies limits related to characters, such as the maximum number of bytes in a multibyte character (`MB_CUR_MAX`). These limits ensure that implementations provide a predictable environment for developers.