Rust Progamming: Storing UTF-8 Encoded Text with Strings

0

Strings in Rust are implemented as a collection of bytes, plus some methods to provide useful functionality when those bytes are interpreted as text. The String type in Rust is a growable, mutable, owned, UTF-8 encoded string type.

New Rustaceans commonly get stuck on strings for a combination of three reasons:

  • Rust's propensity for exposing possible errors
  • Strings being a more complicated data structure than many programmers give them credit for
  • UTF-8

These factors combine in a way that can seem difficult when you're coming from other programming languages.

One of the ways in which String is different from other collections is that indexing into a String is complicated by the differences between how people and computers interpret String data. For example, the following code will not compile:

Rust
let hello = "Здравствуйте";
let answer = &hello[0];

This is because the first character of the string "Здравствуйте" is the Cyrillic letter Ze, which takes two bytes to encode in UTF-8. Therefore, the index 0 does not correspond to a valid Unicode scalar value.

To index into a String safely, you can use the chars() method to iterate over the Unicode scalar values in the string. For example, the following code will print the first character of the string "Здравствуйте":

Rust
let hello = "Здравствуйте";

for c in hello.chars() {
    println!("{}", c);
    break;
}

This code will print the following output:

З

You can also use the bytes() method to iterate over the bytes in a String. This can be useful for tasks such as reading and writing binary data.

Post a Comment

0Comments
Post a Comment (0)