Rust Programming: Indexing into Strings

0

Rust strings are stored as UTF-8 encoded text. UTF-8 is a variable-width encoding, meaning that different characters can be represented by different numbers of bytes. For example, the character a is represented by a single byte, while the character é is represented by two bytes.

This means that indexing into a Rust string is not as straightforward as it might seem. In order to index into a string, you would need to know the byte offset of the character you want to access. However, this is not something that the Rust compiler can know at compile time.

Another reason why Rust strings don't support indexing is that it would be unsafe. If you were able to index into a string and accidentally access a byte that is not part of a valid character, your program could crash.

So, instead of allowing indexing, Rust provides a number of other ways to access individual characters in a string. For example, you can use the chars() method to get an iterator over the individual characters in a string. You can also use the get() method to get a reference to a specific character in a string.

Here are some examples of how to access individual characters in a Rust string without using indexing:

Rust
let s1 = String::from("hello");

// Get an iterator over the individual characters in the string.
let chars = s1.chars();

// Get a reference to the first character in the string.
let first_char = s1.get(0);

// Get a reference to the last character in the string.
let last_char = s1.get(s1.len() - 1);

// Get a reference to the character at a specific index in the string.
let third_char = s1.get(2);

While it may seem inconvenient at first, Rust's approach to strings is actually safer and more efficient. By not allowing indexing, Rust helps to prevent crashes and improve the performance of your programs.

Internal Representation

The internal representation of a String in Rust is a Vec<u8>. This means that a String is simply a vector of bytes, where each byte represents a character in the string.

The important thing to note is that Rust uses UTF-8 encoding for its strings. This means that a single Unicode character may be represented by multiple bytes, depending on the character. For example, the Cyrillic letter Ze (З) is represented by two bytes in UTF-8: 208 and 151.

This is why the following code is invalid:

Rust
let hello = "Здравствуйте";
let answer = &hello[0];

The expression &hello[0] returns the first byte of the string, which is 208. However, this is not a valid Unicode character on its own. To get the first character of the string, we need to use a function like chars().

Rust
let hello = "Здравствуйте";
let first_char = hello.chars().next().unwrap();

The chars() method returns an iterator over the characters in the string. The next() method returns the next element in the iterator, and the unwrap() method returns the element or panics if the iterator is empty.

It is important to understand the internal representation of Strings in Rust so that you can avoid writing code that may produce unexpected results.

Bytes and Scalar Values and Grapheme Clusters! Oh My!

Bytes, scalar values, and grapheme clusters are three different ways to view a string in Rust.

Bytes are the basic unit of data that computers store. A string in Rust is stored as a vector of bytes, encoded in UTF-8. For example, the Hindi word "नमस्ते" is stored as the following vector of bytes:

[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]

Scalar values are the next level of abstraction. A scalar value is a single Unicode code point. For example, the Hindi word "नमस्ते" is represented by the following scalar values:

['न', 'म', 'स', '्', 'त', 'े']

However, not all scalar values represent a single character. For example, the scalar value '्' is a diacritic, which does not make sense on its own.

Grapheme clusters are the closest thing to what we would call letters. A grapheme cluster is a sequence of one or more scalar values that represents a single character. For example, the Hindi word "नमस्ते" is represented by the following grapheme clusters:

["न", "म", "स्", "ते"]

Rust provides different ways to iterate over strings, depending on which perspective you need. For example, the .chars() method iterates over the scalar values in a string, while the .graphemes() method iterates over the grapheme clusters in a string.

The reason why Rust does not allow you to index into a string to get a character is because it is not always possible to determine the character at a given index without walking through the entire string. This is because a single Unicode code point can represent multiple characters, or multiple Unicode code points can represent a single character.

For example, the following string contains a single grapheme cluster, but two scalar values:

"a\u{030a}"

The scalar value '\u{030a}' is a combining ring above. This diacritic is applied to the previous character, 'a', to form the grapheme cluster 'å'.

If you were to index into this string at index 1, you would get the scalar value '\u{030a}'. However, this is not a valid character on its own. It is only valid when combined with the previous character, 'a'.

For this reason, Rust does not allow you to index into a string to get a character. Instead, you must use the .chars() or .graphemes() methods to iterate over the string and get the characters.

Post a Comment

0Comments
Post a Comment (0)