What does the index of an UTF-8 encoding error indicate?

Question

fn main() {
    let ud7ff = String::from_utf8(vec![0xed, 0x9f, 0xbf]);
    if ud7ff.is_ok() {
        println!("U+D7FF OK! Get {}", ud7ff.unwrap());
    } else {
        println!("U+D7FF Fail!");
    }

    let ud800 = String::from_utf8(vec![0xed, 0xa0, 0x80]);
    if ud800.is_ok() {
        println!("U+D800 OK! Get {}", ud800.unwrap());
    } else {
        println!("{}", ud800.unwrap_err());
    }
}

Running this code prints invalid utf-8 sequence of 1 bytes from index 0. I understand it's an encoding error, but why does the error say index 0? Shouldn't it be index 1 because index 0 is the same in both cases?

DK. · Accepted Answer

That's because Rust is reporting the byte index which begins an invalid code point sequence, not any specific byte within that sequence. After all, the error could be the second byte, or maybe the first byte was corrupted? Or maybe the leading byte of the sequence went missing.

Rust doesn't, and can't, know, so it just reports the most convenient position: the first offset at which it couldn't decode a complete code point.

What does the index of an UTF-8 encoding error indicate?

Tags:

unicode

utf-8

rust

炸鱼薯条德里克

1 Answers

DK.

Recent Activity

Donate For Us

What does the index of an UTF-8 encoding error indicate?

Tags:

unicode

utf-8

rust

炸鱼薯条德里克

1 Answers

DK.

Related questions

Recent Activity

Donate For Us