Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does the index of an UTF-8 encoding error indicate?

fn main() {
    let ud7ff = String::from_utf8(vec![0xed, 0x9f, 0xbf]);
    if ud7ff.is_ok() {
        println!("U+D7FF OK! Get {}", ud7ff.unwrap());
    } else {
        println!("U+D7FF Fail!");
    }

    let ud800 = String::from_utf8(vec![0xed, 0xa0, 0x80]);
    if ud800.is_ok() {
        println!("U+D800 OK! Get {}", ud800.unwrap());
    } else {
        println!("{}", ud800.unwrap_err());
    }
}

Running this code prints invalid utf-8 sequence of 1 bytes from index 0. I understand it's an encoding error, but why does the error say index 0? Shouldn't it be index 1 because index 0 is the same in both cases?

like image 829
炸鱼薯条德里克 Avatar asked Oct 18 '25 03:10

炸鱼薯条德里克


1 Answers

That's because Rust is reporting the byte index which begins an invalid code point sequence, not any specific byte within that sequence. After all, the error could be the second byte, or maybe the first byte was corrupted? Or maybe the leading byte of the sequence went missing.

Rust doesn't, and can't, know, so it just reports the most convenient position: the first offset at which it couldn't decode a complete code point.

like image 140
DK. Avatar answered Oct 20 '25 19:10

DK.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!