Just stumble across Why we didn't rewrite our feed handler in Rust from databento
And they say they cannot share a buffer during a loop. What would be the rust way to fix this without making copies of data? I made a minimal example
struct Source {
id: usize,
data: Vec<u8>,
}
impl Source {
fn fetch_data(&self) -> Vec<u8> {
println!("Fetching data from source {}", self.id);
self.data.clone()
}
}
const SPLITTER: u8 = b',';
fn process_data(chunks: &[&[u8]]) {
println!("Processing {} chunks:", chunks.len());
for (i, chunk) in chunks.iter().enumerate() {
println!(" Chunk {}: {:?}", i, String::from_utf8_lossy(chunk));
}
}
fn process_sources(sources: Vec<Source>) {
let mut buffer: Vec<&[u8]> = Vec::new(); // allocate once
for source in sources {
let data: Vec<u8> = source.fetch_data();
buffer.extend(data.split(|b| *b == SPLITTER));
process_data(&buffer);
buffer.clear();
}
}
fn main() {
let sources = vec![
Source {
id: 1,
data: b"hello,world,this,is,source1".to_vec(),
},
Source {
id: 2,
data: b"foo,bar,baz".to_vec(),
},
];
process_sources(sources);
}
The error is:
error[E0597]: `data` does not live long enough
--> src/main.rs:27:23
|
| let data: Vec<u8> = source.fetch_data();
| ---- binding `data` declared here
| buffer.extend(data.split(|b| *b == SPLITTER));
| ------ ^^^^ borrowed value does not live long enough
| |
| borrow later used here
...
| }
| - `data` dropped here while still borrowed
The most idiomatic would be to avoid the allocation altogether:
fn process_data<'a, C>(chunks: C)
where
C: IntoIterator<Item = &'a [u8]>,
{
let mut n = 0;
for (i, chunk) in chunks.into_iter().enumerate() {
println!(" Chunk {}: {:?}", i, String::from_utf8_lossy(chunk));
n += 1;
}
println!("Processed {n} chunks.");
}
fn process_sources(sources: Vec<Source>) {
for source in sources {
let data: Vec<u8> = source.fetch_data();
process_data(data.split(|b| *b == SPLITTER));
}
}
This doesn't allocate for the chunks at all and avoids looping over the chunks twice but doesn't let us know ahead of time how many chunks there are in total.
You can also convince the borrow checker that what you're doing is ok. The unsafe escape hatch is there to be used when necessary. Here it is additionaly wrapped in a safe wrapper which doesn't let you accidentially leak a reference like if you used data.split(…).map(|x| unsafe { &*(x as *const _) }) directly:
use buffer::Buffer;
mod buffer {
pub struct Buffer<'pseudo, T: ?Sized> {
data: Vec<&'pseudo T>,
}
impl<'pseudo, T: ?Sized> Buffer<'pseudo, T> {
pub fn new() -> Self {
Self { data: Vec::new() }
}
pub fn set<'a>(
&'a mut self,
items: impl IntoIterator<Item = &'a T>,
) -> impl std::ops::Deref<Target = [&'a T]>
where
T: 'a,
'pseudo: 'a,
{
// make sure there are no remaining items with a different actual lifetime in the buffer
self.data.clear();
self.data.extend(items.into_iter().map(|x| {
// SAFETY:
// - `BufferView` only hands out references with the correct lifetime `'a`
// - `BufferView` is the only way to retreive such a reference
// - We `clear` our internal buffer so we cannot possibly have
// items with two different lifetimes in the buffer at the same time
unsafe { &*(x as *const _) }
}));
BufferView {
buffer: self,
}
}
}
struct BufferView<'a, 'pseudo, T: ?Sized> {
buffer: &'a mut Buffer<'pseudo, T>,
}
impl<'a, 'pseudo, T: ?Sized> std::ops::Deref for BufferView<'a, 'pseudo, T>
where
T: 'a,
'pseudo: 'a,
{
type Target = [&'a T];
fn deref(&self) -> &[&'a T] {
&self.buffer.data
}
}
}
fn process_sources(sources: Vec<Source>) {
let mut buffer = Buffer::new(); // allocate O(log(max(n))) times
for source in sources {
let data: Vec<u8> = source.fetch_data();
let buffer = buffer.set(data.split(|b| *b == SPLITTER));
process_data(&buffer);
}
}
I didn't come up with this myself (the original source was probably this blog post by David Lattimore, which I found via the standard library proposal linked at the bottom of this entry), but if the Vec is empty, you can do this by looping over the Vec, mapping it to a new type:
for source in sources {
let data: Vec<u8> = source.fetch_data();
buffer.extend(data.split(|b| *b == 1));
process_data(&buffer);
buffer.clear();
buffer = buffer.into_iter().map(|_| unreachable!()).collect();
}
You're looping over an empty vector, so the body of the map doesn't matter. Conceptually, this deallocates the allocation and creates a new one; but Rust has an optimisation to reuse a Vec's buffer when you map and collect a Vec and the old and new types have the same size and alignment (which is true in this case: a slice with one lifetime has the same size and alignment as a slice with the same element type but a different lifetime). So the whole reassignment to the buffer optimises out and the memory allocator isn't involved at all.
This technique is unfortunately quite fragile – you have to know that the buffer-reuse optimisation exists for this code to make any sense – so it would probably be preferable for Rust to have an explicit "reuse this vector's allocation for a new type" method on Vec. But as of the time you asked the question, Rust didn't support that in safe code.
With unsafe code, you can just take the allocation from one Vec and place it in another – but if you get that wrong, e.g. by not emptying the Vec first, or by using it with a type with different alignment, or by giving the wrong capacity figure, you end up with memory safety bugs. So a safe version would be very preferable. As such, there's been recent movement towards adding this as a standard library method: there's now an accepted proposal to add it to the standard library as a safe method on Vec. It probably won't become stable for a while (there are design issues that need to be worked out), but it'll likely become possible in nightly Rust fairly soon.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With