Things do not always work out. I'm working for a way out.
Published Sep 09, 2024
Apache Arrow是一个高效的列式数据的内存表示,本文所指Arrow是arrow-rs,也就是Arrow的Rust实现,本文介绍其StringViewArray与StringArray的不同,这也是Rust版本才有的特性。
StringArray是arrow的一种基本列实例、是一列字符串数据,一个StringArray主要有
nullmap,指示某一行是否为nulloffset数组,指示某一行的字符串在Buffer上的位移StringViewArray中同样也有nullmap,但是却不使用简单的offset,而是用一个StringView来指示字符串。首先,一个StringView的大小为16字节,前4个字节用来表示字符串长度。剩余的12字节分两种情况,
StringViewArray的buffer的编号;偏移即在该buffer中第几个字节开始为该字符串数据;前缀存放了字符串前4个字节的编码数据。
好处显而易见,如果一个Array中的字符串有大量重复,那么StringView就可以简单指向Buffer中的同一段数据,大大减少内存用量。如下面的示例,对于一个Array:
“Apache StringView”
“Apache DataFusion”
“Apache StringView”
“InfluxDB”
StringViewArray能够重复利用此前出现过的字符串,相比StringArray,Buffer减少了17字节(另外这里可以看出在数据量小的情况,因为StringView占用的内存大于一般的OffsetSize,如i32, i64,StringView的好处较难体现)

arrow提供了StringViewArrayBuilder来构建一个Array,而关键函数就是append_value(&mut self, value: impl AsRef<T::Native>)。
pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
let v: &[u8] = value.as_ref().as_ref();
let length: u32 = v.len().try_into().unwrap();
if length <= 12 {
let mut view_buffer = [0; 16];
view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
view_buffer[4..4 + v.len()].copy_from_slice(v);
self.views_builder.append(u128::from_le_bytes(view_buffer));
self.null_buffer_builder.append_non_null();
return;
}
// Deduplication if:
// (1) deduplication is enabled.
// (2) len > 12
if let Some((mut ht, hasher)) = self.string_tracker.take() {
let hash_val = hasher.hash_one(v);
let hasher_fn = |v: &_| hasher.hash_one(v);
let entry = ht.entry(
hash_val,
|idx| {
let stored_value = self.get_value(*idx);
v == stored_value
},
hasher_fn,
);
match entry {
Entry::Occupied(occupied) => {
// If the string already exists, we will directly use the view
let idx = occupied.get();
self.views_builder
.append(self.views_builder.as_slice()[*idx]);
self.null_buffer_builder.append_non_null();
self.string_tracker = Some((ht, hasher));
return;
}
Entry::Vacant(vacant) => {
// o.w. we insert the (string hash -> view index)
// the idx is current length of views_builder, as we are inserting a new view
vacant.insert(self.views_builder.len());
}
}
self.string_tracker = Some((ht, hasher));
}
let required_cap = self.in_progress.len() + v.len();
if self.in_progress.capacity() < required_cap {
self.flush_in_progress();
let to_reserve = v.len().max(self.block_size.next_size() as usize);
self.in_progress.reserve(to_reserve);
};
let offset = self.in_progress.len() as u32;
self.in_progress.extend_from_slice(v);
let view = ByteView {
length,
prefix: u32::from_le_bytes(v[0..4].try_into().unwrap()),
buffer_index: self.completed.len() as u32,
offset,
};
self.views_builder.append(view.into());
self.null_buffer_builder.append_non_null();
}
大致逻辑如下:
views_builder加入views中当然,这是从零构建一个StringViewArray的方法,我们还可以直接通过generic_view_array::new_unchecked方法来不安全地直接构造一个Array,具体见下文。
substr(str, start, [count])可以帮助我们获得字符串的指定字串,因为substr的结果一定是原字符串的子集,那么如果输入的DataType是StringViewArray的话,我们就可以这样操作:
StringView, 只需要根据通过计算得到的sub_string的index构建view并载入views_buffer中这样做的好处是:
StringViewArray带来了节约内存的好处generic_array_builder一次次调用append_value更高效。虽然buffer可能暂时会有无用的部份,但是GC可以后续把无用的部份回收掉。下面列举一个示例函数用于手动收集view和构造nullmap以便在new_unchecked中使用:
/// Make a `u128` based on the given substr, start(offset to view.offset), and
/// push into to the given buffers
fn make_and_append_view(
views_buffer: &mut Vec<u128>,
null_builder: &mut NullBufferBuilder,
raw: &u128,
substr: &str,
start: u32,
) {
let substr_len = substr.len();
let sub_view = if substr_len > 12 {
let view = ByteView::from(*raw);
make_view(substr.as_bytes(), view.buffer_index, view.offset + start)
} else {
// inline value does not need block id or offset
make_view(substr.as_bytes(), 0, 0)
};
views_buffer.push(sub_view);
null_builder.append_non_null();
}
// Somewhere else:
...
let views_buf = ScalarBuffer::from(views_buf);
let nulls_buf = null_builder.finish();
// Safety:
// (1) The blocks of the given views are all provided
// (2) Each of the range `view.offset+start..end` of view in views_buf is within
// the bounds of each of the blocks
unsafe {
let array = StringViewArray::new_unchecked( // 在这里直接构造一个StringViewArray
views_buf,
string_view_array.data_buffers().to_vec(),
nulls_buf,
);
Ok(Arc::new(array) as ArrayRef)
}
...
完整的优化代码可以参考这里。