F3: Open-source data file format for the future [pdf]

db.cs.cmu.edu

78 points by eatonphil 15 hours ago

anigbrowl 2 hours ago

> Wes McKinney

Sold

For those who don't know. Wes McKinney is the creator of Pandas, the go-to tabular analysis library for Python. That gives his format widespread buy-in from the outset, as well as a couple of decades' of Caring About The Problem which makes his insights unusually valuable.

snthpy 7 minutes ago

I'm a big fan of Wes' work and Pandas was incredibly influential. However technically it wasn't his best work. In terms of selling points, I think that the Arrow data format is technically much better and more influential on the data ecosystem as a whole, see DataFusion, etc...
That said, now let me see what F3 is actually about (and yes, your comment is what actually made me want to click through to the link) ...
geodel 8 minutes ago

Also creator of Apache Arrow. A core component of modern data analytics.
wodenokoto 8 minutes ago

His work on parquet probably stands out as a better call to authority
nialse 26 minutes ago

Mixing data and code is a classic security mistake. Having one somewhat known individual involved doesn’t magically make it less of a mistake.

moelf 3 hours ago

Not in physicists' future! /jk. Exabytes of data produced in the next two decades at the Large Hadron Collider will be stored in a format homemade by CERN: https://cds.cern.ch/record/2923186/

theamk 2 hours ago

this is messed up

> The decoding performance slowdown of Wasm is minimal (10–30%) compared to a native implementation.

so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future. And you also give up any advanced decoding functions other than "decode whole block and store into memory".

I have no idea why would anyone do this. If you care about speed, then wasm is not going to cut it. If you don't care about speed, you don't need super-fancy encoding algorithms, just use any of the well-known ones.

apavlo an hour ago

> so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future.
The WASM is meant as a backup. If you have the native decoder installed (e.g., as a crate), then a system will prefer to use that. Otherwise, fallback to WASM. A 10-30% performance hit is worth it over not being able to read a file at all.
xyzzy_plugh 2 hours ago

I kind of agree with you, but there's more to the picture.
The situation you describe is kind of already the case with various approaches to compression. For example, perhaps we decide to bitpack instead of use the generic compressor. Or change compressors entirely.
This sort of thing exists without WASM, and it means you have to "transcode" i.e. rewrite the file after updating your software with the new techniques.
With WASM, it's the same. You just rewrite the file.
I do agree that this pushes the costs of iteration up the stack in a vastly less efficient way. Overall this seems way more expensive, very unclear that future proofing is worth it. I've worked with exabyte-scale systems and re-encoding swaths of data regularly would not be good.

digdugdirk 3 hours ago

Forgive me for not understanding the difference between columnar storage and otherwise, but why is this so cool? Is the big win that you can send around custom little vector embedding databases with a built in sandbox?

I get the "step change/foundational building block" vibe from the paper - and the project name itself implies more than a little "je ne sais quoi" - but unfortunately I only understand a few sentences per page. The pictures are pretty though, and the colour choices are tasteful yet bold. Two thumbs up from the easily swayed.

gopalv an hour ago

> Is the big win that you can send around custom little vector embedding databases with a built in sandbox?
No, this is a compatibility layer for future encoding changes.
For example, ORCv2 has never shipped because we tried to bundle all the new features into a new format version, ship all the writers with the features disabled, then ship all the readers with support and then finally flip the writers to write the new format.
Specifically, there was a new flipped bit version of float encoding which sent the exponent, mantissa and sign as integers for maximum compression - this would've been so much easier to ship if I could ship a wasm shim with the new file and skip the year+ wait for all readers to support it.
We'd have made progress with the format, but we'd also be able to deprecate a reader impl in code without losing compatibility if the older files carried their own information.
Today, something like Spark's variant type would benefit from this - the sub-columnarization that does would be so much easier to ship as bytecode instead of as an interpreter that contains support for all possible recombinations from split up columns.
PS: having spent a lot of nights tweaking tpc-h with ORC and fixing OOMs in the writer, it warms my heart to see it sort of hold up those bits in the benchmark
bbminner an hour ago

Afaik the win of columnar storage comes from the fact that you can very quickly scan the entire column across all rows making very efficient use of os buffering etc. so queries like select a where b = 'x' are very quick.

fallat 3 hours ago

A format requiring a program to decode is nuts. Might as well bundle 7zip with every zip file.

coderatlarge an hour ago

not a bad idea if your average file size is a terabyte and there’s never a sure way to get a clean binary.

esafak 4 hours ago

How do they prevent people from embedding malicious payloads in the WebAssembly?

aeonfox 3 hours ago

By sandboxing:
> We first discuss the implementation considerations of the input to the Wasm-side Init() API call. The isolated linear memory space of Wasm instance is referred to as guest, while the program’s address space running the Wasm instance is referred to as host. The input to a Wasm instance consists of the contiguous bytes of an EncUnit copied from the host’s memory into the guest’s memory, plus any additional runtime options.
> Although research has shown the importance of minimizing the number of memory copies in analytical workloads, we consider the memory copy while passing input to Wasm decoders hard to avoid for several reasons. First, the sandboxed linear memory restricts the guest to accessing only its own memory. Prior work has modified Wasm runtimes to allow access to host memory for reduced copying, but such changes compromise Wasm’s security guarantees
- flockonus 22 minutes ago
  
  For one, sandboxing can't solve the halting problem.
- theamk 2 hours ago
  
  It's "the future work". But possibly, allowlists might help:
  > Security concerns. Despite the sandbox design of Wasm, it still has vulnerabilities, especially with the discovery of new attack techniques. [...] We believe there are opportunities for future work to improve Wasm security. One approach is for creators of Wasm decoding kernels to register their Wasm modules in a central repository to get the Wasm modules verified and tamper-resistant.

tomnicholas1 4 hours ago

The pitch for this sounds very similar to the pitch for Vortex (i.e. obviating the need to create a new format every time a shift occurs in data processing and computing by providing a data organization structure and a general-purpose API to allow developers to add new encoding schemes easily).

But I'm not totally clear what the relationship between F3 and Vortex is. It says their prototype uses the encoding implementation in Vortex, but does not use the Vortex type system?

apavlo 3 hours ago

The backstory is complicated. The plan was to establish a consortium between CMU, Tsinghua, Meta, CWI, VoltronData, Nvidia, and SpiralDB to unify behind a single file format. But that fell through after CMU's lawyers freaked out over Meta's NDA stuff to get access to a preview of Velox Nimble. IANAL, but Meta's NDA seemed reasonable to me. So the plan fell through after about a year, and then everyone released their own format:
→ Meta's Nimble: https://github.com/facebookincubator/nimble
→ CWI's FastLanes: https://github.com/cwida/FastLanes
→ SpiralDB's Vortex: https://vortex.dev
→ CMU + Tsinghua F3: https://github.com/future-file-format/f3
On the research side, we (CMU + Tsinghua) weren't interested in developing new encoders and instead wanted to focus on the WASM embedding part. The original idea came as a suggestion from Hannes@DuckDB to Wes McKinney (a co-author with us). We just used Vortex's implementations since they were in Rust and with some tweaks we could get most of them to compile to WASM. Vortex is orthogonal to the F3 project and has the engineering energy necessary to support it. F3 is an academic prototype right now.
I note that the Germans also released their own fileformat this year that also uses WASM. But they WASM-ify the entire file and not individual column groups:
→ Germans: https://github.com/AnyBlox
- rancar2 2 hours ago
  
  Andrew, it’s always great to read the background from the author on how (and even why!) this all played out. This comment is incredibly helpful for understanding the context of why all these multiple formats were born.
- digdugdirk 2 hours ago
  
  ... Are you saying that there's 5 competing "universal" file format projects? Each with different non-compatible approaches? Is this a laughing/crying thing, or a "lots of interesting paths to explore" thing?
  Also, back on topic - is your file format encryptable via that WASM embedding?

lifthrasiir 4 hours ago

It seems one of the first file formats that embed WebAssembly modules. Is there any other prior work? I'm specifically interested from the compression perspective as a well-chosen WebAssembly preprocessor can greatly boost the compression ratio. (In fact, I'm currently working on such file format just in case.)

binary132 an hour ago

This feels like one of those ideas that seems like a good idea during a late-night LLM brainstorming sesh but not so good when you come back to it with a fresh brain in the morning.

edoceo an hour ago

LLM = Lager, lager, mead

fijiaarone 3 hours ago

Nothing screams "future proof" like WASM.

DecoPerson 2 hours ago

Old websites still run very well.
Is there any reason to believe that a major new browser tech, WASM, will ever have support dropped for its early versions?
Even if the old versions are not as loved (i.e.: engine optimized for it and immediately ready to be executed) as the old versions, emulation methods work wonders and could easily be downloaded on-demand by browsers needing to run "old WASM".
I'm quite optimistic for the forwards-compatibility proposed here.
- xyzzy_plugh 2 hours ago
  
  Sure. Just recently Google was pushing to remove XSLT in Chrome. Flash is a bit of a different beast but it died long ago.
  However I don't think it matters too much. Here WASM is not targeting the browser, and there are many more runtimes for WASM, in many diverse languages, and they outnumber browsers significantly. It won't die easily.

jauntywundrkind 3 hours ago

Reminds me of Anyblox, which is a columnar data-store file with attached wasm for encode/decode. Super fun read, showed up in the comments for Spiral. https://news.ycombinator.com/item?id=45212960#45214646

14 votes & no comments, 3 mo ago, for AnyBlox: A Framework for Self-Decoding Datasets [pdf] https://gienieczko.com/anyblox-paper https://news.ycombinator.com/item?id=44501743

bound008 an hour ago

The irony of being a PDF file

trhway 3 hours ago

the embedded decoder may as well be closing on a full-blown SQL execution engine (as the data size tramps the executable size these days). Push that execution onto a memory chips and into SSD/HDD controllers. I think something similar happens in filesystem development too where instead of straight access to raw data you may access some execution API over the data. A modern take on IBM mainframe's filesystem database.

1oooqooq 3 hours ago

data file format for the future [pdf]

DenisM 4 hours ago

tldr

The proliferation of opensource file formats (i.e., Parquet, ORC) allows seamless data sharing across disparate platforms. However, these formats were created over a decade ago for hardware and workload environments that are much different from today

Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable.

NegativeK 3 hours ago

I know sandboxes for wasm are very advanced, but decades of file formats with built-in scripting are flashing the danger lights at me.
kijin 3 hours ago

Is that really necessary, though? Data files are useless without a program that knows how to utilize the data. Said program should already know how to decode data on the platform it's running.
And if you're really working on an obscure platform, implementing a decoder for a file format is probably easier than implementing a full-blown wasm runtime for that platform.
- magicalhippo 3 hours ago
  
  > Data files are useless without a program that knows how to utilize the data.
  As I see it, the point is that the exact details of how the bits are encoded is not really interesting from the perspective of the program reading the data.
  Consider a program that reads CSV files and processes the data in them. First column contains a timestamp, second column contains a filename, third column contains a size.
  As long as there's a well-defined interface that the program can use to extract rows from a file, where each row contains one or more columns of data values and those data values have the correct data type, then the program doesn't really care about this coming from a CSV file. It could just as easily be a 7zip-compressed JSON file, or something else entirely.
  Now, granted, this file format isn't well-suited as a generic file format. After all, the decoding API they specify is returning data as Apache Arrow arrays. Probably not well-suited for all uses.
  - mbreese 2 hours ago
    
    I think the counter argument here is that you’re now including a CSV decoder in every CSV data file now. At the data sizes we’re talking, this is negligible overhead, but it seems overly complicated to me. Almost like it’s trying too hard to be clever.
    How many different storage format implementations will there realistically be?