Several things going on here:
- concurrency is very hard
- .. but object storage "solves" most of that for you, handing you a set of semantics which work reliably
- single file throughput sucks hilariously badly
- .. because 1Gb is ridiculously large for an atomic unit
- (this whole thing resembles a project I did a decade ago for transactional consistency on TFAT on Flash, except that somehow managed faster commit times despite running on a 400Mhz MIPS CPU. Edit: maybe I should try to remember how that worked and write it up for HN)
- therefore, all of the actual work is shifted to the broker. The broker is just periodically committing its state in case it crashes
- it's not clear whether the broker ACKs requests before they're in durable storage? Is it possible to lose requests in flight anyway?
- there's a great design for a message queue system between multiple nodes that aims for at least once delivery, and has existed for decades, while maintaining high throughput: SMTP. Actually, there's a whole bunch of message queue systems?
> The broker runs a single group commit loop on behalf of all clients, so no one contends for the object. Critically, it doesn't acknowledge a write until the group commit has landed in object storage. No client moves on until its data is durably committed.
Yea, the group commit is the real insight here.
I read this blog post and to help wrap my head around it I put together a simple TCP-based KV store with group commit, helped make it click for me.
AFAIK you can kinda "seek" reads in S3 using a range header, WCGW? =D
You can, and it's actually great if you store little "headers" etc to tell you those offsets. Their design doesn't seem super amenable to it because it appears to be one file, but this is why a system that actually intends to scale would break things up. You then cache these headers and, on cache hit, you know "the thing I want is in that chunk of the file, grab it". Throw in bloom filters and now you have a query engine.
Works great for Parquet.
Yep! Other than random reads (~p99=200ms on larger ranges), it's essential to get good download performance of a single file. A single (range) request can "only" drive ~500 MB/s, so you need multiple offsets.
Amazon S3 Select enables SQL queries directly on CSV, JSON, or Apache Parquet objects, allowing retrieval of filtered data subsets to reduce latency and costs
S3 Select is, very sadly, deprecated. It also supported HTTP RANGE headers! But they've killed it and I'll never forgive them :)
Still, it's nbd. You can cache a billion Parquet header/footers on disk/ memory and get 90% of the performance (or better tbh).
Wow I didn't know that. To be fair now that S3 tables exists it is rather redundant.