I really don’t like the highly hierarchical format, that there’s a “..meta” and a “…meta” somewhere else. I can imagine we want to annotate the whole diff, each file and each chunk. That’s a total of 3 levels of depth. Let’s just give them distinct names and not go full yaml with a format for once?
This helps with readability (if one of the “meta” blocks is missing, for example, I could still tell at a glance what it refers to without counting dots), and is less error prone (it make little sense to me why the metadata associated with a whole diff should have the same fields as the metadata of a file).
Furthermore, why do we have two formats? Json and key=value pairs? Is there any reason to not just use one format because it sounds like the number of things we’d want to annotate is quite small. Having a single structure makes it much easier to write parsers or integrate with existing tooling (grep, sed or jq - but not both at once)
Other notes:
- please allow trailing commas in lists
- diffs are inherently splittable. I can grab half of a diff and apply it. How does your format influence that? I guess it breaks because I would need to copy the preamble, then skip 20 lines, then copy the block I need?
- revisions are a file property? Not a commit checksum? (I might just be dumb here)
In the early drafts, we played with a number of approaches for the structure. Things like "commit-meta", etc. In the end, we broke it down into `#<section_level><section_type>`, just to simplify the parsing requirements. Every meta block is a meta block, and knowing what section level you're supposed to be in and comparing to what section level you get become a matter of "count the dots".
The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for. The parsing rules for the header are intentionally very simple.
JSON was chosen after a lot of discussion between us and outside parties and after experimentation with other grammars. The header for a meta block can specify a format used to serialize the data, in case down the road something supplants JSON in a meaningful way. We didn't want to box ourselves in, but we also don't want to just let any format sit in there (as that brings us back to the format compatibility headaches we face today).
For the other notes:
1. Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.
2. If your goal is to simply feed to GNU patch (or similar), you can still split it. This extra data is in the Unified Diff "garbage" areas, so they'll be ignored anyway (so long as they don't conflict, and we take care to ensure that in our recommendations on encoding).
If your goal is to split into two DiffX files, it does become more complicated in that you'd need to re-add the leading headers.
That said, not all diff formats used in the wild can be split and still retain all metadata. Mercurial diffs, for example, have a header that must be present at the top to indicate parent commit information. You can remove that and still feed to GNU patch, but Mercurial (or tools supporting the format) will no longer have the information on the parent commit.
3. Revisions depend heavily on the SCM. Some SCMs use a commit identifier. Some use per-file identifiers. Some use a combination of the two. Some use those plus additional information that either gets injected into the diff or needs to be known out-of-bounds. There's a wide variety of requirements here across the SCM landscape.
> The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for.
One more thing you should prepare for whenever you have "free-form bits of metadata". They somehow turn into: "some user was storing 100MB blobs in there, and that broke our other thing".
> 1. Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.
This is what I was referring to. This is not json:
> #..meta: format=json, length=270
> The header formats are meant to be very simple key/value pairs that are known by the parser, and not free-form bits of metadata. That's what the "meta" blocks are for. The parsing rules for the header are intentionally very simple.
Exactly my point. That level of flexibility for a .patch format to support another language embedded in it is overwhelming. Keep in mind that you are proposing a textual format, not a binary format. So people will use 3rd party text parsing tools to play with it. And having 2 distinct languages in there makes that annoying and a pain.
How do they reasonable work around that though? If they want the ability to move away from JSON, you have to know that it is JSON before trying to parse it. And then you need to know how much data to read. So I can see why they put those 2 tidbits of info above data block.
Maybe they could have said too bad, JSON for life, we'll never change it. OK. But then you still need the length or a delimiter for the "end of json".
What was your reasoning for discarding the existing header format used by git?
> Compatibility is a key factor here, so we'd want to go with base-level JSON. I'd prefer being able to have trailing commas in lists, but not enough to make life hard for someone implementing this without access to a JSON5 parser.
Everyone has access to a JSON5 parser. Everyone has to suffer for the sake of a few people who don't to pay the trifling tax of pip installing something --- when they're using an external library for a novel file format _anyway_?
> Everyone has access to a JSON5 parser.
That's just a lack of imagination. When you're making a product for teams that span everything from a brand new startup using the latest tooling to teams that are working on software that runs on embedded systems from the 90's, you need to consider things like this.
There are json5 parsers written in C89 out there. And your embedded systems from the 90s probably doesn't have a JSON parser built in at all either... If you're going to build your own json parser, adding json5 support on top is really trivial.
That doesn't mean it's not going to be difficult to use that parser. Not everyone has the luxury of being able to use third-party code, or having the time allotted to write a JSON5 parser. The JSON parser some places are using may have been written two decades ago and works well enough that there's little motivation to implement JSON5 support. Sometimes it's just company policy or internal politics that prevent the usage.
It's also just not that big a deal overall for the intended use of the DiffX format. It's mainly machine-generated and machine-consumed. There's human readability concerns for sure, but the format looks to be designed mainly for tools to create and consume, so missing a few features that JSON5 brings is not that big of a deal.
"That doesn't mean it's not going to be difficult to use that parser. Not everyone has the luxury of being able to use third-party code, or having the time allotted to write a JSON5 parser."
Why are these people the target market?
I understand it may be important to you, but that isn't the same as "matters to target market/audience".
On top of that, the same constraints you mention here would stop you parsing current git patch formats, and lots of other things anyway. So you were never going to be using modern tools that might care here.
This is all also really meta. Who exactly is writing software with >1% market share, needs to parse the patch format, and can't access a JSON parser.
Instead of this theoretical discussion, let's have a concrete one.
In this specific instance, those people are part of the target market because the project chooses to make them part of the target market. It's worked well enough for Review Board.
So the whole world should suffer through vanilla JSON because someone, somewhere, has an overbearing and paranoid software approval process? That's the attitude the delayed universal unicode adoption by a decade.
That's a bit dramatic. This isn't something as universal as Unicode. You really only need to care about this if you're writing tools that generate or consume the DiffX format, which is not something most people will be doing. The whole world isn't suffering their decision to use JSON instead of JSON5.
I don't think this is true, and honestly, I think it would be a mistake to consider it - they can't serve everyone, down that path is madness. FWIW - I even have a JSON parser in my RTOS-that-must-run-in-less-than-512k.
I also think that target of "embedded systems from the 90's" makes no sense because the tooling for the embedded system, which is what would conceivably want to handle patch format, ran on the host, which easily had access to a JSON parser.
But let's assume it does matter - let's be super concrete - assume they want to serve 95-99% of the users of patch format (i doubt it's even that high).
Which exact pieces of software with even >1% market share that need to process patch format don't have access to a JSON parser?