Is there any reason why XML such as this :
<person>
<firstname>Joe</firstname>
<lastname>Plumber</lastname>
</person>
couldn't be compressed like this for client/server transfer.
<person>
<firstname>Joe</>
<lastname>Plumber</>
</>
It would be smaller - and slightly faster to parse.
Assuming that there are no edge conditions meaning this wouldn't work - are there any libraries to do such a thing?
This is a hard thing to google it turns out :
Your search - </>
- did not match any
documents.
Suggestions:
Try different keywords.
Edit: Seems to be confusion in what I'm asking. I am talkin about my own form of compression. I am fully aware that as it stands this is NOT XML. The server and client would have to be 'in on the scheme'. It would be especially helpful for schemas that have very long element names, becuase the bandwidth taken up by those element names would be halved.
If you wrote a compression routine which did that, then yes, you could compress a stream and restore it at the other end.
The reasons this isn't done are:
- much better XML agnostic compression schemes already exist (in terms of compression ratio, and probably in terms of CPU and space - a certain 7 N UTF-8 document would get 14% compression but require at least 2 N bytes space to decompress, rather than constant space required by most decompression algorithms.
- much better XML aware compression schemes already exist (google 'binary xml'). For schema aware compression, the schemes based on ASN.1 give much better than reducing the size devoted to indicating element type by half.
- the decompressor must parse the non-standard XML and keep a stack of the open tags it has encountered. So unless you're plugging it in instead of a parser, you have doubled the parsing cost. If you do plug it instead of the parser, you're mixing a different layers, which is liable to cause confusion at some point
That's not valid XML. Closing tags must be named. It's potentially error prone otherwise and frankly I think it'd be less readable your way.
In reference to your clarification about this being a nonstandard violation of the XML standard to save a few bytes, it is an incredibly bad idea for several reasons:
- It's nonstandard and possibly will have to be supported far in the future;
- Standards exist for a reason. Standards and conventions have a lot of power and having "custom XML" ranks up there with Ivory Tower graphic designers who force programmers to write a custom button replacement because the standard one can't do whatever weird, wonderful and confusing behaviour was dreamt up;
- Gzip compression is easy and far more effective and won't break standards. If you see a gzip octet stream, there's no mistaking it for XML. The real problem with the shorthand scheme you've got is that it still has at the top so some poor unsuspecting parser may make the mistake of thinking its valid and bomb out with a different, misleading error;
- Information theory: compression works by removing redundancy of information. If you do that by hand, it makes gzip compression no more effective because the same amount of information is represetned;
- There is a significant overhead on converting documents to and from this scheme. It can't be done with a standard XML parser so you'd have to effectively write your own XML parser and outputter that understands this scheme (actually conversion to this format can be done with a parser; getting it back is more difficult), which is a lot of work (and a lot of bugs).
If you need better compression and easier parsing, you may try using XML attributes:
<person firstname="Joe" lastname="Plumber" />
As you say, this isn't XML, so why make it even look like XML? You've already lost the ability to use any XML parsers or tools. I would either
- Use XML, and compress it on the wire as you'll see far greater savings than with your own scheme
- Use another more compact format like YAML or JSON