Tuesday, January 03, 2006

GZipStream and Deflate compression

I have had a low priority project floating around for about 6 months that requires some analysis of various files and how well they have been compressed. Since VS2003 did not have any compression support and the fact that I did not want to evaluate, choose, buy and learn a third party compression library (this project is just not worth it) I chose to wait until VS2005 since it was to have some compression support.

As an aside, this was my very first VS2005 project. At first glance it really does look totally awesome. However I will reserve my final judgment until I get into my first big ASP2.NET application.

The Help is really good on the compression topic. I was not looking forward to all that mucking around with streams but the examples did exactly what I wanted and a quick copy and paste later my program was basically finished.

The first file that the program checked was a PDF. PDFs are inherently compressed so I wasn't expecting it to reduce in size my much. But I did not expect that it would INCREASE in size by 50%! A small check of the code later and it was no bug.

The compression in VS2005 has two flavors; one based on GZip and the other on the Deflate algorithm. The help says that GZip actually uses the same algorithm as Deflate but it can be extended to use other formats.

I did a bit of analysis to see how much the files would compress by. (The results of PDF file seen in this table are of a different PDF file that initially alerted me to these issues hence the discrepancy)










FileGZipDeflate
1000000 spaces0.86%0.86%
A Word document88.60%88.57%
A Word document zipped152.97%152.90%
A PDF92.76%92.75%
A PDF zipped153.35%153.34%
An XML file3.24%3.24%
An XML file zipped145.46%145.46%



From these results that it is clear that it is not wise to use the .Net compression libraries on data that is already highly compressed. It will make the data 50% bigger then it was originally. If the data is somewhat compressed (PDF and Word) then there is little benefit in using these libraries.

However if the data is not at all compressed then the compression works well. For example if you are moving around XML data (ie SOAP, etc) then it may be a great idea to compress the stream. (caveat emptor - CPU load will go up and you will lose compatibility with computers not supporting the decompression algorithm. This is kind of against the spirit of SOAP)

Another point is that Help's claim that the GZip's algorithm is the same as Deflates appears to be totally true. The files come out to be slightly different sizes but I can only assume this has something to do with header sizes.

The compression libraries work on streams rather than 'block mode'. The libraries compress the data as it arrives from the stream rather then having an opportunity to examine the entire data file from beginning to end and then compressing as is done in programs like WinZip. Stream based compression apparently can never be as good as block compression.

So back to my problem. In determining whether a file is really well compressed I have resorted to checking whether the .Net GZip algorithm increased the size by more then 50%! If it does then it is really when compressed otherwise there maybe an opportunity for further compression.

No comments: