Compression

data.jpeg

 

Sometimes you don’t get to define the requirements, they sometimes appear to serve a higher purpose that you can’t begin to understand. All you know is that they are requirements, and there were decisions made for various reasons. Sometimes you have to play the cards that you are dealt. But it is still your choice in how to play them.

I’m talking about message formats here, In a specific transaction processing system there are two requirements that we must adhere to:

  1. Accept a 8,000 – 10,000 bytes incoming fixed message format.
  2. Log the Raw Message Request and Responses for all interface connections

Regarding #1 I’d prefer to see a variable message format here instead, but I understand the need of an existing system to talk in the language this it is used to. Item #2 had me very concerned when I first heard of it, with my PCI background, I was ready to put my foot down and call people crazy – (Imagining the request to log raw messages that contained track data, pin blocks, card verification numbers)  To my surprise this was not for a financial transaction processing system but for one of a different purpose.  One that exists in a highly regulated word with data retention requirements and the need integrity of the raw transaction messages for compliance and legal reasons.

The challenge I had logging the raw messages where their sheer size – 10K and when you are looking at 4-6 legs of a transaction – client request, client response, endpoint request, endpoint response, and other transaction paths that sometimes seem recursive, we have 50K of logging for a single transaction – times 3 to 5 million transactions per day – that is 150 GB to 250 GB per day of logging !

The easiest solution was to look into compression – how much time would compressing the data stream before logging it take ? Would this impact transaction processing time ? How was the raw messages used ? If we compress the message, what needs to occur on applications on the other end, what language and platform are they written in, what is a portable algorithm ?

It turns out the these messages contains many repeating unused fields with default values – these compress very well:

 

image001.png

 

Enter gzip – On our platform Java’s GZipInputStream and for our clients tools the .NET GZipStream.

How did this work out ?

 

raw_size    comp_size   Compression %
------------------------------------------
3975        393         90.1    
10599       484         95.4

 

How much disk storage and SAN space and upgrades were saved 😉 Priceless.

 

Leave a Comment.