C Program For Arithmetic Coding For Data
. Introduction According to Wikipedia: Arithmetic coding is a method for lossless data compression. Normally, a string of characters such as the words ‘hello there’ is represented using a fixed number of bits per character, as in the ASCII code. Like Huffman coding, arithmetic coding is a form of variable-length entropy encoding that converts a string into another form that represents frequently used characters using more bits, with the goal of using fewer bits in total. As opposed to other entropy encoding techniques that separate the input message into its component symbols and replace each symbol with a code word, arithmetic coding encodes the entire message into a single number, a fraction ne, where (0.0 ≤ n character may be encoded into the number, or alternatively, a number representing the number of characters to encode can be passed to the decoding function/method.
What Every Computer Scientist Should Know About Floating- Point Arithmetic. Appendix D What Every Computer Scientist Should Know About Floating- Point Arithmetic.
The C# implementation The encoding procedure written in C# is included in the downloadable source code for this article. The code for the encoder as well as the decoder were first published (in C) in an article entitled 'Arithmetic Coding for Data Compression' in the February 1987 issue of 'Communications of the ACM', by Ian H. Witten, Radford Neal, and John Cleary, published again by Mark Nelson as C code, and then ported to C# by myself and is being published here with the author's permission. I have modified the code slightly so as to further isolate statistical modeling and arithmetic coding. The coder ( coder.cs in the source code) class is an object that implements arithmetic coding. This class can then be used by any statistical model of the data.
I have included a test project and a few examples of testing the coding and decoding methods of the class. There are two major differences between the algorithms shown earlier and the code included in this article. The first difference is in the way probabilities are transmitted. In the algorithms shown above, the probabilities were kept as a pair of floating point numbers on the 0.0 to 1.0 range. Each symbol had its own section of that range.
In the C# class included here, a symbol has a slightly different definition. Instead of two floating point numbers, the symbol's range is defined as two integers, which are counts along a scale. The scale is also included as part of the coder class definition: as the property ' scale' of the class ( coder.scale).
This scale is used in the methods encodesymbol and decodesymbol to convert the probability integer into its floating point equivalent, or to decode an encoded symbol into its char equivalent. This also means that a user of the class can input the probability of a symbol as integers instead of floating point numbers. See the included test solution for an example.
University Of Calgary
So, for instance in the 'HELLO WORLD' example, the letter H was defined previously as the high/low pair of 0.2 and 0.3. In the code being used here, 'H' would be defined as the low and high counts of 2 and 3, with the symbol scale being 10. The second difference in this algorithm is that all of the comparison and shifting operations are being done in base 2, rather than base 10.
The illustrations given previously were done on base 10 numbers to make the algorithms a little more comprehensible. The algorithms work properly in base 10, but masking off digits and shifting in base 10 on most computers is expensive and slow.
Instead of comparing the two MSD digits, we now compare the two MSD bits. There are two things missing that are needed in order to use the encoding and decoding algorithms. The first is a set of bit oriented input and output routines.
These are shown in the code listing and are presented here as the bit IO routines. The coder.symbol structure is responsible for storing the probabilities of each character, and performing two different transformations. During the encoding process, the coder.symbol structure has to take a character to be encoded and convert it to a probability range. The probability range is defined as a low count, a high count in the structure. During the decoding process, the coder.symbol structure has to take a count derived from the input bit stream and convert it into a character for output.
The coder object is created with a 'dictionary' or 'alphabet' made up of these coder.symbol structures. Test code An example program is shown in the included solution. It implements a compression/expansion program that uses some arbitrary models based on the discussed examples. The decoding method needs to know the length of the encoded string so as to know when to stop decoding the message. The test project encodes an arbitrarily defined input string, and writes it out to a MemoryStream object. This is the return value of the Encode method of the class. The MemoryStream object returned will contain an array of bytes with the compressed data in binary.
One can then decode the stream by calling the Expand method of the class. The Expand method of the class takes a memory stream and the length of the encoded message as parameters.
To test it, we can encode a string and then pass the binary stream returned back to the Expand method for decoding. During the encoding process, a routine called convertinttosymbol is called. This routine gets a given input character, and converts it to a low count, high count, and scale using the current model. Since our model is a set of fixed probabilities, this just means looking up the probabilities in the input coder.symbol struct.
Once those are defined, the encoder can be called. During the decoding process, there are two functions associated with modeling. In order to determine what character is waiting to be decoded on the input stream, the model needs to be interrogated to determine what the present scale is. In our example, the scale (or range of counts) is fixed by the coder.scale property. When decoding, a modeling function called convertsymboltoint is called. It takes the given count, and determines what character matches the count. Finally, the decoder is called again to process that character out of the input stream.
Harvey Green 3-Jan-12 14:33 3-Jan-12 14:33 Thanks for this port from C from the code from Mark Nelson (which is here by the way for reference. ) But I think you could have been a bit smarter with your port. I mean using a char for each symbol isn't really useful when you want to convert binary data (would have preferred byte, or even better, the implementation of a generic type for the symbol struct). Also I don't see the point of using MemoryStream for the output of the coder (the C code used a C iostream) - in my opinion it would have been simpler just to return a List of char or List of byte here.
Just minor points though, thanks again. For further reading, I highly recommend people take a look at this page here which contains a C# app for learning about Arithmetic Coding. There is also some highly useful comments at the bottom. JuddToken 26-Jan-10 5:55 26-Jan-10 5:55 Notice that the encoder is initialized with 0xffff. Then when the first symbol is encoded, you're trying to multiply this by another number - the range. Even if you divide by the scale.
How in the world is this going to fit in a UInt16??? Does anyone use this for something besides BILL GATES or HELLO WORLD? Don't try to change high and low to uint's from ushort. This will cause a lot of problems. Make sure your file statistics are scaled before instantiating a coder object with it, but that still won't fix the problem above. And the only way the MSBit is going to always be bit 15 is when it is always confined to a short. And you need it be since you hard-coded the bit comparisons.
No need, just reformat and repost it here then it will get seen by one of the many moderators on the site. That way you get the best reviews from people who know how to write articles, and may also know something about the subject you have written about.
If you are unsure of how to present the article you can submit via email rather than use the submission wizard, in which case it will be posted via one of the CodeProject editors who will add any enhancements that they feel are necessary. Remember that the important thing with articles is to make them visualy appealing, so people will be drawn to your work first and maybe even vote you up for one of the monthly awards. Last Visit: 31-Dec-99 19:00 Last Update: 9-Feb-18 20:46 1 General News Suggestion Question Bug Answer Joke Praise Rant Admin Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.