Arithmetic coding (method of compression) Suppose we have a message that only contains the characters A, B, and C, with the following frequencies, expressed as fractions: A: 0.5 B: 0.2 C: 0.3 To show how arithmetic compression works, we first set up a table, listing characters with their probabilities along with the cumulative sum of those probabilities. The cumulative sum defines "intervals", ranging from the bottom value to less than, but not equal to, the top value. The order in which characters are listed in the table does not seem to be important, except to the extent that both the coder and decoder have to know what the order is. letter probability interval ______ ___________ _________ C: 0.3 0.0 : 0.3 B: 0.2 0.3 : 0.5 A: 0.5 0.5 : 1.0 ______ ___________ _________ Now each character can be coded by the shortest binary fraction whose value falls in the character's probability interval: letter probability interval binary fraction ______ ___________ _________ ____________________ C: 0.3 0.0 : 0.3 0 B: 0.2 0.3 : 0.5 0.011 = 3/8 = 0.375 A: 0.5 0.5 : 1.0 0.1 = 1/2 = 0.5 ______ ___________ _________ ____________________ This shows how single characters can be assigned minimum-length binary codes. However, arithmetic coding doesn't stop there and simply translate the individual characters in a message as these binary codes. It takes a subtler approach, assigning binary fractions to complete messages. To start, let's consider sending messages consisting of all possible permutations of two of these three characters. We determine the probability of the two-character strings by multiplying the probabilities of the two characters, and then set up a series of intervals using those probabilities. string probability interval binary fraction ______ ___________ ____________ _______________________ CC: 0.09 0.00 : 0.09 0.0001 = 1/16 = 0.0625 CB: 0.06 0.09 : 0.15 0.001 = 1/8 = 0.125 CA: 0.15 0.15 : 0.30 0.01 = 1/4 = 0.25 BC: 0.06 0.30 : 0.36 0.0101 = 5/16 = 0.3125 BB: 0.04 0.36 : 0.40 0.011 = 3/8 = 0.375 BA: 0.10 0.40 : 0.50 0.0111 = 7/16 = 0.4375 AC: 0.15 0.50 : 0.65 0.1 = 1/2 = 0.5 AB: 0.10 0.65 : 0.75 0.1011 = 11/16 = 0.6875 AA: 0.25 0.75 : 1.00 0.11 = 3/4 = 0.75 ______ ___________ ____________ _______________________ The higher the probability of the string, in general the shorter the binary fraction needed to represent it. Let's build a similar table for three characters now: string probability interval binary fraction ______ ___________ _____________ _______________________________ CCC 0.027 0.000 : 0.027 0.000001 = 1/64 = 0.015625 CCB 0.018 0.027 : 0.045 0.00001 = 1/32 = 0.03125 CCA 0.045 0.045 : 0.090 0.0001 = 1/16 = 0.0625 CBC 0.018 0.090 : 0.108 0.00011 = 3/32 = 0.09375 CBB 0.012 0.108 : 0.120 0.000111 = 7/64 = 0.109375 CBA 0.03 0.120 : 0.150 0.001 = 1/8 = 0.125 CAC 0.045 0.150 : 0.195 0.0011 = 3/16 = 0.1875 CAB 0.03 0.195 : 0.225 0.00111 = 7/32 = 0.21875 CAA 0.075 0.225 : 0.300 0.01 = 1/4 = 0.25 BCC 0.018 0.300 : 0.318 0.0101 = 5/16 = 0.3125 BCB 0.012 0.318 : 0.330 0.010101 = 21/64 = 0.328125 BCA 0.03 0.330 : 0.360 0.01011 = 11/32 = 0.34375 BBC 0.012 0.360 : 0.372 0.0101111 = 47/128 = 0.3671875 BBB 0.008 0.372 : 0.380 0.011 = 3/8 = 0.375 BBA 0.02 0.380 : 0.400 0.011001 = 25/64 = 0.390625 BAC 0.03 0.400 : 0.430 0.01101 = 13/32 = 0.40625 BAB 0.02 0.430 : 0.450 0.0111 = 7/16 = 0.4375 BAA 0.05 0.450 : 0.500 0.01111 = 15/32 = 0.46875 ACC 0.045 0.500 : 0.545 0.1 = 1/2 = 0.5 ACB 0.03 0.545 : 0.575 0.1001 = 9/16 = 0.5625 ACA 0.075 0.575 : 0.650 0.101 = 5/8 = 0.625 ABC 0.03 0.650 : 0.680 0.10101 = 21/32 = 0.65625 ABB 0.02 0.680 : 0.700 0.1011 = 11/16 = 0.6875 ABA 0.05 0.700 : 0.750 0.10111 = 23/32 = 0.71875 AAC 0.075 0.750 : 0.825 0.11 = 3/4 = 0.75 AAB 0.05 0.825 : 0.875 0.11011 = 27/32 = 0.84375 AAA 0.125 0.875 : 1.000 0.111 = 7/8 = 0.875 ______ ___________ _____________ _______________________________ * Let's stop here and send one of the binary strings defined (DECODING !) in the table above to a decoder. We'll arbitrarily select the binary string with the decimal value of 0.21875 from the table above. This value was obtained using the probability values and intervals defined earlier: string probability interval ______ ___________ _________ C: 0.3 0.0 : 0.3 B: 0.2 0.3 : 0.5 A: 0.5 0.5 : 1.0 ______ ___________ _________ The value 0.21875 clearly falls into the interval for "C", so "C" must be the first character. We can then "zoom in" on the characters that follow the "C" by subtracting the bottom value of the interval for "C", which happens to be 0, and dividing the result by the width of the probability interval for "C", which is 0.3: (0.21875 - 0) / 0.3 = 0.72917 This is a simple shift and scaling operation. The result falls into the probability interval for "A", and so the second character must be "A". We can then zoom in on the next character by the same approach as before, subtracting the bottom value of the interval for "A", which is 0.5, and dividing the result by the width of the probability interval for "A", which is also 0.5: (0.72917 - 0.5) / 0.5 = 0.4583 This clearly falls into the probability interval for "B", and so the string has been correctly uncompressed to "CAB", which is the correct answer. Poniżej przedstawiono schemat programu realizującego kodowanie arytmetyczna (tzn. realizującego kompresję tą metodą). Set low to 0.0 Set high to 1.0 While there are still input symbols do get an input symbol range = high - low. high = low + range*high_range(symbol) low = low + range*low_range(symbol) End of While output low