Associated Topics || Dr. Math Home || Search Dr. Math

### Data Compression

```Date: 08/10/2003 at 06:18:58
From: John P
Subject: Reducing a binary number

Is there an algorithm that can reduce any binary number to a much
smaller binary number, then later be reversed to regain the original
number? It has to work for any binary number.

If a large binary number can be reduced to a smaller binary number
consistently by applying a formula or technique, then is it possible
to reverse the process to turn the reduced value into the original
large binary value by reversing the sequence or formula?

Let's say I have a binary number.

Step 1: (original binary number) 10001101001111001100010111000011

Step 2: Apply the algorithm

Step 3: (ex. reducing it to say 1011, with no remainder)

When I try it I always get a high remainder.
```

```
Date: 08/11/2003 at 11:21:13
From: Doctor Rick
Subject: Re: Reducing a binary number

Hi, John.

If I understand you correctly, it is not possible. For example, you've
given me a 32-bit binary number to start with. There are 2^32
different 32-bit binary numbers. You want to reduce it to a 4-bit
binary number. There are only 2^4 = 16 of these. If each of the 2^32
32-bit numbers is to be reduced to one of the 16 4-bit numbers, then
we'll have to have a lot of cases in which two different 32-bit
numbers are reduced to the SAME 4-bit number. When you try to reverse
the process, you won't be able to decide which of these 32-bit numbers
you started with, based on the 4-bit number alone.

Perhaps you can tell us why you want to do this. There are tricks by
which seemingly impossible things similar to this can in fact be done,
but they depend on one of two modifications to the problem. Either
there is additional information that tells you that not all 32-bit
numbers are possible to start with (in fact, there must be 16 or
fewer), or else additional information is kept besides the 4-bit
number that allows you to decide which number you started with, out
of those that reduce to the same number.

- Doctor Rick, The Math Forum
http://mathforum.org/dr.math/
```

```
Date: 08/12/2003 at 00:10:49
From: John P
Subject: Reducing a binary number

Thank you for your response, Dr Math.

I am interested in any tricks or alternatives that you might have in
mind to solve the problem. This is an extra credit problem that takes
any size binary number as an input and reduces it to a small binary
number, after applying an algorithm. Also, the problem requires
reversing the process to take the reduced value and regaining the
larger binary value after reversing the algorithm, or another
algorithm.

You mentioned perhaps supplying a key value do handle the problem. If
I understand you correctly, would the reduced have a key value to go
along with it, so that when I reverse the process, I would supply the
two values (reduced value and the key value) as input?

John
```

```
Date: 08/12/2003 at 09:17:22
From: Doctor Warren
Subject: Re: Reducing a binary number

Hi John,

Thanks for writing to Dr. Math!

What you're attempting to do, to represent a large number with a small
one, is example of "data compression."  This is a enormous topic of
great importance to the modern computerized world.  If you can
represent a large file in a more compact way, it will take less time
to transmit it across a network, and it will occupy less space on your
disk.  Network bandwidth and disk space are expensive, so compression
saves money (and time).

Dr. Rick pointed out that it is impossible to represent ANY 32-bit
number in only four bits. It is, in fact, only possible to represent
sixteen different numbers. This is called the "pigeonhole" problem
by computer scientists who work on compression algorithms.  (Yes,
there are people who spend their entire lives trying to do what you're
trying to do, in increasingly better ways). On the outside, I can see
why your teacher assigned this as an extra credit problem - it's one
of those beautiful problems that, as you say, seems so easy at the
start, but really isn't easy at all.

Every now and then, someone will announce to the world that they've
invented a program that can compress any file (files are just strings
of bits) all the way down to just a single bit. But think about that
for a minute: if you gave it a picture of your mother, the program
might give you back "1."  If you gave it a picture of your father, the
program might give you back "0."  If you gave it a picture of your
best friend, it'd had to give you back either "1" or "0," but both of
those already mean "mother" and "father."  You can only represent two
pictures with just one bit - there's no way the program can really
compress ANYTHING all the down to just one bit. At most, it can only
compress two files all the way down to one bit.

Since you're looking for ways to compress a 32-bit number, I'll give
you an example of a simple compression algorithm. It's called "run-
length encoding," and is used in various places such as .gif graphic
files, which are common on the Web. In run-length encoding, you
represent long repetitive strings of the same bits in a more compact
form. For example, take this number:

before: 11111000000010000001111111000000

Notice the long stretches of zeros and ones in this number. These
stretches are called "runs."  For example, the number begins with a
run of 1's, of length 5. In a simple run-length encoding scheme, I
would represent the whole number as a sequence of runs, like this:
"five ones, followed by seven zeros, followed by one one, followed by
six zeros, followed by seven ones, followed by six zeros."  I might
represent each run by a three-digit value (representing the length)
followed by the bit being repeated.  For example, "11111" (five ones)
becomes

1011
^  ^
|  the digit being repeated is a one
101 means repeat five times

If I continue this pattern out, I get the following string:

after RLE: 1011 1110 0011 0110 1111 0110

where I have added spaces between the bits representing each run.
Notice that my new number only 24 bits long, so I've saved 8 bits.

There's also an easy way to gain a little more compression: notice
that the runs alternate. The first run is ones, so the second is
zeros, and the third is ones, and so on. I really only have to store
whether the first run is ones or zeros, and then I can represent the
rest of the runs by just their lengths.

even better RLE: 1 011 111 001 011 111 011

The first bit in this number means only that the first run is ones;
every subsequent triplet of bits encodes the number of bits in the
next run, assuming that the runs always alternate value. If the
original number had a run of more than seven digits in a row (the
maximum length run you can encode with three bits), you'd have to do
something a little ugly, though - you'd have to store a "zero-length"
run in the middle of it. In any event, this number is only 19 bits -
I've saved over 40% of the original space.

As always, there are drawbacks...

RLE is only good at encoding numbers with a lot of repeated ones and
zeros.  The worst case for this algorithm is a number like this:

before: 10101010101010101010101010101010

After I RLE this number, it becomes:

after: 1 001 001 001 001 001 001 001 001 001 001 001 001 001 001
001 001 001 001 001 001 001 001 001 001 001 001 001 001
001 001 001 001

Uh-oh!  This "compressed" number is 97 bits long, over three times the
length of the original number. This is, in fact, a problem common to
ALL compression algorithms. A given compression algorithm will be good
at compressing some kinds of data, will be unable to compress other
kinds of data, and will actually make some kinds of data even bigger.
People have invented all kind of algorithms; some are good for
compressing English text, some for music, some for pictures, and so
on, but there is no single algorithm that is the best for all
kinds of data.

Just to get your creative juices flowing a bit, I'll give you a
pointer to a website where you can learn about another kind of
compression, called "dictionary" or "substitution coding."  More
specifically, it's a compression algorithm called LZ78, developed in
1978 by two fellows, Lempel and Ziv.  This kind of compression is
actually what's used in zip files!

Data Structures and Algorithms: Lempel-Ziv Compression
McGill University: School of Computer Science
http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic23/

The LZ78 algorithm is MUCH better than the simple run-length encoding
I described above.  LZ78 would do pretty well on both of these
numbers:

11111000000010000001111111000000
10101010101010101010101010101010

which are very different. You might want to do some Google searches on
a man named Claude E. Shannon, who first discovered the concept of
"information entropy."  Shannon determined that English text, for
example, has approximately 2.3 bits per character of "raw
information."  You can't compress English text any better than down to
2.3 bits per character, on average.

In general, one of the best places on the Web to learn about
compression techniques is the FAQ for the comp.compression USENET
newsgroup, available here:

Compression FAQ
http://www.faqs.org/faqs/compression-faq/

Note that this a HUGE document, so don't try reading it straight
questions that sound interesting.  I think you'll find a lot of it
very interesting!

Let me know if you have any more questions.

- Doctor Warren, The Math Forum
http://mathforum.org/dr.math/
```
Associated Topics:
College Algorithms
High School Calculators, Computers
High School Permutations and Combinations

Search the Dr. Math Library:

 Find items containing (put spaces between keywords):   Click only once for faster results: [ Choose "whole words" when searching for a word like age.] all keywords, in any order at least one, that exact phrase parts of words whole words

Submit your own question to Dr. Math
Math Forum Home || Math Library || Quick Reference || Math Forum Search