Data CompressionDate: 08/10/2003 at 06:18:58 From: John P Subject: Reducing a binary number Is there an algorithm that can reduce any binary number to a much smaller binary number, then later be reversed to regain the original number? It has to work for any binary number. If a large binary number can be reduced to a smaller binary number consistently by applying a formula or technique, then is it possible to reverse the process to turn the reduced value into the original large binary value by reversing the sequence or formula? Let's say I have a binary number. Step 1: (original binary number) 10001101001111001100010111000011 Step 2: Apply the algorithm Step 3: (ex. reducing it to say 1011, with no remainder) When I try it I always get a high remainder. Date: 08/11/2003 at 11:21:13 From: Doctor Rick Subject: Re: Reducing a binary number Hi, John. If I understand you correctly, it is not possible. For example, you've given me a 32-bit binary number to start with. There are 2^32 different 32-bit binary numbers. You want to reduce it to a 4-bit binary number. There are only 2^4 = 16 of these. If each of the 2^32 32-bit numbers is to be reduced to one of the 16 4-bit numbers, then we'll have to have a lot of cases in which two different 32-bit numbers are reduced to the SAME 4-bit number. When you try to reverse the process, you won't be able to decide which of these 32-bit numbers you started with, based on the 4-bit number alone. Perhaps you can tell us why you want to do this. There are tricks by which seemingly impossible things similar to this can in fact be done, but they depend on one of two modifications to the problem. Either there is additional information that tells you that not all 32-bit numbers are possible to start with (in fact, there must be 16 or fewer), or else additional information is kept besides the 4-bit number that allows you to decide which number you started with, out of those that reduce to the same number. - Doctor Rick, The Math Forum http://mathforum.org/dr.math/ Date: 08/12/2003 at 00:10:49 From: John P Subject: Reducing a binary number Thank you for your response, Dr Math. I am interested in any tricks or alternatives that you might have in mind to solve the problem. This is an extra credit problem that takes any size binary number as an input and reduces it to a small binary number, after applying an algorithm. Also, the problem requires reversing the process to take the reduced value and regaining the larger binary value after reversing the algorithm, or another algorithm. You mentioned perhaps supplying a key value do handle the problem. If I understand you correctly, would the reduced have a key value to go along with it, so that when I reverse the process, I would supply the two values (reduced value and the key value) as input? John Date: 08/12/2003 at 09:17:22 From: Doctor Warren Subject: Re: Reducing a binary number Hi John, Thanks for writing to Dr. Math! What you're attempting to do, to represent a large number with a small one, is example of "data compression." This is a enormous topic of great importance to the modern computerized world. If you can represent a large file in a more compact way, it will take less time to transmit it across a network, and it will occupy less space on your disk. Network bandwidth and disk space are expensive, so compression saves money (and time). Dr. Rick pointed out that it is impossible to represent ANY 32-bit number in only four bits. It is, in fact, only possible to represent sixteen different numbers. This is called the "pigeonhole" problem by computer scientists who work on compression algorithms. (Yes, there are people who spend their entire lives trying to do what you're trying to do, in increasingly better ways). On the outside, I can see why your teacher assigned this as an extra credit problem - it's one of those beautiful problems that, as you say, seems so easy at the start, but really isn't easy at all. Every now and then, someone will announce to the world that they've invented a program that can compress any file (files are just strings of bits) all the way down to just a single bit. But think about that for a minute: if you gave it a picture of your mother, the program might give you back "1." If you gave it a picture of your father, the program might give you back "0." If you gave it a picture of your best friend, it'd had to give you back either "1" or "0," but both of those already mean "mother" and "father." You can only represent two pictures with just one bit - there's no way the program can really compress ANYTHING all the down to just one bit. At most, it can only compress two files all the way down to one bit. Since you're looking for ways to compress a 32-bit number, I'll give you an example of a simple compression algorithm. It's called "run- length encoding," and is used in various places such as .gif graphic files, which are common on the Web. In run-length encoding, you represent long repetitive strings of the same bits in a more compact form. For example, take this number: before: 11111000000010000001111111000000 Notice the long stretches of zeros and ones in this number. These stretches are called "runs." For example, the number begins with a run of 1's, of length 5. In a simple run-length encoding scheme, I would represent the whole number as a sequence of runs, like this: "five ones, followed by seven zeros, followed by one one, followed by six zeros, followed by seven ones, followed by six zeros." I might represent each run by a three-digit value (representing the length) followed by the bit being repeated. For example, "11111" (five ones) becomes 1011 ^ ^ | the digit being repeated is a one 101 means repeat five times If I continue this pattern out, I get the following string: after RLE: 1011 1110 0011 0110 1111 0110 where I have added spaces between the bits representing each run. Notice that my new number only 24 bits long, so I've saved 8 bits. There's also an easy way to gain a little more compression: notice that the runs alternate. The first run is ones, so the second is zeros, and the third is ones, and so on. I really only have to store whether the first run is ones or zeros, and then I can represent the rest of the runs by just their lengths. even better RLE: 1 011 111 001 011 111 011 The first bit in this number means only that the first run is ones; every subsequent triplet of bits encodes the number of bits in the next run, assuming that the runs always alternate value. If the original number had a run of more than seven digits in a row (the maximum length run you can encode with three bits), you'd have to do something a little ugly, though - you'd have to store a "zero-length" run in the middle of it. In any event, this number is only 19 bits - I've saved over 40% of the original space. As always, there are drawbacks... RLE is only good at encoding numbers with a lot of repeated ones and zeros. The worst case for this algorithm is a number like this: before: 10101010101010101010101010101010 After I RLE this number, it becomes: after: 1 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 Uh-oh! This "compressed" number is 97 bits long, over three times the length of the original number. This is, in fact, a problem common to ALL compression algorithms. A given compression algorithm will be good at compressing some kinds of data, will be unable to compress other kinds of data, and will actually make some kinds of data even bigger. People have invented all kind of algorithms; some are good for compressing English text, some for music, some for pictures, and so on, but there is no single algorithm that is the best for all kinds of data. Just to get your creative juices flowing a bit, I'll give you a pointer to a website where you can learn about another kind of compression, called "dictionary" or "substitution coding." More specifically, it's a compression algorithm called LZ78, developed in 1978 by two fellows, Lempel and Ziv. This kind of compression is actually what's used in zip files! Data Structures and Algorithms: Lempel-Ziv Compression McGill University: School of Computer Science http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic23/ The LZ78 algorithm is MUCH better than the simple run-length encoding I described above. LZ78 would do pretty well on both of these numbers: 11111000000010000001111111000000 10101010101010101010101010101010 which are very different. You might want to do some Google searches on a man named Claude E. Shannon, who first discovered the concept of "information entropy." Shannon determined that English text, for example, has approximately 2.3 bits per character of "raw information." You can't compress English text any better than down to 2.3 bits per character, on average. In general, one of the best places on the Web to learn about compression techniques is the FAQ for the comp.compression USENET newsgroup, available here: Compression FAQ http://www.faqs.org/faqs/compression-faq/ Note that this a HUGE document, so don't try reading it straight through. Browse the table of contents in the first part, and read the questions that sound interesting. I think you'll find a lot of it very interesting! Let me know if you have any more questions. - Doctor Warren, The Math Forum http://mathforum.org/dr.math/ |
Search the Dr. Math Library: |
[Privacy Policy] [Terms of Use]
Ask Dr. Math^{TM}
© 1994- The Math Forum at NCTM. All rights reserved.
http://mathforum.org/dr.math/