-1

can someone please name an existing algo which is used for compressing numbers? numbers are integers and totally random with no spaces and decimals, eg. 35637462736423478235687479567456....n

well, so far, all i have is this, it converts the integers into ascii reducing approx 40% of the original size

function intergerToChar($v)
{
    $buffer="";
    $charsLen=strlen($v);
    for($i = 0; $i <= $charsLen; $i++)
    {     
        $asc=$v[$i];
        if($asc==0){$buffer[]=0;}
        elseif($asc==1){$buffer[]=$v[$i].$v[$i+1].$v[$i+2];$i=$i+2;}
        elseif($asc==2)
        {
            if($v[$i+1]<5){$buffer[]=$v[$i].$v[$i+1].$v[$i+2];$i=$i+2;}
            elseif($v[$i+1]==5 && $v[$i+2]<6){$buffer[]=$v[$i].$v[$i+1].$v[$i+2];$i=$i+2;}
            else{$buffer[]=$v[$i].$v[$i+1];$i++;}       
        }
        else{$buffer[]=$v[$i].$v[$i+1];$i++;}  
    }
    return $buffer;   
}

btw, i know PHP is not meant for building a compression tool. I'll be using C/C++

UPDATE: This is another PHP code with better compressing result than the above code, it can compress upto 66% if the integers on the position 1st,6th,12,th, and so on has vales of less than 256 and the 3 integers following them have a values not more than 256 than the preceding 3 integers egs, 134298156286159.... can be compressed upto 66% i knw its not optimal, please feel free to make suggestions/corrections

function intergerToChar2($v)
{
    $buffer="";
    $charsLen=strlen($v);
    for($i = 0; $i <= $charsLen; $i++)
    {     
        if($v[$i].$v[$i+1].$v[$i+2]<256){$base=$v[$i].$v[$i+1].$v[$i+2];$i=$i+2;}
        else{$base=$v[$i].$v[$i+1];$i=$i+1;}$i=$i+1;

        if($v[$i].$v[$i+1].$v[$i+2]<256){$next=$v[$i].$v[$i+1].$v[$i+2];$i=$i+2;}
        else{$next=$v[$i].$v[$i+1];$i=$i+1;}

        if($next!=="")
        {
            $next=$next-$base;
            if($next<0)$next=255+$next;
        }

        $buffer[]=$base;
        $buffer[]=$next;
    }
    return $buffer;   
}

btw, 10 bit encoding or 40 bit encoding can be easily done using base_convert() or 4th comment from http://php.net/manual/en/ref.bc.php page which always shows a compression of about 58.6%.

10
  • 2
    Is that really a number, or just a string with only numeric characters?
    – brianestey
    Commented Aug 7, 2013 at 2:19
  • How are you storing it now?
    – Blender
    Commented Aug 7, 2013 at 2:20
  • @brianestey yes, u r right! a string of numbers. it can be characters too.
    – Nok Imchen
    Commented Aug 7, 2013 at 2:21
  • 1
    Close-voters: OP has included some code which, although not quite optimal, works. You may not like the fact that it's written in PHP; it certainly wouldn't be my choice, but I don't see how you can say that it isn't an attempted solution which demonstrates some understanding of the problem. Voting to reopen.
    – rici
    Commented Aug 8, 2013 at 5:58
  • 1
    @NokImchen: You can always arrange to super-compress certain strings at the cost of other strings being less compressible. That's why you can do better than the optimal random solution if you know something about what strings you can expect to compress. However, the information-theoretic limit is a real limit, averaged over all strings. That's easy to prove. Notwithstanding that fact, the quest for compression schemes which "can always compress random data" is roughly as widespread as the quest for perpetual motion engines, and just as futile.
    – rici
    Commented Aug 8, 2013 at 6:50

1 Answer 1

4

If the digits are random, then you cannot compress the sequence more than the information-theoretic limit, which is log210 bits/digit. (Actually, it's slightly more than that unless the precise length of the string is fixed.) You can achieve that limit by representing the digits as a (very long) binary number; however, that's awkward and timeconsuming to compress and decompress.

A very near optimal solution results from the fact that 1000 is only slightly less than 210, so you can represent 3 digits using 10 bits. That's 3.33 bits/digits, compared with the theoretically optimal 3.32 bits/digit. (In other words, it's about 99.7% optimal.)

Since there actually 1024 possible 10-bit codes, and you only need 1000 of them to represent 3 digits, you have some left over; one of them can be used to indicate the end of the stream, if necessary.

It's a little bit annoying to output 10-bit numbers. It's easier to output 40-bit numbers, since 40 bits is exactly five bytes. Fortunately, most languages these days support 40-bit arithmetic (actually 64-bit arithmetic).

(Note: This is not that different from your solution. But it's a bit easier and a bit more compressed.)

7
  • Actaully, the question is a bit misleading, the real data is something like this pastebin.com/316U5aDt (i've made a representation. Sorry for my bad English). I cant use 10 bits. i have to stick to the regular 8 bits.
    – Nok Imchen
    Commented Aug 7, 2013 at 4:08
  • @NokImchen: Of course you can use a 10-bit encoding. You just have to write them out 8 bits at a time :), which is why I said that computing four of them would be easier, since you can write the 40 bits out as five 8-bit bytes. That was based on your numeric strings being long, though. If they are short, then you're likely to waste bits at the end of each numeric sequence.
    – rici
    Commented Aug 7, 2013 at 4:17
  • yes, 90+% of the data is numeric thanks for explanation, i now understand that i can use 10 bits encoding, will it be better if i use more than 10 bits encoding?
    – Nok Imchen
    Commented Aug 7, 2013 at 4:20
  • @NokImchen: as I said, the 10-bit encoding is 99.7% optimal. Squeezing out the extra 3 bits per thousand will not be worth the effort.
    – rici
    Commented Aug 7, 2013 at 4:28
  • 1
    @NokImchen: Sadly, no. The information-theoretic limit is about 59.75% and that's for a string of known length (otherwise you also need to encode the length). The only way you can do better is if the sttring is not random.
    – rici
    Commented Aug 7, 2013 at 20:59

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.