Unsigned 32-bit integer to binary string function

Question

I've been trying to solidify my understanding of the binary number system by writing code. So, I tried to write a function with what I currently understand about the binary system that returns a string representation of the binary number representing the input.

Any feedback regarding how I may optimize the program is greatly appreciated. Thanks!

char *uint32_to_binarystr(uint32_t n) {
    if (n != UINT32_MAX) {
        int expb2 = 0;              // Exponent of 2.      
        static char binarystr[33];  // 32 bits + 1 termination NULL char.
        binarystr[32] = '\0';       // Add terminating NULL char.
        memset(binarystr, '0', 32); // Initialize 32-bit string as 0.
        while (n != 0) {                         // Continue until n has been completely represented by a sum of powers of 2.
            while (1 << expb2 <= n) { expb2++; } // Find the power of 2 that yields a result greater than n. 
            expb2 -= 1;                          // Once this number is found, we are sure that the previous power yields the largest power of 2 less than n.
            n -= 1 << expb2;                     // We've found a power of 2 to represent this "chunk" of n. Reduce n by the same amount.
            binarystr[31 - expb2] = '1';         // Set the bit at the index with expb2 digits following it to 1.
            expb2 = 0;                           // Reset exponent of 2 and repeat.
        }
        return binarystr;
    } else {
        /*
        In the case that UINT32_MAX is to be represented, just return hard-coded string. 
        Why? 
        A 32-bit shift is not doable in some cases:
        https://stackoverflow.com/questions/7401888/why-doesnt-left-bit-shift-for-32-bit-integers-work-as-expected-when-used
        */
        static char binarystr[33];  
        binarystr[32] = '\0';       
        memset(binarystr, '1', 32); 
        return binarystr;
    }
}

solidify my understanding of the binary number system: I pedagogically suggest attempting a general algorithm to convert any given (input) number with its base, to any given (input) base. Binary, Hex, Decimal, it is all the same really. But not Roman numerals of course! Your code seems to be a bit shifting exercise and necessarily machine (CPU) specific. — radarbob, 14 hours ago

Lundin · Accepted Answer · 2021-06-15 10:34:17Z

As already said in other reviews, you should switch to caller allocation. And there's no initialize the string at all, just overwrite all of it.

As for the algorithm itself, or in general, go with readability and keep things simple. When we do so, we tend to get efficiency for free too. This seems like a good compromise between readability and performance:

void uint32_to_binstr (char dst[33], uint32_t n)
{
  for(size_t i=0; i<32; i++)
  {
    bool one = n & (1u << i);
    size_t index = 32-i-1; 
    dst[index] = '0' + one;
  }
  dst[32]='\0';
}

There's a rule of thumb to always keep (for) loops as simple as possible, as close to idea ideal for(size_t i=0; i<n; i++) as possible. When you do this, the code tends to turn both more readable and efficient. (A bit of fiddling around with disassemblers showed me that this gave somewhat better machine code than the down counting versions or the versions right shifting a mask.)

Since I counted from 0 to 31, I need to compensate for that when determining the index used for storage - the above actually fills the array from the back to the front. The calculation 32-i-1 gives 31, 30, 29...

The bool one = n & (1u << i); is basically the same as if(n & (1u << i)) and will likely result in a branch too, probably there's way around it without turning the code real ugly.

Note that I used 1u for shifting, since 1 gives a signed int and shifting signed numbers is a very bad idea. In case of left shift we might shift data into the sign bit = undefined behavior bug. In case of right shift we don't know if we get arithmetic or logical shift = non-portable code and generally a bad idea.

Peeking at the resulting machine code (gcc x86 -O3):

uint32_to_binstr:
        lea     r9, [rdi-32]
        mov     rax, rdi
        mov     r8d, 1
.L2:
        mov     ecx, edi
        mov     edx, r8d
        sub     ecx, eax
        sal     edx, cl
        and     edx, esi
        cmp     edx, 1
        mov     edx, 48
        sbb     dl, -1
        sub     rax, 1
        mov     BYTE PTR [rax+32], dl
        cmp     r9, rax
        jne     .L2
        mov     BYTE PTR [rdi+32], 0
        ret

This is reasonably compact and not many branches.

Now since we decided to use a trivial, readable loop, we automatically get lots of advantages we hadn't even considered from the start. Suppose we want to modify the code to drop leading zeroes or just print a certain amount of digits? We can use pretty much the very same code, just swap out the hard-coded 32 for a parameter and then change the order of parameters:

void uint32_to_binstr (uint32_t n, size_t digits, char dst[digits])
{
  for(size_t i=0; i<digits; i++)
  {
    bool one = n & (1u << i);
    size_t index = digits-i-1; 
    dst[index] = '0' + one;
  }
  dst[digits]='\0';
}

AJNeufeld · Accepted Answer · 2021-06-15 02:32:30Z

Your code has a possible bug, depending on your CPU/compiler.

You are treating UINT32_MAX (0xFFFFFFFF) as a special case, however, you could run into problems with any value greater that 0x80000000.

For example, if we let n be 0x80000001, then ...

        int expb2 = 0;
        ...
        while (n != 0) {
            while (1 << expb2 <= n) { expb2++; } 
            ...

we start increasing expb2, shifting 1 over by more and more bits until it is greater than n.

But on a 32-bit architecture, since 1 << 31 is less n, you increase expb2 to 32. and 1 << 32 will overflow 32-bit integer, which results in undefined behaviour, but my experience, the result will be either 0 or 1. Since this is less than n, the loop continues, and the while (1 << expb2 <= n) { expb2++; } loop will continue forever.

Instead of trying to bit shift to find the ideal bit to start at, since you have a known type (uint32) with a known bit width (32), simply start at bit 31 and work your way down.

char *uint32_to_binarystr(uint32_t n) {
    static char binarystr[33];
    binarystr[32] = '\0';
    memset(binarystr, '0', 32);

    for(int expb2 = 31; expb2 >= 0; expb2--) {
        if (1 << expb2 >= n) {
            binarystr[31 - expb2] = '1';
            n -= 1 << expb2;
        }
    }
    return binarystr;
}

Since the maximum shift is 31, there is no need to special case UINT32_MAX.

Thanks for the informative feedback and debugging! The UINT32_MAX special case felt really "hacky" when I implemented it, so thanks again for the alternative algorithm. — Sean Xie, 11 hours ago

Toby Speight · Accepted Answer · 2021-06-15 10:11:55Z

The interface is problematic. What output does this code produce?

#include <stdio.h>
int main(void)
{
    const char *s1 = uint32_to_binarystr(0);
    const char *s2 = uint32_to_binarystr(1000000);
    printf("%s\n%s\n", s1, s2);
}

You might expect it to output 00000000000000000000000000000000 followed by 00000000000011110100001001000000, but that's not what we get. Do you see why?

The problem is

    static char binarystr[33];  
    ⋮
    return binarystr;

s1 and s2 both point to the same data!

The two usual solutions to this problem are

return newly-allocated memory that the caller must free(), or
have the caller supply a buffer into which to write.

I tend to prefer the latter - it avoids overhead of heap allocation in many cases, and reduces the risk of memory leaks due to misuse.

I would write that as

#include <stdbool.h>
#include <stdint.h>
#include <string.h>

bool uint32_to_binarystr(char *binarystr, size_t len, uint32_t n) {
    if (len < 33) {
        // 32 bits + 1 terminating NUL char
        return false;
    } 
    ⋮
    return true;
}

int main(void)
{
    char s1[33];
    char s2[33];
    assert(uint32_to_binarystr(s1, sizeof s1, 0));
    assert(uint32_to_binarystr(s2, sizeof s2, 1000000));
    printf("%s\n%s\n", s1, s2);
}

Passing the length available might seem unnecessary (who would supply a buffer that's too short?) but it does turn out to be a good idea in practice.

For the algorithm, I would approach this with a "moving bit" (for simplicity, I'll demonstrate the uint8_t version). If we start with our number abcdefgh and an initial mask 10000000, then we can decide whether the first digit is 0 or 1 using a bitwise AND operation n & mask. Then move to the next digit by shifting mask rightwards by one. When the bit runs off the end, mask becomes zero, and we know we're done.

In code (and switching back to 32 bits), that looks like:

for (uint32_t mask = 0x80000000;  mask;  mask >>= 1) {
    if (n & mask) {
        /* we have a 1 bit */
    } else {
        /* we have a 0 bit */
    }
}

We can simplify the body of the loop, too, because C guarantees that '1' is one more than '0':

char digit = '0' + ((n & mask) != 0);

If we write all the characters like this, then there's no need for memset() at the beginning.

Putting this all together, I have a replacement which is worth studying:

#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>

bool uint32_to_binarystr(char *buf, size_t len, uint32_t n) {
    if (len < 33) {
        // 32 bits + 1 terminating NUL char
        return false;
    }
    for (uint32_t mask = 0x80000000;  mask;  mask >>= 1) {
        bool bit_is_set = n & mask;
        *buf = '0' + bit_is_set;
        ++buf;
    }
    *buf = '\0';          /* add the terminator */
    return true;
}

*buf++ = '0' + ((n & mask) != 0); is really cryptic to read for beginners though. I would split that expression up in several lines, to increase readability. — Lundin, 4 hours ago
@Lundin, I tried to offset that by showing the intermediate step with if/else. But I've now edited to break up the expression. — Toby Speight, 3 hours ago
Turns out I was writing an answer with nearly the same solution as you did that :) The bool seems to give slightly better machine code than if(n & mask) *buf = '0'; else *buf = '1';. — Lundin, 3 hours ago

Abstract Approach · Accepted Answer · 2021-06-15 06:41:24Z

If you have a good G.P.U. ......... Never mind, these calculations are so small and fast I think you went the right way.

In theory using cuda or something similar with an actual multiply instruction might be better if you were converting A LOT of numbers. But I see no reason for that.

I wonder if you could optimize this with assembly. When we write something like mov x2, #10 it has to convert that "10" to a "1010" and I'm sure it does so via a lookup table or hardware somehow.

Good exercise, now I am curious (I've always given computers hex values as I feel like it would be at least a few clock cycles to figure out what a base 10 representation of a number should be stored as.)

Ben Eater on YouTube has a video about hardwiring a decimal to hex converter with discrete logic gates, I imagine a similar principle applies. There should be some way you can use it if the computer can't.

Is it possible it would be faster to just

fprintf(temporary.txt, "%x", 42)

and convert the output to binary one digit at a time, make a lookup table for every 4 bits, but I'm guessing the persistent memory access will outweigh any gain.

TL;DR Like I said, from my armature eyes it seems like you took the right approach and tracked down the only bug. I'm just giving food for thought and thinking out loud a bit lol,..... I'll stop now

For someone just getting to grips with binary, CUDA is probably a bit too advanced! — Toby Speight, 7 hours ago

Stack Exchange Network

current community

your communities

more stack exchange communities

Unsigned 32-bit integer to binary string function

4 Answers 4

Your Answer

Not the answer you're looking for? Browse other questions tagged c or ask your own question.

Hot Network Questions

Unsigned 32-bit integer to binary string function

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c or ask your own question.

Related

Hot Network Questions