Sort a binary file without loading it into memory or using a temporary file

Question

I had a silly assignment:

You are given a binary file FLOATS.DAT of unknown length, type float data has been written in it. Write a program which sorts the data in the file (the output is a sorted file FLOATS.DAT). You need to sort it without loading the file into memory, or by using a temporary file. Use the Shell sort algorithm.

Obviously by using a temp. file it would be easy, even more if you could load it into memory.

Here is my solution:

#include <stdio.h>

// sizeof(float) = 4 on my machine

int main(int argc, char **argv)
{
    size_t sajz = sizeof(float); // for convenience 
    int n; // total number of floats in the file

    FILE *dat;

    dat = fopen("FLOATS.DAT", "r+b");

    if (dat == NULL)
    {
        printf("Error while opening.\n");
        return 1;
    }

    // find the number of floats in the file
    fseek(dat, 0, SEEK_END);
    n = ftell(dat)/sajz;
    rewind(dat);

    float tjmh, tx;
    int i,j,h;

    // standard shell sort algorithm, the hard part is when I have to
    // read a position I first have to jump to it, then read it,
    // same when writing to a location
    for (h = n/2; h>0; h/=2)
    {
        for (i=h; i<n; i++)
        {
            j = i;

            fseek(dat, i*sajz, SEEK_SET);
            fread(&tx, sajz, 1, dat);
            rewind(dat);

            fseek(dat, (j-h)*sajz, SEEK_SET);
            fread(&tjmh, sajz, 1, dat);
            rewind(dat);

            for (; j>=h && tx < tjmh; j -= h)
            {

                fseek(dat, j*sajz, SEEK_SET);
                fwrite(&tjmh, sajz, 1, dat);
                rewind(dat);

                // this was tricky to figure out
                // I need the value of `j` after it has been
                // subtracted from h
                fseek(dat, (j - 2*h)*sajz, SEEK_SET);
                fread(&tjmh, sajz, 1, dat);
                rewind(dat);
            }

            fseek(dat, j*sajz, SEEK_SET);
            fwrite(&tx, sajz, 1, dat);
            rewind(dat);

        }
    }

    fclose(dat);

    return 0;
}

Can this be improved in any way?

Code snippet to generate the file with numbers 2.0, 3.0, 10.0, 4.0, 1.0, 7.0, 9.0, 5.0, 6.0, 8.0

int main()
{
    FILE *dat;

    dat = fopen("FLOATS.DAT", "wb");

    float a;
    if (dat)
    {
        a = 2.00;
        fwrite(&a, sizeof(float), 1, dat);
        a = 3.00;
        fwrite(&a, sizeof(float), 1, dat);
        a = 10.00;
        fwrite(&a, sizeof(float), 1, dat);
        a = 4.00;
        fwrite(&a, sizeof(float), 1, dat);
        a = 1.00;
        fwrite(&a, sizeof(float), 1, dat);
        a = 7.00;
        fwrite(&a, sizeof(float), 1, dat);
        a = 9.00;
        fwrite(&a, sizeof(float), 1, dat);
        a = 5.00;
        fwrite(&a, sizeof(float), 1, dat);
        a = 6.00;
        fwrite(&a, sizeof(float), 1, dat);
        a = 8.00;
        fwrite(&a, sizeof(float), 1, dat);

        fclose(dat);
    }

    return 0;
}

You can use an array here as well.

Where did this assignment come from? Would mmap() be considered "loading the file into memory"? — 200_success♦, yesterday
By "Shell Sort" I though it meant use the sort built into your shell. But what I think it now means you need to sort a stream of floating point numbers using "shell sort". Obviously you have to read it into memory the point of not loading it is to force you to sort a stream of data. — Loki Astari, yesterday
Of course, in reality if one is "given [implying by someone else] a binary file FLOATS.DAT of unknown length, type float data", one first has to determine whether it's possible to unambiguously determine the non-portable bit representation that was used to write it, otherwise it's of no use... and even if you can figure that out, your system might differ, meaning you'd have to convert to its representation, which isn't necessarily trivial. All of which supports your conclusion that - like so many others - it's a "silly assignment". ;) — underscore_d, yesterday
@underscore_d: in practice, the IEEE-754 floating point representation is near-universal. gcc defines macros for float endianness, but IDK if that's an issue on real CPUs the way integer endianness is. — Peter Cordes, 22 hours ago

mdfst13 · Answer 1 · 2016-07-19 03:02:38Z

Don't Repeat Yourself (DRY)

            fseek(dat, i*sajz, SEEK_SET);
            fread(&tx, sajz, 1, dat);
            rewind(dat);

You do this three times. The general rule is first time write, second time copy, third time refactor into a common function. Note that if there is sufficient commonality, you may create the common function with only two uses (e.g. with the fwrite version). Make a function:

void read_at_position(FILE *source, size_t position, size_t size, void *datum)
{
    fseek(source, position * size, SEEK_SET);
    fread(datum, size, 1, source);
    rewind(source);
}

I made datum a void * because that's what fread takes. This way you don't have to write a new function if you change the datum type.

Now you can rewrite the original section (and the other four) like

            read_at_position(dat, i, sajz, &tx);

If you're worried about the overhead, you could create a macro instead

#define READ_AT_POSITION(source, position, size, datum) \
    do {\
        fseek(source, (position) * (size), SEEK_SET);\
        fread(datum, size, 1, source);\
        rewind(source);\
    } while (0)

which you'd use as

            READ_AT_POSITION(dat, i, sajz, &tx);

And the compiler will turn it into your original version.

Note that a good compiler will probably do that with the function as well.

Better names

The only way that I know what sajz and dat do is from looking at their declarations. Good names should balance being easy to write (e.g. size) with being descriptive (size_of_each_datum or size_of_float). Perhaps datum_size instead of sajz.

Similar problem with tjmh and tx. Neither name tells me anything about what they do.

Consider changing h to gap_size. Yes, h is a standard name in a shellsort (as per Wikipedia), but will most people remember that if they see it?

Reusability

Rather than writing this directly into main consider making a function. E.g.

void sort_floats_in_file(FILE *source, size_t datum_size);

You could generalize this by passing in a comparison function.

void sort_in_file(FILE *source, size_t datum_size, int (*compar)(const void *, const void*));

However, that can add unnecessary overhead in simple cases like this. Perhaps the compiler takes care of that. Perhaps not.

Robustness

You say

    size_t sajz = sizeof(float); // for convenience

And later

    float tjmh, tx;

Consider flipping the order around and saying

    float tjmh, tx;
    size_t sajz = sizeof tjmh; // for convenience

Then if you change the data type of tjmh and tx, sajz will change to match automatically. As is, it would be easy to forget and only change in the one place.

Similarly,

    float a;
    if (dat)
    {
        a = 2.00;
        fwrite(&a, sizeof(float), 1, dat);

Consider rewriting this as

    if (dat)
    {
        test_write(2.00, dat);

with

void test_write(float a, FILE *dat)
{
    fwrite(&a, sizeof a, 1, dat);
}

Not only is this less typing, it avoids the problem of changing the type in one place and not the other. Admittedly there's some extra work to write the function, but the reduction per test case should cover it.

JS1 · Answer 2 · 2016-07-19 05:11:58Z

up vote 5 down vote

Actually loads file into memory

By virtue of using fread and fwrite, you are using buffered I/O and you are actually loading a large part of the file into memory. In fact, for your small test case, the entire file will be loaded into memory on the first fread. To avoid this, you either need to:

Use read() and write() instead.
Turn buffering off using setvbuf().

Be kind, rewind?

Your calls to rewind() are unnecessary because you always use fseek(..., SEEK_SET) which sets the file position from the start of the file. So you can just remove all your calls to rewind().

answered yesterday

JS1

17k21561

There's a tight upper bound on how much total memory will be used at any one time by stdio buffers. So this is an in-place sort that uses O(1) space to sort an arbitrary sized file. That should satisfy the spirit of the assignment. If you want to avoid ending up with the whole file in memory (in the pagecache) you'd need to open(..., O_RDRW|O_DIRECT) to use direct I/O (problematic since IIRC you can only read / write whole pages), or use madvise(MADV_DONTNEED) to flush the kernel's disk cache. Upvoted for rewind, though. – Peter Cordes 22 hours ago

Unclear why suggest read() instead of fread() as post is not system specific and read() is not in the standard C library. Suppose read() is likely to exist. – chux 11 hours ago

add a comment |

chux · Answer 3 · 2016-07-19 21:32:41Z

Additions to previous 2 fine answers

Check I/O results. Robust check for errors, especially I/O. Code does use if (dat == NULL), which is good. Consider checking other file I/O.

size_t sajz = sizeof(float); // for convenience 
int n; // total number of floats in the file
... 
// fseek(dat, 0, SEEK_END);
// n = ftell(dat)/sajz;
// rewind(dat);

if (fseek(dat, 0, SEEK_END)) Handle_Error();
long pos = ftell(dat);
if (pos == -1) Handle_Error();
n = pos/sajz;
rewind(dat);  // No error return

Use long rather than int. Code is limited to files LONG_MAX or less in size as code uses long ftell(). Yet code mixes int with long with various calculations. This only makes a difference when 1) INT_MAX < LONG_MAX, and 2) the file is huge. Yet this is precisely the scenario to use an in-place sort: when the file is huge.
```
// int n; // total number of floats in the file
// ...
// int i,j,h;

long n; // total number of floats in the file
...
long i,j,h;
```
Hidden compare: tx < tjmh in for (; j>=h && tx < tjmh; j -= h). Suggest un-hiding it. The central computation of the entire code is this compare.
```
// for (; j>=h && tx < tjmh; j -= h)

for (; j>=h && cmp(tx, tjmh); j -= h) {
// or 
for (; j>=h; j -= h) {
  if (tx < tjmh) break;
```

Consider that code may want another compare. Example: With typical floating point, Not-a-number and -0.0 may occur. How will your code sort that, if < is not to coding goals? Perhaps a helper function.

int cmp_lt(float f1, float f2) {
  // first 3 if's return a result when f1, f1 are well ordered. (Neither NaN)
  if (f1 < f2) return 1;
  if (f1 > f2) return 0;
  if (f1 == f2) {
    if (f1 == 0.0f && (signbit(f1) != signbit(f2)) {
      // f1 and f2 are both some 0.0 (one + and one -)
      // let -0.0 be "less than" +0.0
      return signbit(f1) == 0;
    }
    // Consider then equal
    return 0;
  }
  // at least 1 of f1,f2 is a NaN
  return isnan(f2);  // Consider f1 "less than" f2 when f2 is NaN
}

for (; j>=h && cmp_lt(tx, tjmh); j -= h) {

asked	yesterday
viewed	458 times
active	today

current community

your communities

more stack exchange communities

Sort a binary file without loading it into memory or using a temporary file

3 Answers 3

Don't Repeat Yourself (DRY)

Better names

Reusability

Robustness

Actually loads file into memory

Be kind, rewind?

Your Answer

Not the answer you're looking for? Browse other questions tagged c sorting or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

Sort a binary file without loading it into memory or using a temporary file

3 Answers 3

Don't Repeat Yourself (DRY)

Better names

Reusability

Robustness

Actually loads file into memory

Be kind, rewind?

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c sorting or ask your own question.

Visit Chat

Related

Hot Network Questions