Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I would like to store a very simple pojo object in binary format:

public class SampleDataClass {
    private long field1;
    private long field2;
    private long field3;
}

To do this, I have written a simple serialize/deserialize pair of methods:

public class SampleDataClass {

    // ... Fields as above        

    public static void deserialize(ByteBuffer buffer, SampleDataClass into) {
        into.field1 = buffer.getLong();
        into.field2 = buffer.getLong();
        into.field3 = buffer.getLong();
    }

    public static void serialize(ByteBuffer buffer, SampleDataClass from) {
        buffer.putLong(from.field1);
        buffer.putLong(from.field2);
        buffer.putLong(from.field3);
    }
}

Simple and efficient, and most importantly the size of the objects in binary format is fixed. I know the size of each record serialized will be 3 x long, i.e. 3 x 8bytes = 24 bytes.

This is crucial, as I will be recording these sequentially and I need to be able to find them by index later on, i.e. "Find me the 127th record".

This is working fine for me, but I hate the boilerplate - and the fact that at some point I'm going to make a mistake and end up write a load of data that can't be read-back because there's an inconsistency between my serialize / deserialize method.

Is there a library that generate something like this for me?

Ideally I'm looking for something like protobuf, with a fixed-length encoding scheme. Later-on, I'd like to encode strings too. These will also have a fixed length. If a string exceeds the length it's truncated to n bytes. If a string is too short, I'll null-terminate it (or similar).

Finally, protobuf supports different versions of the protocol. It is inevitable I'll need to do that eventually.

I was hoping someone had a suggestion, before I start rolling-my-own

share|improve this question
    
I've checked-out Cap'n Proto, but (A) it's not ready for prime-time and (B) it's only got reliable support for C++ at the moment. –  jwa Nov 22 '13 at 15:46
add comment

3 Answers

Make your class inherit the java.io.Serializable interface. Then you can use java.io.ObjectOutputStream and java.io.ObjectInputStream to serialize / deserialize objects to / from streams. The write and read methods take byte arrays as arguments. To make it fixed length, standardize the size of the byte[] arrays used.

share|improve this answer
    
To make this a fixed size, how do I figure out the length in bytes that it needs to be? Does this have to be hard-coded? If so I've still go that boilerplate to manage. It could be a step forward from hand-coding it all, but I'd still need a way to handle versioning. Also, from your suggestion I assume the ObjectInputStream would be able to handle the trailing zeros let by a "shorter" objet? –  jwa Nov 22 '13 at 16:18
    
I thought you knew ahead of time how many bytes the pojo would be when serialized (ie. 24 bytes), no? In case of versioning, you can use ObjectOutputStream.reset in order to force a write and avoid reusing cached data. –  CaliforniaDreaming Nov 22 '13 at 16:25
    
Now that I think of it, you may also be able to use reflection to determine the size of the fields in your object, thus avoiding hard coding values in your serializer. –  CaliforniaDreaming Nov 22 '13 at 16:28
    
I do indeed have the lenght of the object coded from an up-front calculated value, just clarifying that this wouldn't solve that problem for me. –  jwa Nov 22 '13 at 16:30
    
To clarify if this helps me with versioning. If I add field4 and attempt to deserialize a byte-array that was only written with fields 1-3 (i.e. 24 bytes) will it handle the fact field4 is missing? Even if this is the case, it would prevent me from ever re-ordering my fields? –  jwa Nov 22 '13 at 16:32
show 4 more comments

The most difficult part here is capping your strings or collections. You can do this with Kryo for Strings by overriding default serializers. Placing strings into a custom buffer class (i.e. FixedSerializableBuffer) which stores or is annotated with a length to cut also makes sense.

public class KryoDemo {
    static class Foo{
        String s;
        long v;

        Foo() {
        }

        Foo(String s, long v) {
            this.s = s;
            this.v = v;
        }

        @Override
        public String toString() {
            final StringBuilder sb = new StringBuilder("Foo{");
            sb.append("s='").append(s).append('\'');
            sb.append(", v=").append(v);
            sb.append('}');
            return sb.toString();
        }
    }

    public static void main(String[] args) {
        Kryo kryo = new Kryo();

        Foo foo = new Foo("test string", 1);

        kryo.register(String.class, new Serializer<String>() {
            {
                setImmutable(true);
                setAcceptsNull(true);
            }

            public void write(Kryo kryo, Output output, String s) {
                if (s.length() > 4) {
                    s = s.substring(0, 4);
                }

                output.writeString(s);
            }

            public String read(Kryo kryo, Input input, Class<String> type) {
                return input.readString();
            }
        });

        // serialization part, data is binary inside this output
        ByteBufferOutput output = new ByteBufferOutput(100);

        kryo.writeObject(output, foo);

        System.out.println("before: " + foo);
        System.out.println("after: " + kryo.readObject(new Input(output.toBytes()), Foo.class));
    }
}

This prints:

before: Foo{s='test string', v=1}
after: Foo{s='test', v=1}
share|improve this answer
add comment

If the only additional requirement over standard serialization is efficient random access to the n-th entry, there are alternatives to fixed-size entries, and that you will be storing variable length entries (such as strings) makes me think that these alternatives deserve consideration.

One such alternative is to have a "directory" with fixed length entries, each of which points to the variable length content. Random access to an entry is then implemented by reading the corresponding pointer from the directory (which can be done with random access, as the directory entries are fixed size), and then reading the block it points to. This approach has the disadvantage that an additional I/O access is required to access the data, but permits a more compact representation of the data, as you don't have to pad variable length content, which in turn speeds up sequential reading. Of course, neither the problem nor the above solution is novel - file systems have been around for a long time ...

share|improve this answer
    
I see what you are getting at. My requirement actually leads to sequential reads, i.e. read elements 20-50. The advantage of fixed width would be they are (fragmentation of HDD aside) theoretically sequentially stored on disk. Looking-up against the table would require at least some amount of seeking, back to the directory, then back into the data blocks. –  jwa Nov 22 '13 at 16:40
    
Not necessarily. You could keep the directory in a separate file, or use a separate file descriptor for reading from the directory. That would cause the OS to cache both the current directory and the current data block, speeding up sequential reading. –  meriton Nov 22 '13 at 16:44
    
Or - if you know the range beforehand - you can read the relevant part of the directory first, and then sequentially read the data entries. –  meriton Nov 22 '13 at 16:45
    
Good suggestion, thansk for the clarification. –  jwa Nov 22 '13 at 16:45
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.