vote up 0 vote down star

Hi,

I need an efficient Java structure to manipulate very sparse vectors of doubles: basic read / write operations. I implemented it in a HashMap but the access is too slow. Should I use another data structure? Do you recommend any free library?

Looking for some peaceful advice :)

Thanks a lot,

Marie

flag
2  
The access for a HashMap is too slow? I know there's overhead with using a HashMap, but that's only calling your hashCode and equals methods. Did you make sure that those methods have optimal implementations? – Malaxeur Dec 10 '09 at 12:56
What kind of access do you need to be fast? It's hard to believe that the HashMap is causing much overhead over an hypothetical optimal solution, unless your indexes do not hash properly... – Pascal Cuoq Dec 10 '09 at 12:57
@pascal HashMap boxes all indices into Integer objects, this is slow. – Adrian Dec 10 '09 at 13:23
Are you accessing the vector in a sorted order or are your accesses random? Are you making loops over all values or are you selectively processing just a subset of the values? – jabbie Dec 11 '09 at 0:39

4 Answers

vote up 1 vote down

How big is your dataset? Much larger than Integer.MAX_VALUE? the problem is that HashSet is backed by an array. Collisions will slow performance. Perhaps it's not the mechanism of hashmap that is too slow, but the fact that you have multiple collisions. Perhaps if you partitioned your data first (e.g) using another hash function, then stored each partition of data in it's own hashmap you'd have more luck.

link|flag
vote up 1 vote down

HashMap is the way to go. It shouldn't be slow. Run your code through a profiler to see where all the time goes and then optimize accordingly. If you need tips to optimize the code, post an example here so we can help with a specific issue.

[EDIT] Depending on the size of the indexes, you can use a technique as in Integer.valueOf(int) to cache the objects for boxing. But this will only work when you create lots of maps and the indexes are in a somewhat limited range.

Or you can try IntHashMap from commons-lang. It's a bit hard to use (it's package private) but you can copy the code.

Lastly, you could use your own implementation of an int-based HashMap with optimized value lookup for your case.

link|flag
OK, thanks all for your quick reply. From your answers, I understand that HashMap is the best strategy to use. Actually, I am using *very* big data sets which I unfortunately can not optimize further and I was wondering how to make my code more efficient. :) Thanks again! – Marie Dec 10 '09 at 13:05
OP says map is too slow, sorry but reading the Q should be the minimum before answering. – Adrian Dec 10 '09 at 13:15
@Adrian: And reading my answer should be the minimum before downvoting it. – Aaron Digulla Dec 10 '09 at 15:34
@Marie: As I said: Run it through a profiler to determine the bottlenecks and then come back. – Aaron Digulla Dec 10 '09 at 15:36
@aaron OP has integer indices, for that a map is just not the way to go. A sparse vector with binary search on ordered indices is better than relying on the hash-values of integers (which are well, the integers themselves), not to speak of integer boxing in Java. – Adrian Dec 10 '09 at 16:07
show 2 more comments
vote up 0 vote down

For 1D sparse array, map is normally the way to go. You only need to use a library if it's multi-dimension.

If you compare access time between map and array,

   map.get(99);
   array[99];

map is going to be much slower. Any library would have the same issue.

Is that sparse array all about? You trade time for space.

link|flag
1  
OP says map is too slow, sorry but reading the Q should be the minimum before answering. – Adrian Dec 10 '09 at 13:14
1  
Reading the A should be the minumum before downvoting :( – John Mill Dec 10 '09 at 13:21
Your answer does not suggest another data structure not a library. OP asked for that. If you fix this, I'll remove the downvote. – Adrian Dec 10 '09 at 13:35
So you wouldn't take NO as an answer :) Slow is a relative term. It's always going to be slower compared with regular array. My point is that any other approach will be slower. You wouldn't find a library that compress array and also provides the speed gain. For speed, you use raw array. For space, you compress it. Can't get both. – ZZ Coder Dec 10 '09 at 14:34
If you answer is "nothing is faster than a map" then it's false. There are data structures that are faster than a hash map for some use cases, and sparse vectors are one of these usecases. – Adrian Dec 10 '09 at 16:03
vote up 0 vote down

You can copy paste the sparse vector from my Hapax project: ch.akuhn.matrix.SparseVector

PS: to all those other answers and comments that dont grok why using a map is too slow. It is slow because a map boxes all indices to Integer objects!

The sparse vector presented here is fast for read access and appending values, but not for putting at random indices. Its is optimal for a scenario where you first create the sprase vector but putting values in order of increasing indices, and later use the map for reading mostly.

Important methods in the sparse vector class are

// ...

public class SparseVector {

    /*default*/ int[] keys;
    /*default*/ int size, used;
    /*default*/ double[] values;

    public SparseVector(int size, int capacity) {
    	assert size >= 0;
    	assert capacity >= 0;
    	this.size = size;
    	this.keys = new int[capacity];
    	this.values = new double[capacity];
    }

    public double get(int key) {
    	if (key < 0 || key >= size) throw new IndexOutOfBoundsException(Integer.toString(key));
    	int spot = Arrays.binarySearch(keys, 0, used, key);
    	return spot < 0 ? 0 : values[spot];
    }

    public boolean isUsed(int key) {
    	return 0 <= Arrays.binarySearch(keys, 0, used, key);
    }

    public double put(int key, double value) {
    	if (key < 0 || key >= size) throw new IndexOutOfBoundsException(Integer.toString(key));
    	int spot = Arrays.binarySearch(keys, 0, used, key);
    	if (spot >= 0) return values[spot] = (float) value;
    	else return update(-1 - spot, key, value);
    }

    public void resizeTo(int newSize) {
    	if (newSize < this.size) throw new UnsupportedOperationException();
    	this.size = newSize;
    }

    public int size() {
    	return size;
    }

    private double update(int spot, int key, double value) {
    	// grow if reaching end of capacity
    	if (used == keys.length) {
    		int capacity = (keys.length * 3) / 2 + 1;
    		keys = Arrays.copyOf(keys, capacity);
    		values = Arrays.copyOf(values, capacity);
    	}
    	// shift values if not appending
    	if (spot < used) {
    		System.arraycopy(keys, spot, keys, spot + 1, used - spot);
    		System.arraycopy(values, spot, values, spot + 1, used - spot);
    	}
    	used++;
    	keys[spot] = key;
    	return values[spot] = (float) value;
    }

    public int used() {
    	return used;
    }

    public void trim() {
    	keys = Arrays.copyOf(keys, used);
    	values = Arrays.copyOf(values, used);
    }

}
link|flag
OK, thanks, this seems very nice. The question here would become: is this implementation more time-efficient than a HashMap? I am really not an expert... let's vote (but please not fight) – Marie Dec 10 '09 at 13:28
Putting might be slower, getting will be faster. I'll add that to the answer. – Adrian Dec 10 '09 at 13:37
Vote?? Measure! – Svante Dec 10 '09 at 13:39
That's because voting might be more time-efficient for a bad programmer like me :) I already started changing the code to test it. – Marie Dec 10 '09 at 13:44
We cannot vote about what's faster in your usecase. Different data structure have different properties, which one is faster eventually depends on your particular usecase. – Adrian Dec 10 '09 at 14:00
show 3 more comments

Your Answer

Get an OpenID
or
never shown

Not the answer you're looking for? Browse other questions tagged or ask your own question.