Join the Stack Overflow Community
Stack Overflow is a community of 6.6 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

In my data mining project, I'm given a complicated, huge multidemensional array of arrays that contains all the info I require, except that I have to perform a "fix" on it before I can process it. I've written some code that takes care of the issue, but it's taking way too long for the huge amount of data I have to "fix," and I'm hoping someone can help me find a more efficient solution.

Essentially, the type of array I'm working with is first indexed by an integer, as any run-of-the-mill array would, i.e. $x[0], $x[1], $x[2], except that each element is an associative array that contains key-pair values that I need (such as $x[0]['item'], $x[0]['price']), however one key is stored a bit deeper, the ID.

An ID number exists in the array as $x[0]['@attributes']['id'], and I would like to simplify the structure by duplicating this info along with the other key pairs, like $x[0]['id'].

The data set I'm working with is large, but here is a simplified example of my situation:

$attrib1 = array('id'=>'101');
$item1 = array('@attributes'=>$attrib1, 'item'=>'milk', 'price'=>'3.50');
$attrib2 = array('id'=>'102');
$item2 = array('@attributes'=>$attrib2, 'item'=>'butter', 'price'=>'2.45');
$attrib3 = array('id'=>'103');
$item3 = array('@attributes'=>$attrib3, 'item'=>'bread', 'price'=>'1.19');
$items = array($item1, $item2, $item3);
echo "Starting data - items using itemid as attribute:\n";
print_r($items);

# set item numbers by key instead of attribute
$i=0;
while(isset($items[$i]['@attributes']['id'])) {
   $items[$i]['itemid'] = $items[$i]['@attributes']['id'];
   #unset($items[$i]['@attributes']);
   $i++;
} # while
echo "\nDesired result - items using itemid as key:\n";
print_r($items);

Here is the output from that above example:

Starting data - items using itemid as attribute:
Array
(
    [0] => Array
        (
            [@attributes] => Array
                (
                    [id] => 101
                )

            [item] => milk
            [price] => 3.50
        )

    [1] => Array
        (
            [@attributes] => Array
                (
                    [id] => 102
                )

            [item] => butter
            [price] => 2.45
        )

    [2] => Array
        (
            [@attributes] => Array
                (
                    [id] => 103
                )

            [item] => bread
            [price] => 1.19
        )

)

Desired result - items using itemid as key:
Array
(
    [0] => Array
        (
            [@attributes] => Array
                (
                    [id] => 101
                )

            [item] => milk
            [price] => 3.50
            [itemid] => 101
        )

    [1] => Array
        (
            [@attributes] => Array
                (
                    [id] => 102
                )

            [item] => butter
            [price] => 2.45
            [itemid] => 102
        )

    [2] => Array
        (
            [@attributes] => Array
                (
                    [id] => 103
                )

            [item] => bread
            [price] => 1.19
            [itemid] => 103
        )

)

Note the added [itemid] key-value pair in the desired result. Is there a faster / more elegant way of accomplishing this? I've looked at some of PHP's fancy array functions, but I can't wrap my head around this more complicated situation to make use of them. Any ideas?

share|improve this question

closed as too localized by Baba, hakre, Jocelyn, tereško, PeeHaa Oct 27 '12 at 23:57

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.If this question can be reworded to fit the rules in the help center, please edit the question.

    
How many arrays are we talking about? Have you considered that parallel execution might be required for large amounts of data? – Dan Oct 25 '12 at 21:31
    
The arrays can contain as many as 300 to 4000 elements, each element containing a variety of associative key data, similar to my example data. The trouble is, there are tens of thousands of these array sets I have to process, so even cutting out a few seconds for each one could potentially cut the full job by hours. – Emo Mosley Oct 26 '12 at 1:08

Memory Efficiency

PHP DOC Comments : Memory footprint of splFixedArray is about 37% of a regular "array" of the same size.

splFixedArray also implements Iterator which means it encapsulate the list and expose visibility to one element at a time making them far more efficient.

The foreach loop makes a copy of any array passed to it. If you are processing a large amount of data, using it directly with our array can be a performance issue

Also see How big are PHP arrays (and values) really? (Hint: BIG!)

You can try

$it = SplFixedArray::fromArray($items);
foreach ( $it as $value ) {
    // Play with big array
}

Speed

Here is a simple benchmark

set_time_limit(0);
echo "<pre>";

$total = 10000;
$item = array("milk","butter","bread");
$items = array();

// Generating Random Data
for($i = 0; $i < $total; $i ++) {
    $att = array('id' => $i);
    $items[] = array('@attributes' => $att,'item' => $item[$i % 3],'price' => mt_rand(100, 5000) / 100);
}
// Pure array no copy
function m1($array) {
    foreach ( $array as $k => $v ) {
        isset($v['@attributes']) and $array[$k]['id'] = $v['@attributes']['id'];
        unset($array[$k]['@attributes']);
    }
    return $array;
}

// Array clean copy
function m2($array) {
    $items = array();
    foreach ( $array as $k => $v ) {
        isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
        $items[$k]['item'] = $v['item'];
        $items[$k]['price'] = $v['price'];
    }
    return $items;
}

// Array Iterator
function m3($array) {
    $it = new ArrayIterator($array);
    $items = array();
    foreach ( $it as $k => $v ) {
        isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
        $items[$k]['item'] = $v['item'];
        $items[$k]['price'] = $v['price'];
    }
    return $items;
}

// SplFixedArray Array
function m4($array) {
    $it = SplFixedArray::fromArray($array);
    $items = array();
    foreach ( $it as $k => $v ) {
        isset($v['@attributes']) and $items[$k]['id'] = $v['@attributes']['id'];
        $items[$k]['item'] = $v['item'];
        $items[$k]['price'] = $v['price'];
    }
    return $items;
}

// Array Map
function m5($array) {
    $items = array_map(function ($v) {
        isset($v['@attributes']) and $v['id'] = $v['@attributes']['id'];
        unset($v['@attributes']);
        return $v;
    }, $array);
    return $items;
}

// Array Walk
function m6($array) {
    array_walk($array, function (&$v, $k) {
        isset($v['@attributes']) and $v['id'] = $v['@attributes']['id'];
        unset($v['@attributes']);
        return $v;
    });
    return $array;
}

$result = array('m1' => 0,'m2' => 0,'m3' => 0,'m4' => 0,'m5' => 0,'m6' => 0);

for($i = 0; $i < 1; ++ $i) {
    foreach ( array_keys($result) as $key ) {
        $alpha = microtime(true);
        $key($items);
        $result[$key] += microtime(true) - $alpha;
    }
}

echo '<pre>';
echo "Single Run\n";
print_r($result);
echo '</pre>';

$result = array('m1' => 0,'m2' => 0,'m3' => 0,'m4' => 0,'m5' => 0,'m6' => 0);

for($i = 0; $i < 2; ++ $i) {
    foreach ( array_keys($result) as $key ) {
        $alpha = microtime(true);
        $key($items);
        $result[$key] += microtime(true) - $alpha;
    }
}

echo '<pre>';
echo "Dual Run\n";
print_r($result);
echo '</pre>';

It has a very Interesting results

PHP 5.3.10

Single Run
Array
(
    [m1] => 0.029280185699463 <--------------- fastest
    [m2] => 0.038463115692139
    [m3] => 0.049274921417236
    [m4] => 0.03856086730957
    [m5] => 0.032699823379517
    [m6] => 0.032186985015869
)

Dual Run
Array
(
    [m1] => 0.068470001220703
    [m2] => 0.077174663543701
    [m3] => 0.085768938064575
    [m4] => 0.07695198059082
    [m5] => 0.073209047317505
    [m6] => 0.065080165863037 <--------------- Fastest after in 2 loops
)

PHP 5.4.1

Single Run
Array
(
    [m1] => 0.029529094696045
    [m2] => 0.035377979278564
    [m3] => 0.03830099105835
    [m4] => 0.034613132476807
    [m5] => 0.031363010406494
    [m6] => 0.028403043746948  <---------- fastest
)

Dual Run
Array
(
    [m1] => 0.072367191314697
    [m2] => 0.071731090545654
    [m3] => 0.078131914138794
    [m4] => 0.075049877166748
    [m5] => 0.065959930419922
    [m6] => 0.060923099517822  <---------- Fastest
)
share|improve this answer
    
I want to test this, but codepad doesn't have SplFixedArray. Could you run this test and report the results please? – Asad Saeeduddin Oct 25 '12 at 21:47
    
It not new see nikic.github.com/2011/12/12/… – Baba Oct 25 '12 at 21:47
    
I see. Turns out it is almost twice as fast as the other answer codepad.viper-7.com/mvPm7w – Asad Saeeduddin Oct 25 '12 at 21:54
    
You are looking at speed alone .. what of memory implications ??? It performs better in both :) – Baba Oct 25 '12 at 21:57
1  
Very good answer! – Madara Uchiha Oct 26 '12 at 4:20

That looks like it's coming from XML, so i would add that it's possible for @attributes to have more than just ID in it.. but assuming that won't happen you could try using a foreach instead, though I'm not sure about speed gains.

There may be an impact because you are modifying the same array you are looping (I can't find evidence for this though, so experiment required)

$cleanedArray = array();
foreach($bigArray as $subArray)
{
  if(isset($subArray['@attributes']))
  {
     $subArray['itemid'] = $subArray['@attributes']['id'];
    unset($subArray['@attributes']); //Optional
    $cleanedArray[] = $subArray;
  }
}

Apologies if that ends up slower

Edit: Missing index added

share|improve this answer
1  
codepad.org/A8E9EfvR, seems faster by a pretty consistent ratio of 9 to 4, from my crude test – Asad Saeeduddin Oct 25 '12 at 21:41
    
Seems like it's over twice as fast? I wasn't expecting that to be honest.. but good news! – Martin Lyne Oct 25 '12 at 21:42
    
Also thanks for that codepad link, never seen that before, looks useful. – Martin Lyne Oct 25 '12 at 21:43
    
Yes - the source data is XML - good eye. The only thing stored as an attribute is that ID number. I'll have to try this and other routines on some live data to see how they spec out - I'll post results. – Emo Mosley Oct 25 '12 at 23:28
1  
BTW, your algorithm basically works, but you need $subArray['@attributes']['id']. I tested my original and this one (no unsetting for either one), and I'm seeing this one to be much slower. – Emo Mosley Oct 26 '12 at 1:58

This isn't an answer so much as it is a comparison of the approaches provided:

I used this script to average out the times the algorithms took:

<?php
//base data
$attrib1 = array('id'=>'101');
$item1 = array('@attributes'=>$attrib1, 'item'=>'milk', 'price'=>'3.50');
$attrib2 = array('id'=>'102');
$item2 = array('@attributes'=>$attrib2, 'item'=>'butter', 'price'=>'2.45');
$attrib3 = array('id'=>'103');
$item3 = array('@attributes'=>$attrib3, 'item'=>'bread', 'price'=>'1.19');
$results = array('test1'=>array(),'test2'=>array(),'test3'=>array());

//set trials
$trials=1000;

//test 1
for($count=0;$count<$trials;$count++){
unset($items);
$items = array($item1, $item2, $item3);
$timer1=microtime();
$i=0;
while(isset($items[$i]['@attributes']['id'])) {
   $items[$i]['itemid'] = $items[$i]['@attributes']['id'];
   $i++;
}
$timer1=microtime()-$timer1;
$results['test1'][$count]=$timer1;
}

//test 2
for($count=0;$count<$trials;$count++){
unset($items);
unset($cleanedArray);
$items = array($item1, $item2, $item3);
$cleanedArray = array();
$timer2=microtime();
foreach($items as $subArray)
{
  if(isset($subArray['@attributes']))
  {
    unset($subArray['@attributes']);
    $cleanedArray[] = $subArray;
  }
}
$timer2=microtime()-$timer2;
$results['test2'][$count]=$timer2;
}

//test 3
for($count=0;$count<$trials;$count++){
unset($items);
unset($it);
$items = array($item1, $item2, $item3);
$it = SplFixedArray::fromArray($items);
$timer3=microtime();
foreach($it as $subArray)
{
  if(isset($subArray['@attributes']))
  {
    unset($subArray['@attributes']);
    $cleanedArray[] = $subArray;
  }
}
$timer3=microtime()-$timer3;
$results['test3'][$count]=$timer3;
}

//results
$factor=pow(10,-6);
echo "Test 1 averaged " . round(array_sum($results['test1']) / count($results['test1'])/$factor,1) . " µs, with range: " . round((max($results['test1'])-min($results['test1']))/$factor,1) . " µs - (min: " . (min($results['test1'])/$factor) . ", max: " . (max($results['test1'])/$factor) . ")<br/>";

echo "Test 2 averaged " . round(array_sum($results['test2']) / count($results['test2'])/$factor,1) . " µs, with range: " . round((max($results['test2'])-min($results['test2']))/$factor,1) . " µs - (min: " . (min($results['test2'])/$factor) . ", max: " . (max($results['test2'])/$factor) . ")<br/>";

echo "Test 3 averaged " . round(array_sum($results['test3']) / count($results['test3'])/$factor,1) . " µs, with range: " . round((max($results['test3'])-min($results['test3']))/$factor,1) . " µs - (min: " . (min($results['test3'])/$factor) . ", max: " . (max($results['test3'])/$factor) . ")<br/>";

echo "<pre>";
var_dump($results);
echo "</pre>";

The results here are extremely variable at low numbers of trials, but should become more skewed if the base array is larger and larger numbers of trials are run.

share|improve this answer
    
My concern is not unsetting the unused data - I placed that code in my example to clean up the output for clarity in this posting. My main focus is to quickly consolidate the ID in with the other associative data in the array. – Emo Mosley Oct 26 '12 at 1:11
    
@EmoMosley The repeated unsets are to make the tests fair. They are not timed. The gist of this code is to use each approach a large number of times and average the results. – Asad Saeeduddin Oct 26 '12 at 3:19

Not the answer you're looking for? Browse other questions tagged or ask your own question.