Performance of using objects with scripting dictionary

Question

The code below is a working excerpt from a larger application that I am re-factoring. My question focuses on the performance aspect of using (or not using) objects in this case.

I have a medium-ish size data set on a worksheet (6500 rows by 12 columns) and am building an application to merge (mostly) new data records with the existing data.

My overall approach is to:

Ingest the new data into a single object. The new dataset is held internally in both an array and a dictionary.
Ingest the existing data into a single object. The existing data is held internally in both an array and a dictionary.
Determine if a record from the new dataset exists in the existing dataset by creating a "key" and using it to search the dictionary. This unique key is created by concatenating the first four fields of a data record.
If the data record does not exist in the existing dataset, add it to the existing data object (to both the dictionary and the array).

So far, so good. I have working code that accomplishes these steps, though without the architecture of classes/objects and good functional design.

My question concerns performance issues I'm seeing in Step 2. (A fully functional example is listed at the bottom of this post.)

I'm seeing wildly different execution performance times when using objects and the dictionary as opposed to when I side-step the objects completely. Searches online for questions related to object performance turned up this reference, but nothing directly related to my experience.

Giving you the answers up front, my performance is measured as:

Array Used is (1000 by 12)
set up array                                = 7.046 ms
populate dict with    objects               = 4775.396 ms
populate dict without objects (store key)   = 11.222 ms
populate dict without objects (store array) = 7502.135 ms

Clearly, when I'm using my full dataset at 6500+ rows, my execution time grows to an unacceptable level.

I would appreciate comments and feedback on the performance of the methods in the example. Beyond that, if you have any guidance for more efficiently merging the datasets I'm very open to that as well.

To run my example below, create two classes. The first timer class CTimer uses the code found in this post. The second simple class is called LaborRecord:

Option Explicit

'------------ Class Public Variables --------------------------------------
'
Public projectID As String
Public projectName As String
Public resource As String
Public laborDate As String
Public laborHours As String
Public ftePerLaborMonth As String
Public laborMonth As String
Public laborLevel As String
Public laborCost As String
Public laborBurdened As String
Public sprintNumber As String
Public ftePerSprint As String
Public duplicate As Boolean

'------------ Class Private Variables -------------------------------------
'
Private Type TRecord
    key As String
End Type
Private this As TRecord

Private Const PROJID_COL = 1
Private Const PROJNAME_COL = 2
Private Const RESOURCE_COL = 3
Private Const LABORDATE_COL = 4
Private Const LABORHOURS_COL = 5
Private Const FTEMONTH_COL = 6
Private Const LABORMONTH_COL = 7
Private Const LABORLEVEL_COL = 8
Private Const LABORCOST_COL = 9
Private Const LABORBURD_COL = 10
Private Const SPRINTNUM_COL = 11
Private Const FTESPRINT_COL = 12


'------------ Class Properties --------------------------------------------
'
Public Property Get key() As String
    key = this.key
End Property

Public Property Let Record(arrayRow As Variant)
    '--- expects a single row of the labor data array, then sets all
    '    internal values using this data
    projectID = arrayRow(PROJID_COL)
    projectName = arrayRow(PROJNAME_COL)
    resource = arrayRow(RESOURCE_COL)
    laborDate = arrayRow(LABORDATE_COL)
    laborHours = arrayRow(LABORHOURS_COL)
    ftePerLaborMonth = arrayRow(FTEMONTH_COL)
    laborMonth = arrayRow(LABORMONTH_COL)
    laborLevel = arrayRow(LABORLEVEL_COL)
    laborCost = arrayRow(LABORCOST_COL)
    laborBurdened = arrayRow(LABORBURD_COL)
    sprintNumber = arrayRow(SPRINTNUM_COL)
    ftePerSprint = arrayRow(FTESPRINT_COL)
    this.key = projectID & projectName & resource & laborDate
End Property

'------------ Class Public Methods ----------------------------------------
'

'------------ Class Constructor and Destructor ----------------------------
'
Private Sub Class_Initialize()
    duplicate = False
End Sub

Private Sub class_terminate()
End Sub

The full test module is as follows:

Option Explicit

Sub setup()
    Dim data As Variant
    Dim i As Long, j As Long
    Const NUM_ROWS = 1000
    Const NUM_COLS = 12
    Dim myTimer As CTimer
    Set myTimer = New CTimer

    '--- init the array
    ReDim data(1 To NUM_ROWS, 1 To NUM_COLS)
    myTimer.StartCounter
    For i = 1 To UBound(data, 1)
        For j = 1 To UBound(data, 2)
            data(i, j) = i & "+" & j
        Next j
    Next i
    Dim arraySetupTime As Double
    arraySetupTime = myTimer.TimeElapsed

    Dim dictWithObjTime As Double
    DictWithObjects data
    dictWithObjTime = myTimer.TimeElapsed - arraySetupTime

    Dim dictWithoutObjTimeKey As Double
    DictWithoutObjectsKey data
    dictWithoutObjTimeKey = myTimer.TimeElapsed - dictWithObjTime

    Dim dictWithoutObjTimeArr As Double
    DictWithoutObjectsArr data
    dictWithoutObjTimeArr = myTimer.TimeElapsed - dictWithoutObjTimeKey

    Debug.Print "Array Used is (" & NUM_ROWS & " by " & NUM_COLS & ")"
    Debug.Print "set up array                                = " & Format(arraySetupTime, "0.000 ms")
    Debug.Print "populate dict with    objects               = " & Format(dictWithObjTime, "0.000 ms")
    Debug.Print "populate dict without objects (store key)   = " & Format(dictWithoutObjTimeKey, "0.000 ms")
    Debug.Print "populate dict without objects (store array) = " & Format(dictWithoutObjTimeArr, "0.000 ms")
End Sub

Sub DictWithObjects(ByRef arr As Variant)
    Dim i As Long
    Dim duplicatesFound As Boolean
    Dim thisRecord As LaborRecord
    Dim thisDict As Dictionary
    Set thisDict = New Dictionary
    duplicatesFound = False

    For i = LBound(arr, 1) To UBound(arr, 1)
        Set thisRecord = New LaborRecord
        thisRecord.Record = Application.WorksheetFunction.Index(arr, i, 0)
        If Not thisDict.Exists(thisRecord.key) Then
            thisDict.Add thisRecord.key, thisRecord  'the Item is the object
        Else
            duplicatesFound = True
        End If
    Next i
End Sub

Sub DictWithoutObjectsKey(ByRef arr As Variant)
    Dim i As Long
    Dim duplicatesFound As Boolean
    Dim thisRecord As LaborRecord
    Dim thisDict As Dictionary
    Dim key As String
    Set thisDict = New Dictionary
    duplicatesFound = False

    Dim projectID As String
    Dim projectName As String
    Dim resource As String
    Dim laborDate As String
    Dim laborHours As String
    Dim ftePerLaborMonth As String
    Dim laborMonth As String
    Dim laborLevel As String
    Dim laborCost As String
    Dim laborBurdened As String
    Dim sprintNumber As String
    Dim ftePerSprint As String


    For i = LBound(arr, 1) To UBound(arr, 1)
        projectID = arr(i, 1)
        projectName = arr(i, 2)
        resource = arr(i, 3)
        laborDate = arr(i, 4)
        laborHours = arr(i, 5)
        ftePerLaborMonth = arr(i, 6)
        laborMonth = arr(i, 7)
        laborLevel = arr(i, 8)
        laborCost = arr(i, 9)
        laborBurdened = arr(i, 10)
        sprintNumber = arr(i, 11)
        ftePerSprint = arr(i, 12)
        key = projectID & projectName & resource & laborDate
        If Not thisDict.Exists(key) Then
            thisDict.Add key, key          'the Item is just the key
        Else
            duplicatesFound = True
        End If
    Next i
End Sub

Sub DictWithoutObjectsArr(ByRef arr As Variant)
    Dim i As Long
    Dim duplicatesFound As Boolean
    Dim thisRecord As LaborRecord
    Dim thisDict As Dictionary
    Dim key As String
    Set thisDict = New Dictionary
    duplicatesFound = False

    Dim projectID As String
    Dim projectName As String
    Dim resource As String
    Dim laborDate As String
    Dim laborHours As String
    Dim ftePerLaborMonth As String
    Dim laborMonth As String
    Dim laborLevel As String
    Dim laborCost As String
    Dim laborBurdened As String
    Dim sprintNumber As String
    Dim ftePerSprint As String


    For i = LBound(arr, 1) To UBound(arr, 1)
        projectID = arr(i, 1)
        projectName = arr(i, 2)
        resource = arr(i, 3)
        laborDate = arr(i, 4)
        laborHours = arr(i, 5)
        ftePerLaborMonth = arr(i, 6)
        laborMonth = arr(i, 7)
        laborLevel = arr(i, 8)
        laborCost = arr(i, 9)
        laborBurdened = arr(i, 10)
        sprintNumber = arr(i, 11)
        ftePerSprint = arr(i, 12)
        key = projectID & projectName & resource & laborDate
        If Not thisDict.Exists(key) Then
            thisDict.Add key, arr          'the Item is the array slice
        Else
            duplicatesFound = True
        End If
    Next i
End Sub

Are you actually using the generated objects (or arrays for that matter) for anything elsewhere in your code? I guess I'm a little confused as to why you're including a benchmark for storing a {key, key} combination. — Comintern, Sep 12 '16 at 23:42
I previously used data directly from Cells within the Range to process the merge. While this didn't take overly long (at first), as the dataset grows the time increases. My intent was to ingest both the new and existing data into arrays to speed the merge, then copy the updated dataset back to the sheet. Using an intermediate object may help as a mechanism, but is not strictly necessary. This performance issue came up because of the original intent to refactor and because I was exploring different paths. I was quite surprised at the speed difference and measured it, then asked my question. — PeterT, Sep 13 '16 at 1:45
I have rolled back the last edit. Please see what you may and may not do after receiving answers. — Vogel612, Sep 13 '16 at 15:27
Thanks for the roll-back. Converted to a secondary answer below. — PeterT, Sep 13 '16 at 15:37

Community · Accepted Answer · 2017-05-23 12:40:52Z

As far as the relative performance goes:

When you instantiate an object and add it to the Dictionary in DictWithObjects, the only thing that actually gets stored in the Dictionary is a reference to the object. This is basically just a pointer. So in your "populate dict with objects" case, you are making a copy of the object reference and storing that. In this case, you are adding the overhead of instantiating the object and initializing it. However, this case also does something that none of the other test cases do - it uses Application.WorksheetFunction.Index to slice the input array. This alone invalidates your benchmarks because none of the other subs do this.

In your "populate dict without objects (store array)" case, you are actually cramming a significant amount of data into the Dictionary. Not only that, but in this line...

thisDict.Add key, arr          'the Item is the array slice

...you are not only storing the entire array in the Dictionary, you're making a copy of the entire array and storing that in the Dictionary. I would guess that a very large part of the poor performance of DictWithoutObjectsArr is due solely to this bug in the benchmarking code. That said, the missing call to Application.WorksheetFunction.Index(arr, i, 0) makes the performance of this method even worse - when I add the array slicing, it seems like the large memory copies are actually cheaper from a performance standpoint than an array slice and a much smaller memory copy.

In your final case, storing the unique row identifier as both the key and the value, you are simply storing a String. This case is sacrificing long term performance in exchange for not collecting the same information that the other cases do. The trade-off will come later in a finished product when it is time to get information back out. Obviously this isn't possible (since you only have a key-key combination), and you aren't benchmarking everything that is relevant to the overall performance. Assuming that you need to get the row information at some point, this benchmark should simply either be ignored or taken only as a proxy estimation of how long it takes to add a generic key value pair to a Dictionary. Otherwise, at very least the value should be the row - at least in that case you can get some useful information out of it later.

In short, the above make the benchmarking results almolst entirely meaningless.

A couple other things:

Your object shouldn't be exposing all of it's members as public class variables. These should be encapsulated into properties. As currently written, the class is little more than a heavy-weight Type with a initializer property (more on this below).
The objects should be responsible for calculating their own keys - not the calling code.
Your "constructor" seems really misplaced in the write-only property Record. Better would be to simply make it a Sub. Best would be to add the VB_PredeclaredId attribute and create an object factory. However, taking an array as a parameter seems suspect. If the class is intended to mirror what you would find on a row, a Range parameter would probably be clearer and more performant than slicing a single dimension out of a 2D array. If you're going to pull your data range into an array (which makes sense), it's probably much faster to just set the individual properties. If you aren't pulling the dataset into an array, just pass it a Range. There's nothing that says you can only have one class factory. In fact, in a comparison benchmark, setting properties directly cut the object creation overhead by %75 over slicing and passing an array using WorksheetFunction.Index.
The line duplicate = False in Class_Initialize raises red flags about the class design also. It's hard to tell what it is there for (due mainly to the stripped down version of the code submitted), but regardless, it probably shouldn't. An individual object shouldn't have any concept of being a "duplicate" outside of possibly an Equals(other as LaborRecord) utility function. If your class structure is designed properly, a LaborRecord shouldn't know or care about other LaborRecords.

You mention that dictionaries basically store pointers; do collections behave the same way? — Raystafarian, Sep 13 '16 at 10:02
@Raystafarian - "Pointer" is a bit of a simplification, but yes - collections and arrays will also just hold a reference to the object. — Comintern, Sep 13 '16 at 12:26
And that is different than taking values from a range object and storing them in a dictionary because the values will be used rather than "pointers" to the original object? — Raystafarian, Sep 13 '16 at 14:29
@Raystafarian - It's different because you only have one copy of the data that the class holds, and you only have to instantiate the object once. Arrays and structs don't have control of when they go out of scope, so Collections and Dictionaries copy them. They can't just store a reference because the scope of the container might (actually will most likely) outlive the scope of whatever the reference is pointing to. — Comintern, Sep 13 '16 at 14:33

PeterT · Accepted Answer · 2016-09-13 15:38:26Z

Thank you to @Comintern and @Raystafarian for their discussion, because that focused on exactly my question and confusion. I initially believed my benchmark code was valid at the function level because my intent was to measure the differences between the overall methodologies (including whether or not to use Application.WorksheetFunction.Index). My error was in the assumption of exactly "what" is stored as the Item in a Dictionary. To correct my mistake in storing the entire array instead of the intended slice, I present the updated benchmarks and code below. Clearly taking the array slice is consuming more time than I believed, plus not being aware when an object "pointer" is stored vs. the entire array. (As an aside, the LaborRecord was exactly intended as a convenient heavyweight Type, so the exposure of all members as Public was deliberate. The duplicate variable is necessary when performing the merge function, but shouldbn't have been kept in the example.)

Ultimately, I will be storing just the index (row number) in the dictionary to provide an access to the data fields.

Updated benchmark results:

Array Used is (1000 by 12)
set up array                                = 8.523 ms
populate dict with    objects               = 4477.020 ms
populate dict without objects (store key)   = 4386.910 ms
populate dict without objects (store array) = 8821.639 ms

Updated code:

Option Explicit

Sub setup()
    Dim data As Variant
    Dim i As Long, j As Long
    Const NUM_ROWS = 1000
    Const NUM_COLS = 12
    Dim myTimer As CTimer
    Set myTimer = New CTimer

    '--- init the array
    ReDim data(1 To NUM_ROWS, 1 To NUM_COLS)
    myTimer.StartCounter
    For i = 1 To UBound(data, 1)
        For j = 1 To UBound(data, 2)
            data(i, j) = i & "+" & j
        Next j
    Next i
    Dim arraySetupTime As Double
    arraySetupTime = myTimer.TimeElapsed

    Dim dictWithObjTime As Double
    DictWithObjects data
    dictWithObjTime = myTimer.TimeElapsed - arraySetupTime

    Dim dictWithoutObjTimeKey As Double
    DictWithoutObjectsKey data
    dictWithoutObjTimeKey = myTimer.TimeElapsed - dictWithObjTime

    Dim dictWithoutObjTimeArr As Double
    DictWithoutObjectsArr data
    dictWithoutObjTimeArr = myTimer.TimeElapsed - dictWithoutObjTimeKey

    Debug.Print "Array Used is (" & NUM_ROWS & " by " & NUM_COLS & ")"
    Debug.Print "set up array                                = " & Format(arraySetupTime, "0.000 ms")
    Debug.Print "populate dict with    objects               = " & Format(dictWithObjTime, "0.000 ms")
    Debug.Print "populate dict without objects (store key)   = " & Format(dictWithoutObjTimeKey, "0.000 ms")
    Debug.Print "populate dict without objects (store array) = " & Format(dictWithoutObjTimeArr, "0.000 ms")
End Sub

Sub DictWithObjects(ByRef arr As Variant)
    Dim i As Long
    Dim thisRecord As LaborRecord
    Dim thisDict As Dictionary
    Set thisDict = New Dictionary

    For i = LBound(arr, 1) To UBound(arr, 1)
        Set thisRecord = New LaborRecord
        thisRecord.Record = Application.WorksheetFunction.Index(arr, i, 0)
        If Not thisDict.Exists(thisRecord.key) Then
            thisDict.Add thisRecord.key, thisRecord  'the Item is the object
        End If
    Next i
End Sub

Sub DictWithoutObjectsKey(ByRef arr As Variant)
    Dim i As Long
    Dim thisRecord As LaborRecord
    Dim thisDict As Dictionary
    Dim key As String
    Set thisDict = New Dictionary

    Dim projectID As String
    Dim projectName As String
    Dim resource As String
    Dim laborDate As String
    Dim laborHours As String
    Dim ftePerLaborMonth As String
    Dim laborMonth As String
    Dim laborLevel As String
    Dim laborCost As String
    Dim laborBurdened As String
    Dim sprintNumber As String
    Dim ftePerSprint As String
    Dim arrSlice As Variant

    For i = LBound(arr, 1) To UBound(arr, 1)
        arrSlice = Application.WorksheetFunction.Index(arr, i, 0)
        projectID = arrSlice(1)
        projectName = arrSlice(2)
        resource = arrSlice(3)
        laborDate = arrSlice(4)
        laborHours = arrSlice(5)
        ftePerLaborMonth = arrSlice(6)
        laborMonth = arrSlice(7)
        laborLevel = arrSlice(8)
        laborCost = arrSlice(9)
        laborBurdened = arrSlice(10)
        sprintNumber = arrSlice(11)
        ftePerSprint = arrSlice(12)
        key = projectID & projectName & resource & laborDate
        If Not thisDict.Exists(key) Then
            thisDict.Add key, key          'the Item is just the key
        End If
    Next i
End Sub

Sub DictWithoutObjectsArr(ByRef arr As Variant)
    Dim i As Long
    Dim thisRecord As LaborRecord
    Dim thisDict As Dictionary
    Dim key As String
    Set thisDict = New Dictionary

    Dim projectID As String
    Dim projectName As String
    Dim resource As String
    Dim laborDate As String
    Dim laborHours As String
    Dim ftePerLaborMonth As String
    Dim laborMonth As String
    Dim laborLevel As String
    Dim laborCost As String
    Dim laborBurdened As String
    Dim sprintNumber As String
    Dim ftePerSprint As String
    Dim arrSlice As Variant

    For i = LBound(arr, 1) To UBound(arr, 1)
        arrSlice = Application.WorksheetFunction.Index(arr, i, 0)
        projectID = arrSlice(1)
        projectName = arrSlice(2)
        resource = arrSlice(3)
        laborDate = arrSlice(4)
        laborHours = arrSlice(5)
        ftePerLaborMonth = arrSlice(6)
        laborMonth = arrSlice(7)
        laborLevel = arrSlice(8)
        laborCost = arrSlice(9)
        laborBurdened = arrSlice(10)
        sprintNumber = arrSlice(11)
        ftePerSprint = arrSlice(12)
        key = projectID & projectName & resource & laborDate
        If Not thisDict.Exists(key) Then
            thisDict.Add key, arrSlice       'the Item is the array slice
        End If
    Next i
End Sub

Stack Exchange Network

current community

your communities

more stack exchange communities

Performance of using objects with scripting dictionary

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged vba excel or ask your own question.

Hot Network Questions

Performance of using objects with scripting dictionary

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged vba excel or ask your own question.

Related

Hot Network Questions