The code below is a working excerpt from a larger application that I am re-factoring. My question focuses on the performance aspect of using (or not using) objects in this case.
I have a medium-ish size data set on a worksheet (6500 rows by 12 columns) and am building an application to merge (mostly) new data records with the existing data.
My overall approach is to:
- Ingest the new data into a single object. The new dataset is held internally in both an array and a dictionary.
- Ingest the existing data into a single object. The existing data is held internally in both an array and a dictionary.
- Determine if a record from the new dataset exists in the existing dataset by creating a "key" and using it to search the dictionary. This unique key is created by concatenating the first four fields of a data record.
- If the data record does not exist in the existing dataset, add it to the existing data object (to both the dictionary and the array).
So far, so good. I have working code that accomplishes these steps, though without the architecture of classes/objects and good functional design.
My question concerns performance issues I'm seeing in Step 2. (A fully functional example is listed at the bottom of this post.)
I'm seeing wildly different execution performance times when using objects and the dictionary as opposed to when I side-step the objects completely. Searches online for questions related to object performance turned up this reference, but nothing directly related to my experience.
Giving you the answers up front, my performance is measured as:
Array Used is (1000 by 12) set up array = 7.046 ms populate dict with objects = 4775.396 ms populate dict without objects (store key) = 11.222 ms populate dict without objects (store array) = 7502.135 ms
Clearly, when I'm using my full dataset at 6500+ rows, my execution time grows to an unacceptable level.
I would appreciate comments and feedback on the performance of the methods in the example. Beyond that, if you have any guidance for more efficiently merging the datasets I'm very open to that as well.
To run my example below, create two classes. The first timer class CTimer
uses the code found in this post. The second simple class is called LaborRecord
:
Option Explicit
'------------ Class Public Variables --------------------------------------
'
Public projectID As String
Public projectName As String
Public resource As String
Public laborDate As String
Public laborHours As String
Public ftePerLaborMonth As String
Public laborMonth As String
Public laborLevel As String
Public laborCost As String
Public laborBurdened As String
Public sprintNumber As String
Public ftePerSprint As String
Public duplicate As Boolean
'------------ Class Private Variables -------------------------------------
'
Private Type TRecord
key As String
End Type
Private this As TRecord
Private Const PROJID_COL = 1
Private Const PROJNAME_COL = 2
Private Const RESOURCE_COL = 3
Private Const LABORDATE_COL = 4
Private Const LABORHOURS_COL = 5
Private Const FTEMONTH_COL = 6
Private Const LABORMONTH_COL = 7
Private Const LABORLEVEL_COL = 8
Private Const LABORCOST_COL = 9
Private Const LABORBURD_COL = 10
Private Const SPRINTNUM_COL = 11
Private Const FTESPRINT_COL = 12
'------------ Class Properties --------------------------------------------
'
Public Property Get key() As String
key = this.key
End Property
Public Property Let Record(arrayRow As Variant)
'--- expects a single row of the labor data array, then sets all
' internal values using this data
projectID = arrayRow(PROJID_COL)
projectName = arrayRow(PROJNAME_COL)
resource = arrayRow(RESOURCE_COL)
laborDate = arrayRow(LABORDATE_COL)
laborHours = arrayRow(LABORHOURS_COL)
ftePerLaborMonth = arrayRow(FTEMONTH_COL)
laborMonth = arrayRow(LABORMONTH_COL)
laborLevel = arrayRow(LABORLEVEL_COL)
laborCost = arrayRow(LABORCOST_COL)
laborBurdened = arrayRow(LABORBURD_COL)
sprintNumber = arrayRow(SPRINTNUM_COL)
ftePerSprint = arrayRow(FTESPRINT_COL)
this.key = projectID & projectName & resource & laborDate
End Property
'------------ Class Public Methods ----------------------------------------
'
'------------ Class Constructor and Destructor ----------------------------
'
Private Sub Class_Initialize()
duplicate = False
End Sub
Private Sub class_terminate()
End Sub
The full test module is as follows:
Option Explicit
Sub setup()
Dim data As Variant
Dim i As Long, j As Long
Const NUM_ROWS = 1000
Const NUM_COLS = 12
Dim myTimer As CTimer
Set myTimer = New CTimer
'--- init the array
ReDim data(1 To NUM_ROWS, 1 To NUM_COLS)
myTimer.StartCounter
For i = 1 To UBound(data, 1)
For j = 1 To UBound(data, 2)
data(i, j) = i & "+" & j
Next j
Next i
Dim arraySetupTime As Double
arraySetupTime = myTimer.TimeElapsed
Dim dictWithObjTime As Double
DictWithObjects data
dictWithObjTime = myTimer.TimeElapsed - arraySetupTime
Dim dictWithoutObjTimeKey As Double
DictWithoutObjectsKey data
dictWithoutObjTimeKey = myTimer.TimeElapsed - dictWithObjTime
Dim dictWithoutObjTimeArr As Double
DictWithoutObjectsArr data
dictWithoutObjTimeArr = myTimer.TimeElapsed - dictWithoutObjTimeKey
Debug.Print "Array Used is (" & NUM_ROWS & " by " & NUM_COLS & ")"
Debug.Print "set up array = " & Format(arraySetupTime, "0.000 ms")
Debug.Print "populate dict with objects = " & Format(dictWithObjTime, "0.000 ms")
Debug.Print "populate dict without objects (store key) = " & Format(dictWithoutObjTimeKey, "0.000 ms")
Debug.Print "populate dict without objects (store array) = " & Format(dictWithoutObjTimeArr, "0.000 ms")
End Sub
Sub DictWithObjects(ByRef arr As Variant)
Dim i As Long
Dim duplicatesFound As Boolean
Dim thisRecord As LaborRecord
Dim thisDict As Dictionary
Set thisDict = New Dictionary
duplicatesFound = False
For i = LBound(arr, 1) To UBound(arr, 1)
Set thisRecord = New LaborRecord
thisRecord.Record = Application.WorksheetFunction.Index(arr, i, 0)
If Not thisDict.Exists(thisRecord.key) Then
thisDict.Add thisRecord.key, thisRecord 'the Item is the object
Else
duplicatesFound = True
End If
Next i
End Sub
Sub DictWithoutObjectsKey(ByRef arr As Variant)
Dim i As Long
Dim duplicatesFound As Boolean
Dim thisRecord As LaborRecord
Dim thisDict As Dictionary
Dim key As String
Set thisDict = New Dictionary
duplicatesFound = False
Dim projectID As String
Dim projectName As String
Dim resource As String
Dim laborDate As String
Dim laborHours As String
Dim ftePerLaborMonth As String
Dim laborMonth As String
Dim laborLevel As String
Dim laborCost As String
Dim laborBurdened As String
Dim sprintNumber As String
Dim ftePerSprint As String
For i = LBound(arr, 1) To UBound(arr, 1)
projectID = arr(i, 1)
projectName = arr(i, 2)
resource = arr(i, 3)
laborDate = arr(i, 4)
laborHours = arr(i, 5)
ftePerLaborMonth = arr(i, 6)
laborMonth = arr(i, 7)
laborLevel = arr(i, 8)
laborCost = arr(i, 9)
laborBurdened = arr(i, 10)
sprintNumber = arr(i, 11)
ftePerSprint = arr(i, 12)
key = projectID & projectName & resource & laborDate
If Not thisDict.Exists(key) Then
thisDict.Add key, key 'the Item is just the key
Else
duplicatesFound = True
End If
Next i
End Sub
Sub DictWithoutObjectsArr(ByRef arr As Variant)
Dim i As Long
Dim duplicatesFound As Boolean
Dim thisRecord As LaborRecord
Dim thisDict As Dictionary
Dim key As String
Set thisDict = New Dictionary
duplicatesFound = False
Dim projectID As String
Dim projectName As String
Dim resource As String
Dim laborDate As String
Dim laborHours As String
Dim ftePerLaborMonth As String
Dim laborMonth As String
Dim laborLevel As String
Dim laborCost As String
Dim laborBurdened As String
Dim sprintNumber As String
Dim ftePerSprint As String
For i = LBound(arr, 1) To UBound(arr, 1)
projectID = arr(i, 1)
projectName = arr(i, 2)
resource = arr(i, 3)
laborDate = arr(i, 4)
laborHours = arr(i, 5)
ftePerLaborMonth = arr(i, 6)
laborMonth = arr(i, 7)
laborLevel = arr(i, 8)
laborCost = arr(i, 9)
laborBurdened = arr(i, 10)
sprintNumber = arr(i, 11)
ftePerSprint = arr(i, 12)
key = projectID & projectName & resource & laborDate
If Not thisDict.Exists(key) Then
thisDict.Add key, arr 'the Item is the array slice
Else
duplicatesFound = True
End If
Next i
End Sub
Cells
within theRange
to process the merge. While this didn't take overly long (at first), as the dataset grows the time increases. My intent was to ingest both the new and existing data into arrays to speed the merge, then copy the updated dataset back to the sheet. Using an intermediate object may help as a mechanism, but is not strictly necessary. This performance issue came up because of the original intent to refactor and because I was exploring different paths. I was quite surprised at the speed difference and measured it, then asked my question. \$\endgroup\$