|
- How many dimensions are needed?
- C++ Template programming may require some code duplication for each level of higher dimension.
- Address calculation is the easy part. A simple approach can be used for dimensions up to a dozen.
- For a 3-dimensional example:
- Let the size of the array be
[m, n, p]
- For each dimension, we calculate a "weight vector" by successively multiplying terms:
[p*n, p, 1]
- To access the element at
[i, j, k] , its linearized element index can be calculated as:
[i, j, k] dotproduct [p*n, p, 1] or ((((i * n) + j) * p) + k)
- For higher dimensions, care must be taken to reduce the time spent in multiplications needed for element address calculation.
- What are the typical sizes of the array, to the order of magnitude?
- For small-sized arrays (less than 1000 elements), basically any design and implementation will work, unless proven otherwise by a profiler.
- For medium-sized arrays (as long as it can still fit entirely in the computer's memory or the application's address space), performance will be an increasingly important concern.
- For large arrays (which cannot fit as a contiguous piece of memory), architecture concern will dominate.
- Architecture / Memory Layout
- Driving forces:
- Size / scale
- Performance
- Computational platform
- Interoperability (with other code / libraries)
- Access semantics
- Available choices:
- Contiguous memory (the easiest, most straight-forward choice)
- Non-contiguous memory (example: tiled array)
- Memory-mapped file
- Distributed across multiple machines in a network
- Does your implementation need to support high performance?
- As a ballpark, anything requiring more than one billion elementary operations per second can be considered high-performance for the purpose of choosing a design.
- Designing for high-performance will impose constraints on:
- Choice of computational platform
- Choice of memory layout
- Computational platform?
- If high performance is not needed, scalar C++ code will be sufficient.
- Otherwise, a high-performance computational platform will be needed.
- This will impose constraints on:
- Computational choices
- Scalar C++
- Multi-core computing
- Manual threading
- Concurrent programming library (Intel TBB, Microsoft PPL/AMP)
- Compiler-assisted (OpenMP)
- Parallel programming languages. Example: Cilk, Brook
- Open-soure library (OpenCV)
- SIMD (Single-Instruction Multiple Data) example: Intel SSE2, AVX, ARM NEON, PowerPC AltiVec
- GPU programming (CUDA, OpenCL)
- Does the array need to be accessible by other libraries, without copying?
- If so, your array's memory layout must be compatible with that other library you are using.
- For small array sizes, or for access patterns where writes (modifications) occur infrequently, copying is not an issue.
- For large arrays which are both read and written very frequently, data copying can become an performance issue.
- A story where a bad combination of data copying and access semantics can cause an increasingly disproportionate performance issue:
- Consider two modules, each of which has its own copy of a 100 MB array, and which they have to keep synchronized at regular intervals.
- Suppose that one of the modules freely modifies its own copy, without keeping track of which region of the array is being modified.
- When the time comes to perform synchronization, the first module will have to copy the entire 100 MB array over, because it cannot tell which part needs synchronization.
- The user of the first module will complain that "if I am only modifying a few elements of this array, why is the library taking so long to synchronize?"
- Lesson: by requiring modifications to go through an API call, access semantics can be used to reduce inefficiency in data copying.
- Access semantics.
- Boils down to three questions:
- Who computes the memory address of elements? given the subscripts
(i, j, k)
- How is the array's lifetime (ownership) managed?
- Is the array implementation being notified of which part of the array is modified?
- Choices:
- Direct access to underlying memory, no API call.
- The array implementation will not know which part of the array is modified / used.
- Direct access to underlying memory, requiring a API call to lock/unlock an array view.
- Known as "advisory locks" because it cannot enforce access control.
- An advisory lock is still useful, because by locking/unlocking the library is notified of current users of the array, as well as the region being modified.
- API method for reading and writing one element at a time
- Easiest to implement and use
- Slowest
- API method for copying an array view at a time
- Copies between a sub-rectangle of your array and a normal C/C++/Java array
- Good balance between high-performance and safety
- Allows your array implementation to use a different memory layout
- Design of the library interface and syntax
If you have made it thus far, congratulations, you can start designing the interface and the syntax!
|
|
answered Sep 8 '13 at 14:52
|
|