Constraint Algorithm Optimization in C++

Question

I'm especially interested in ways to further optimize a constraint algorithm. So far I've switched to SIMD sse4 vector operations which nearly doubled the performance. I'm hoping there is more performance to squeeze out that I simply don't see.

Here is the algorithm:

FVector4 rest(REST_DISTANCE); // Set the constant simd vector outside the loop
FVector4 stiffness(STIFFNESS);

for (int i = 0; i < TEST_SIZE; i++)
{
    const auto& constraint = Constraints[i];
    auto& position1 = Particle[constraint.first].Position;
    auto& position2 = Particle[constraint.second].Position;

    for (int j = 0; j < ITERATION_COUNT; j++)
    {
        auto delta = position2 - position1;
        auto distance = delta.norm();

        auto correctionDistance = (distance - rest) / distance;
        auto pCorrection = correctionDistance * delta * stiffness;

        position1 += pCorrection;
        position2 -= pCorrection;
    }
}

Constraints is an array of a simple struct that holds two unsigned int's, Particle is an array of FParticle structs which holds three FVector4 fields and a Fvector4 simply wraps a __m128 and provides operators and functions like norm() which utilize _mm_* intrinsic functions.

Currently the algorithm takes 55 cycles per Constraint per Iteration, this is with TEST_SIZE of 150,000 and ITERATION_COUNT at 16. I know there is performance to gain from making it parallel but I'm hoping to get more performance out of the single threaded version specifically.

I can post the actual structs if needed.

FVector4 code:

struct FVector4
{
    FVector4()
    {}
    explicit FVector4(const __m128& InitData) : Data(InitData)
    {
    }
    explicit FVector4(__m128&& InitData) : Data(InitData)
    {
    }
    explicit FVector4(float InitValue)
    {
        Data = _mm_set_ps1(InitValue);
    }
    explicit FVector4(float X, float Y, float Z, float W = 0.f)
    {
        Data = _mm_setr_ps(X, Y, Z, W);
    }

    __forceinline float& operator[](int index)
    {
        return Data.m128_f32[index];
    }
    __forceinline const float& operator[](int index) const
    {
        return Data.m128_f32[index];
    }
    __forceinline __m128 squaredNorm() const
    {
        return _mm_dp_ps(Data, Data, 0x7F);
    }
    __forceinline __m128 norm() const
    {
        auto sqNorm = squaredNorm();
        return _mm_mul_ps(sqNorm, _mm_rsqrt_ps(sqNorm));
    }
    __forceinline __m128 normalized() const
    {
        auto dp = _mm_dp_ps(Data, Data, 0x7F);
        dp = _mm_rsqrt_ps(dp);
        return _mm_mul_ps(Data, dp);
    }
    __forceinline void normalize()
    {
        Data = normalized();
    }
    __m128 Data;
};

Operators as used in the algorithm:

__forceinline FVector4 operator-(const FVector4& a, const FVector4& b)
{
    return FVector4(_mm_sub_ps(a.Data, b.Data));
}
__forceinline FVector4 operator*(const FVector4& a, const FVector4& b)
{
    return FVector4(_mm_mul_ps(a.Data, b.Data));
}
__forceinline FVector4 operator/(const FVector4& a, const FVector4& b)
{
    return FVector4(_mm_mul_ps(a.Data, _mm_rcp_ps(b.Data)));
}
__forceinline FVector4& operator+=(FVector4& a, const FVector4& b)
{
    a.Data = _mm_add_ps(a.Data, b.Data);
    return a;
}
__forceinline FVector4& operator-=(FVector4& a, const FVector4& b)
{
    a.Data = _mm_sub_ps(a.Data, b.Data);
    return a;
}

glampert · Accepted Answer · 2016-01-06 17:50:47Z

I don't know much about SSE and can't offer you any suggestions on performance right now, just a few suggestions to improve your code a little bit. Nevertheless, it looks well written and clean.

Since the default constructor does nothing, you can make it = default. That's the new syntax for such cases since C++11. It also conveys more meaning that you want a default no-op constructor.
The move constructor is pointless. An __m128 is a native type like an integer, so it fits in a CPU register. It will always be copied. You can remove this constructor overload.
The other constructor that takes a __m128 by const reference is probably less efficient that it could be. Again, __m128 can be passed via registers, so taking by reference will instead pass a pointer that has to be dereferenced. I bet you it is generating more code than if you'd just pass it by value.
The constructor that takes the four X,Y,Z,W floats doesn't have to be explicit. For any constructor taking more than one argument that has no default, explicit doesn't make sense. It is only meaningful when the constructor has a single required argument and you want to force writing something like func(FVector{42.0f}). If the constructor takes more than one argument, it can't be implicitly called.
const float& operator[](int index) const should instead return by value. Again you are forcing the compiler to generate indirection code when the value can be cheaply passed in a register.
I would also recommend asserting in the array subscript operators to ensure index in >= 0 && < 4.
If you decide to port this code in the future, __forceinline is going to be a problem. It is a Microsoft extension. I would #define a portable macro for it instead or just use standard inline. __forceinline is no guarantee of inlining, despite the misleading name. It just provides a slightly stronger hint to the compiler than inline.
__m128 is somewhat an ugly type to pass around, with those double underscores. In your place, I would declare some alias for it, perhaps simd_t or something. It might even be of value in the future if you decide to port FVector to some other SIMD set like AltiVec.

Bizkit · Accepted Answer · 2016-01-06 17:54:10Z

1

    auto delta = position2 - position1;
    auto distance = delta.norm();

    auto correctionDistance = (distance - rest) / distance;
    auto pCorrection = correctionDistance * delta * stiffness;

I would mark these all as const auto as they aren't changing during each loop cycle and might enable the compiler to make certain optimizations it couldn't do before as well as express their constness to the reader.

EDIT: If you're willing to make certain assumptions about ITERATION_COUNT, you might be able to do a little manual loop unrolling.

edited Jan 6, 2016 at 17:54

answered Jan 6, 2016 at 17:47

Bizkit

1,73910 silver badges17 bronze badges

\$\begingroup\$ True, I'll make those changes and try it. Yes, the iteration count will never go below 4 so I could unroll it at least that many times. \$\endgroup\$

NIZGTR
– NIZGTR

2016-01-06 18:10:41 +00:00
Commented Jan 6, 2016 at 18:10

Add a comment |

Stack Exchange Network

Constraint Algorithm Optimization in C++

2 Answers 2

Your Answer

Hot Network Questions

Constraint Algorithm Optimization in C++

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions