Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_by_key results are non-deterministic for floats [NVBug 3477443] #1587

Open
allisonvacanti opened this issue Jan 10, 2022 · 4 comments
Open
Assignees
Labels
enhancement P2: nice to have repro: verified

Comments

@allisonvacanti
Copy link
Collaborator

@allisonvacanti allisonvacanti commented Jan 10, 2022

While floating point reduction is non-associative and some floating point error is expected, the scan algorithms guarantee consistent results "run-to-run" on the same device. This is not the case for the scan_by_key algorithms, which currently produce different results (within fp error) run-to-run on the same device.

We should look into this and see if it's possible to provide the same guarantee for the keyed algorithms.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>

#include <thrust/copy.h>
#include <thrust/fill.h>
#include <thrust/sequence.h>
#include <thrust/scan.h>
#include <iostream>

int main()
{
  auto const numElements = 250000;
  thrust::device_vector<double> data(numElements, 0.1);
  thrust::device_vector<double> keys(numElements, 1);

  thrust::device_vector<double> out1(numElements);
  thrust::device_vector<double> out2(numElements);

  thrust::host_vector<double> hostOut1(numElements);
  thrust::host_vector<double> hostOut2(numElements);

  thrust::inclusive_scan_by_key(keys.begin(), keys.end(), data.begin(), out1.begin());
  thrust::inclusive_scan_by_key(keys.begin(), keys.end(), data.begin(), out2.begin());

  // copy all of out1 and out2 to the host
  thrust::copy(out1.begin(), out1.end(), hostOut1.begin());
  thrust::copy(out2.begin(), out2.end(), hostOut2.begin());

  // Check the outputs are exactly the same
  for(int i = 0; i <numElements; i++) {
    if (hostOut1[i] != hostOut2[i]) {
      std::cout << "Element "<< i << " is not equal" << std::endl;
    }
  }

  return 0;
}
@allisonvacanti allisonvacanti added enhancement P2: nice to have repro: verified labels Jan 10, 2022
@allisonvacanti allisonvacanti changed the title scan_by_key results are non-deterministic for floats scan_by_key results are non-deterministic for floats [NVBug 3477443] Jan 10, 2022
@allisonvacanti
Copy link
Collaborator Author

@allisonvacanti allisonvacanti commented Jan 24, 2022

Note that there are currently two versions of scan by key -- one in CUB and one in Thrust. Eventually we'll be updating Thrust to use CUB's implementation, so the fix should go in to CUB.

@senior-zero
Copy link
Collaborator

@senior-zero senior-zero commented Jan 25, 2022

All our scan implementations rely on decoupled look-back approach. That means that CUB and Thrust scans don't provide run-to-run determinism for scan operations on floating-point types. In other words, this issue isn't unique to the scan-by-key variant of the algorithm.

Here's an illustration of the tile states of five thread blocks. Each thread block reads the tile states of the predecessor block. If the tile state of the predecessor block contains only tile aggregate, this aggregate is added to the partial sum, and the previous tile state gets inspected. If one of the predecessor states has the full prefix, it gets added to the collected aggregate, and the result is stored in the tile state of the current thread block.

image

Multiple thread blocks run concurrently. Updates of the tile states get visible non-deterministically. For the example above, it's possible that thread block four will observe the final prefix. In this case it'll write (a + b + c + d) + e as its result. It's also possible that only tile aggregate will be observed. In this case, the result might be (a + b + c) + (d + e).

I don't see how this algorithm might provide run-to-run determinism. The original paper that proposes this algorithm states that:

An interesting implication is that scan results are not necessarily deterministic for pseudo-associative operators. For example, the prefix sum across a given floating-point dataset may vary from run to run because the number and order of scan operators applied by CUB may vary from one run to the next.

I suggest we update the documentation and remove this guarantee. We could add a stable scan implementation that would use a different algorithm for floating-point values.

@allisonvacanti
Copy link
Collaborator Author

@allisonvacanti allisonvacanti commented Jan 25, 2022

Thanks for looking into this!

I was hoping we had some new tech in the regular decoupled scan algorithm that managed to address this limitation, but it sounds like we don't. I re-ran the test program using the regular scan algorithms (which are documented to be the same run-to-run) and confirmed that they also differ. I think you're right, the docs are just out of date.

Let's just fix the docs for now. If there's significant interest we can look into adding some stable versions later.

@lilohuang
Copy link
Contributor

@lilohuang lilohuang commented Feb 25, 2022

I have created a follow up question #1621, please take a look, and I appreciate your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement P2: nice to have repro: verified
Projects
None yet
Development

No branches or pull requests

3 participants