Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share buffers between CPU and GPU #1696

Merged
merged 5 commits into from Jun 5, 2023

Conversation

kiltyj
Copy link
Contributor

@kiltyj kiltyj commented Jun 5, 2023

This updates ggml_metal_add_buffer to use MTLDevice.newBufferWithBytesNoCopy to attempt share buffers between CPU and GPU rather than re-allocating for the GPU.

I've been following #1642 and noticed that this might help some of the swapping-related issues some have been seeing with larger models and devices with < 96GB memory. With this change, I'm no longer seeing any swapping or odd/corrupted output.

Apologize if I missed any contribution steps, or if this change is missing something obvious. One thing I'm not sure about is whether this covers all possible cases. One thing to note is that newBufferWithBytesNoCopy requires a page-aligned source buffer. This seems to be the case with mmap, but not sure if it will remain true for all possible configurations / code paths.

@ggerganov
Copy link
Owner

Ah very interesting!
I did see the MTLDevice.newBufferWithBytesNoCopy function and tried it but it kept crashing.
I guess I was running with --no-mmap at that time.

One thing to note is that newBufferWithBytesNoCopy requires a page-aligned source buffer. This seems to be the case with mmap, but not sure if it will remain true for all possible configurations / code paths.

Currently, it's not the case without mmap. Should be easy to fix I think

@kiltyj
Copy link
Contributor Author

kiltyj commented Jun 5, 2023

@ggerganov I took my best shot at it, but I'm definitely stepping into unfamiliar territory. Even without this code, these buffers seemed to be page-aligned when running with --no-mmap on my M1 Max, so I'm not sure how to best test this other than confirming that the modified code still seems to work.

@akx
Copy link

akx commented Jun 5, 2023

👍 Anecdotally seems to work fine on my laptop (M2 Max). Initialization seems to be a lot faster, at least.

@x4080
Copy link

x4080 commented Jun 5, 2023

Hi, I tried this and havent got disk swap, m2 pro 16gb using llama-13b-supercot.ggmlv3.q4_0.bin
result:

llama_print_timings:        load time =  1501.98 ms
llama_print_timings:      sample time =     6.33 ms /   327 runs   (    0.02 ms per token)
llama_print_timings: prompt eval time =  1140.83 ms /    16 tokens (   71.30 ms per token)
llama_print_timings:        eval time = 23182.15 ms /   326 runs   (   71.11 ms per token)

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@ggerganov ggerganov merged commit 9d0693b into ggerganov:master Jun 5, 2023
5 of 21 checks passed
@kiltyj kiltyj deleted the metal-shared-memory branch June 5, 2023 20:30
@ggerganov ggerganov mentioned this pull request Jun 6, 2023
20 tasks
@Johnhersh
Copy link

After this PR was merged I now get this error:
ggml_metal_add_buffer: buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184

Any idea why?

@kiltyj
Copy link
Contributor Author

kiltyj commented Jun 6, 2023

@Johnhersh The error you shared was actually added in #1706. Was it running without issues before #1696? I suspect that before that it was actually failing silently when creating that buffer, e.g. hanging or producing garbage output.

The root cause is that macOS is refusing to create a buffer big enough to hold the model you're loading. It seems like macOS has an upper limit on a MTLBuffer size that is roughly half the amount of total unified memory, which a 30B model exceeds for machines with 32GB or less.

Since the limit seems to be on individual buffer size and not total memory usage, last night I started playing with some ideas for how to split the data buffer in a way that downstream Metal code would be happy with, but haven't gotten anything working yet, as I'm not very familiar with how this buffer is / can be segmented.

@Johnhersh
Copy link

Was it running without issues before #1696? I suspect that before that it was actually failing silently when creating that buffer, e.g. hanging or producing garbage output.

Yes you're correct, before these it was outputting ?? repeatedly. So this error message is quite correct then. Thank you!

@kiltyj
Copy link
Contributor Author

kiltyj commented Jun 6, 2023

So that I could confirm that this is not related to the newBufferWithBytes => newBufferWithBytesNoCopy change, I just tried running the pre-#1696 version on my 32GB machine with a 33B model. Even with newBufferWithBytes, Metal refuses to allocate a buffer large enough for the model (returns nil).

Not too surprising. Just wanted to make sure that the MTLDevice.maxBufferLength limit applies to both methods of creating a MTLBuffer. So @ggerganov, seems like we might need to figure out how to split up buffers that exceed this limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants