Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the framework to support prepack #4413

Merged
merged 9 commits into from Aug 7, 2020
Merged

Add the framework to support prepack #4413

merged 9 commits into from Aug 7, 2020

Conversation

@yufenglee
Copy link
Member

yufenglee commented Jul 2, 2020

Description:
Set up framework to support prepacking
Motivation and Context
Prepacking initialized constant tensor can improve performance for some operators, for example, prepacking the constant matrix B for MatMul, weights for GRU&LSTM. However, our current mechanics to support prepack has a drawback: it introduce memory overhead.
This PR introduces a method to remove the memory overhead. It adds a virtual function PrePack to allow OpKernel to PrePack the tensors. The function is invoked in SessionState initialization stage. If no other Ops using this initialized constant tensor, it will be released.

@yufenglee yufenglee requested review from skottmckay, pranavsharma and tracysh Jul 2, 2020
@yufenglee yufenglee requested a review from microsoft/onnxruntime as a code owner Jul 2, 2020
Copy link
Contributor

pranavsharma left a comment

Changes look good overall. As discussed offline can you add a session option to turn this on/off?

Status SessionState::PrePackConstantTensors() {
// calculate the use count of each value
std::unordered_map<std::string, size_t> node_arg_use_count;
for (auto& node : GetGraphViewer().Nodes()) {

This comment has been minimized.

@pranavsharma

pranavsharma Jul 2, 2020 Contributor

Can this be const auto&? Explicitly call out const based on coding guidelines. "Qualify usages of 'auto' with 'const', '*', '&' and '&&' where applicable to more clearly express the intent".
Check other loops as well below.

@@ -99,6 +99,8 @@ Status FinalizeSessionState(SessionState& session_state,
session_state.CleanInitializedTensorsFromGraph();

ORT_RETURN_IF_ERROR(session_state.CreateKernels(kernel_registry_manager));
ORT_RETURN_IF_ERROR(session_state.PrePackConstantTensors());

This comment has been minimized.

@pranavsharma

pranavsharma Jul 2, 2020 Contributor

Can we call this PrePackInitializedTensors to keep the naming consistent with SaveInitializedTensors and CleanInitializedTensorsFromGraph?

@pranavsharma
Copy link
Contributor

pranavsharma commented Jul 2, 2020

Also, please fill in the PR template (description and motivation/context) appropriately.


for (auto& node : GetGraphViewer().Nodes()) {
auto kernel = GetMutableKernel(node.Index());
int input_idx = 0;

This comment has been minimized.

@skottmckay

skottmckay Jul 3, 2020 Contributor

Can delay this until later.

if (is_packed && node_arg_use_count.count(input_name) && node_arg_use_count[input_name] == 1) {
// release the constant intialized tensor
constant_initialized_tensors_[ort_value_idx] = OrtValue();
}

This comment has been minimized.

@skottmckay

skottmckay Jul 3, 2020 Contributor

Would it be better to remove it from the map completely?

@skottmckay
Copy link
Contributor

skottmckay commented Jul 3, 2020

Overall approach looks nice and clean to me.

}

for (auto& node : GetGraphViewer().Nodes()) {
auto kernel = GetMutableKernel(node.Index());

This comment has been minimized.

@skottmckay

skottmckay Jul 3, 2020 Contributor

Does this work correctly if an initializer from the main graph is packed in a subgraph and becomes unused? Possibly does via the reference counting in OrtValue.

// }
// return Status::OK();
// }

This comment has been minimized.

@skottmckay

skottmckay Jul 3, 2020 Contributor

nit: probably need a more complete example showing that the kernel needs to do an allocation and have a member that owns that buffer.

@yufenglee yufenglee force-pushed the roli/pre-packing-memory branch 2 times, most recently from e18e6aa to a22f6c7 Jul 28, 2020
@yufenglee yufenglee changed the title [WIP] add the framework to support prepack Add the framework to support prepack Jul 28, 2020
@yufenglee yufenglee force-pushed the roli/pre-packing-memory branch from 91f7920 to 4243a9b Aug 4, 2020
yufenglee added 2 commits Aug 5, 2020
@@ -857,6 +857,10 @@ struct OrtApi {
* \param index index of string tensor element to fill
*/
ORT_API2_STATUS(FillStringTensorElement, _Inout_ OrtValue* value, _In_ const char* s, size_t index);

// Control pre-packing of initialized constant tensors
ORT_API2_STATUS(EnablePrePacking, _Inout_ OrtSessionOptions* options);

This comment has been minimized.

@pranavsharma

pranavsharma Aug 6, 2020 Contributor

You might be able to achieve this without adding these 2 APIs once this PR is merged.

This comment has been minimized.

@gwang-msft

gwang-msft Aug 7, 2020 Contributor

You may merge first, and I can move the prepacking into session config map in my change


In reply to: 466743313 [](ancestors = 466743313)

const size_t packed_b_size = MlasGemmPackBSize(N, K, b_is_signed);
if (packed_b_size == 0) {
return;
auto alloc = Info().GetAllocator(0, OrtMemTypeDefault);

This comment has been minimized.

@pranavsharma

pranavsharma Aug 6, 2020 Contributor

nit: auto& alloc

@@ -81,5 +81,8 @@ struct SessionOptions {

// Deterministic compute is likely not as performant. This option is default to false.
bool use_deterministic_compute = false;

// Control the pre-packing of initialized constant tensors
bool use_prepacking = true;

This comment has been minimized.

@pranavsharma

pranavsharma Aug 7, 2020 Contributor

nit: could be called use_weight_prepacking

@yufenglee yufenglee merged commit b22091d into master Aug 7, 2020
40 checks passed
40 checks passed
Android CI Pipeline Build #20200805.22 succeeded
Details
Linux CPU CI Pipeline Build #20200805.21 succeeded
Details
Linux CPU x64 NoContribops CI Pipeline Build #20200805.21 succeeded
Details
Linux GPU CI Pipeline Build #20200805.21 succeeded
Details
Linux GPU TensorRT CI Pipeline Build #20200805.23 succeeded
Details
Linux OpenVINO CI Pipeline Build #20200805.19 succeeded
Details
MacOS CI Pipeline Build #20200805.22 succeeded
Details
MacOS NoContribops CI Pipeline Build #20200805.22 succeeded
Details
Windows CPU CI Pipeline Build #20200805.21 succeeded
Details
Windows CPU CI Pipeline (build debug) build debug succeeded
Details
Windows CPU CI Pipeline (build release) build release succeeded
Details
Windows CPU CI Pipeline (build_x64_no_contrib_ops debug) build_x64_no_contrib_ops debug succeeded
Details
Windows CPU CI Pipeline (build_x64_no_contrib_ops release) build_x64_no_contrib_ops release succeeded
Details
Windows CPU CI Pipeline (x86_build debug) x86_build debug succeeded
Details
Windows CPU CI Pipeline (x86_build release) x86_build release succeeded
Details
Windows CPU CI Pipeline (x86_no_contrib_ops debug) x86_no_contrib_ops debug succeeded
Details
Windows CPU CI Pipeline (x86_no_contrib_ops release) x86_no_contrib_ops release succeeded
Details
Windows GPU CI Pipeline Build #20200805.21 succeeded
Details
Windows GPU CI Pipeline (build debug) build debug succeeded
Details
Windows GPU CI Pipeline (build release) build release succeeded
Details
Windows GPU TensorRT CI Pipeline Build #20200805.22 succeeded
Details
Windows GPU TensorRT CI Pipeline (build debug) build debug succeeded
Details
Windows GPU TensorRT CI Pipeline (build release) build release succeeded
Details
centos7_cpu Build #20200805.22 succeeded
Details
centos7_cpu (linux_centos_ci Debug) linux_centos_ci Debug succeeded
Details
centos7_cpu (linux_centos_ci Release) linux_centos_ci Release succeeded
Details
license/cla All CLA requirements met.
Details
orttraining-linux-ci-pipeline Build #20200805.21 succeeded
Details
orttraining-linux-ci-pipeline (Onnxruntime_Linux_CPU_Training Debug) Onnxruntime_Linux_CPU_Training Debug succeeded
Details
orttraining-linux-ci-pipeline (Onnxruntime_Linux_CPU_Training Release) Onnxruntime_Linux_CPU_Training Release succeeded
Details
orttraining-linux-gpu-ci-pipeline Build #20200805.21 had test failures
Details
orttraining-linux-gpu-ci-pipeline (Onnxruntime_Linux_GPU_Training Debug) Onnxruntime_Linux_GPU_Training Debug succeeded
Details
orttraining-linux-gpu-ci-pipeline (Onnxruntime_Linux_GPU_Training Release) Onnxruntime_Linux_GPU_Training Release succeeded
Details
orttraining-mac-ci-pipeline Build #20200805.22 succeeded
Details
orttraining-win-ci-pipeline Build #20200805.22 succeeded
Details
orttraining-win-ci-pipeline (Win_CPU_Training Debug) Win_CPU_Training Debug succeeded
Details
orttraining-win-ci-pipeline (Win_CPU_Training RelWithDebInfo) Win_CPU_Training RelWithDebInfo succeeded
Details
orttraining-win-gpu-ci-pipeline Build #20200805.21 succeeded
Details
orttraining-win-gpu-ci-pipeline (Win_GPU_Training Debug) Win_GPU_Training Debug succeeded
Details
orttraining-win-gpu-ci-pipeline (Win_GPU_Training RelWithDebInfo) Win_GPU_Training RelWithDebInfo succeeded
Details
@yufenglee yufenglee deleted the roli/pre-packing-memory branch Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.