[ET-VK][runtime] Add prepack cache to avoid duplicate weight prepacking#18361
[ET-VK][runtime] Add prepack cache to avoid duplicate weight prepacking#18361SS-JIA merged 5 commits intogh/SS-JIA/499/basefrom
Conversation
When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18361
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit fc166c7 with merge base 38b40bc ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) ghstack-source-id: 355089157 Pull Request resolved: #18361
This PR needs a
|
…ht prepacking" When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) [ghstack-poisoned]
Pull Request resolved: #18361 When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. ghstack-source-id: 355234968 @exported-using-ghexport Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/)
…ht prepacking" When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) [ghstack-poisoned]
Pull Request resolved: #18361 When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. ghstack-source-id: 355269010 @exported-using-ghexport Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/)
…ht prepacking" When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) [ghstack-poisoned]
Pull Request resolved: #18361 When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. ghstack-source-id: 355353466 @exported-using-ghexport Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/)
…ht prepacking" When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) [ghstack-poisoned]
Pull Request resolved: #18361 When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. ghstack-source-id: 355397958 @exported-using-ghexport Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/)
Pull Request resolved: #18361 When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. ghstack-source-id: 355397958 @exported-using-ghexport Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/)
…ng (#18390) Pull Request resolved: #18361 When embedding and linear ops share tied weights and both use the same prepacking function (prepack_quantized_linear_weight), the weight gets prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed by (input ValueRef, kernel name) that returns the already-prepacked tensor on cache hit, avoiding the duplicate allocation. ghstack-source-id: 355397958 @exported-using-ghexport Differential Revision: [D97430801](https://our.internmc.facebook.com/intern/diff/D97430801/) Co-authored-by: ssjia <ssjia@devvm26340.ftw0.facebook.com>
Stack from ghstack (oldest at bottom):
When embedding and linear ops share tied weights and both use the same
prepacking function (prepack_quantized_linear_weight), the weight gets
prepacked twice, wasting GPU memory. Add a cache to ComputeGraph keyed
by (input ValueRef, kernel name) that returns the already-prepacked
tensor on cache hit, avoiding the duplicate allocation.
Differential Revision: D97430801