Conversation
b5cf68c to
3c4b31b
Compare
|
Michael, I finished what I wanted. |
|
@amontoison probably this is why the |
|
A more realistic example, batched QuadraticModel where we vary only the RHS: JuliaSmoothOptimizers/QuadraticModels.jl@main...klamike:QuadraticModels.jl:mk/rhsbatch |
|
Amazing Michael! |
|
Should we hardcode |
I think so, yes, to be consistent with the regular API. Both can be updated at the same time later (maybe only in 0.22)
We probably want to define some more meta functions like |
65dcf55 to
a20cc30
Compare
|
@amontoison what do you think about having an API for updating the nbatch? and maybe some optional |
|
@klamike I don't unserstand what you mean by updating the |
|
Yes I meant updating the In the ExaModels case, of course it depends on how you do the batching. When I added parameters to ExaModels I specifically made the lower level functions all take the parameter vector as an input, to make it possible to implement the batching the way I did in BatchNLPKernels. It is based on a single ExaModel and does the same number of kernel launches for the batch evaluation as ExaModels would for a regular model, just with 2D grids. Since the base parametric ExaModel is built "unbatched", it is trivial to implement the |
|
Actually the I do still think the |
|
@klamike Be free to add what you need in the APi on |
|
I think it's good to go, I was overcomplicating things. I got the MadIPM UniformBatch + RHSBatchQuadraticModel working locally, will clean it up and push to the MadIPM PR soon. |
|
@klamike Do you have any benchmark with |
|
Sorry about that, I missed your message.. The latest result is on a batch of 128 9241_pegase DCOPF, ~6.5x faster over sequential, and ~1.85x faster over multi-threading with 8 threads (comparing 1 task/mini-batch vs 1 task/problem) |
26093b5 to
5da0386
Compare
Good 🥇 Also should I add a prefix @michel2323 worked on |
|
For the Regarding the
|
|
Which GPU did you run the benchmark on @klamike? |
|
I believe it was RTX 6000 Pro Blackwell |
|
I see. For case 9241, practical GPU throughput is about 15 CPU threads? I wonder how optimized is the batched solver. E.g., is the symbolic factorization reused in each solve? Also, fp64 flops have not bottlenecked us so far, but it might be on this regime. I wonder how it performs on B100/H100 gpus
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Michael Klamkin ***@***.***>
Sent: Sunday, February 15, 2026 6:26:27 PM
To: JuliaSmoothOptimizers/NLPModels.jl ***@***.***>
Cc: Sungho Shin ***@***.***>; Comment ***@***.***>
Subject: Re: [JuliaSmoothOptimizers/NLPModels.jl] Batch API (PR #540)
[https://avatars.githubusercontent.com/u/17013474?s=20&v=4]klamike left a comment (JuliaSmoothOptimizers/NLPModels.jl#540)<#540 (comment)>
I believe it was RTX 6000 Pro Blackwell
—
Reply to this email directly, view it on GitHub<#540 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AG62JSDFO3WUWL5BU2TO4F34MD6CHAVCNFSM6AAAAACUAOHAI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSMBVGQZTOMRSHE>.
You are receiving this because you commented.Message ID: ***@***.***>
|
|
We are indeed reusing the symbolic factorization, using cuDSS's uniform batch feature. But the overall performance of the solver is not quite optimized, currently a lot of time is spent on packing/unpacking from individual to batched buffers since that made it simpler to batch-ify incrementally. Now that (most of) the solver is batched, I plan to revisit this. I think we have H100 and H200 but no B100. I'll give it a try soon, with more CPU threads too. |
|
@klamike We discussed about the batch API with Sungho last week and we converged to a storage where everything is a multi-dimensional array and the last dimension is the number of batch. I already updated CUDSS.jl last weekend for that: |
|
I like that approach! But for the KKT system nzval, I think strided vector is actually nicer (we can reuse the transfer kernels by just building a batched map). Matrix is definitely more natural for a user-facing API. (so it makes sense to have both in CUDSS). As I am working on this version, I have come across several kernels which can be written exactly the same (differs only in argument types), or only one or two words different (e.g. same |
Co-authored-by: Michael Klamkin <klamike@users.noreply.github.com>
997a31c to
edcec16
Compare
|
@klamike Do you anything else before that I merge the PR? |
cc @klamike