WebML Working Group participants
- Introduction
- Goals
- Non-goals
- User research
- Use cases
- Proposed API
- Considered alternatives
- Related work
- Privacy and security considerations
- References
The WebNN API enables web applications to perform ML model inference by constructing a graph representation of the model (MLGraphBuilder), compiling it into a native format (MLGraph), and executing it via MLContext.dispatch(). However, compiling large models for certain devices, such as NPUs, can be time-consuming. This can be particularly difficult since compilation must happen on potentially slower end-user devices rather than ahead-of-time. To address this, we propose an explicit API for caching compiled graphs, allowing web applications to save and reuse them, thereby reducing the overhead of repeated compilation.
This proposal documents ongoing discussions in the W3C WebML Working Group and builds on existing mechanisms in frameworks like ONNX Runtime.
- Provide a mechanism for web applications to save and load compiled
MLGraphobjects. - Reduce the time required for repeated ML model inference by avoiding redundant graph compilation.
- Ensure compatibility with existing WebNN API constructs and workflows.
- This proposal does not aim to define a universal format for graph serialization across all frameworks.
- It does not address caching mechanisms for non-WebNN APIs or other types of computational graphs.
- Cross-origin model sharing is out of scope.
[If any user research has been conducted to inform the design choices presented, discuss the process and findings. We strongly encourage that API designers consider conducting user research to verify that their designs meet user needs and iterate on them, though we understand this is not always feasible.]
A web application performing real-time image recognition can save the compiled graph after the first inference. If the page is reloaded, subsequent inferences reuse the cached graph, significantly reducing latency by avoiding both the model redownload and recompilation steps.
partial interface MLContext {
Promise<sequence<DOMString>> listGraphs();
Promise<MLGraph> loadGraph(DOMString key);
Promise<undefined> saveGraph(DOMString key, MLGraph graph);
undefined deleteGraph(DOMString key);
};listGraphs(): Returns a list of keys for all cached graphs.loadGraph(key): Loads a cached graph associated with the given key.saveGraph(key, graph): Saves the provided graph under the specified key.deleteGraph(key): Deletes the cached graph associated with the given key.
A graph may be evicted from the cache due to storage pressure or browser/platform updates which render previously compiled graphs invalid. Developers should consider the level of durability to be somewhere between IndexedDB and the HTTP cache. [For specification purposes, reuse the Storage standard concepts as applicable.]
A JS ML framework, such as ONNX Runtime Web, may need to know the input and output operands info (name, shape and data type) to construct input and output tensors for an inference session. The input and output operands info is known if users pass the source model, e.g. ONNX model. With model cache, user may only pass the model key, the framework needs to fetch the input and output operands info from an MLGraph. It would be necessary to expose the inputDescriptors and outputDescriptors internal slots of MLGraph interface.
partial interface MLGraph {
record<USVString, MLOperandDescriptor> inputs;
record<USVString, MLOperandDescriptor> outputs;
};A separate saveGraph() API might introduce overhead on some native ML frameworks, such as ONNX Runtime, because its implementation may need to hold the source model in the memory and recompile the source model when user code calls saveGraph().
An alternative consideration is to have a buildAndSave() method. The implementation can just compile the graph once and drop the source model after the compilation.
partial interface MLGraphBuilder {
Promise<MLGraph> buildAndSave(MLNamedOperands outputs, DOMString key);
};However, a compliant implementation of build() could save the compiled model into a temporary file which is deleted unless saveGraph() is called later, rendering an explicit buildAndSave() unnecessary.
GPU shader caching is implicit, however the difference is that a shader program is a small input and so it's easy for the site to regenerate the shader so the browser can hash it to compare with the cache. ML models on the other hand are large because of the weights. Loading all the weights just to discover that a cached version of the model is available would be a waste of time and resources. (via comment)
Furthermore, an ML model can't be compiled without the weights because the implementation may perform device-specific constant folding and memory layout optimizations.
ONNX Runtime introduced the EPContext mechanism to encapsulate compiled blobs into ONNX models. This approach inspired the WebNN caching proposal but is tailored to ONNX-specific workflows.
The WebGPU API employs a shader caching mechanism. While similar in concept, it is designed for GPU shaders rather than ML model graphs.
To prevent cross-origin data leakage, cached graphs must be partitioned per origin. This ensures that a graph saved by one website cannot be accessed by another.
For security reasons, model compilation and inference will typically happen in sandboxed processes. This will introduce implementation challenges and care must be taken in how the caching mechanism allows data to be read from and written to disk.