Replies: 5 comments 8 replies
-
|
There are 3 layers to consider in a solution:
Zip layer: docx4j uses commons-compress to unzip the docx; each part is read into a ByteArray. Typically, if you want to do something with a part which contains XML for which we have a JAXB content model, then that Byte Array is unmarshalled when required via JAXB to Java objects. You might wish (maybe in a Phase 2) to investigate other ways of handling the zip layer (ie so each zip entry doesn't consume RAM), but at least the code here does not need to unmarshall the whole document part. stax/jaxb processing layer: https://github.com/plutext/docx4j/blob/VERSION_11_5_3/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/ContentControlsViaStAX.java demonstrates manipulating content controls at the docx4j object model level, without unmarshalling the entire document.xml ByteArray, then outputting to a new docx file. Handling the actual content controls: the demo code needs to instead use the code in org.docx4j.model.datastorage to perform the binding. Some glue/integration code will be required to do this, but not much, and then you will be done :-) This approach will handle nested content controls. Obviously you won't save much memory if "most" of the document is within a single content control. As to how much memory can be saved, best just to try it in a profiler on your sample documents. Please let us know results. Summary of answers:- Can docx4j handle extracting all nested content controls in a memory efficient way? Can the memory consumption be estimated? --> Yes, at least use of JAXB can be kept to just where it is required to process content controls, skipping over the rest Can docx4j modify the docx file with the resulting values for the content controls memory efficiently? --> Yes, see above In case I have to parse the OOXML with StAX by Hand, is there any task that docx4j could be supporting? In case I have to write the resulting docx "manually": how high is the risk, given that only content controls have to be stripped out of the OOXML, that the resulting docx (with the modified OOXML) is still valid? --> N/A, demo uses docx4j to write the docx. As long as you don't add or modify rels (eg add images or hyperlinks), the resulting docx should be good. |
Beta Was this translation helpful? Give feedback.
-
|
19e43c1 is most of a binding step POC done. A little more to do... I will also have a go at the conditions/repeats part. I have deleted the ContentControlsViaStAX.java example since it is superceded by better code. conditions the decide whether content X or content Y should be included in the document |
Beta Was this translation helpful? Give feedback.
-
|
I will also have a go at the conditions/repeats part. After the OpenDoPE step, bookmark handling currently causes the document to be unmarshalled, so this needs to be looked into next. After writing BindingTraverserStAX yesterday, it occurred to me that the fully featured BindingTraverserXSLT could take a StAX source (currently the JAXB is marshalled to org.w3c.dom.Document). |
Beta Was this translation helpful? Give feedback.
-
|
OpenDoPEHandler then BindingHandler now works without unmarshalling :-) BindingTraverserXSLT now supports StAX input, so probably best to use this given that it is feature complete (unless it still a lot slower than the non-XSLT alternatives, which are missing features). Note, currently you need to be using the ContentControlBindingExtensionsOld sample to see this work without unmarshalling, since an outstanding issue is cloning (ie whether to destructively change the input template, or make a clone of it) which the Docx4J facade does. With ContentControlBindingExtensionsOld you can easily see/comment in/out each step in the process. In terms of the entire workflow, here is the current status of StAXifying:
|
Beta Was this translation helpful? Give feedback.
-
|
"Invoice" sample preliminary unscientific timings via the existing JAXB approach: using StAX: Now, with the invoice filled with 20 pages of filler content, via the existing JAXB approach: using StAX: What about 396 pages? via the existing JAXB approach: using StAX: Again, these are unscientific. Just a single run, JVM not warmed up. And I haven't measured the difference in time saving the docx (with StAX, the MDP doesn't need to be marshalled again), though this will be insignificant. But with the 396 page docx, the XSLT-based steps (OpenDoPEIntegrity, BindingHandler, OpenDoPEIntegrityAfterBinding, RemovalHandler) are all a fair bit quicker, leading to a noticeable overall speed improvement (0.6*time). Interestingly, OpenDoPEHandler is slower, but its a relatively small component of overall processing time for this very simple sample docx. My hunch is that that part wouldn't get any better with a more complex docx. I didn't measure memory usage, but there should be some good savings here without the extra JAXB objects. How many pages are your input templates likely to be? If there are many tables, are these mostly within content controls? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What am I trying to do?
I want to replace alle content controls, this also includes nested content controls, within a given docx file. The values are retrieved from a legacy API, which needs all content controls, including their structure, at once to calculate the values. Except for the content controls everything has to stay exactly the same in the document (Pictures, links, static text / tables, ...).
What is my approach to solve this?
First, I extract all the content controls. Then I passed all of them together in a format which also represents their hierarchy, to the legacy API. With the resulting values I create a copy of the docx where I omit the w:sdt / w:sdtContent Tags but replace them with either the values or their whole children structure.
What problems do I have?
What is my Question?
Thank you so much for your help. Any Idea or input is highly appreciated.
P.S.: initially this was a Question on SO which got closed automatically: https://docs.aspose.com/words/java/memory-requirements/#how-much-memory-asposewords-needs
P.P.S.: already a big thanks to Jason for his support on SO!
Beta Was this translation helpful? Give feedback.
All reactions