Replacing all Content Controls (also nested ones) #617

Frueps · 2025-04-04T09:09:55Z

Frueps
Apr 4, 2025

What am I trying to do?

I want to replace alle content controls, this also includes nested content controls, within a given docx file. The values are retrieved from a legacy API, which needs all content controls, including their structure, at once to calculate the values. Except for the content controls everything has to stay exactly the same in the document (Pictures, links, static text / tables, ...).

What is my approach to solve this?

First, I extract all the content controls. Then I passed all of them together in a format which also represents their hierarchy, to the legacy API. With the resulting values I create a copy of the docx where I omit the w:sdt / w:sdtContent Tags but replace them with either the values or their whole children structure.

What problems do I have?

Replacing the content controls runs on a server where many users can start the process in parallel. So memory efficiency is key.
I struggle to find good tooling support. As far as I evaluated my options the best support for handling content controls seems to be here in docx4j. Alternatives which I had a look at include Aspose.Words and Apache POI. Aspose.Words has a crazy amount of memory consumption according to their own calculations (https://docs.aspose.com/words/java/memory-requirements/#how-much-memory-asposewords-needs). Apache POI has only rudimentary and experimental support for content controls according to their own documentation.
Most memory efficient way that I found so far is to parse the OOXML by hand with a StAX parser. This is really cumbersome and also I am afraid that I will not produce a valid docx file once I write the result values back into the newly created OOXML copy.

What is my Question?

Can docx4j handle extracting all nested content controls in a memory efficient way? Can the memory consumption be estimated?
Can docx4j modify the docx file with the resulting values for the content controls memory efficiently?
In case I have to parse the OOXML with StAX by Hand, is there any task that docx4j could be supporting?
In case I have to write the resulting docx "manually": how high is the risk, given that only content controls have to be stripped out of the OOXML, that the resulting docx (with the modified OOXML) is still valid?

Thank you so much for your help. Any Idea or input is highly appreciated.

P.S.: initially this was a Question on SO which got closed automatically: https://docs.aspose.com/words/java/memory-requirements/#how-much-memory-asposewords-needs
P.P.S.: already a big thanks to Jason for his support on SO!

plutext · 2025-04-06T00:58:05Z

plutext
Apr 6, 2025
Maintainer

There are 3 layers to consider in a solution:

the zip file layer
the stax/jaxb processing
handling the content controls

Zip layer: docx4j uses commons-compress to unzip the docx; each part is read into a ByteArray. Typically, if you want to do something with a part which contains XML for which we have a JAXB content model, then that Byte Array is unmarshalled when required via JAXB to Java objects.

You might wish (maybe in a Phase 2) to investigate other ways of handling the zip layer (ie so each zip entry doesn't consume RAM), but at least the code here does not need to unmarshall the whole document part.

stax/jaxb processing layer: https://github.com/plutext/docx4j/blob/VERSION_11_5_3/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/ContentControlsViaStAX.java demonstrates manipulating content controls at the docx4j object model level, without unmarshalling the entire document.xml ByteArray, then outputting to a new docx file.

Handling the actual content controls: the demo code needs to instead use the code in org.docx4j.model.datastorage to perform the binding. Some glue/integration code will be required to do this, but not much, and then you will be done :-)

This approach will handle nested content controls. Obviously you won't save much memory if "most" of the document is within a single content control.

As to how much memory can be saved, best just to try it in a profiler on your sample documents. Please let us know results.
Previous experience suggests you may see 1/4 heap usage, and quicker execution.

Summary of answers:-

Can docx4j handle extracting all nested content controls in a memory efficient way? Can the memory consumption be estimated?

--> Yes, at least use of JAXB can be kept to just where it is required to process content controls, skipping over the rest

Can docx4j modify the docx file with the resulting values for the content controls memory efficiently?

--> Yes, see above

In case I have to parse the OOXML with StAX by Hand, is there any task that docx4j could be supporting?

--> N/A, see https://github.com/plutext/docx4j/blob/VERSION_11_5_3/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/ContentControlsViaStAX.java

In case I have to write the resulting docx "manually": how high is the risk, given that only content controls have to be stripped out of the OOXML, that the resulting docx (with the modified OOXML) is still valid?

--> N/A, demo uses docx4j to write the docx. As long as you don't add or modify rels (eg add images or hyperlinks), the resulting docx should be good.

3 replies

Frueps Apr 10, 2025
Author

Hi Jason,

thank you so much for your time and effort! Also thanks for the code example with StAX and JAXB. My use case is slightly different, because I have to collect all content controls before I can hand them off as a whole collection to the legacy API. But I can easily adapt that part. This also means I have to read the XML twice :/ but unfortunately there is no other way without changing the legacy API.

"Obviously you won't save much memory if "most" of the document is within a single content control."
--> that idea crossed my mind also at some point. Therefore I will use the safest approach and use StAX only to extract all infos from the content controls. That should be fairly easy.

However writing back the content controls while stripping them out or change their visibility to false is a bit trickier. I will build a prototype and let you know my results.

plutext Apr 10, 2025
Maintainer

However writing back the content controls while stripping them out or change their visibility to false is a bit trickier.

--> You should be able to do this without any issues following the ContentControlsViaStAX.java example combined with the code in org.docx4j.model.datastorage

Is there repeating data? For example, invoice line items.

How about conditional content?

Frueps Apr 10, 2025
Author

"Is there repeating data? For example, invoice line items."
--> Yes there is. I will adapt your OpenDoPe approach to our needs, as we also need to store even more additional meta data in a single content control.

How about conditional content?
--> Yes this is also a requirement. Not only for "normal" content but also for repeats. So we need the possibility to repeat 0..N times. The standard behavior of Word is 1..N times as far as I could figure it out.

We also will have additional requirements in the future, for example conditions that do not trigger whether or not content is included in the document but conditions the decide whether content X or content Y should be included in the document. (This is only the beginning. We are trying to replace our old system which used MailMerge within a Word document).

plutext · 2025-04-12T03:02:41Z

plutext
Apr 12, 2025
Maintainer

19e43c1 is most of a binding step POC done. A little more to do...

I will also have a go at the conditions/repeats part.

I have deleted the ContentControlsViaStAX.java example since it is superceded by better code.

conditions the decide whether content X or content Y should be included in the document
--> you can effectively do "or" already: include context X using condition1==true; include context Y using condition1==false. Or am I missing something?

3 replies

Frueps Apr 16, 2025
Author

Of course, you are absolutely right, this is already possible. I was still thinking of our old solution that could handle this a litte bit more beautiful, i.e. more in a "if else" kind of way.

plutext Apr 17, 2025
Maintainer

https://www.docx4java.org/forums/data-binding-java-f16/conditional-content-binding-ternary-operator-t3146.html

Somebody on your team, or a coincidence? :-)

Frueps Apr 17, 2025
Author

haha, no this is a coincidence. Currently I am working solo on this feature.

plutext · 2025-04-13T02:13:36Z

plutext
Apr 13, 2025
Maintainer

I will also have a go at the conditions/repeats part.
--> 44d65e0 can use StAX in OpenDoPEHandler step. A useful part of that commit is https://github.com/plutext/docx4j/blob/VERSION_11_5_3/docx4j-core/src/main/java/org/docx4j/openpackaging/parts/WordprocessingML/SdtStAXHandler.java which abstracts the "drop down from StAX to JAXB for content control manipulation" bit. Similar abstract classes could be used for StAX to JAXB handling for other objects, if useful (bookmarks?).

After the OpenDoPE step, bookmark handling currently causes the document to be unmarshalled, so this needs to be looked into next.

After writing BindingTraverserStAX yesterday, it occurred to me that the fully featured BindingTraverserXSLT could take a StAX source (currently the JAXB is marshalled to org.w3c.dom.Document).

0 replies

plutext · 2025-04-14T02:11:12Z

plutext
Apr 14, 2025
Maintainer

OpenDoPEHandler then BindingHandler now works without unmarshalling :-)

BindingTraverserXSLT now supports StAX input, so probably best to use this given that it is feature complete (unless it still a lot slower than the non-XSLT alternatives, which are missing features).

Note, currently you need to be using the ContentControlBindingExtensionsOld sample to see this work without unmarshalling, since an outstanding issue is cloning (ie whether to destructively change the input template, or make a clone of it) which the Docx4J facade does.

With ContentControlBindingExtensionsOld you can easily see/comment in/out each step in the process.

In terms of the entire workflow, here is the current status of StAXifying:

OpenDoPEHandlerComponents: TODO, not necessary unless you are using components
OpenDoPEHandler: done
OpenDoPEIntegrity: done
BindingHandler: done, with both XSLT and non-XSLT approaches
OpenDoPEIntegrityAfterBinding: done
XsltFinisher (optional): done
RemovalHandler (optional): done
OpenDoPEReverter (optional): TODO

0 replies

plutext · 2025-04-14T03:17:41Z

plutext
Apr 14, 2025
Maintainer

"Invoice" sample preliminary unscientific timings via the existing JAXB approach:

Unmarshalling: 4
OpenDoPEHandler: 72
OpenDoPEIntegrity: 77
BindingHandler.applyBindings: 71
OpenDoPEIntegrityAfterBinding: 7
RemovalHandler: 14

using StAX:

OpenDoPEHandler: 87
OpenDoPEIntegrity: 56
BindingHandler.applyBindings: 65
OpenDoPEIntegrityAfterBinding: 7
RemovalHandler: 15

Now, with the invoice filled with 20 pages of filler content, via the existing JAXB approach:

Unmarshalling: 18
OpenDoPEHandler: 79
OpenDoPEIntegrity: 153
BindingHandler.applyBindings: 164
OpenDoPEIntegrityAfterBinding: 92
RemovalHandler: 81

using StAX:

OpenDoPEHandler: 91
OpenDoPEIntegrity: 105
BindingHandler.applyBindings: 128
OpenDoPEIntegrityAfterBinding: 43
RemovalHandler: 71

What about 396 pages? via the existing JAXB approach:

Unmarshalling: 98
OpenDoPEHandler: 106
OpenDoPEIntegrity: 1106
BindingHandler.applyBindings: 1090
OpenDoPEIntegrityAfterBinding: 910
RemovalHandler: 901

using StAX:

OpenDoPEHandler: 162
OpenDoPEIntegrity: 587
BindingHandler.applyBindings: 656
OpenDoPEIntegrityAfterBinding: 499
RemovalHandler: 507

Again, these are unscientific. Just a single run, JVM not warmed up. And I haven't measured the difference in time saving the docx (with StAX, the MDP doesn't need to be marshalled again), though this will be insignificant.

But with the 396 page docx, the XSLT-based steps (OpenDoPEIntegrity, BindingHandler, OpenDoPEIntegrityAfterBinding, RemovalHandler) are all a fair bit quicker, leading to a noticeable overall speed improvement (0.6*time). Interestingly, OpenDoPEHandler is slower, but its a relatively small component of overall processing time for this very simple sample docx. My hunch is that that part wouldn't get any better with a more complex docx.

I didn't measure memory usage, but there should be some good savings here without the extra JAXB objects.

How many pages are your input templates likely to be? If there are many tables, are these mostly within content controls?

2 replies

Frueps Apr 16, 2025
Author

Hi Jason,

awesome to see that you got some inspiration out of my question =)

"I didn't measure memory usage, but there should be some good savings here without the extra JAXB objects."
--> I am pretty sure about that as well.

"How many pages are your input templates likely to be?"
--> Difficult to say, because this will run on the server of a large scale ERP system. Clients working with Word could be doing anything from invoices, to projcect management, to real estate management to many many other different things. My well educated guess is that the average template is less than 10 pages. Our memory problem does not arise out of large templates but rather out of many users on bad self hosted server hardware. But in theory a user can build a huge template as well, so we want to be prepared.

"If there are many tables, are these mostly within content controls?"
--> highly likely, yes. As far as I understand, content controls were designed for a single Word user to manually insert some data into the document. But what we need is some mechanism to generate documents which are populated by values from our data base.

plutext Apr 17, 2025
Maintainer

awesome to see that you got some inspiration out of my question =)
--> :-) yes, it was an interesting question, and once down the rabbit hole, best to "strike while the iron is hot" (sorry to mix metaphors!)

I subsequently took a quick look at Heap usage in VisualVM, and for those tests at least it was about half.

I asked about tables because these use lots of elements (tr, tc etc), whereas paragraphs can be simple (or not, if they have a lot of direct formatting). If the tables are inside the content controls, then with the StAX approach they will get unmarshalled, so no efficiency gain there.

Looking at the timing measures, it would be nice to avoid OpenDoPEIntegrity and OpenDoPEIntegrityAfterBinding if possible.

OpenDoPEIntegrityAfterBinding only handles bookmarks in plain text content controls. Therefore it could be avoided if we know there are no such bookmarks: either because you know there are none in your input template, or because we checked for these in BindingHandler.applyBindings. I haven't looked into it, but you'd think applyBindings could handle this, obviating the need for OpenDoPEIntegrityAfterBinding entirely. It is only really useful if other unrelated integrity issues surface which it could handle.

OpenDoPEIntegrity does a bit more. It handles:

ID collisions on comment, footnoteReference and endnoteReference
table integrity
removes w15 repeats

so not low hanging fruit.

Replacing all Content Controls (also nested ones) #617

Uh oh!

Uh oh!

Frueps Apr 4, 2025

What am I trying to do?

What is my approach to solve this?

What problems do I have?

What is my Question?

Replies: 5 comments · 8 replies

Uh oh!

plutext Apr 6, 2025 Maintainer

Uh oh!

Frueps Apr 10, 2025 Author

Uh oh!

plutext Apr 10, 2025 Maintainer

Uh oh!

Frueps Apr 10, 2025 Author

Uh oh!

plutext Apr 12, 2025 Maintainer

Uh oh!

Frueps Apr 16, 2025 Author

Uh oh!

plutext Apr 17, 2025 Maintainer

Uh oh!

Frueps Apr 17, 2025 Author

Uh oh!

Uh oh!

plutext Apr 13, 2025 Maintainer

Uh oh!

Uh oh!

plutext Apr 14, 2025 Maintainer

Uh oh!

Uh oh!

plutext Apr 14, 2025 Maintainer

Uh oh!

Frueps Apr 16, 2025 Author

Uh oh!

plutext Apr 17, 2025 Maintainer

Frueps
Apr 4, 2025

Replies: 5 comments 8 replies

plutext
Apr 6, 2025
Maintainer

Frueps Apr 10, 2025
Author

plutext Apr 10, 2025
Maintainer

Frueps Apr 10, 2025
Author

plutext
Apr 12, 2025
Maintainer

Frueps Apr 16, 2025
Author

plutext Apr 17, 2025
Maintainer

Frueps Apr 17, 2025
Author

plutext
Apr 13, 2025
Maintainer

plutext
Apr 14, 2025
Maintainer

plutext
Apr 14, 2025
Maintainer

Frueps Apr 16, 2025
Author

plutext Apr 17, 2025
Maintainer