ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT

I used the below command to extract text from a pdf using textractor

```python
response = client.start_document_analysis(
	DocumentLocation=(
		'S3Object': {
			'Bucket': Bucket,
			'Name': Name
			}
		},
		FeatureTypes=['LAYOUT','FORMS'],
		OutputConfig={
			'S3Bucket': S3Bucket,
			'S3Prefix': S3Prefix
		},
	KMSKeyId=KMSKeyId
)
```

I took the output file, added an extension ".json" (an optional step). Then I tried to run the example data extraction in csv format from the below page.

https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter

```python
from textractprettyprinter import get_layout_csv_from_trp2

with open(<some_test_file>) as input_fp:
    trp2_doc: TDocument = TDocumentSchema().load(json.load(input_fp))
    layout_csv = get_layout_csv_from_trp2(trp2_doc)
    csv_output = io.StringIO()
    csv_writer = csv.writer(csv_output, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for page in layout_csv:
        csv_writer.writerows(page)
    print(csv_output)

json.load(input_fp) works fine. But TDocumentSchema().load(json.load(input_fp)) is throwing "ValidationError"
```

```
Cell In[13], line 4
	1 with open("1.json") as input_fp:
	2 	TDocumentSchema().load(json.load(input_fp))

File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:722, in Schema.load(self, data, many, partial, unknown)
	691 def load(
	692 	self,
	693 	data: (
	(...)
	700 unknown: str | None = None,
	701 ):
	702 		"""Deserialize a data structure to an object defined by this schema's fields.
	703
	704 		:param data: The data to deserialize.
	(...)
	720 			if invalid data are passed.
	721			"""
	722 	return self._do_load(
	723 		data, many=many, partial=partial, unknown=unknown, postprocess=True
	724 	)
	
File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:909, in Schema._do_load(self, data, many, partial, unknown, postprocess)
	907 	exec = ValidationError(errors, data=data, valid_data=result)
	908 	self.handle_error(exc, data, many=many, partial=partial)
	909 	raise exc
	911 return result
	
ValidationError: {'Blocks': {0: {'Confidence': ['Field may not be null'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null'].........
```

I tried with multi page pdf and single page pdf, but always getting this error.

I uploaded a document through console (Amazon Textract --> Bulk Document Uploader --> Upload Documents --> AnalyzeDocument - Layout), took the output 'analyzeDocResponse.json' and tried to do the same procedure. TDocumentSchema().load(json.load(input_fp)) worked fine on that. get_layout_csv_from_trp2(trp2_doc) failed with error "AttributeError: 'NoneType' object has no attribute 'ids'". I am not interested to get a fix for this error, as I will not be using this in Production. I just tested it.

--------------------------------------------------------------------------------
Given below are the environment details

Operation System: Windows 11 Pro
Python Version: 3.10.12

amazon-textract-caller==0.2.1
amazon-textract-pipeline-pagedimensions==0.0.9
amazon-textract-prettyprinter==0.1.8
amazon-textract-textractor==1.4.5
amazon-textract-response-parser==1.0.2
marshmallow==3.20.1
textract-trp==0.1.3

Any help to get this error resolved is highly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT #169

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT #169

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions