Skip to content

ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT #169

@Risho92

Description

@Risho92

I used the below command to extract text from a pdf using textractor

response = client.start_document_analysis(
	DocumentLocation=(
		'S3Object': {
			'Bucket': Bucket,
			'Name': Name
			}
		},
		FeatureTypes=['LAYOUT','FORMS'],
		OutputConfig={
			'S3Bucket': S3Bucket,
			'S3Prefix': S3Prefix
		},
	KMSKeyId=KMSKeyId
)

I took the output file, added an extension ".json" (an optional step). Then I tried to run the example data extraction in csv format from the below page.

https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter

from textractprettyprinter import get_layout_csv_from_trp2

with open(<some_test_file>) as input_fp:
    trp2_doc: TDocument = TDocumentSchema().load(json.load(input_fp))
    layout_csv = get_layout_csv_from_trp2(trp2_doc)
    csv_output = io.StringIO()
    csv_writer = csv.writer(csv_output, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for page in layout_csv:
        csv_writer.writerows(page)
    print(csv_output)

json.load(input_fp) works fine. But TDocumentSchema().load(json.load(input_fp)) is throwing "ValidationError"
Cell In[13], line 4
	1 with open("1.json") as input_fp:
	2 	TDocumentSchema().load(json.load(input_fp))

File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:722, in Schema.load(self, data, many, partial, unknown)
	691 def load(
	692 	self,
	693 	data: (
	(...)
	700 unknown: str | None = None,
	701 ):
	702 		"""Deserialize a data structure to an object defined by this schema's fields.
	703
	704 		:param data: The data to deserialize.
	(...)
	720 			if invalid data are passed.
	721			"""
	722 	return self._do_load(
	723 		data, many=many, partial=partial, unknown=unknown, postprocess=True
	724 	)
	
File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:909, in Schema._do_load(self, data, many, partial, unknown, postprocess)
	907 	exec = ValidationError(errors, data=data, valid_data=result)
	908 	self.handle_error(exc, data, many=many, partial=partial)
	909 	raise exc
	911 return result
	
ValidationError: {'Blocks': {0: {'Confidence': ['Field may not be null'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null'].........

I tried with multi page pdf and single page pdf, but always getting this error.

I uploaded a document through console (Amazon Textract --> Bulk Document Uploader --> Upload Documents --> AnalyzeDocument - Layout), took the output 'analyzeDocResponse.json' and tried to do the same procedure. TDocumentSchema().load(json.load(input_fp)) worked fine on that. get_layout_csv_from_trp2(trp2_doc) failed with error "AttributeError: 'NoneType' object has no attribute 'ids'". I am not interested to get a fix for this error, as I will not be using this in Production. I just tested it.


Given below are the environment details

Operation System: Windows 11 Pro
Python Version: 3.10.12

amazon-textract-caller==0.2.1
amazon-textract-pipeline-pagedimensions==0.0.9
amazon-textract-prettyprinter==0.1.8
amazon-textract-textractor==1.4.5
amazon-textract-response-parser==1.0.2
marshmallow==3.20.1
textract-trp==0.1.3

Any help to get this error resolved is highly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    pythonRelates to the Python version of TRP

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions