This package is intended to become a replacement utility to the Apario Contribution project that most of this code is taken from and built onto. That first iteration was necessary because it was specifically designed to ingest the JFK Files from the official National Archives' .gov site. All official. All legit, with metadata XLSX analysis support.
This has expanded into the Apario Writer project where the intended goal of this utility will be to take resources and prepare them for consumption by the Apario Reader application. This writer is responsible for generating what the reader presents to end users at the configured domain name.
apario-writer \
--download-pdf-url "https://www.cia.gov/readingroom/docs/CIA-RDP96-00788R001500160012-7.pdf" \
--database-directory "/idoread.com-data/stargate-tmp" \
--pdf-title "STATEMENT BEFORE THE INVESTIGATIONS SUBCOMMITTEE HOUSE ARMED SERVICES" \
--metadata-json "{\"Collection\":\"STARGATE\",\"Released At\":\"2004-05-17\",\"Created At\":\"2016-11-04\"}" \
--log "/var/log/idoread.com/apario-writer.log"NOTE:
- The
--database-directoryis required on every command unless a--config config.yamldefines it elsewhere. - The
--pdf-titleis required when using--download-pdf-url. - The
--metadata-jsonis always optional but must be a flattened map[string]string of data only (KEY=VALUE list). - The
--logflag when omitted assumes a local directory calledlogsexists for it to write anengine-*.logfile. - In addition to
--download-pdf-urladditional options can be used. Currently XLSX and CSV uploads are permitted, however the header/column titles are fixed and must be defined according to specifications. They were designed for the STARGATE files and the JFK Assassination records originally.
Output:
/idoread.com-data/stargate-tmp/<checksum of url>/CIA-RDP96-00788R001500160012-7.pdf
/idoread.com-data/stargate-tmp/<checksum of url>/record.json
/idoread.com-data/stargate-tmp/<checksum of url>/extracted.json
/idoread.com-data/stargate-tmp/<checksum of url>/pages/
/idoread.com-data/stargate-tmp/<checksum of url>/pages/CIA-RDP96-00788R001500160012-7_page_1.pdf
/idoread.com-data/stargate-tmp/<checksum of url>/pages/ocr.0000001.txt
/idoread.com-data/stargate-tmp/<checksum of url>/pages/page.000001.json
/idoread.com-data/stargate-tmp/<checksum of url>/pages/page.dark.0000001.original.jpg
/idoread.com-data/stargate-tmp/<checksum of url>/pages/page.dark.0000001.large.jpg
/idoread.com-data/stargate-tmp/<checksum of url>/pages/page.dark.0000001.medium.jpg
/idoread.com-data/stargate-tmp/<checksum of url>/pages/page.dark.0000001.small.jpg
/idoread.com-data/stargate-tmp/<checksum of url>/pages/page.dark.0000001.social.jpg
/idoread.com-data/stargate-tmp/<checksum of url>/pages/page.light.0000001.original.jpg
/idoread.com-data/stargate-tmp/<checksum of url>/pages/page.light.0000001.large.jpg
/idoread.com-data/stargate-tmp/<checksum of url>/pages/page.light.0000001.medium.jpg
/idoread.com-data/stargate-tmp/<checksum of url>/pages/page.light.0000001.small.jpg
/idoread.com-data/stargate-tmp/<checksum of url>/pages/page.light.0000001.social.jpg
This is the default intended usage of the apario-writer application.
- Currently the
page.<dark|light>.#######.social.jpgis not created in the pipeline. - No
<basename-no-extension>.dark.pdffile is created. Only the original PDF is downloaded and replaced with OCR. - Extracted text may come from a PDF file whose keywords are more than 17 chars. If so, they keywords are concatenated into the extracted text.
- Not tested on Windows as there are a lot of runtime requirements. Tested on MacOS and Rocky Linux.
- Using the docker container wrapper requires knowledge of how to use Docker in a less than "hello world" manner.
This software is released under the GPL-3 Open Source license.