Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
## PAVES 0.8.0: 2025-12-31

- Correct various problems with table detection
- Add more optional table detectors
- Add preliminary table structure prediction
- Remove "bears" API that I never used for anything
260 changes: 24 additions & 236 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,247 +1,35 @@
# PAVÉS: Bajo los adoquines, la PLAYA 🏖️

[**PLAYA**](https://github.com/dhdaines/playa) is intended
to get objects out of PDF, with no
dependencies or further analysis. So, over top of **PLAYA** there is
**PAVÉS**: "**P**DF, **A**nalyse et **V**isualisation ... plus
avancé**ES**", I guess?
[**PLAYA**](https://github.com/dhdaines/playa) is intended to get
objects out of PDF, with no dependencies or further analysis. So,
over top of **PLAYA**, this package provides **P**DF, **A**nalyse et
**V**isualisation simplifi**É**e**S**.

Anything that deviates from the core mission of "getting objects out
of PDF" goes here, so, hopefully, more interesting analysis and
extraction that may be useful for all of you AI Bros doing
"Partitioning" and "Retrieval-Assisted-Generation" and suchlike
things. But specifically, visualization stuff inspired by the "visual
debugging" features of `pdfplumber` but not specifically tied to its
data structures and algorithms.
Or, if you prefer, **P**DF **A**nalysis and **V**isualization for
dummi**ES**.

There will be dependencies. Oh, there will be dependencies.
The goal here is not to provide elaborate, enterprise-grade,
battle-tested, cloud and AI-native, completely configurable and
confoundingly complex classes for ETL. It's to give you some helpful
functions that you can use to poke around in PDFs and get useful
things out of them, often but not exclusively in the context of a
Jupyter notebook.

## Installation

```console
pip install paves
```

## Looking at Stuff in a PDF

When poking around in a PDF, it is useful not simply to read
descriptions of objects (text, images, etc) but also to visualise them
in the rendered document. `pdfplumber` is quite nice for this, though
it is oriented towards the particular set of objects that it can
extract from the PDF.

The primary goal of [PLAYA-PDF](https://dhdaines.github.io/playa)
is to give access to all the objects and
particularly the metadata in a PDF. One goal of PAVÉS (because there
are a few) is to give an easy way to visualise these objects and
metadata.

First, maybe you want to just look at a page in your Jupyter notebook.
Okay!

```python
import playa, paves.image as pi
pdf = playa.open("my_awesome.pdf")
page = pdf.pages[3]
pi.show(page)
```

Something quite interesting to do is, if your PDF contains a logical
structure tree, to look at the bounding boxes of the contents of those
structure elements for a given page:

```python
pi.box(page.structure)
```

![Structure Elements](./docs/page3-elements.png)

Note however that this only gives you the elements associated with
*marked content sections*, which are the leaf nodes of the structure
tree. So, you can also search up the structure tree to find things
like tables, figures, or list items:

```python
pi.box(page.structure.find_all("Table"))
pi.box(page.structure.find_all("Figure"))
pi.box(page.structure.find_all("LI"))
```

You can even search with regular expressions, to find headers for
instance:

```python
pi.box(page.structure.find_all(re.compile(r"H\d+")))
```

Alternately, if you have annotations (such as links), you can look at
those too:

```python
pi.box(page.annotations)
```

![Annotations](./docs/page2-annotations.png)

You can of course draw boxes around individual PDF objects, or
one particular sort of object, or filter them with a generator
expression:

```python
pi.box(page) # outlines everything
pi.box(page.texts)
pi.box(page.images)
pi.box(t for t in page.texts if "spam" in t.chars)
```

Alternately you can "highlight" objects by overlaying them with a
semi-transparent colour, which otherwise works the same way:

```python
pi.mark(page.images)
```

![Annotations](./docs/page298-images.png)

If you wish you can give each type of object a different colour:

```python
pi.mark(page, color={"text": "red", "image": "blue", "path": "green"})
```

![Annotations](./docs/page298-colors.png)

You can also add outlines and labels around the highlighting:

```python
pi.mark(page, outline=True, label=True,
color={"text": "red", "image": "blue", "path": "green"})
```

![Annotations](./docs/page298-outlines.png)
See the [https://dhdaines.github.io/paves](documentation) for more
information. There will also be some helpful notebooks soon, to help
you.

By default, PAVÉS will assign a new colour to each distinct label based
on a colour cycle [borrowed from
Matplotlib](https://matplotlib.org/stable/gallery/color/color_cycle_default.html)
(no actual Matplotlib was harmed in the making of this library). You
can use Matplotlib's colour cycles if you like:

```
import matplotlib
pi.box(page, color=matplotlib.color_sequences["Dark2"])
```

![Color Cycles](./docs/page2-color-cycles.png)

Or just any list (it must be a `list`) of color specifications (which
are either strings, 3-tuples of integers in the range `[0, 255]`, or
3-tuples of floats in the range `[0.0, 1.0]`):

```
pi.mark(page, color=["blue", "magenta", (0.0, 0.5, 0.32), (233, 222, 111)], labelfunc=repr)
```

![Cycle Harder](./docs/page298-color-cycles.png)

(yes, that just cycles through the colors for each new object)

## Working in the PDF mine

`pdfminer.six` is widely used for text extraction and layout analysis
due to its liberal licensing terms. Unfortunately it is quite slow
and contains many bugs. Now you can use PAVÉS instead:

```python
from paves.miner import extract, LAParams

laparams = LAParams()
for page in extract(path, laparams):
# do something
```

This is generally faster than `pdfminer.six`. You can often make it
even faster on large documents by running in parallel with the
`max_workers` argument, which is the same as the one you will find in
`concurrent.futures.ProcessPoolExecutor`. If you pass `None` it will
use all your CPUs, but due to some unavoidable overhead, it usually
doesn't help to use more than 2-4:

```
for page in extract(path, laparams, max_workers=2):
# do something
```

There are a few differences with `pdfminer.six` (some might call them
bug fixes):

- By default, if you do not pass the `laparams` argument to `extract`,
no layout analysis at all is done. This is different from
`extract_pages` in `pdfminer.six` which will set some default
parameters for you. If you don't see any `LTTextBox` items in your
`LTPage` then this is why!
- Rectangles are recognized correctly in some cases where
`pdfminer.six` thought they were "curves".
- Colours and colour spaces are the PLAYA versions, which do not
correspond to what `pdfminer.six` gives you, because what
`pdfminer.six` gives you is not useful and often wrong.
- You have access to the list of enclosing marked content sections in
every `LTComponent`, as the `mcstack` attribute.
- Bounding boxes of rotated glyphs are the actual bounding box.

Probably more... but you didn't use any of that stuff anyway, you just
wanted to get `LTTextBoxes` to feed to your hallucination factories.

## PLAYA Bears

[PLAYA](https://github.com/dhdaines/playa) has a nice "lazy" API which
is efficient but does take a bit of work to use. If, on the other
hand, **you** are lazy, then you can use `paves.bears`, which will
flatten everything for you into a friendly dictionary representation
(but it is a
[`TypedDict`](https://typing.readthedocs.io/en/latest/spec/typeddict.html#typeddict))
which, um, looks a lot like what `pdfplumber` gives you, except
possibly in a different coordinate space, as defined [in the PLAYA
documentation](https://github.com/dhdaines/playa#an-important-note-about-coordinate-spaces).

```python
from paves.bears import extract

for dic in extract(path):
print("it is a {dic['object_type']} at ({dic['x0']}", {dic['y0']}))
print(" the color is {dic['stroking_color']}")
print(" the text is {dic['text']}")
print(" it is in MCS {dic['mcid']} which is a {dic['tag']}")
print(" it is also in Form XObject {dic['xobjid']}")
```

This can be used to do machine learning of various sorts. For
instance, you can write `page.layout` to a CSV file:

```python
from paves.bears import FIELDNAMES

writer = DictWriter(outfh, fieldnames=FIELDNAMES)
writer.writeheader()
for dic in extract(path):
writer.writerow(dic)
```

you can also create a Pandas DataFrame:

```python
df = pandas.DataFrame.from_records(extract(path))
```

or a Polars DataFrame or LazyFrame:

```python
from paves.bears import SCHEMA
## Installation

df = polars.DataFrame(extract(path), schema=SCHEMA)
```
Install it from PyPI (as `paves`) with `pip` or `uv`, preferably in a
virtual environment. That's all. If you want to play around in the
source code you can use `hatch` or `uv` (your choice), for instance:

As above, you can use multiple CPUs with `max_workers`, and this will
scale considerably better than `paves.miner`.
# with hatch
hatch shell
# with uv
uv venv
. .venv/bin/activate

## License

Expand Down
46 changes: 0 additions & 46 deletions benchmarks/miner.py

This file was deleted.

32 changes: 29 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
{%
include-markdown "../README.md"
%}
# PAVÉS: Bajo los adoquines, la PLAYA 🏖️

[**PLAYA**](https://github.com/dhdaines/playa) is intended to get
objects out of PDF, with no dependencies or further analysis. So,
over top of **PLAYA**, this package provides **P**DF, **A**nalyse et
**V**isualisation simplifi**É**e**S**.

Or, if you prefer, **P**DF **A**nalysis and **V**isualization for
dummi**ES**.

The goal here is not to provide elaborate, enterprise-grade,
battle-tested, cloud and AI-native, completely configurable and
confoundingly complex classes for ETL. It's to give you some helpful
functions that you can use to poke around in PDFs and get useful
things out of them, often but not exclusively in the context of a
Jupyter notebook.

## Installation

Install it from PyPI (as `paves`) with `pip` or `uv`, preferably in a
environment. That's all. If you want to play around in the source
code you can use `hatch` or `uv` (your choice).

## Quick start

## License

`PAVÉS` is distributed under the terms of the
[MIT](https://spdx.org/licenses/MIT.html) license.
Loading
Loading