Skip to content

is it possible to output regular files instead of warc? #228

@ftc2

Description

@ftc2

i only want files, not warc.

can grab-site output regular files (like html and images) for me like wget can? (links must be converted to relative links)

side question: has anyone here actually had good results with getting files back out of warc? this wouldn't be such a big deal if that were possible. i've never seen a util that can exract files from warcs with 100% success rate (and it's usually insanely slow).

i've tried:

  • jwat-tools: seemed the best coded of the bunch but gave me nonsensical filenames like extracted.001, and idk how to get past that
  • warcat: slow and fails on many warcs
  • warc-extractor: the easiest to use of the bunch (it can hit a bunch of warcs in a single dir), but it's insanely slow, and it also fails on many warcs
  • the unarchiver: fails on some warcs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions