Go to file
2024-10-12 12:57:47 +02:00
.gitea/workflows Update .gitea/workflows/go.yml 2024-10-12 12:57:47 +02:00
archlinux build(archlinux): make PKGBUILD to build on Arch 2024-10-11 20:04:44 +02:00
src chore(src): improve code quality and readability 2024-10-11 20:06:42 +02:00
.gitignore chore: add binary name to gitignore 2024-10-11 20:07:22 +02:00
CODE_OF_CONDUCT.md docs: add repo documentation & readme 2024-10-11 20:05:02 +02:00
go.mod chore(go): update go modules 2024-10-11 20:05:25 +02:00
go.sum chore(go): update go modules 2024-10-11 20:05:25 +02:00
LICENSE.md docs: add repo documentation & readme 2024-10-11 20:05:02 +02:00
Makefile build: add Makefile to make building easier 2024-10-11 20:08:46 +02:00
README.md docs: add repo documentation & readme 2024-10-11 20:05:02 +02:00

hocr2pdf

Convert HOCR data into PDFs with integrated image support

hocr2pdf is a tool for converting HOCR (HTML-based OCR) documents into PDF format, integrating text with associated images. This tool is ideal for users needing to create searchable PDFs from OCR data and images, such as scanned documents or annotated text.

Installing / Getting started

To get started with hocr2pdf, you'll need to have Go installed on your machine. The following instructions assume you have Go set up.

  1. Clone the repository:

    $ git clone https://winlogon.ddns.net/winlogon/hocr2pdf.git
    $ cd hocr2pdf/
    
  2. Build the project:

    $ make
    
  3. Run the application:

    $ ./hocr2pdf -hocr path/to/your.hocr -image path/to/your-image.png -pdf output.pdf
    

    This command generates a PDF named output.pdf from the HOCR file and image provided.

Initial Configuration

No additional initial configuration is required beyond the standard Go setup and dependencies.

Developing

To contribute to hocr2pdf, clone the repository:

$ git clone https://winlogon.ddns.net/winlogon/hocr2pdf.git
$ cd hocr2pdf/

Building

After making code changes, you can build the project with:

$ make

This command compiles the source code into an executable named hocr2pdf.

Deploying / Publishing

To deploy or distribute the project, simply distribute the built binary. For publishing on a server, ensure the executable is included in your deployment package.

Features

  • Convert HOCR to PDF: Takes HOCR data and an image file to produce a PDF.
  • Bounding box parsing: Extracts text coordinates from HOCR data for accurate placement.
  • Text extraction: Converts HOCR document text into a plain text string for use in PDFs.

Configuration

The application uses command-line arguments for configuration:

Argument Type Default Description Example
-hocr String "" Path to the HOCR file to process. ./hocr2pdf -hocr myfile.hocr -image myimage.png -pdf output.pdf
-image String "" Path to the image file to be included in the PDF. ./hocr2pdf -hocr myfile.hocr -image myimage.png -pdf output.pdf
-pdf String "" Path to the output PDF file. ./hocr2pdf -hocr myfile.hocr -image myimage.png -pdf output.pdf
-overwrite Boolean false If true, will overwrite the output PDF file if it already exists. ./hocr2pdf -hocr myfile.hocr -image myimage.png -pdf output.pdf -overwrite

Contributing

We welcome contributions to improve hocr2pdf. Please fork the repository, make your changes, and submit a pull request.

Licensing

The code in this project is licensed under the BSD 3-Clause.