restructure README
Some checks are pending
Rust / build (push) Waiting to run

This commit is contained in:
judge 2025-11-22 13:46:22 +01:00
parent 249e65c55f
commit 745434e642
No known key found for this signature in database
GPG key ID: 6512C30DD8E017B5
3 changed files with 16 additions and 41 deletions

4
Cargo.lock generated
View file

@ -1745,8 +1745,8 @@ dependencies = [
]
[[package]]
name = "paperless-field-extractor"
version = "0.1.2"
name = "paperless-llm-workflows"
version = "0.3.0"
dependencies = [
"actix-web",
"chrono",

View file

@ -21,26 +21,21 @@ As a base model this software is using a quantized version of `Qwen3` to reduce
Long term I want expand the features to enable fine tuning models to your document corpus. This is where the actual learning would come in.
# Usage
This project spawns an API server that can be integrated to provide custom processing steps via `paperless` Workflow feature.
## LLM Workflows
After starting the service you can navigate to `http://{paperless-llm-workflows.ip}:8123/api/` to get an up to date API documentation describing all the endpoints.
To integrate a functionality into paperless you need add it as webhook trigger in your paperless worflows:
![Paperless Webhook](./example-workflow-action.png)
# Custom Field Value Prediction
In its current form `paperless-ngx` does not support predicting the values of custom fields from the contents of the document. Addressing this is a complicated issue:
- As opposed to the currently predictable values for each document (`correspondent`, `document_date`, `storage_path`, `document_type`) custom fields do not exist for every document!
Also the values of a custom field are much more flexible in what the content could be. Paperless uses classical neural nets to train predictors for the supported fields from the user specified
variants. While document_date uses a complex regex + neural net approach. This is not easily generalizable to more field types such as the user defined custom fields.
- The first step in automating custom fields is assigning them to documents. However, since this requires the Paperless system to process the documents first, implementing this as a built-in feature would significantly complicate the standard workflow.
# How does this tool expect the workflow to work?
1. Documents are imported and processed by paperless default document import, assigning document types as usual
2. Use Paperless workflows to assign unfilled custom fields to the documents based on document_type and correspondent
3. Use this software to fill in the empty custom fields:
- First all documents are scanned for unfilled custom fields and a given a processing tag, to indicate to the user that the document is being worked on
- Uses a locally running language model to predict the value of the custom field from the document data
- Upload filled custom fields to the corresponding document and set a finished tag to inform the user that all document processing has finished
# Supported Custom Field Types
## Supported Custom Field Types
Currently this projects predicting the following kinds of custom fields:
- [x] Boolean
@ -98,6 +93,7 @@ The default container is setup to include a model already and with some environm
<podman/docker> run -it --rm \
--device /dev/kfd \ # give graphics device access to the container
--device /dev/dri \ # give graphics device access to the container
-p 8123:8123
-e PAPERLESS_API_CLIENT_API_TOKEN=<token> \
-e PAPERLESS_SERVER=<paperless_ngx_url> \
-e PAPERLESS_USER=<user> \ # used for tag creation
@ -106,26 +102,6 @@ The default container is setup to include a model already and with some environm
Currently only the `vulkan` backend has a prebuilt container availible, it should be fine for most deployments even without a graphics processor availible.
The easiest way to have this run in the background is to configure a cron job or systemd-timer to regularly run the software regularly checking for new documents with unfilled custom fields.
## Dry Run for Testing
If you wish to check how this would look for your documents with unfilled custom fields you can use the dry-run mode.
``` sh
<podman/docker> run -it --rm \
--device /dev/kfd \ # give graphics device access to the container
--device /dev/dri \ # give graphics device access to the container
-e PAPERLESS_API_CLIENT_API_TOKEN=<token> \
-e PAPERLESS_SERVER=<paperless_ngx_url> \
-e PAPERLESS_USER=<user> \ # used for tag creation
ghcr.io/ju6ge/paperless-field-extractor:<version>-<backend> --dry-run
```
This will run the inference printing the results to the terminal, but without setting add tags to documents or sending the extracted fields back to paperless. This mode is also useful for evaluing differnt
models.
NOTE: The processing and finshed tags will be setup as tags on the server though, since the software assumes requires their existence.
## Building the Container yourself
@ -165,7 +141,6 @@ You will need to download a model gguf yourself and configure the `GGUF_MODEL_PA
# Future Work
Depending on interesent and request the following future updates may come:
- Contious Serving, using Webhooks to automatically trigger custom field extraction instead of requiring a timer setup
- Automated Finetuning using LoRa on existing corpus of documents
# LICENSE

BIN
example-workflow-action.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB