I follow the #opensource hashtag on social media Mastodon, so basically I get anything related to Free and Open Source Software up on my wall, and it's been great. A user posted a link to Paperless - an open source document management system (DMS). I've never heard of that before, so it immediately piqued my interest.
What is a DMS? Perhaps you have used a CMS - Content Management System - for your web pages? A document management system is more or less the same, here the focus is just on the file storage and internal categorization/indexing, and not on external display.
After trying the demo which I think was broken, I installed the software myself on my server. The software itself is a backend written in Python and then with a Typescript frontend.
A lot has been put into the actual "feel" when you use the system, it feels nice to use. It is of course a subjective assessment on my part, but navigation and use of the product generally feels fluid and solid. I have not yet been able to provoke any error messages even though I have uploaded many different file types to the system.
Paperless install
You can do the installation itself manually or via Docker Compose. I chose Docker because it's easy to clean up again, but I think this product is a keeper because I think it's something that can develop in exciting directions.
The installation was text based and I felt well guided. You simply download a single installation file and then the Paperless installation does the rest with a few questions for username/password, additional package selection, etc.
OCR recognition
The first thing I set out to test was Paperless' OCR recognition. OCR stands for Optical Character Recognition, and its function is the automatic conversion of graphics into text. Below you see an example of a scan. In the view on the right, there is a book page, in the text field on the left, the OCR-recognized text.
During the installation I had installed Tesseract's OCR modules in Danish, German and English. Tesseract was developed by Hewlett-Packard in the 1980s, which released it as Open Source in 2005 together with the University of Nevada, Las Vegas, USA. The following year, Google started sponsoring the development, and has done so ever since.
The English OCR worked reasonably well, even on book pages that were a bit crooked. There were a few words that were not recognized correctly, but that is to be expected when using OCR software.
The Danish part of Tesseract OCR has problems with ø and å, which are recognized as o and a, but funnily enough not with 'æ', it could be the font, which is "thin" in certain places. Right there, I would like better options for viewing and fine-tuning the settings inside Paperless. I tried saving the source material as black and white at 1200 ppi with a bold (bold) font, but it made no difference.
Apart from the OCR integration, which must be said to be the feature that "sells" the system, the user interface appears easy and straightforward... I generally like user interfaces where the setup is kept minimalistic and simple - secondary buttons that are not in use, only appear when you need to use them. Less is more, simply.
Document types and emails
You can ask Paperless to classify documents according to certain words (tags, or labels in Danish). So if you use the word “Annual Report”, it will cool all reports in a specific folder with specific labels. But this provided that the OCR is error-free in Danish. I haven't had time to test this part thoroughly, but I'll update the post here in that case.
Paperless can also be set to monitor incoming emails from certain accounts and then classify the documents afterwards.
Conclusion and wishes
As mentioned, I would like more options to adjust import settings on the OCR in Paperless itself. Conversion wise there is nothing to complain about, generally the system recognized everything I threw at it. It insisted on OCRing everything, probably because of the classification feature, and it's super cool to see OCR-recognized words pop up in the search index by themselves. An upload may take some time when it has to be scanned during import. An image of 3000×3000 pixels (1200 ppi) took me approx. 15-20 seconds.
The Danish translation of the Paperless backend is not quite complete, but absolutely usable. There is no option to install additional plugins. This would, for example, enable the system to be adapted to specific work patterns or specific industries. If, for example, you paired the system with Python PIL or Pillow, the system would become a fairly capable backend for graphic companies. Unfortunately, the developers write in the documentation that plugins are unlikely to become a reality. I think it's a shame, but okay, concentrating on the core functionality is also an opportunity to avoid the system becoming too broad.
However, the lack of plugin extensions is compensated by the fact that you can add special fields to your document types, and by the possibility of using a REST API to make Paperless work with other types of software. So there are still plenty of opportunities for integration with external systems.
All in all, a solid open source product that I can easily recommend trying out.
Installation and screenshots:
https://docs.paperless-ngx.com/setup/
https://docs.paperless-ngx.com/paperless-a-history