Augraphy is a Python library that creates multiple copies of original documents through an augmentation pipeline that randomly distorts each copy – degrading the clean version into dirty and realistic copies rendered through synthetic paper printing, faxing, scanning, and copy machine processes.
The dataset created for the Denoising ShabbyPages competition was generated using Augraphy pipelines to create realistic old and noisy documents from “born digital” sources. This simulation of realistic paper-oriented process distortions creates large amounts of training data for AI/ML processes to learn how to remove those distortions.
If you want to help out the project, fork and star it on GitHub!