Information Extraction and Noise Reduction of Images in PDF Files
Information Extraction and Noise Reduction of Images in PDF Files
The development of the digital publishing industry has entered a rising period. Various e-books, rich media content, and fragmented information continue to meet people's needs for mobile reading. It has become a trend to convert paper forms to digital forms. Therefore, in the process of converting various formats of publishing information resources, publishing companies face many issues, such as the traceability of content resources, the effective and long-term preservation of documents, and how to keep consistent with the resources distributed to the Internet.
At present, many storage and archives are saved in PDF format. From the international standard "Document management - Portable document format - Part 1:PDF 1.7", it is not difficult to find that PDF files are suitable for review, publication, and archiving. The main advantage of PDF is high-fidelity content rendering while supporting multiple platforms and multimedia. PDF has strong interactivity, high security can use signatures, and can also package unstructured and structured data. Currently, ePub3, Mobi, etc. fragment PDF based on XML format. Fragmentation in digital publishing has brought broad room for improvement in services. However, in the process of fragmentation, different conversion tools or manual indexing can easily cause the distortion of the PDF content, especially the noise of the image information.
This article is to decompose the text and pictures in the archived PDF file and extract the size, font, color, and other element information of the text in the PDF file.
1. PDF Image Preprocessing
Unlike filtering algorithms, such as convolution, Fourier transform, and wavelet transforms in image processing, we do not involve image processing methods such as defogging, sharpening, and de-bounding in our research. At the same time, the image preprocessing of PDF files does not involve information hiding technologies like digital watermarks. We focus on decomposing the PDF content and extract the image content information according to the original compression algorithm, matrix conversion, position offset, rendering method, etc. of the image in the PDF file, so as to obtain the original image data in the PDF file, and then conducts research and analysis.
Image preprocessing is mainly to extract and analyze the corresponding picture information in the PDF file, including layer recognition, color space, picture size, picture location, compression algorithm, conversion algorithm, etc. Analyze the objects in the PDF file through Xref, and the color space corresponding to the Indexe space used by the picture is DeviceCMYK. The algorithms for extracting pictures are FlateDecode and DCTDecode. Then, obtain the original raw stream corresponding to the image stream, and implement the corresponding decoding on the data stream of the image sample according to the FlateDecode and DCTDecode algorithms, and form an image data stream image filter. Based on this filter, after the image is preprocessed, the relevant parameters such as the spatial conversion and rendering of the image can be obtained from the data stream.
2. Image color space conversion and color value calculation
We mainly adopt CMYK color space as the color value standard and provide unified CMYK channel color value information to provide various management needs like publishing and distribution. In the research and development process, various color spaces such as RGB, Lab, Indexed, DeviceN, Separation, etc. involved in the image in the PDF file are converted to CMYK. After completing the CMYK conversion, we analyzed, counted, and calculated the color value related information of the picture.
In our research, the PDF file image has always remained the same. In order to ensure that it is in the original archived state, we did not adjust the brightness, contrast, image enhancement, etc. of the image/picture. We only calculate the color value in the color space. For the purpose of statistics, we did not use frequency-domain conversion statistical calculations, but only performed direct processing in the spatial domain. If there are noise and stain data in the document processing, we still keep them in the PDF file.
3. Realize noise reduction through multi-channel linear filtering algorithm
Image noise is widespread. Noise is generally expressed as a large or small extreme value. These extreme values act on the true color values of image pixels through addition and subtraction and cause bright and dark dot interference to the image, which greatly reduces the image quality. In the process of converting the color space of a picture, we will also bring noise and stained pixels to the picture. At the same time, some of the original images in the PDF file are inevitably present with artificial symbols. For example, if several color points are used for self-information marking, such image noise will cause serious distortion of statistics.
Generally speaking, the noise spectrum in an image is located in an area with a higher or higher spatial frequency. Low-pass filtering in the spatial domain is used to smooth noise, while high-pass filtering is mainly used to filter isolated pixels or high peak data in pixel blocks. In order to shield the error statistics and analysis caused by these factors, in the system application research, we adopt a multi-channel threshold control algorithm.
First, high-pass filtering is used to simulate the low-frequency components of the image, so that the high-frequency components of the image can pass through without loss or low loss. Second, design a multi-channel linear threshold filtering algorithm for low-frequency noise. The main reason for using linear multi-threshold algorithm judgment is to realize fast calculation and accurate judgment. And then, there is no need to convert the picture, we only need to realize the recognition and extraction of picture-related information. Based on the statistics of a large number of PDF image files, through continuous optimization and adjustment of channel thresholds, the noise reduction and decontamination management of PDF file images can provide applicable technical guarantees.
Statistical analysis of pictures in PDF files is conducive to the realization of practical applications such as indexing of related similar pictures, multi-picture association query, and finding pictures by pictures, and provides authoritative management for the fidelity and tampering of the pictures after fragmentation or transmission. Accurate data facilitates traceability and intellectual property management also provides more practical management tools for content providers in the Internet era. People can establish relevant attributes around PDF content, and then establish an in-depth management system based on content files. This also provides powerful technical tools for network reading, deep mining, and knowledge services.
We have realized the recognition and extraction of text, images/pictures, and other content, and the noise reduction and decontamination processing of images, and provided the output of related attribute data. And next, we can automatically index and fragment PDF files to provide services based on XML data. At the same time, we will design a self-learning threshold optimization adjustment algorithm based on high-pass filtering. This will intelligently meet more practical applications and provide underlying technical support for digital reading and information management.