Research on PDF Content Extraction
Research on PDF Content Extraction
The importance of PDF content extraction is to lay the foundation for the research of converting PDF to pictures and text. The correct rate and speed of PDF content information extraction directly affect the success or failure of subsequent conversions. Then, to realize the conversion of the PDF format, the first thing is to understand the structure of PDF.
Ordinary PDF File Structure
PDF files are 8-bit binary byte sequences and can also be described by 7-bit ASCII, but no matter which description method is used, PDF files are transmitted and stored in binary, not text files. The initial structure of the PDF file includes file header, file body, cross-reference table, and file tail. The structure of the modified PDF file is just appended to the end of the original PDF file. The appended part includes the revised file body, cross-reference table, and file tail. Therefore, the modified PDF file contains multiple file bodies, cross-reference tables, and file tails.
Linear PDF File Structure
Linear PDF files use a special organization method to achieve efficient access in the network environment. In general, linear PDF files are used for read-only. Although modification is also allowed, once modified, it is no longer a linear PDF file, but a normal PDF file. There are two main differences between linear PDF files and ordinary PDF files.
A. The rules for the order of objects in PDF files are different.
Indirect objects in linear PDF files are divided into two groups and numbered sequentially. The first group includes the document directory, some document-level objects, and all objects belonging to the first page of the document. These objects are numbered in sequential order following the last object number of the second group. There may be exceptions to the thread table called thread flow.
The second group of indirect objects consists of the remaining objects in the document, including objects other than the first page, all shared objects, and so on. Their sequence number starts from 1. These two sets of objects are indexed by two cross-reference tables.
B. Linear PDF file adds a data structure called a clue table.
The clue table is the core part of the linearized information stored in the data structure. It enables the client to make a request to efficiently display any page in the document or retrieve other information.
Linear PDF File Structure from Top to Bottom
a. File header.
b. Linearization parameter dictionary.
Including /Linearized version information, /L file length, /H offset address and length of the main thread stream, /O object number of the first-page object, /E offset address at the end of the first page, /N document page The offset address of the first entry in the /T main cross-reference table.
c. Cross-reference table and end of file on the first page.
Among them, the trailer at the end of the file includes the total number of items in the cross-reference table of /Size, the offset address of the main cross-reference table of /Prev, the indirect objects of the /Roo document directory, and so on.
d. Document directory and other document objects required.
e. Lead flow.
The thread flow is assigned to the last object numbers in the file, and their cross-reference table entry is at the end of the cross-reference table on the first page. The stream mainly includes the thread table of the page offset address, the shared object thread table, and other thread tables.
f. The object part of the first page. Including shared and non-shared objects. Among them, the order of e and f can be reversed.
g. Object parts of other pages. Include non-shared objects of these pages.
h. Shared objects of other pages except for the first page.
i. Other objects not related to the page.
j. Overflow thread stream (optional).
k. Main cross-reference table and end of the file.
In the trailer, including the number of items in the /Size main cross-reference table. After startxref, the offset address of the first-page cross-reference table is given. Since linear PDF files can be accessed efficiently in the network environment, they are basically all PDF files in this form at present. Therefore, when considering extracting file content information from PDF files, only linear PDF file extraction is considered.
Implementation of PDF Text Extraction
PDF file information extraction is divided into the following three steps.
A. Extraction and merging of main cross-reference table and first-page cross-reference table
In a linear PDF file, there are two cross-reference tables and two file tails. Extracting and merging the two cross-reference tables is helpful to find the offset address of indirect objects.
The specific implementation is as follows.
a. In the online parameter dictionary, / t gets the first entry address of the main cross-reference table, finds the main cross-reference table, and puts the offset address part of each entry item into the cross-reference table array CrossTable[].
b. Find the first-page cross-reference table by the offset address of the first-page cross-reference table given after the start reference in the second trail. Similarly, the offset address part of each entry item is proposed and added into the cross-reference table array crosstable []. In this way, the offset address part of the cross-reference table is extracted and merged.
B. Find, extract, and decode the stream that stores the content of each page in the text
The indirect object number of the document directory is obtained from the / root of the first trail. This is the entry to find the body content and is the root of the document page tree. Then get the offset address in the cross-reference table array and find the object. If not all pages of PDF text are extracted,
a. If / type is / Catalog in the object, the object is a document catalog object. Get the object number after / pages, and use the same method as before to go to the location of this object.
b. If / type is / pages, the object is a page tree node object. Get all the object numbers after / kids, and turn to the position of each object in turn.
c. If / type is / page in the object, the object is a page object. If there is no / content entry, this page is empty. Otherwise, its value can be a stream or an array of stream streams. Get all the object numbers after / contents, and record the object numbers into the content object number array ContentObjectNo [] of the page. For each object number of the content object number array ContentObjectNo[] of this page, go to the corresponding object position, extract the decoded name after / filter, and put all contents between stream and Endstream into a byte array. The stream describes the content of the page. In the source code, the stream displays garbled code, which needs to be decoded.
d. Connect the decoded string TextDecodeStr of each page stream together.
a. Stream stream decoded string
The decoded string of stream contains not only text but also a lot of font and size information. In addition, the format of this part of PDF files converted from different word versions is also slightly different. The text object starts with the BT operator and ends with the ET operator. It includes displaying text string, moving text, setting text status, and other parameter information. The first parameter of the text state operator TF is the font information. The operands of the text matrix of the text positioning operator TM are [Sx 0 0 Sy Tx Ty].
The image state operators q, Q, cm .q is used to save the current image state information. Q is used to restore image state information. cm is used to realize the coordinate transformation from the user coordinate system to the equipment coordinate system.
b. Realization of text and related information extraction
Find Tj and TJ to extract PDF text, take out the contents in brackets for Tj, remove the digital information for TJ, and only take out the contents in brackets. If there is an escape character sign "\", it should be removed. For English, what Tj / TJ gives is the original text, which can be displayed directly.
Find Tf to extract font information. The first operation of Tf is font information. Search for Tm and extract the font size information. The first operand of Tm is the font size information. Look for Td / TD and combine it with TM to determine whether to wrap.
In Summary
In this paper, for the PDF file generated by word document transformation, the extraction of file content and text information is realized. In the form of text, the content of all PDF files, as well as the text font, font size, line number, and other text information are extracted. Using this text information, we can further extract the required content of PDF documents. For example, we can extract the title, author, address, abstract, keyword, and other header information of PDF papers according to the font, font size, line number, etc.
Due to the large number of PDF files on the Internet, the content arrangement in the stream of different PDF files is also various. The text information such as font, font size, line feed, and so on are summarized through a large number of experiments. The values of some parameters can be accurately extracted within the range of PDF files tested. There may be errors in other more extensive PDF files, which need further optimization. In addition, this paper only discusses and analyzes the PDF file generated by word document transformation, and the PDF file generated by other methods needs further study.