|
Name |
|
Date |
Size |
#Lines |
LOC |
| .. | | - | - |
| config/ | H | - | - | 253 | 202 |
| inc/ | H | - | - | 1,683 | 1,106 |
| misc/ | H | - | - | 120 | 80 |
| odf/ | H | - | - | 121 | 74 |
| pdfparse/ | H | - | - | 2,041 | 1,738 |
| sax/ | H | - | - | 378 | 245 |
| test/ | H | - | - | 24,565 | 20,512 |
| tree/ | H | - | - | 4,840 | 3,688 |
| wrapper/ | H | - | - | 1,369 | 1,105 |
| xpdftest/ | H | - | - | 70,591 | 70,561 |
| xpdfwrapper/ | H | - | - | 2,543 | 1,906 |
| README.md | H A D | 16-Feb-2025 | 6.1 KiB | 178 | 129 |
| filterdet.cxx | H A D | 18-May-2025 | 31.3 KiB | 854 | 691 |
| filterdet.hxx | H A D | 09-Apr-2025 | 4.9 KiB | 137 | 43 |
| pdfiadaptor.cxx | H A D | 09-Apr-2025 | 13.7 KiB | 371 | 296 |
| pdfiadaptor.hxx | H A D | 05-Mar-2023 | 5.6 KiB | 141 | 76 |
| pdfimport.component | H A D | 16-Jul-2020 | 1.5 KiB | 33 | 24 |
README.md
1# PDF import
2
3## Introduction
4
5The code in this directory parses a PDF file and builds a LibreOffice
6document contain similar elements, which can then be edited.
7It is invoked when opening a PDF file, but **not** when inserting
8a PDF into a document. Inserting a PDF file renders it and inserts
9a non-editable, rendered version.
10
11The parsing is done by the library [Poppler](https://poppler.freedesktop.org/)
12which then calls back into one layer of this code which is built as a
13Poppler output device implementation.
14
15The PDF format is specified by [this document](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf).
16
17Note that PDF is a language that describes how to **render** a page, not
18a language for describing an editable document, thus some of the conversion
19is a heuristic that doesn't always give good results.
20
21Indeed, PDF is Turing complete, and can embed Javascript, which is also
22Turing complete, so it's a wonder that PDFs ever manage to display anything.
23
24## Current limitations
25
26- Not all elements have clipping implemented.
27
28- LibreOffice's clipping routines all use Even-odd winding rules, where
29as PDF can (and usually does) use non-zero winding rules, making some
30clipping operations incorrect.
31
32- In PDF, there's no concept of lines of text or paragraphs, each
33character can be entirely separate. The code has very simple heuristics
34for reassembling characters back into lines of text.
35Other programs, like *pdftotext* have more complex heuristics that might be worth a try.
36
37- Some cheap PDF operations, like the more advanced fills, generate many
38hundreds of objects in LibreOffice, which can make the document painfully
39slow to open. At least some of these are possible to improve by adding
40more Poppler API implementations. Some may require expanding LibreOffice's
41set of fill types.
42
43- There can be differences between distributions Poppler library builds
44and the builds LibreOffice builds when it doesn't have a distro build
45to use, e.g. in LibreOffice's own distributed builds or the bibisect
46builds. In particular the distro builds may include another library
47(supporting another embedded image type) than LibreOffice's build.
48
49## Fundamental limitations
50
51- The ordering of fonts embedded in PDF are often ASCII, but not always.
52Sometimes they're arbitrary. They may then include a *ToUnicode* map allowing
53programs to map the arbitrary index back to Unicode. Alas not all PDFs
54include it, and some even use a bogus map to make it harder to copy/edit.
55If the same PDF renders correctly in other readers but fails to copy-and-paste
56then this is probably the issue.
57
58- PDF can use complex programming in many places, for example a simple fill
59could be composed of a complex program to generate the fill tiles instead
60of an obvious simple item that can be encoded as LibreOffice shading type.
61Rendering these down to image tiles works OK but can sometimes end up
62with a fuzzy image rather than a nice sharp vector representation.
63
64- Poppler's device interface API is not meant to be stable. The code
65thus has lots of ifdef's to deal with different Poppler versions.
66
67## Structure
68
69Note that the structure is dictated by Poppler being GPL licensed, where
70as LibreOffice isn't.
71
72- *xpdfwrapper/* contains the GPL code that's linked with Poppler
73and forms the *xpdfimport* binary. That binary outputs a stream
74representing the PDF as simpler operations (lines, clipping operations,
75images etc). These form a series of commands on stdout, and binary
76data (mostly images) on stderr. This does make adding debugging tricky.
77
78- *wrapper/* contains the LibreOffice glue that execs the *xpdfimport*
79binary and parses the stream. It also sets up password entry for
80protected PDFs. After parsing the keyword and then any data that
81should be with the keyword, this layer than calls into the following
82tree layer.
83
84- *tree/*' forms internal tree objects for each of the calls from the
85wrapper layer. The tree is then 'visited' by optimisation layers
86(that do things like assemble individual characters into lines of text)
87and then by backend specific XML generators (e.g. for Draw and Writer)
88that then generate an XML stream to be parsed by the core of LibreOffice.
89
90## The wrapper protocol
91
92The LibreOffice wrapper talks to the GPL wrapper code over a pipe
93using a simple line based protocol before the main decoding is done.
94
95The commands are:
96
97- *Pmypassword* - set the password to be used for future opening of the PDF,
98it can be empty.
99
100- *O* - Open the PDF document using the password. This returns a response
101line which is either **#OPEN** when it worked or **#ERROR**. The **#ERROR**
102includes information on the failure shown below.
103
104- *G* - Go - ie render the document using the previously provided document.
105No more commands are accepted after this point, the structure is dumped
106to stdout, and the binary data blobs go to stderr.
107
108- *E* - Exit without doing anything more with the file. Used when you give
109up on password attempts.
110
111Some example runs might be:
112
113- A normal unencrypted document:
114
115```
116 P
117 O
118 #OPEN
119 G
120```
121
122- An encrypted document:
123
124```
125 P
126 O
127 #ERROR:2:ENCRYPTED
128 Psecret
129 O
130 #OPEN
131 G
132```
133
134- An encrypted document that we give up on:
135
136```
137 P
138 O
139 #ERROR:2:ENCRYPTED
140 E
141```
142
143- A document with some other error:
144
145```
146 P
147 O
148 #ERROR:1:
149 E
150```
151
152Note we don't rely on the error number in the code.
153
154## Hybrid documents
155
156PDF can contain other files, one use of which is to store the original document
157file that was used to generate the PDF.
158
159TBD: Once I figure out how it works.
160
161## Bug handling
162
163- Please tag bugs with *filter:pdf* in component *filters and storage*.
164
165- The *pdfseparate* utility which is part of poppler is useful for splitting
166a PDF into individual pages to figure out which page is causing a crash
167or hang or shrinking the problem down.
168
169- [qpdf](https://github.com/qpdf/qpdf) is useful for editing raw PDF
170files to really cut down the number of primitives, but takes some
171getting used to.
172
173- The xpdfimport binary can be run independently of the rest of LibreOffice
174to allow the translated stream to be examined:
175
176 ./instdir/program/xpdfimport problem.pdf < /dev/null > stream 2> binarystream
177
178