summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorThomas Deutschmann <whissi@gentoo.org>2021-03-30 10:59:39 +0200
committerThomas Deutschmann <whissi@gentoo.org>2021-04-01 00:04:14 +0200
commit5ff1d6955496b3cf9a35042c9ac35db43bc336b1 (patch)
tree6d470f7eb448f59f53e8df1010aec9dad8ce1f72 /doc/Devices.htm
parentImport Ghostscript 9.53.1 (diff)
downloadghostscript-gpl-patches-5ff1d6955496b3cf9a35042c9ac35db43bc336b1.tar.gz
ghostscript-gpl-patches-5ff1d6955496b3cf9a35042c9ac35db43bc336b1.tar.bz2
ghostscript-gpl-patches-5ff1d6955496b3cf9a35042c9ac35db43bc336b1.zip
Import Ghostscript 9.54ghostscript-9.54
Signed-off-by: Thomas Deutschmann <whissi@gentoo.org>
Diffstat (limited to 'doc/Devices.htm')
-rw-r--r--doc/Devices.htm84
1 files changed, 68 insertions, 16 deletions
diff --git a/doc/Devices.htm b/doc/Devices.htm
index b8be9431..c1048e59 100644
--- a/doc/Devices.htm
+++ b/doc/Devices.htm
@@ -38,7 +38,6 @@
<li><a href="https://www.ghostscript.com/">Home</a></li>
<li><a href="https://www.ghostscript.com/license.html">Licensing</a></li>
<li><a href="https://www.ghostscript.com/releases.html">Releases</a></li>
- <li><a href="https://www.ghostscript.com/release_history.html">Release History</a></li>
<li><a href="https://www.ghostscript.com/documentation.html" title="Documentation">Documentation</a></li>
<li><a href="https://www.ghostscript.com/download.html" title="Download">Download</a></li>
<li><a href="https://www.ghostscript.com/performance.html" title="Performance">Performance</a></li>
@@ -71,13 +70,18 @@
<li><a href="#BMP">BMP file format</a></li>
<li><a href="#PCX">PCX file format</a></li>
<li><a href="#PSD">PSD file format (DeviceN color model)</a></li>
+<li><a href="#PDFimage">Bitmap PDF output, PCLm output</a></li>
+</ul>
+<li><a href="#OCR-Devices">OCR Devices</a></li>
+<ul>
+<li><a href="#OCR">OCR text output</a></li>
+<li><a href="#PDFocr">Bitmap PDF output (with OCR text)</a></li>
+<li><a href="#PDFwriteocr">Vector PDF output (with OCR Unicode CMaps)</a></li>
</ul>
<li><a href="#High-level">High level formats</a></li>
<ul>
<li><a href="#PDF">PDF file output</a></li>
-<li><a href="#PDFimage">Bitmap PDF output, PCLm output</a></li>
<li><a href="#OCR">OCR devices</a></li>
-<li><a href="#PDFocr">Bitmap PDF output (with OCR text)</a></li>
<li><a href="#PS">PostScript file output</a></li>
<li><a href="#EPS">EPS file output</a></li>
<li><a href="#PXL">PCL-XL file output</a></li>
@@ -955,7 +959,11 @@ of 'high-level' formats. These allow Ghostscript to preserve (as much as
possible) the drawing elements of the input file maintaining flexibility,
resolution independence, and editability.</p>
-<h3><a name="OCR"></a>Optical Character Recognition (OCR) output</h3>
+<hr>
+
+<h2><a name="OCR-Devices"></a>Optical Character Recognition (OCR) devices</h2>
+
+<h3><a name="OCR"></a>OCR text output</h3>
<p>
These devices render internally in 8 bit greyscale, and then
@@ -967,18 +975,29 @@ resolution independence, and editability.</p>
<p>
The Tesseract engine relies on files to encapsulate each
language and/or script. These &quot;traineddata&quot; files
- are available in different forms, including <a href="github.com/tesseract-ocr/tessdata_fast">fast</a>
- and <a href="tesseract-ocr/tessdata_best">best</a> variants.
+ are available in different forms, including <a href="http://github.com/tesseract-ocr/tessdata_fast">fast</a>
+ and <a href="http://github.com/tesseract-ocr/tessdata_best">best</a> variants.
Alternatively, people can train their own data using the
standard Tesseract tools.
</p>
<p>
- These files are looked for from a variety of places. Firstly,
- any files placed in &quot;Resource/Tesseract/&quot; will be
- included in the binary for any standard (COMPILE_INITS=1) build.
- Secondly, files will be searched for in the current directory.
- Thirdly, files will be searched for in the directory given by
- the environment variable TESSDATA_PREFIX.
+ These files are looked for from a variety of places.
+</p>
+<ul>
+ <li>Firstly, files will be searched for in the directory given by the
+ environment variable TESSDATA_PREFIX.
+ <li>Next, they will be searched for within the ROM filing system. Any
+ files placed in &quot;tessdata&quot; will be included within the ROM
+ filing system in the binary for any standard (COMPILE_INITS=1) build.
+ <li>Next, files will be searched for in the configured 'tessdata' path. On
+ Unix, this can be specified at the configure stage using
+ '--with-tessdata=&lt;path&gt;' (where &lt;path&gt; is a list of
+ directories to search, separated by ':' (on Unix) or ';' (on Windows)).
+ <li>Finally, we resort to searching the current directory.
+</ul>
+<p>
+ Please note, this pattern of directory searching differs from the original
+ release of the OCR devices.
</p>
<p>
By default, the OCR process defaults to looking for English text,
@@ -993,7 +1012,7 @@ resolution independence, and editability.</p>
Arabic:</dd></dl>
<blockquote>
<pre>
- <kbd>gs -sDEVICE=ocr -r200 -sOCRLanguage="eng,ara" -o out.txt\
+ <kbd>gs -sDEVICE=ocr -r200 -sOCRLanguage="eng+ara" -o out.txt\
zlib/zlib.3.pdf</kbd>
</pre>
</blockquote>
@@ -1041,6 +1060,39 @@ resolution independence, and editability.</p>
</p>
<p>
+<h3><a name="PDFwriteocr"></a>Vector PDF output (with OCR Unicode CMaps)</h3>
+<p>
+The pdfwrite device has been augmented to use the OCR engine to analyse text
+(not images!) in the input stream, and derive Unicode code points for it.
+That information can then be used to create ToUnicode CMaps which are attached
+to the Font (or CIDFont) objects embedded in the PDF file.
+</p>
+<p>
+Fonts which have ToUnicode CMaps can be reliably (limited by the accuracy of
+the CMap) used in search and copy/paste functions, as well as text extraction
+from PDF files. Note that OCR is not a 100% perfect process; it is possible
+that some text might be misidentified.
+</p>
+<p>
+OCR is a slow operation! In addition it can (for Latin text at least) sometimes
+be preferable not to add ToUnicode information which may be incorrect, but instead
+to use the existing font Encoding. For English text this may give better results.
+</p>
+<p>For these reasons the OCR functionality of pdfwrite can be controlled by using a new
+parameter <code>-sUseOCR</code>. This has three possible values;
+</p>
+<dt><code>-sUseOCR=</code><b><em>string</em></b></dt>
+<dd>
+ <dl>
+ <dt>Never<dd>Default - don't use OCR at all even if support is built-in.
+ <dt>AsNeeded<dd>If there is no existing ToUnicode information, use OCR.
+ <dt>Always<dd>Ignore any existing information and always use OCR.
+ </dl>
+</dd>
+</p>
+
+<hr>
+
<h2><a name="High-level"></a>High-level devices</h2>
<h3><a name="PDF"></a>PDF writer</h3>
@@ -2081,7 +2133,7 @@ spot colors.</p>
<hr>
<p>
-<small>Copyright &copy; 2000-2020 Artifex Software, Inc. All rights reserved.</small>
+<small>Copyright &copy; 2000-2021 Artifex Software, Inc. All rights reserved.</small>
<p>
This software is provided AS-IS with no warranty, either express or
@@ -2094,7 +2146,7 @@ or contact Artifex Software, Inc., 1305 Grant Avenue - Suite 200,
Novato, CA 94945, U.S.A., +1(415)492-9861, for further information.
<p>
-<small>Ghostscript version 9.53.1, 14 September 2020
+<small>Ghostscript version 9.54.0, 30 March 2021
<!-- [3.0 end visible trailer] ============================================= -->
@@ -2122,7 +2174,7 @@ Novato, CA 94945, U.S.A., +1(415)492-9861, for further information.
</ul>
</div>
<div class="col-ft-3 footright"><img src="images/Artifex_logo.png" width="194" height="40" alt=""/> <br>
- © Copyright 2019 Artifex Software, Inc. <br>
+ © Copyright 2019-2021 Artifex Software, Inc. <br>
All rights reserved.
</div>
</div>