diff options
author | Thomas Deutschmann <whissi@gentoo.org> | 2021-03-30 10:59:39 +0200 |
---|---|---|
committer | Thomas Deutschmann <whissi@gentoo.org> | 2021-04-01 00:04:14 +0200 |
commit | 5ff1d6955496b3cf9a35042c9ac35db43bc336b1 (patch) | |
tree | 6d470f7eb448f59f53e8df1010aec9dad8ce1f72 /doc/Devices.htm | |
parent | Import Ghostscript 9.53.1 (diff) | |
download | ghostscript-gpl-patches-5ff1d6955496b3cf9a35042c9ac35db43bc336b1.tar.gz ghostscript-gpl-patches-5ff1d6955496b3cf9a35042c9ac35db43bc336b1.tar.bz2 ghostscript-gpl-patches-5ff1d6955496b3cf9a35042c9ac35db43bc336b1.zip |
Import Ghostscript 9.54ghostscript-9.54
Signed-off-by: Thomas Deutschmann <whissi@gentoo.org>
Diffstat (limited to 'doc/Devices.htm')
-rw-r--r-- | doc/Devices.htm | 84 |
1 files changed, 68 insertions, 16 deletions
diff --git a/doc/Devices.htm b/doc/Devices.htm index b8be9431..c1048e59 100644 --- a/doc/Devices.htm +++ b/doc/Devices.htm @@ -38,7 +38,6 @@ <li><a href="https://www.ghostscript.com/">Home</a></li> <li><a href="https://www.ghostscript.com/license.html">Licensing</a></li> <li><a href="https://www.ghostscript.com/releases.html">Releases</a></li> - <li><a href="https://www.ghostscript.com/release_history.html">Release History</a></li> <li><a href="https://www.ghostscript.com/documentation.html" title="Documentation">Documentation</a></li> <li><a href="https://www.ghostscript.com/download.html" title="Download">Download</a></li> <li><a href="https://www.ghostscript.com/performance.html" title="Performance">Performance</a></li> @@ -71,13 +70,18 @@ <li><a href="#BMP">BMP file format</a></li> <li><a href="#PCX">PCX file format</a></li> <li><a href="#PSD">PSD file format (DeviceN color model)</a></li> +<li><a href="#PDFimage">Bitmap PDF output, PCLm output</a></li> +</ul> +<li><a href="#OCR-Devices">OCR Devices</a></li> +<ul> +<li><a href="#OCR">OCR text output</a></li> +<li><a href="#PDFocr">Bitmap PDF output (with OCR text)</a></li> +<li><a href="#PDFwriteocr">Vector PDF output (with OCR Unicode CMaps)</a></li> </ul> <li><a href="#High-level">High level formats</a></li> <ul> <li><a href="#PDF">PDF file output</a></li> -<li><a href="#PDFimage">Bitmap PDF output, PCLm output</a></li> <li><a href="#OCR">OCR devices</a></li> -<li><a href="#PDFocr">Bitmap PDF output (with OCR text)</a></li> <li><a href="#PS">PostScript file output</a></li> <li><a href="#EPS">EPS file output</a></li> <li><a href="#PXL">PCL-XL file output</a></li> @@ -955,7 +959,11 @@ of 'high-level' formats. These allow Ghostscript to preserve (as much as possible) the drawing elements of the input file maintaining flexibility, resolution independence, and editability.</p> -<h3><a name="OCR"></a>Optical Character Recognition (OCR) output</h3> +<hr> + +<h2><a name="OCR-Devices"></a>Optical Character Recognition (OCR) devices</h2> + +<h3><a name="OCR"></a>OCR text output</h3> <p> These devices render internally in 8 bit greyscale, and then @@ -967,18 +975,29 @@ resolution independence, and editability.</p> <p> The Tesseract engine relies on files to encapsulate each language and/or script. These "traineddata" files - are available in different forms, including <a href="github.com/tesseract-ocr/tessdata_fast">fast</a> - and <a href="tesseract-ocr/tessdata_best">best</a> variants. + are available in different forms, including <a href="http://github.com/tesseract-ocr/tessdata_fast">fast</a> + and <a href="http://github.com/tesseract-ocr/tessdata_best">best</a> variants. Alternatively, people can train their own data using the standard Tesseract tools. </p> <p> - These files are looked for from a variety of places. Firstly, - any files placed in "Resource/Tesseract/" will be - included in the binary for any standard (COMPILE_INITS=1) build. - Secondly, files will be searched for in the current directory. - Thirdly, files will be searched for in the directory given by - the environment variable TESSDATA_PREFIX. + These files are looked for from a variety of places. +</p> +<ul> + <li>Firstly, files will be searched for in the directory given by the + environment variable TESSDATA_PREFIX. + <li>Next, they will be searched for within the ROM filing system. Any + files placed in "tessdata" will be included within the ROM + filing system in the binary for any standard (COMPILE_INITS=1) build. + <li>Next, files will be searched for in the configured 'tessdata' path. On + Unix, this can be specified at the configure stage using + '--with-tessdata=<path>' (where <path> is a list of + directories to search, separated by ':' (on Unix) or ';' (on Windows)). + <li>Finally, we resort to searching the current directory. +</ul> +<p> + Please note, this pattern of directory searching differs from the original + release of the OCR devices. </p> <p> By default, the OCR process defaults to looking for English text, @@ -993,7 +1012,7 @@ resolution independence, and editability.</p> Arabic:</dd></dl> <blockquote> <pre> - <kbd>gs -sDEVICE=ocr -r200 -sOCRLanguage="eng,ara" -o out.txt\ + <kbd>gs -sDEVICE=ocr -r200 -sOCRLanguage="eng+ara" -o out.txt\ zlib/zlib.3.pdf</kbd> </pre> </blockquote> @@ -1041,6 +1060,39 @@ resolution independence, and editability.</p> </p> <p> +<h3><a name="PDFwriteocr"></a>Vector PDF output (with OCR Unicode CMaps)</h3> +<p> +The pdfwrite device has been augmented to use the OCR engine to analyse text +(not images!) in the input stream, and derive Unicode code points for it. +That information can then be used to create ToUnicode CMaps which are attached +to the Font (or CIDFont) objects embedded in the PDF file. +</p> +<p> +Fonts which have ToUnicode CMaps can be reliably (limited by the accuracy of +the CMap) used in search and copy/paste functions, as well as text extraction +from PDF files. Note that OCR is not a 100% perfect process; it is possible +that some text might be misidentified. +</p> +<p> +OCR is a slow operation! In addition it can (for Latin text at least) sometimes +be preferable not to add ToUnicode information which may be incorrect, but instead +to use the existing font Encoding. For English text this may give better results. +</p> +<p>For these reasons the OCR functionality of pdfwrite can be controlled by using a new +parameter <code>-sUseOCR</code>. This has three possible values; +</p> +<dt><code>-sUseOCR=</code><b><em>string</em></b></dt> +<dd> + <dl> + <dt>Never<dd>Default - don't use OCR at all even if support is built-in. + <dt>AsNeeded<dd>If there is no existing ToUnicode information, use OCR. + <dt>Always<dd>Ignore any existing information and always use OCR. + </dl> +</dd> +</p> + +<hr> + <h2><a name="High-level"></a>High-level devices</h2> <h3><a name="PDF"></a>PDF writer</h3> @@ -2081,7 +2133,7 @@ spot colors.</p> <hr> <p> -<small>Copyright © 2000-2020 Artifex Software, Inc. All rights reserved.</small> +<small>Copyright © 2000-2021 Artifex Software, Inc. All rights reserved.</small> <p> This software is provided AS-IS with no warranty, either express or @@ -2094,7 +2146,7 @@ or contact Artifex Software, Inc., 1305 Grant Avenue - Suite 200, Novato, CA 94945, U.S.A., +1(415)492-9861, for further information. <p> -<small>Ghostscript version 9.53.1, 14 September 2020 +<small>Ghostscript version 9.54.0, 30 March 2021 <!-- [3.0 end visible trailer] ============================================= --> @@ -2122,7 +2174,7 @@ Novato, CA 94945, U.S.A., +1(415)492-9861, for further information. </ul> </div> <div class="col-ft-3 footright"><img src="images/Artifex_logo.png" width="194" height="40" alt=""/> <br> - © Copyright 2019 Artifex Software, Inc. <br> + © Copyright 2019-2021 Artifex Software, Inc. <br> All rights reserved. </div> </div> |