summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorThomas Deutschmann <whissi@gentoo.org>2020-09-10 18:10:49 +0200
committerThomas Deutschmann <whissi@gentoo.org>2020-09-11 20:06:36 +0200
commitacfc02c1747065fe450c7cfeb6f1844b62335f08 (patch)
tree5887806a2e6b99bbb0255e013a9028810e230a7f /doc/Devices.htm
parentImport Ghostscript 9.52 (diff)
downloadghostscript-gpl-patches-acfc02c1747065fe450c7cfeb6f1844b62335f08.tar.gz
ghostscript-gpl-patches-acfc02c1747065fe450c7cfeb6f1844b62335f08.tar.bz2
ghostscript-gpl-patches-acfc02c1747065fe450c7cfeb6f1844b62335f08.zip
Import Ghostscript 9.53ghostscript-9.53
Signed-off-by: Thomas Deutschmann <whissi@gentoo.org>
Diffstat (limited to 'doc/Devices.htm')
-rw-r--r--doc/Devices.htm103
1 files changed, 91 insertions, 12 deletions
diff --git a/doc/Devices.htm b/doc/Devices.htm
index 166c4080..921211a6 100644
--- a/doc/Devices.htm
+++ b/doc/Devices.htm
@@ -1,15 +1,6 @@
<!doctype html>
<html>
<head>
-<!-- Global site tag (gtag.js) - Google Analytics -->
-<script async src="https://www.googletagmanager.com/gtag/js?id=UA-54391264-2"></script>
-<script>
- window.dataLayer = window.dataLayer || [];
- function gtag(){dataLayer.push(arguments);}
- gtag('js', new Date());
-
- gtag('config', 'UA-54391264-2');
-</script>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro" rel="stylesheet">
@@ -85,12 +76,14 @@
<ul>
<li><a href="#PDF">PDF file output</a></li>
<li><a href="#PDFimage">Bitmap PDF output, PCLm output</a></li>
+<li><a href="#OCR">OCR devices</a></li>
+<li><a href="#PDFocr">Bitmap PDF output (with OCR text)</a></li>
<li><a href="#PS">PostScript file output</a></li>
<li><a href="#EPS">EPS file output</a></li>
<li><a href="#PXL">PCL-XL file output</a></li>
<li><a href="#TXT">Text output</a></li>
</ul>
-<li><a href="#Dis play_devices">Display devices</a></li>
+<li><a href="#Display_devices">Display devices</a></li>
<ul>
<li><a href="#x11_devices">X Window System</a></li>
<li><a href="#display_device">display device (MS Windows, OS/2, gtk+)</a></li>
@@ -962,6 +955,92 @@ of 'high-level' formats. These allow Ghostscript to preserve (as much as
possible) the drawing elements of the input file maintaining flexibility,
resolution independence, and editability.</p>
+<h3><a name="OCR"></a>Optical Character Recognition (OCR) output</h3>
+
+<p>
+ These devices render internally in 8 bit greyscale, and then
+ feed the resultant image into an OCR engine. Currently, we
+ are using the Tesseract engine. Not only is this both free
+ and open source, it gives very good results, and supports
+ a huge number of languages/scripts.
+</p>
+<p>
+ The Tesseract engine relies on files to encapsulate each
+ language and/or script. These &quot;traineddata&quot; files
+ are available in different forms, including <a href="github.com/tesseract-ocr/tessdata_fast">fast</a>
+ and <a href="tesseract-ocr/tessdata_best">best</a> variants.
+ Alternatively, people can train their own data using the
+ standard Tesseract tools.
+</p>
+<p>
+ These files are looked for from a variety of places. Firstly,
+ any files placed in &quot;Resource/Tesseract/&quot; will be
+ included in the binary for any standard (COMPILE_INITS=1) build.
+ Secondly, files will be searched for in the current directory.
+ Thirdly, files will be searched for in the directory given by
+ the environment variable TESSDATA_PREFIX.
+</p>
+<p>
+ By default, the OCR process defaults to looking for English text,
+ using &quot;eng.traineddata&quot;. This can be changed by using the
+ <code>-sOCRLanguage=</code> switch;
+</p>
+<blockquote>
+<dl>
+<dt><code>-sOCRLanguage=</code><b><em>language</em></b></dt>
+<dd>This sets the trained data sets to use within the Tesseract
+ OCR engine. For example, the following will use English and
+ Arabic:</dd></dl>
+<blockquote>
+<pre>
+ <kbd>gs -sDEVICE=ocr -r200 -sOCRLanguage="eng,ara" -o out.txt\
+ zlib/zlib.3.pdf</kbd>
+</pre>
+</blockquote>
+</blockquote>
+<p>
+ The first device is named ocr. It extracts data as unicode codepoints
+ and outputs them to the device as a stream of UTF-8 bytes.
+</p>
+<p>
+ The second device is named hocr. This extracts the data in
+ <a href="wikipedia.org/wiki/HOCR">hOCR</a> format.
+</p>
+<p>
+ These devices are implemented as downscaling devices, so the
+ standard parameters can be used to control this process. It
+ may seem strange to use downscaling on an image that is not
+ actually going to be output, but there are actually good reasons
+ for this. Firstly, the higher the resolution, the slower the
+ OCR process. Secondly, the way the Tesseract OCR engine works
+ means that anti-aliased images perform broadly as well as the
+ super-sampled image from which it came.
+</p>
+
+<h3><a name="PDFocr"></a>PDF image output (with OCR text)</h3>
+
+<p>
+ These devices do the same render to bitmap and wrap as a PDF process as
+ the <a name="PDFimage">PDFimage</a> devices above, but with the addition
+ of an OCR step at the end. The OCR'd text is overlaid &quot;invisibly&quot;
+ over the images, so searching and cut/paste should still work.
+</p>
+<p>
+ The OCR engine being used is Tesseract. For information on this
+ including how to control what language data is used, see the <a href="OCR">
+ OCR devices</a> section above.
+</p>
+<p>
+ There are three devices named pdfocr8, pdfocr24 and pdfocr32. These
+ produce valid PDF files with a colour depth of 8 (Gray), 24 (RGB) or
+ 32 (CMYK).
+</p>
+<p>
+ These devices accept all the same flags as the <a name="PDFimage">PDFimage</a>
+ devices described above.
+</p>
+<p>
+
<h2><a name="High-level"></a>High-level devices</h2>
<h3><a name="PDF"></a>PDF writer</h3>
@@ -2002,7 +2081,7 @@ spot colors.</p>
<hr>
<p>
-<small>Copyright &copy; 2000-2019 Artifex Software, Inc. All rights reserved.</small>
+<small>Copyright &copy; 2000-2020 Artifex Software, Inc. All rights reserved.</small>
<p>
This software is provided AS-IS with no warranty, either express or
@@ -2015,7 +2094,7 @@ or contact Artifex Software, Inc., 1305 Grant Avenue - Suite 200,
Novato, CA 94945, U.S.A., +1(415)492-9861, for further information.
<p>
-<small>Ghostscript version 9.52, 19 March 2020
+<small>Ghostscript version 9.53.0, 10 September 2020
<!-- [3.0 end visible trailer] ============================================= -->