MAC系统中的JAVA中使用tess4j-4.4.1实现OCR识别的环境搭建(含tesseract安装配置)

作者: admin 分类: 环境搭建 发布时间: 2019-10-21 14:14  阅读: 81 views

近期准备搜集整理一些pdf中的资料,但大部分是扫描版,不能直接拷贝。手打又很费劲,所以从技术角度出发,看有什么突破。试了几个ocr软件试用版感觉效果很强大。所以搭建java版本的ocr环境看能不能减轻工作量。

OCR (Optical Character Recognition,光学字符识别)

是指电子设备(例如扫描仪或数码相机)检查纸上打印的字符,通过检测暗、亮的模式确定其形状,然后用字符识别方法将形状翻译成计算机文字的过程;即,针对印刷体字符,采用光学的方式将纸质文档中的文字转换成为黑白点阵的图像文件,并通过识别软件将图像中的文字转换成文本格式,供文字处理软件进一步编辑加工的技术。

Tess4J

是对Tesseract OCR API.的Java-JNA封装。使java能够通过调用Tess4J的API来使用Tesseract-OCR。支持的格式:TIFF,JPEG,GIF,PNG,BMP,JPEG,and PDF

Tesseract的OCR引擎

最先由HP实验室于1985年开始研发,至1995年时已经成为OCR业内最准确的三款识别引擎之一。然而,HP不久便决定放弃OCR业务,Tesseract也从此尘封。数年以后,HP意识到,与其将Tesseract束之高阁,不如贡献给开源软件业,让其重焕新生。在2005年,Tesseract由美国内华达州信息技术研究所获得,并委托Google对其进行改进、优化工作。

Tesseract目前已作为开源项目发布在Google Project,它与Leptonica图片处理库结合,可以读取各种格式的图像并将它们转化成超过60种语言的文本,我们还可以不断训练自己的库,使图像转换文本的能力不断增强。如果团队深度需要,还可以以它为模板,开发出符合自身需求的OCR引擎。

下载地址
Tesseract 地址 https://github.com/tesseract-ocr/tesseract

Tess4J 地址 https://github.com/nguyenq/tess4j

test4j 配置

1. 增加pom文件

<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.4.1</version>
    <scope>test</scope>
</dependency>

2. java代码简单示例

package com.chl.test.orc;

import java.io.File;

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

public class Tess4jOcrTest {

    public static void main(String[] args) {
        String bath = "/Users/chenhailong/Desktop/pdf/";
        test1(bath + "test.png");
    }

    /**
     * 根据路径识别文字结果
     * @param path
     */
    public static void test1(String path) {
        File file = new File(path);
        ITesseract it = new Tesseract();
        try {
            String result = it.doOCR(file);
            System.out.println("识别结果:"+result );
        } catch (TesseractException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

3. 右键运行上面的代码,出现以下错误

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-classic/1.2.3/logback-classic-1.2.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Detected both log4j-over-slf4j.jar AND bound slf4j-log4j12.jar on the class path, preempting StackOverflowError. 
SLF4J: See also http://www.slf4j.org/codes.html#log4jDelegationLoop for more details.
Exception in thread "main" java.lang.ExceptionInInitializerError
    at org.slf4j.impl.StaticLoggerBinder.<init>(StaticLoggerBinder.java:72)
    at org.slf4j.impl.StaticLoggerBinder.<clinit>(StaticLoggerBinder.java:45)
    at org.slf4j.LoggerFactory.bind(LoggerFactory.java:150)
    at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:124)
    at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:412)
    at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357)
    at net.sourceforge.tess4j.Tesseract.<clinit>(Tesseract.java:78)
    at com.chl.test.orc.Tess4jOcrTest.test1(Tess4jOcrTest.java:22)
    at com.chl.test.orc.Tess4jOcrTest.main(Tess4jOcrTest.java:13)
Caused by: java.lang.IllegalStateException: Detected both log4j-over-slf4j.jar AND bound slf4j-log4j12.jar on the class path, preempting StackOverflowError. See also http://www.slf4j.org/codes.html#log4jDelegationLoop for more details.
    at org.slf4j.impl.Log4jLoggerFactory.<clinit>(Log4jLoggerFactory.java:54)
    ... 9 more

解决方式为排除冲突包

<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.4.1</version>
    <scope>test</scope>
    <exclusions>
        <exclusion>
            <groupId>org.slf4j</groupId>
            <artifactId>log4j-over-slf4j</artifactId>
        </exclusion>
    </exclusions>
</dependency>

4. 继续运行,出现以下错误

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-classic/1.2.3/logback-classic-1.2.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract':
dlopen(libtesseract.dylib, 9): image not found
dlopen(libtesseract.dylib, 9): image not found
Native library (darwin/libtesseract.dylib) not found in resource path ([file:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/resources.jar, file:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Con... file:/Users/chenhailong/.m2/repository/org/apache/pdfbox/jbig2-imageio/3.0.2/jbig2-imageio-3.0.2.jar, file:/Users/chenhailong/.m2/repository/net/sourceforge/lept4j/lept4j/1.12.3/lept4j-1.12.3.jar, file:/Users/chenhailong/.m2/repository/org/jboss/jboss-vfs/3.2.14.Final/jboss-vfs-3.2.14.Final.jar, file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-classic/1.2.3/logback-classic-1.2.3.jar, file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-core/1.2.3/logback-core-1.2.3.jar, file:/Users/chenhailong/.m2/repository/org/slf4j/jul-to-slf4j/1.7.28/jul-to-slf4j-1.7.28.jar])
    at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:302)
    at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:455)
    at com.sun.jna.Library$Handler.<init>(Library.java:192)
    at com.sun.jna.Native.loadLibrary(Native.java:646)
    at com.sun.jna.Native.loadLibrary(Native.java:630)
    at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:85)
    at net.sourceforge.tess4j.TessAPI.<clinit>(TessAPI.java:42)
    at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:427)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:223)
    at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:195)
    at com.chl.test.orc.Tess4jOcrTest.test1(Tess4jOcrTest.java:24)
    at com.chl.test.orc.Tess4jOcrTest.main(Tess4jOcrTest.java:13)
    Suppressed: java.lang.UnsatisfiedLinkError: dlopen(libtesseract.dylib, 9): image not found
        at com.sun.jna.Native.open(Native Method)
        at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:191)
        ... 11 more
    Suppressed: java.lang.UnsatisfiedLinkError: dlopen(libtesseract.dylib, 9): image not found
        at com.sun.jna.Native.open(Native Method)
        at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:204)
        ... 11 more
    Suppressed: java.io.IOException: Native library (darwin/libtesseract.dylib) not found in resource path ([file:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/resources.jar, ... file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-core/1.2.3/logback-core-1.2.3.jar, file:/Users/chenhailong/.m2/repository/org/slf4j/jul-to-slf4j/1.7.28/jul-to-slf4j-1.7.28.jar])
        at com.sun.jna.Native.extractFromResourcePath(Native.java:1095)
        at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:276)
        ... 11 more

mac系统上没法直接使用 tess4j包,需要安装 tesseract。
安装命令看https://github.com/tesseract-ocr/tesseract/wiki地址。

安装过程中要注意账号权限问题等。如下:

chenhailongdeMacBook-Pro:~ chenhailong$ brew install tesseract
Updating Homebrew...






^C
==> Installing dependencies for tesseract: libpng, jpeg, libtiff, leptonica
==> Installing tesseract dependency: libpng
==> Downloading https://homebrew.bintray.com/bottles/libpng-1.6.34.high_sierra.bottle.tar.gz


######################################################################## 100.0%
==> Pouring libpng-1.6.34.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/libpng/1.6.34: 26 files, 1.2MB
==> Installing tesseract dependency: jpeg
==> Downloading https://homebrew.bintray.com/bottles/jpeg-9c.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring jpeg-9c.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/jpeg/9c: 21 files, 724.5KB
==> Installing tesseract dependency: libtiff
==> Downloading https://homebrew.bintray.com/bottles/libtiff-4.0.9_4.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring libtiff-4.0.9_4.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/libtiff/4.0.9_4: 246 files, 3.5MB
==> Installing tesseract dependency: leptonica
==> Downloading https://homebrew.bintray.com/bottles/leptonica-1.76.0.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring leptonica-1.76.0.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/leptonica/1.76.0: 48 files, 5.6MB
==> Installing tesseract
==> Downloading https://homebrew.bintray.com/bottles/tesseract-3.05.02.high_sierra.bottle.tar.gz
#########################################                                 58.2%


curl: (56) LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 54
Error: Failed to download resource "tesseract"
Download failed: https://homebrew.bintray.com/bottles/tesseract-3.05.02.high_sierra.bottle.tar.gz
Warning: Bottle installation failed: building from source.
==> Installing dependencies for tesseract: autoconf, autoconf-archive, automake, libtool, pkg-config
==> Installing tesseract dependency: autoconf
==> Downloading https://homebrew.bintray.com/bottles/autoconf-2.69.high_sierra.bottle.4.tar.gz
######################################################################## 100.0%
==> Pouring autoconf-2.69.high_sierra.bottle.4.tar.gz
==> Caveats
Emacs Lisp files have been installed to:
  /usr/local/share/emacs/site-lisp/autoconf
==> Summary
🍺  /usr/local/Cellar/autoconf/2.69: 71 files, 3.0MB
==> Installing tesseract dependency: autoconf-archive
==> Downloading https://homebrew.bintray.com/bottles/autoconf-archive-2018.03.13.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring autoconf-archive-2018.03.13.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/autoconf-archive/2018.03.13: 585 files, 3.5MB
==> Installing tesseract dependency: automake
==> Downloading https://homebrew.bintray.com/bottles/automake-1.16.1.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring automake-1.16.1.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/automake/1.16.1: 131 files, 3MB
==> Installing tesseract dependency: libtool
==> Downloading https://homebrew.bintray.com/bottles/libtool-2.4.6_1.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring libtool-2.4.6_1.high_sierra.bottle.tar.gz
==> Caveats
In order to prevent conflicts with Apple's own libtool we have prepended a "g"
so, you have instead: glibtool and glibtoolize.
==> Summary
🍺  /usr/local/Cellar/libtool/2.4.6_1: 71 files, 3.7MB
==> Installing tesseract dependency: pkg-config
==> Downloading https://homebrew.bintray.com/bottles/pkg-config-0.29.2.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring pkg-config-0.29.2.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/pkg-config/0.29.2: 11 files, 627.2KB
==> Downloading https://github.com/tesseract-ocr/tesseract/archive/3.05.02.tar.gz
==> Downloading from https://codeload.github.com/tesseract-ocr/tesseract/tar.gz/3.05.02
######################################################################## 100.0%
==> ./autogen.sh
==> ./configure --prefix=/usr/local/Cellar/tesseract/3.05.02
==> make install


==> Downloading https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
==> Downloading from https://raw.githubusercontent.com/tesseract-ocr/tessdata/3.04.00/eng.traineddata
######################################################################## 100.0%
==> Downloading https://github.com/tesseract-ocr/tessdata/raw/3.04.00/osd.traineddata
==> Downloading from https://raw.githubusercontent.com/tesseract-ocr/tessdata/3.04.00/osd.traineddata
######################################################################## 100.0%
🍺  /usr/local/Cellar/tesseract/3.05.02: 79 files, 38.6MB, built in 10 minutes 40 seconds
==> Caveats
==> autoconf
Emacs Lisp files have been installed to:
  /usr/local/share/emacs/site-lisp/autoconf
==> libtool
In order to prevent conflicts with Apple's own libtool we have prepended a "g"
so, you have instead: glibtool and glibtoolize.

要注意安装过程中出现的问题,我这里有个tesseract下载失败,libtool工具冲突。
但是查看 tesseract –version 可以正常显示,如下:

chenhailongdeMacBook-Pro:~ chenhailong$ tesseract --version
tesseract 3.05.02
 leptonica-1.76.0
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

先重新安装以下tesseract

chenhailongdeMacBook-Pro:~ chenhailong$ brew install tesseract
Updating Homebrew...
^CWarning: tesseract 3.05.02 is already installed and up-to-date
To reinstall 3.05.02, run `brew reinstall tesseract`
chenhailongdeMacBook-Pro:~ chenhailong$ brew reinstall tesseract
==> Reinstalling tesseract 
==> Downloading https://homebrew.bintray.com/bottles/tesseract-3.05.02.high_sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring tesseract-3.05.02.high_sierra.bottle.tar.gz
🍺  /usr/local/Cellar/tesseract/3.05.02: 79 files, 38.6MB

继续运行出现以下错误:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/org/slf4j/slf4j-log4j12/1.7.25/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/chenhailong/.m2/repository/ch/qos/logback/logback-classic/1.2.3/logback-classic-1.2.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Error opening data file ./tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000001270bfe6b, pid=38486, tid=0x0000000000002703
#
# JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libtesseract.dylib+0x12e6b]  _ZN9tesseract9Tesseract15recog_all_wordsEP8PAGE_RESP10ETEXT_DESCPK4TBOXPKci+0xa7
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/chenhailong/git/LawSpider/hs_err_pid38486.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

提示不能找到 tessdata资源目录。
参考如下地址处理
:http://www.zmonster.me/2015/04/17/tesseract-install-usage.html。
搜索一下有说权限问题、环境变量问题。我这里设置

export TESSDATA_PREFIX=/opt/language/

依然报错如上。
参考如下

http://www.itkeyword.com/doc/692839954630676463/tesseract-for-java-setting-tessdata-prefix-for-executable-jar

在代码处增加了setDatapath 的设置,运行变为正常

ITesseract it = new Tesseract();
// 如果没有改变tessdata目录位置请输入.
it.setDatapath(".");
// 如果变更过tessdata目录请指定位置
it.setDatapath("/opt/language/");

测试一张图片如下:

测试ocr功能

识别结果如下:

97X-l-449-38I65-3 Tapworllly: Designing Great thunc Apps r 2010 by O'Reilly Media. Inc. Simplified
Chinese cdilion. jninlly published by O‘Reilly M:dia. Inc, and Publishing Hnusc of Elcclmnlcs Induslry.
2m 1. Aulhorucd uanslauun or me English :dllion‘ 2010 O'Reilly Media‘ Ina. lln: owner ofall nghls m
publish and stll m: same. All rights reamed including the rims nfrcpmdnclion in whole or in van in any
rm...

可以明显感觉到部分识别是错误的。应该是图像的清晰度不够。如果直接识别带有中文字符的图片,可能会出现异常,因为需要下载对应的字体库。附上地址:
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400

到这里,简单的运行环境就算搭建完成,之后在试试看中文识别的准确路怎么样。


最终尝试如下,可以看到官方的中文字库识别效果不是太让人满意。如果自己去训练字库的话,又需要花费很多的操作事件。所以,可以选用百度提供的ocr服务或者其他供应商的相关技术。满足要求即可

识别结果:
本书是_位国际知名的管理学租领导力大师对生活经验的捷炼。
量弗蠹德曹智博逼' 在引导管理者及其他人患考折庸巽们的几个
核心人生问踵时' 总能切中胃薯。 普逼的管理学圈书藿完兢忘'
此书却能长驻人们心中' 灏起不断扩敞的阵阵涟潴。
一鳙青神分癫r学擎、 人类学掌` 小说家、 当今世界量伸犬峰恳暴
掌之一, 苏遣-凯卡尔耀士
只有了鬟了自己 才能了解别人。 凭借对动机柏慵感曾力的濂入
洞察, 晕嘉富獭教侵掏示了障藿茌商业行为量谏处的事实: 我们
都是人。
一纂躏每日峄擅集口主庸' 罗摹朱尔子纂
阅读本书' 蜀时而发出会心的微笑, 时而觉得不舒魔, 但是从中
学到了很多东酉。 本书能引发谩春思考这样一个问魉: 窦能对自
己、 家人、 朋友、 同藁膏多好或青有多坏。 晕嘉量篱行文不洵椿
套, 他将人文科学领域的量新研究发珊串在_起' 还举出寰实生
活当中高瀛与低谷的实例' 樱醒我们: 我们是自己人生的蛇手,
妻粑鏖机会。 ′
一曼济学掌、 社套学掌, 星虞能瀑研宛所, 撑耀士 - 克蕨克
一 霞


   原创文章,转载请标明本文链接: MAC系统中的JAVA中使用tess4j-4.4.1实现OCR识别的环境搭建(含tesseract安装配置)

如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!

发表评论

电子邮件地址不会被公开。 必填项已用*标注

更多阅读