【1.2】解析pdf--pdf2htmlEX

最近市场部的反馈,需要提取别家公司产品的信息,用于做比较。当然,你晓得啦,网页版的可以用scrapy来爬取,再通过beautifulsoap来解析Html内容,那pdf肿么办?当然是先转化为html罗

这个事情要分三步走:

  1. 解密pdf
  2. pdf转化为Html
  3. html的解析,提取想要的信息

一、 在线解密pdf

https://smallpdf.com/cn/unlock-pdf

二、pdf转化为html

1. pdf2htmlex简介

三、 安装

官方文档:https://github.com/coolwanglu/pdf2htmlEX/wiki/Building

3.1 Mac上的安装

Mac OS X可以使用brew来安装

brew install pdf2htmlEX

(这个工具依赖的包实在是太多啦,太多啦,要是手动去安装,会崩溃的)

安装报错1:

==> Downloading ftp://ftp.simplesystems.org/pub/libpng/png/src/libpng16/libpng-1.6.18.tar.xz

curl: (78) RETR response: 550
Error: Failed to download resource "libpng"
Download failed: ftp://ftp.simplesystems.org/pub/libpng/png/src/libpng16/libpng-1.6.18.tar.xz

解决办法:

需要翻墙下载:

ftp://ftp.simplesystems.org/pub/libpng/png/src/libpng16/libpng-1.6.30.tar.xz

cd /Users/tanqianshan/Documents/project/8.pdf_convert/lib
wget -c ftp://ftp.simplesystems.org/pub/libpng/png/src/libpng16/libpng-1.6.30.tar.xz
tar xvJf libpng-1.6.30.tar.xz
./configure --prefix=/usr
make check
make install

安装报错2:

重新安装,仍旧报错:

tanqianshan[2.其他公司表型整理]$ brew install pdf2htmlEX
Warning: You are using OS X 10.12.
We do not provide support for this pre-release version.
You may encounter build failures or other breakage.
Error: You must `brew link cmake' before pdf2htmlex can be installed

解决办法:

brew unistall cmake
sudo brew install cmake

brew install pdf2htmlEX

手动安装 pdf2htmlEX

需要提前安装好的软件

1.poppler

方法一:(失败)

pip install poppler
Could not find a version that satisfies the requirement poppler (from versions: )

方法二:

cd /Users/tanqianshan/Documents/project/8.pdf_convert/lib/
wget https://poppler.freedesktop.org/poppler-0.56.0.tar.xz
tar xvJf libpng-1.6.30.tar.xz
cd poppler-0.56.0
./configure --prefix=/usr

安装 pdf2htmlEX

cd /Users/tanqianshan/Documents/project/8.pdf_convert/lib
git clone git://github.com/coolwanglu/pdf2htmlEX.git
cd pdf2htmlEX
cmake . && make && sudo make install

最后的解决办法:

暂时不安装libpng

brew install pdf2htmlEX --without-libpng

3. 运行pdf2htmlEX

pdf2htmlEX --zoom 1.3 boao_aishenpu.pdf

3.2 centos7 上的安装 (安装的都想哭)

3.2.1 安装依赖的各种库

因为各种报错,所以查了不少安装方式,也分不清,哪些库是有必要的,哪些库是没有必要的,索性都给安装上了。

yum-config-manager –enable epel 
yum -y update

安装key 

cd /etc/pki/rpm-gpg/ 
wget http://mirrors.163.com/centos/RPM-GPG-KEY-CentOS-7 
rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7

cd /etc/yum.repos.d 
wget -c http://linuxsoft.cern.ch/cern/scl/slc6-scl.repo

升级pip

pip install --upgrade pip
pip install --upgrade setuptools
pip install lxml 


yum -y install libtool-ltdl-devel.x86_64 zlib-devel.x86_64 glib2-devel.x86_64 freetype-devel.x86_64 poppler-glib-devel.x86_64 git cmake mk-configure.noarch libjpeg-turbo.x86_64 libtiff.x86_64 libpng-devel.x86_64 giflib-devel.x86_64 libXt-devel.x86_64 autoconf automake libtool bzip2 libxml2.x86_64 libuninameslist-devel.x86_64 libspiro.x86_64 dbus-python-devel.x86_64 pango-devel.x86_64  chrpath uuid-c++.x86_64 uuid.x86_64 uthash-devel.noarch cmake gcc java-1.8.0-openjdk libpng-devel.x86_64 fontforge-devel.x86_64 cairo-devel.x86_64 poppler-devel.x86_64 libspiro-devel.x86_64 freetype-devel.x86_64 poppler-data libjpeg-turbo-devel git gcc-c++ libjpeg-turbo-devel.x86_64 poppler-data.noarch jpackage-utils.noarch gettext.x86_64 jpackage-utils.noarch python27-python-devel.x86_64 libxml2-python27.x86_64 libxml2-python26.x86_64 python27-python-devel.x86_64 libxslt-devel.x86_64 libxslt-python26.x86_64 libxslt.x86_64 libxml2-devel libxslt-devel python-devel python-javapackages.noarch –nogpgcheck install poppler-cpp.x86_64 poppler-cpp-devel.x86_64 libstdc++48-static.x86_64 openjpeg-devel.x86_64



yum install cmake gcc gcc-c++ gtk+-devel gimp-devel gimp-devel-tools gimp-help-browser zlib-devel libtiff-devel libjpeg-devel
 libpng-devel gstreamer-devel libavc1394-devel libraw1394-devel libdc1394-devel jasper-devel jasper-utils swig python libtool nasm

yum -y install libtool-ltdl-devel.x86_64 zlib-devel.x86_64 glib2-devel.x86_64 freetype-devel.x86_64 poppler-glib-devel.x86_64 git cmake mk-configure.noarch libjpeg-turbo.x86_64 libtiff.x86_64 libpng-devel.x86_64 giflib-devel.x86_64 libXt-devel.x86_64 autoconf automake libtool bzip2 libxml2.x86_64 libuninameslist-devel.x86_64 libspiro.x86_64 dbus-python-devel.x86_64 pango-devel.x86_64  chrpath uuid-c++.x86_64 uuid.x86_64 uthash-devel.noarch cmake gcc java-1.8.0-openjdk libpng-devel.x86_64 fontforge-devel.x86_64 cairo-devel.x86_64 poppler-devel.x86_64 libspiro-devel.x86_64 freetype-devel.x86_64 poppler-data libjpeg-turbo-devel git gcc-c++  libjpeg-turbo-devel.x86_64 poppler-data.noarch jpackage-utils.noarch gettext.x86_64 jpackage-utils.noarch python27-python-devel.x86_64 libxml2-python27.x86_64 libxml2-python26.x86_64 python27-python-devel.x86_64 libxslt-devel.x86_64 libxslt-python26.x86_64  libxslt.x86_64 libxml2-devel libxslt-devel python-devel python-javapackages.noarch –nogpgcheck install poppler-cpp.x86_64 poppler-cpp-devel.x86_64 libstdc++48-static.x86_64 openjpeg-devel.x86_64


yum install  autotools-dev libjpeg-dev libtiff4-dev libpng12-dev libgif-dev libxt-dev autoconf automake libtool bzip2 libxml2-dev libuninameslist-dev libspiro-dev python-dev libpango1.0-dev libcairo2-dev chrpath uuid-dev uthash-dev

yum install  cmake gcc gnu-getopt java-1.8.0-openjdk libpng-devel fontforge-devel cairo-devel poppler-devel libspiro-devel freetype-devel  poppler-data libjpeg-turbo-devel git make gcc-c++

3.2.2 安装其他的库

安装openjpeg

wget -c https://sourceforge.net/projects/openjpeg.mirror/files/2.1.0/openjpeg-2.1.0.tar.gz/download?use_mirror=nchc
mv download?use_mirror=nchc openjpeg-2.1.0.tar.gz
tar -xzf openjpeg-2.1.0.tar.gz;
cd openjpeg-2.1.0
cmake . && make && make install

安装poppler

cd /home/sam/anbank-web/lib

wget -c http://poppler.freedesktop.org/poppler-0.35.0.tar.xz
tar -xf poppler-0.35.0.tar.xz
 cd poppler-0.35.0/
./configure --prefix=/usr -enable-xpdf-headers -enable-libjpeg  
make && make install

export LD_LIBRARY_PATH=/usr/lib
export LD_LIBRARY_PATH=/usr/lib64


export LD_RUN_PATH=/usr/lib
export LD_RUN_PATH=/usr/lib64

gcc版本

gcc -v

安装fontforge

正确流程:

cd /home/sam/anbank-web/lib
git clone https://github.com/coolwanglu/fontforge.git fontforge.git
cd fontforge.git
git checkout pdf2htmlEX

./autogen.sh

./configure --enable-debug --prefix=/usr
make V=1  # 报错
make install
fontforge -version

cp fontforge.pc /usr/local/lib/pkgconfig/ 
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig 
vim CMakeLists.txt 
    #adjust version


export LD_LIBRARY_PATH=/usr/local/lib
export LIBRARY_PATH=/usr/local/lib

安装过程中的问题与解决办法:

cd /home/sam/anbank-web/lib
git clone https://github.com/coolwanglu/fontforge.git fontforge.git
cd fontforge.git


 ./autogen.sh

报错1:

at least version 1.6.0 of GNU Autoconf must be installed

解决办法:

yum install autoconf

报错2:

at least version 1.6.0 of GNU Automake must be installed

解决办法:

yum install automake

报错3:

at least version 1.4.2 of GNU Libtool must be installed

解决办法:

yum install libtool

报错4:

 ibtoolize: `COPYING.LIB' not found in `/usr/share/libtool/libltdl' 

解决办法:

yum install libtool-ltdl-devel

问题报错都解决了,接着

./autogen.sh --verbose

./configure --prefix=/usr

报错5:

configure: error: Package requirements (pango >= 1.10 pangoxft) were not met:

No package 'pango' found
No package 'pangoxft' found

Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.

Alternatively, you may set the environment variables PANGO_CFLAGS
and PANGO_LIBS to avoid the need to call pkg-config.
See the pkg-config man page for more details.

解决办法:

yum install pango pango-devel

make;

报错6:

ufo.c:925:12: error: conflicting types for 'SplinePointListInterpretGlif'

解决办法:

如上的各种yum,安装各种包。。

make install的提示

Libraries have been installed in:
   /usr/local/lib64/python2.7/site-packages

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the `LD_RUN_PATH' environment variable
     during linking
   - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
    
    export LIBDIR=/usr/local/lib64/python2.7/site-packages

3.2.3 安装 pdf2htmlEX

cd /home/sam/anbank-web/lib
git clone git://github.com/coolwanglu/pdf2htmlEX.git

cd pdf2htmlEX
cmake . && make && make install

# cmake -DCMAKE_BUILD_TYPE=Debug  则是创建debug模式

pkg-config –print-provides –cflags –libs poppler

报错:

/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libfontforge.so:对‘PyTuple_GetItem’未定义的引用
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libfontforge.so:对‘PyImport_AppendInittab’未定义的引用
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libfontforge.so:对‘PyUnicodeUCS4_AsUTF8String’未定义的引用
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/../../../../lib64/libfontforge.so:对‘PyString_Decode’未定义的引用
collect2: 错误:ld 返回 1
make[2]: *** [pdf2htmlEX] 错误 1
make[1]: *** [CMakeFiles/pdf2htmlEX.dir/all] 错误 2
make: *** [all] 错误 2

报错原因:

/usr/lib64/libfontconfig.so.1 与 libfontconfig.so.2 冲突

解决办法:

[root@localhost lib64]# ll |grep libfont
lrwxrwxrwx.  1 root root       22 9月  25 17:53 libfontconfig.so -> libfontconfig.so.1.7.0
lrwxrwxrwx.  1 root root       22 4月   9 03:38 libfontconfig.so.1 -> libfontconfig.so.1.7.0
-rwxr-xr-x.  1 root root   255968 8月   2 2017 libfontconfig.so.1.7.0
lrwxrwxrwx.  1 root root       21 4月   9 03:42 libfontembed.so.1 -> libfontembed.so.1.0.0
-rwxr-xr-x.  1 root root    53224 8月   3 2017 libfontembed.so.1.0.0
lrwxrwxrwx.  1 root root       19 4月   9 03:38 libfontenc.so.1 -> libfontenc.so.1.0.0
-rwxr-xr-x.  1 root root    27512 8月   2 2017 libfontenc.so.1.0.0
lrwxrwxrwx.  1 root root       21 9月  25 17:53 libfontforge.so -> libfontforge.so.1.0.0
lrwxrwxrwx.  1 root root       21 9月  25 17:53 libfontforge.so.1 -> libfontforge.so.1.0.0
-rwxr-xr-x.  1 root root  4214600 6月  10 2014 libfontforge.so.1.0.0
lrwxrwxrwx.  1 root root       32 9月  29 09:24 libfontforge.so.2 -> /usr/local/lib/libfontforge.so.2


ln -s /usr/local/lib/libfontforge.so.2 /usr/lib64/libfontforge.so.2
ln -s /usr/local/lib/libfontforge.so.2 /usr/lib64/libfontforge.so
ln -s /usr/local/lib/libpoppler.so.54 /usr/lib64/libpoppler.so.54

接着报错:

MakeFiles/pdf2htmlEX.dir/3rdparty/poppler/git/CairoFontEngine.cc.o:在函数‘CairoFreeTypeFont::create(GfxFont*, XRef*, FT_LibraryRec_*, bool)’中:
/home/sam/anbank-web/lib/pdf2htmlEX/3rdparty/poppler/git/CairoFontEngine.cc:425:对‘GfxFont::locateFont(XRef*, PSOutputDev*)’未定义的引用
CMakeFiles/pdf2htmlEX.dir/src/HTMLRenderer/font.cc.o:在函数‘pdf2htmlEX::HTMLRenderer::install_font(GfxFont*)’中:
/home/sam/anbank-web/lib/pdf2htmlEX/src/HTMLRenderer/font.cc:889:对‘GfxFont::locateFont(XRef*, PSOutputDev*)’未定义的引用
CMakeFiles/pdf2htmlEX.dir/src/HTMLRenderer/font.cc.o:在函数‘pdf2htmlEX::HTMLRenderer::install_external_font(GfxFont*, pdf2htmlEX::FontInfo&)’中:
/home/sam/anbank-web/lib/pdf2htmlEX/src/HTMLRenderer/font.cc:944:对‘GfxFont::locateFont(XRef*, PSOutputDev*)’未定义的引用
CMakeFiles/pdf2htmlEX.dir/src/BackgroundRenderer/SplashBackgroundRenderer.cc.o:在函数‘pdf2htmlEX::SplashBackgroundRenderer::SplashBackgroundRenderer(std::string const&, pdf2htmlEX::HTMLRenderer*, pdf2htmlEX::Param const&)’中:
/home/sam/anbank-web/lib/pdf2htmlEX/src/BackgroundRenderer/SplashBackgroundRenderer.cc:35:对‘SplashOutputDev::SplashOutputDev(SplashColorMode, int, bool, unsigned char*, bool, SplashThinLineMode, bool)’未定义的引用
collect2: 错误:ld 返回 1
make[2]: *** [pdf2htmlEX] 错误 1
make[1]: *** [CMakeFiles/pdf2htmlEX.dir/all] 错误 2
make: *** [all] 错误 2

ln -s /usr/lib/libfontforge.so.2 /usr/lib64/libfontforge.so.2
ln -s /usr/lib/libfontforge.so.2 /usr/lib64/libfontforge.so
ln -s /usr/lib/libpoppler.so.54 /usr/lib64/libpoppler.so.54

报错原因:

/usr/lib64下的libpoppler.so版本不对的问题,问题已解决

cd /usr/lib64

[root@localhost lib64]# ll |grep popp
lrwxrwxrwx.  1 root root       24 9月  28 23:37 libpoppler-cpp.so -> libpoppler-cpp.so.10.2.0
lrwxrwxrwx.  1 root root       24 9月  28 23:37 libpoppler-cpp.so.10 -> libpoppler-cpp.so.10.2.0
-rwxr-xr-x.  1 root root    82680 8月  31 2017 libpoppler-cpp.so.10.2.0
lrwxrwxrwx.  1 root root       25 9月  28 23:36 libpoppler-glib.so -> libpoppler-glib.so.18.6.0
lrwxrwxrwx.  1 root root       25 9月  25 17:53 libpoppler-glib.so.18 -> libpoppler-glib.so.18.6.0
-rwxr-xr-x.  1 root root   370648 8月  31 2017 libpoppler-glib.so.18.6.0
lrwxrwxrwx.  1 root root       20 9月  25 17:53 libpoppler.so -> libpoppler.so.46.0.0
lrwxrwxrwx.  1 root root       20 9月  25 17:53 libpoppler.so.46 -> libpoppler.so.46.0.0
-rwxr-xr-x.  1 root root  2689272 8月  31 2017 libpoppler.so.46.0.0
lrwxrwxrwx.  1 root root       25 10月  6 15:07 libpoppler.so.54 -> /usr/lib/libpoppler.so.54


rm libpoppler.so
ln -s libpoppler.so.54 libpoppler.so

报错:

段错误(吐核)9
Segmentation fault( (core dumped))

报错原因:

fontforge或者其他依赖库版本太老

解决办法:

重新安装pdf2htmlEX


tail -f /var/log/messages

ct  5 23:33:09 localhost abrt-server: Executable '/usr/local/bin/pdf2htmlEX' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Oct  5 23:33:09 localhost abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2018-10-05-23:33:09-385272' exited with 1
Oct  5 23:33:09 localhost abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2018-10-05-23:33:09-385272'
Oct  5 23:33:14 localhost kernel: pdf2htmlEX[385279]: segfault at 6b00000064 ip 000000000044747d sp 00007ffd8197fb40 error 4 in pdf2htmlEX[400000+67000]
Oct  5 23:33:14 localhost abrt-hook-ccpp: Process 385279 (pdf2htmlEX) of user 1001 killed by SIGSEGV - ignoring (repeated crash)

报错1:

针对 Executable ‘/usr/local/bin/pdf2htmlEX’ doesn’t belong to any package and ProcessUnpackaged is set to ‘no’ 的解决

vim /etc/abrt/abrt-action-save-package-data.conf

ProcessUnpackaged = no

改为:

ProcessUnpackaged = yes

然后重启服务

service abrtd restart

报错2,但仍然报错:

[root@localhost ~]# tail -f /var/spool/mail/root 
Process 385655 (pdf2htmlEX) of user 1001 killed by SIGSEGV - ignoring (=
repeated crash)
:10=E6=9C=88 06 00:03:13 localhost.localdomain kernel: pdf2htmlEX[38586=
1]: segfault at 6b00000064 ip 000000000044747d sp 00007fff21daa6a0 erro=
r 4 in pdf2htmlEX[400000+67000]
:10=E6=9C=88 06 00:03:13 localhost.localdomain abrt-hook-ccpp[385862]: =
Process 385861 (pdf2htmlEX) of user 1001 killed by SIGSEGV - dumping co=
re
:[User Logs]:

tail -f /var/log/messages 报错变为:

Oct  6 12:36:29 localhost kernel: pdf2htmlEX[395835]: segfault at 6b00000064 ip 000000000044747d sp 00007ffc789b0f80 error 4 in pdf2htmlEX[400000+67000]
Oct  6 12:36:29 localhost abrt-hook-ccpp: Process 395835 (pdf2htmlEX) of user 1001 killed by SIGSEGV - dumping core
Oct  6 12:36:29 localhost abrt-server: Duplicate: core backtrace
Oct  6 12:36:29 localhost abrt-server: DUP_OF_DIR: /var/spool/abrt/ccpp-2018-10-06-00:03:13-385861
Oct  6 12:36:29 localhost abrt-server: Deleting problem directory ccpp-2018-10-06-12:36:29-395835 (dup of ccpp-2018-10-06-00:03:13-385861)
Oct  6 12:36:29 localhost abrt-server: 未指定 sender 的电子邮箱。您想要现在制定吗?如果不,将使用 'user@localhost' [y/N] 
Oct  6 12:36:29 localhost abrt-server: 未指定 receiver 的电子邮箱。您想要现在制定吗?如果不,将使用 'root@localhost' [y/N] 
Oct  6 12:36:29 localhost abrt-server: Undefined variable outside of [[ ]] bracket
Oct  6 12:36:29 localhost abrt-server: 发送电子邮件......
Oct  6 12:36:29 localhost abrt-server: 向 root@localhost 发送通知邮件
Oct  6 12:36:29 localhost abrt-server: 已发送电子邮件至:root@localhost

共享目录

ldconfig -v

cd /home/sam/anbank-web/test/convert_data

pdf2htmlEX --hdpi 144 --vdpi 144  20180912-T-1-1.pdf --dest-dir test.html 

四.解析Html文件

见BeautifulSoup的内容

PS: 两天时间,第一次用scrapy, beautifulsoap, pdf2htmlex提取了三家公司的产品信息,感觉自己棒棒哒

五、讨论

  1. pdf-to-word

https://smallpdf.com/cn/pdf-to-word

参考资料

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn