python 解析电子书的信息

epub 书是可供人们下载的开放性资源格式的电子图书。epub 文件通常与类似亚马逊Kindle 这样的电子阅读器不兼容。

python 解析电子书的信息

一个epub 文件包含两个文件:一个包含数据的压缩文件(.zip文件)以及一个描述压缩文件信息的XML 格式文件。下面是通过python 的lxml 库来解析这个描述压缩文件信息的XML 文件。从而得到相关信息:

Go:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import zipfile
from lxml import etree


def get_epub_info(fname):
    ns = {
        'n': 'urn:oasis:names:tc:opendocument:xmlns:container',
        'pkg': 'http://www.idpf.org/2007/opf',
        'dc': 'http://purl.org/dc/elements/1.1/'
    }

    # prepare to read from the .epub file
    _zip = zipfile.ZipFile(fname)

    # find the contents metafile
    txt = _zip.read('META-INF/container.xml')
    tree = etree.fromstring(txt)
    cfname = tree.xpath('n:rootfiles/n:rootfile/@full-path', namespaces=ns)[0]

    # grab the metadata block from the contents metafile
    cf = _zip.read(cfname)
    # print cf
    tree = etree.fromstring(cf)
    p = tree.xpath('/pkg:package/pkg:metadata', namespaces=ns)[0]

    # repackage the data
    res = {}
    for s in ['title', 'language', 'creator', 'date', 'identifier', 'publisher', 'subject', 'description']:
        res[s] = p.xpath('dc:%s/text()' % s, namespaces=ns)[0]

        # print '--------', s, '-------'
        # for i in p.xpath('dc:%s/text()' % s, namespaces=ns):
        #     print i

    # print p.xpath('dc:identifier/text()', namespaces=ns)[1]  # ISBN

    return res


if __name__ == "__main__":
    print get_epub_info('source/epubsample.epub')

输出

1
{'publisher': 'Shoes and Ships and Sealing Wax Ltd', 'description': 'SUMMARY:\nThis unique \'15 books in 1\' edition of L. Frank Baum\'s original "Oz" series contains the following complete works: "The Wonderful Wizard of Oz," "The Marvelous Land of Oz," "Ozma of Oz," "Dorothy and the Wizard in Oz," "The Road to Oz," "The Emerald City of Oz," "The Patchwork Girl Of Oz," "Little Wizard Stories of Oz," "Tik-Tok of Oz," "The Scarecrow Of Oz," "Rinkitink In Oz," "The Lost Princess Of Oz," "The Tin Woodman Of Oz," "The Magic of Oz," and "Glinda Of Oz." For over a hundred years, L. Frank Baum\'s classic fairy stories about the land of Oz have been delighting children and parents alike. Now, for the first time, the entire Oz series is available in this single, great-value, edition!', 'language': 'UND', 'creator': 'L. Frank Baum', 'title': 'The Wonderful Wizard of Oz', 'date': '2010-01-22T00:08:46', 'identifier': 'd1d2e9d3-2d97-44b9-924a-c59416e85df7', 'subject': 'Science fiction'}

xml 示例

Go:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
<?xml version="1.0"  encoding="UTF-8"?>
<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="uuid_id">
  <metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:calibre="http://calibre.kovidgoyal.net/2009/metadata" xmlns:dc="http://purl.org/dc/elements/1.1/">
    <meta name="calibre:series_index" content="1"/>
    <dc:language>UND</dc:language>
    <meta name="calibre:timestamp" content="2010-01-22T00:08:46"/>
    <dc:title>The Wonderful Wizard of Oz</dc:title>
    <meta name="cover" content="cover"/>
    <dc:date>2010-01-22T00:08:46</dc:date>
    <dc:contributor opf:role="bkp">calibre (0.6.34) [http://calibre-ebook.com]</dc:contributor>
    <dc:identifier id="uuid_id" opf:scheme="uuid">d1d2e9d3-2d97-44b9-924a-c59416e85df7</dc:identifier>
  <dc:creator opf:role="aut" opf:file-as="Baum, L. Frank">L. Frank Baum</dc:creator>
<dc:publisher>Shoes and Ships and Sealing Wax Ltd</dc:publisher>
<opf:meta name="calibre:rating" content="8"/>
<dc:identifier opf:scheme="ISBN">9780954840143</dc:identifier>
<dc:subject>Science fiction</dc:subject>
<dc:subject>Fantasy</dc:subject>
<dc:subject>Epic</dc:subject>
<dc:subject>General</dc:subject>
<dc:subject>Fiction</dc:subject>
<dc:subject>Science Fiction &amp; Fantasy</dc:subject>
<dc:subject>Magic</dc:subject>
<dc:subject>Juvenile Fiction</dc:subject>
<dc:subject>Fantasy &amp; Magic</dc:subject>
<dc:subject>American</dc:subject>
<dc:subject>Fantasy fiction</dc:subject>
<dc:subject>Wizards</dc:subject>
<dc:subject>Classics</dc:subject>
<dc:subject>Anthologies</dc:subject>
<dc:subject>Classic fiction (Children's</dc:subject>
<dc:subject>YA)</dc:subject>
<dc:subject>Ages 9-12 Fiction</dc:subject>
<dc:subject>Young Adult Fiction</dc:subject>
<dc:subject>Action &amp; Adventure</dc:subject>
<dc:subject>Children's Books</dc:subject>
<dc:subject>&amp; Magic</dc:subject>
<dc:subject>Fairy tales</dc:subject>
<dc:subject>Children's stories</dc:subject>
<dc:subject>fables</dc:subject>
<dc:subject>Wizard of Oz (Fictitious character)</dc:subject>
<dc:subject>folk tales</dc:subject>
<dc:subject>Juvenile Fiction : General</dc:subject>
<dc:subject>magical tales &amp; traditional stories</dc:subject>
<dc:subject>Oz (Imaginary place)</dc:subject>
<dc:subject>Juvenile Fiction : Fantasy &amp; Magic</dc:subject>
<dc:description>SUMMARY:
This unique '15 books in 1' edition of L. Frank Baum's original "Oz" series contains the following complete works: "The Wonderful Wizard of Oz," "The Marvelous Land of Oz," "Ozma of Oz," "Dorothy and the Wizard in Oz," "The Road to Oz," "The Emerald City of Oz," "The Patchwork Girl Of Oz," "Little Wizard Stories of Oz," "Tik-Tok of Oz," "The Scarecrow Of Oz," "Rinkitink In Oz," "The Lost Princess Of Oz," "The Tin Woodman Of Oz," "The Magic of Oz," and "Glinda Of Oz." For over a hundred years, L. Frank Baum's classic fairy stories about the land of Oz have been delighting children and parents alike. Now, for the first time, the entire Oz series is available in this single, great-value, edition!</dc:description>
</metadata>
  <manifest>
    <item href="Baum, L. Frank - Oz 01 - The Wizard of Oz (illus.)_split_000.htm" id="Baum,_L._Frank_-_Oz_01_-_The_Wizard_of_Oz_(illus.)80" media-type="application/xhtml+xml"/>
...

使用epub 库解析也不错 https://pypi.python.org/pypi/epub

https://github.com/bettse/epub-reader/blob/master/epub.py

A simple python script to unpack/parse epub books so they can be read on the command line.

本文网址: https://pylist.com/topic/112.html 转摘请注明来源

Suggested Topics

在SAE Python上开启gzip的方法

开启 gzip 的作用自不必说,可以省很多流出带宽,可以省很多云豆。昨天这个博客的云豆消耗,其中流出带宽就占九成多,开启后就会只占五成多。...

python 使用 magic 从文件内容判断文件类型

使用 python-magic 库可以轻松识别文件的类型,python-magic是libmagic文件类型识别库的python接口。libmagic通过根据预定义的文件类型列表检查它们的头文件来识别文件类型。 ...

3行 Python 代码解简单的一元一次方程

一元一次方程:只含有一个未知数(即“元”),并且未知数的最高次数为1(即“次”)的整式方程叫做一元一次方程(英文名:`linear equation with one unknown`)。...

在 Ubuntu 16.04.6 LTS 系统上安装 Python 3.6.3

自己的阿里云一个 VPS 用的是系统 Ubuntu 16.04.6 LTS,自带的python版本是 `2.7.12` 与 `3.5.2`,有时候要用到 python `3.6`,又不想卸掉原来版本。下面介绍安装 python 3.6.3 的过程,因为版本较旧,遇到一些坑,这里记录一下。...

Leave a Comment