python 解析电子书的信息
🕛 by pyList at 2015-08-10 11:02
epub 书是可供人们下载的开放性资源格式的电子图书。epub 文件通常与类似亚马逊Kindle 这样的电子阅读器不兼容。
一个epub 文件包含两个文件:一个包含数据的压缩文件(.zip文件)以及一个描述压缩文件信息的XML 格式文件。下面是通过python 的lxml 库来解析这个描述压缩文件信息的XML 文件。从而得到相关信息:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import zipfile
from lxml import etree
def get_epub_info(fname):
ns = {
'n': 'urn:oasis:names:tc:opendocument:xmlns:container',
'pkg': 'http://www.idpf.org/2007/opf',
'dc': 'http://purl.org/dc/elements/1.1/'
}
# prepare to read from the .epub file
_zip = zipfile.ZipFile(fname)
# find the contents metafile
txt = _zip.read('META-INF/container.xml')
tree = etree.fromstring(txt)
cfname = tree.xpath('n:rootfiles/n:rootfile/@full-path', namespaces=ns)[0]
# grab the metadata block from the contents metafile
cf = _zip.read(cfname)
# print cf
tree = etree.fromstring(cf)
p = tree.xpath('/pkg:package/pkg:metadata', namespaces=ns)[0]
# repackage the data
res = {}
for s in ['title', 'language', 'creator', 'date', 'identifier', 'publisher', 'subject', 'description']:
res[s] = p.xpath('dc:%s/text()' % s, namespaces=ns)[0]
# print '--------', s, '-------'
# for i in p.xpath('dc:%s/text()' % s, namespaces=ns):
# print i
# print p.xpath('dc:identifier/text()', namespaces=ns)[1] # ISBN
return res
if __name__ == "__main__":
print get_epub_info('source/epubsample.epub')
输出
{'publisher': 'Shoes and Ships and Sealing Wax Ltd', 'description': 'SUMMARY:\nThis unique \'15 books in 1\' edition of L. Frank Baum\'s original "Oz" series contains the following complete works: "The Wonderful Wizard of Oz," "The Marvelous Land of Oz," "Ozma of Oz," "Dorothy and the Wizard in Oz," "The Road to Oz," "The Emerald City of Oz," "The Patchwork Girl Of Oz," "Little Wizard Stories of Oz," "Tik-Tok of Oz," "The Scarecrow Of Oz," "Rinkitink In Oz," "The Lost Princess Of Oz," "The Tin Woodman Of Oz," "The Magic of Oz," and "Glinda Of Oz." For over a hundred years, L. Frank Baum\'s classic fairy stories about the land of Oz have been delighting children and parents alike. Now, for the first time, the entire Oz series is available in this single, great-value, edition!', 'language': 'UND', 'creator': 'L. Frank Baum', 'title': 'The Wonderful Wizard of Oz', 'date': '2010-01-22T00:08:46', 'identifier': 'd1d2e9d3-2d97-44b9-924a-c59416e85df7', 'subject': 'Science fiction'}
xml 示例
<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="uuid_id">
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:calibre="http://calibre.kovidgoyal.net/2009/metadata" xmlns:dc="http://purl.org/dc/elements/1.1/">
<meta name="calibre:series_index" content="1"/>
<dc:language>UND</dc:language>
<meta name="calibre:timestamp" content="2010-01-22T00:08:46"/>
<dc:title>The Wonderful Wizard of Oz</dc:title>
<meta name="cover" content="cover"/>
<dc:date>2010-01-22T00:08:46</dc:date>
<dc:contributor opf:role="bkp">calibre (0.6.34) [http://calibre-ebook.com]</dc:contributor>
<dc:identifier id="uuid_id" opf:scheme="uuid">d1d2e9d3-2d97-44b9-924a-c59416e85df7</dc:identifier>
<dc:creator opf:role="aut" opf:file-as="Baum, L. Frank">L. Frank Baum</dc:creator>
<dc:publisher>Shoes and Ships and Sealing Wax Ltd</dc:publisher>
<opf:meta name="calibre:rating" content="8"/>
<dc:identifier opf:scheme="ISBN">9780954840143</dc:identifier>
<dc:subject>Science fiction</dc:subject>
<dc:subject>Fantasy</dc:subject>
<dc:subject>Epic</dc:subject>
<dc:subject>General</dc:subject>
<dc:subject>Fiction</dc:subject>
<dc:subject>Science Fiction & Fantasy</dc:subject>
<dc:subject>Magic</dc:subject>
<dc:subject>Juvenile Fiction</dc:subject>
<dc:subject>Fantasy & Magic</dc:subject>
<dc:subject>American</dc:subject>
<dc:subject>Fantasy fiction</dc:subject>
<dc:subject>Wizards</dc:subject>
<dc:subject>Classics</dc:subject>
<dc:subject>Anthologies</dc:subject>
<dc:subject>Classic fiction (Children's</dc:subject>
<dc:subject>YA)</dc:subject>
<dc:subject>Ages 9-12 Fiction</dc:subject>
<dc:subject>Young Adult Fiction</dc:subject>
<dc:subject>Action & Adventure</dc:subject>
<dc:subject>Children's Books</dc:subject>
<dc:subject>& Magic</dc:subject>
<dc:subject>Fairy tales</dc:subject>
<dc:subject>Children's stories</dc:subject>
<dc:subject>fables</dc:subject>
<dc:subject>Wizard of Oz (Fictitious character)</dc:subject>
<dc:subject>folk tales</dc:subject>
<dc:subject>Juvenile Fiction : General</dc:subject>
<dc:subject>magical tales & traditional stories</dc:subject>
<dc:subject>Oz (Imaginary place)</dc:subject>
<dc:subject>Juvenile Fiction : Fantasy & Magic</dc:subject>
<dc:description>SUMMARY:
This unique '15 books in 1' edition of L. Frank Baum's original "Oz" series contains the following complete works: "The Wonderful Wizard of Oz," "The Marvelous Land of Oz," "Ozma of Oz," "Dorothy and the Wizard in Oz," "The Road to Oz," "The Emerald City of Oz," "The Patchwork Girl Of Oz," "Little Wizard Stories of Oz," "Tik-Tok of Oz," "The Scarecrow Of Oz," "Rinkitink In Oz," "The Lost Princess Of Oz," "The Tin Woodman Of Oz," "The Magic of Oz," and "Glinda Of Oz." For over a hundred years, L. Frank Baum's classic fairy stories about the land of Oz have been delighting children and parents alike. Now, for the first time, the entire Oz series is available in this single, great-value, edition!</dc:description>
</metadata>
<manifest>
<item href="Baum, L. Frank - Oz 01 - The Wizard of Oz (illus.)_split_000.htm" id="Baum,_L._Frank_-_Oz_01_-_The_Wizard_of_Oz_(illus.)80" media-type="application/xhtml+xml"/>
...
使用epub 库解析也不错 https://pypi.python.org/pypi/epub 33
https://github.com/bettse/epub-reader/blob/master/epub.py 17
A simple python script to unpack/parse epub books so they can be read on the command line.
本文网址: https://pylist.com/t/1439175771 (转载注明出处)
如果你有任何建议或疑问可以在下面 留言
发表第一条评论!
相关推荐
小工具
标签
markdown
python
解析
速度
比较
list
按键
排序
高效
ssdb
缓存
分享
一个
html
中文
实体
转换
字符串
分割
英文
spotify
chartify
开源
图表
os
popen
超时
解决
示例
ffmpeg
一览表
参数
文件类型
magic
从文件
判断
格式
常见
时间
cjson
lua
module
错误
cpu
debian
ubuntu
查看
温度
苦短
城铁
guido
我用
kvdb
sae
方便
使用
memcached
正式
支持
openwrt
usb
上网卡
共享
网络
golang
断言
类型
注意
问题
xor
解密
加密
方法
bbr
路由
开启
百度
google
pk
三大
必应
json
测试
gnu
linux
安装
修砖记
变砖
ubnt
er
tftp
轻量级
博客
一款
ulimit
永久
设置
系统
之坑
files
open
服务
failed
load
devtools
sourcemap
chrome
自动更新
microsoft
mac
关闭
app
store
登录
未知
静音
风扇
主机
笔记本
改造
低功耗
web
爬虫
服务器
组装
微信
尝鲜
视频
体验
server
浏览
身份验证
authenticator
迁移
手机
webdriver
微博
selenium
系统启动
固件
编译
宅家
坑记
屏幕
动手
硬件加速
nginx
quic
抢先
最近发表
- Chrome 控制台 DevTools failed to load SourceMap 警告的消除方法
- Mac 关闭 Microsoft 自动更新
- Mac 登录 App Store 出现“发生了未知错误”的解决方法
- 老笔记本改造为无风扇静音主机方案
- 自己组装21瓦低功耗家庭爬虫、文件、web服务器
- 微信视频号尝鲜体验
- Ubuntu/Debian 查看CPU温度的方法
- 在Ubuntu/debian Server 系统使用Chrome 无头浏览模式
- 换手机后 Google 身份验证器 Google Authenticator 数据迁移的简单方法
- 使用Golang selenium WebDriver 自动登录微博
- 在 Ubuntu 或其它 GNU/Linux 系统下安装 Debian
- Mac 下制作 USB ubuntu/debian 系统启动、安装盘的几种方法
- ubuntu/debian 下自行编译 OpenWRT 固件
- 宅家自己动手换手机屏幕掉坑记
- 路由 UBNT ER-X 官方固件升级及开启硬件加速的方法
- 在 Nginx 和 Golang web 上抢先体验 QUIC
最近浏览
- 出现 lua module 'cjson' not found 错误的解决方法
- Ubuntu/Debian 查看CPU温度的方法
- Python之父Guido穿着“人生苦短,我用Python”T恤等城铁
- SAE 上还是使用KVDB 方便
- SAE 正式支持 Memcached 了
- Openwrt 使用USB 4G 上网卡共享网络
- Golang里的类型断言及注意问题
- 用python 实现XOR 加密解密的方法
- Openwrt 路由上开启BBR
- 神奇:google、必应、百度三大翻译PK,百度出奇制胜!
- go json 解析速度测试
- 在 Ubuntu 或其它 GNU/Linux 系统下安装 Debian
- UBNT ER-X 变砖 tftp 修砖记
- 一款轻量级的基于SAE Python的博客诞生了
- Linux 系统永久设置ulimit
- Golang 服务之坑:too many open files