一个分享个人学习、开发经验的Blog,http://www.joyphper.net

jsoup 1.7.3 发布,超强的 HTML 解析器

posted @ 2013-11-11 22:44 | 阅读:3734 | 评论:0 | 分类: PHP , 新闻 , 软件

jsoup 刚刚发布了 1.7.3 版本,改进了表单处理、更可靠的字符集检测、CSS 选择器和解析的性能提升以及内存优化,修复了一些 bug。

jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操作数据。

jsoup的主要功能如下:

从一个URL,文件或字符串中解析HTML;
使用DOM或CSS选择器来查找、取出数据;
可操作HTML元素、属性、文本;
jsoup是基于MIT协议发布的,可放心使用于商业项目。

详细改进内容如下:

Improvements:
- Added the element type FormElement, to facilitate simple form submissions. Find forms in a doc using Elements.forms(), then prepare it for submission with FormElement.submit().
- Improved the reliability of HTTP character-set recognition from response headers, particularly for when servers return out-of-spec responses.
- Added Document.location() to retrieve the document's location URL. Handy if the request was redirected from the original URL.
- Large decrease in the amount of temporary objects created during parsing, leading to less GC load (helpful particularly on Android), and faster parsing.
- Improved the time to match elements with common CSS selectors by ~ 27%.
Bug Fixes:
- Fixed support for self-closing script tags.
- Fixed a crash when reading an unterminated CDATA section.
- Fixed an issue where elements added via the adoption agency algorithm did not preserve their attributes.
- Fixed an issue when cloning a document with extremely nested elements that could cause a stack-overflow.
- Fixed an issue when connecting or redirecting to a URL that contains a space.

TAG: jsoup

共有0条评论 发表评论>>

点击换一张验证码