Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Apache Nutch
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== History == Nutch originated with [[Doug Cutting]], creator of both [[Lucene]] and [[Hadoop]], and [[Mike Cafarella]]. In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented the [[MapReduce]] project and a [[distributed file system]]. The two projects have been spun out into their own subproject, called [[Hadoop]]. In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the [[Apache Software Foundation]].<ref>{{Cite web|url=http://nutch.apache.org/#News|title=Apache Nutch -|website=nutch.apache.org}}</ref> In February 2014 the [[Common Crawl]] project adopted Nutch for its open, large-scale web crawl.<ref name=":0">{{Cite web|title = Common Crawl's Move to Nutch – Common Crawl – Blog|url = http://blog.commoncrawl.org/2014/02/common-crawl-move-to-nutch/|website = blog.commoncrawl.org|access-date = 2015-10-14}}</ref> While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.{{Citation needed|date=October 2015}} ===Release history=== {| class="wikitable sortable" style="width: 80%" |- ! style="width: 5%" | 1.x Branch ! style="width: 5%" | 2.x Branch ! style="width: 10%" | Release date ! style="width: 60%" class="unsortable" | Description |- | 1.1 | | 2010-06-06 | This release includes several major upgrades of existing libraries (Hadoop, Solr, Tika, etc.) on which Nutch depends. Various bug fixes, and speedups (e.g., to Fetcher2) have also been included. |- | 1.2 | | 2010-10-24 | This release includes several improvements (addition of parse-html as a selectable parser again, configurable per-field indexing), new features (including adding timing information to all Tool classes, and implementation of parser timeouts), and bug fixes (fixing an NPE in distributed search, fixing of XML formatting issues per Document fields). |- | 1.3 | | 2011-06-07 | This release includes several improvements (improved RSS parsing support, tighter integration with Apache Tika, external parsing support, improved language identification and an order of magnitude smaller source release tarball—only about 2 MB). |- | 1.4 | | 2011-11-26 | This release includes several improvements including allowing Parsers to declare support for multiple MIME types, configurable Fetcher Queue depth, Fetcher speed improvements, tighter Tika integration, and support for HTTP auth in Solr indexing. |- | 1.5 | | 2012-06-07 | This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. |- | |2.0 | 2012-07-07 | This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora) for big data stores such as Apache Accumulo, Apache Avro, Apache Cassandra, Apache HBase, HDFS, an in memory data store and various high-profile SQL stores. |- | 1.5.1 | | 2012-07-10 | This release is a maintenance release of the popular 1.5.X mainstream version of Nutch which has been widely adopted within the community. |- | |2.1 | 2012-10-05 | This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search. |- | 1.6 | | 2012-12-06 | This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API including the normalization of URLs and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8. |- | |2.2 | 2013-06-08 | This release includes over 30 bug fixes and over 25 improvements representing the third release of increasingly popular 2.x Nutch series. This release features inclusion of Crawler-Commons which Nutch now utilizes for improved robots.txt parsing, library upgrades to Apache Hadoop 1.1.1, Apache Gora 0.3, Apache Tika 1.2 and Automaton 1.11-8. |- | 1.7 | | 2013-06-24 | This release includes over 20 bug fixes, as many improvements; most noticeably featuring a new pluggable indexing architecture which currently supports Apache Solr and Elastic Search. Shadowing the recent Nutch 2.2 release, parsing of Robots.txt is now delegated to Crawler-Commons. Key library upgrades have been made to Apache Hadoop 1.2.0 and Apache Tika 1.3. |- | |2.2.1 | 2013-07-02 | This release includes library upgrades to Apache Hadoop 1.2.0 and Apache Tika 1.3, it is predominantly a bug fix for NUTCH-1591 - Incorrect conversion of ByteBuffer to String. |- | 1.8 | | 2014-03-17 | Although this release includes library upgrades to Crawler Commons 0.3 and Apache Tika 1.5, it also provides over 30 bug fixes as well as 18 improvements. |- | |2.3 | 2015-01-22 | Nutch 2.3 release now comes packaged with a self-contained Apache Wicket-based Web Application. The SQL backend for Gora has been deprecated.<ref>{{cite web |url=http://nutch.apache.org/#22-january-2015-nutch-23-release |title=Nutch 2.3 Release |publisher=The Apache Software Foundation |date=22 January 2015 |website=Apache Nutch News |access-date=18 January 2016}}</ref> |- | 1.10 | | 2015-05-06 | This release includes library upgrades to Tika 1.6, also provides over 46 bug fixes as well as 37 improvements and 12 new features.<ref>{{cite web |url=https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12327187 |title=Nutch 1.10 Release Notes |publisher=The Apache Software Foundation |date=6 May 2015 |website=ASF JIRA |access-date=18 January 2016}}</ref> |- | 1.11 | | 2015-12-07 | This release includes library upgrades to Hadoop 2.X, Tika 1.11, also provides over 32 bug fixes as well as 35 improvements and 14 new features.<ref>{{cite web |url=https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12329358 |title=Nutch 1.11 Release Notes |publisher=The Apache Software Foundation |date=7 December 2015 |website=ASF JIRA |access-date=18 January 2016}}</ref> |- | |2.3.1 |2016-01-21 |This bug fix release contains around 40 issues addressed. |- |1.12 | |2016-06-18 | |- |1.13 | |2017-04-02 | |- |1.14 | |2017-12-23 | |- |1.15 | |2018-08-09 | |- |1.16 | |2019-10-11 | |- | |2.4 |2019-10-11 |Expected to be the last release on the 2.X series, as "no committer is actively working on it".<ref>{{cite news |url=https://nutch.apache.org/news/legacy-nutch-news/#11-october-2019---nutch-24-release |title=Nutch 2.4 Release |publisher=The Apache Software Foundation |date=11 October 2019 |website=Apache Nutch News |access-date=20 May 2022}}</ref> |- |1.17 | |2020-07-02 | |- |1.18 | |2021-01-24 | |- |1.19 | |2022-08-22 | |- |1.20 | |2024-04-09 | |}
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)