Lucene.NET 4.8 vs Java Lucene 9.x
Don’t let the 4.8 version number fool you. Lucene.NET 4.8 contains the vast majority of the features found in Java Lucene 9.x.
Lucene.NET is a powerful open source search library managed by the Apache Foundation. I am a Commiter for the project and I have used it pretty extensively. I donate my time to the project because it’s a wonderful thing that software this powerful is available for anyone to freely use under a very liberal Apache 2 license.
Lucene.NET 4.8 contains the majority of the features found in Java Lucene 9.x
Lucene.NET 4.8 is Feature Rich
People sometimes get the mistaken notion that Lucene.NET 4.8 must have very limited features compared to the current java Lucene version which is at 9.x. But that’s just not the case. The reason for this mistake is easy to understand when you consider that the one version is almost half of the other. The reality however is quite different, Lucene.NET 4.8 contains the majority of the features found in Java Lucene 9.x.
Why is that? Well, the single most significant set of features that were ever introduced into the Java Lucene happened in going from 3.x to 4.0. And that release was several years in the making. Since then the releases have been much smaller incremental steps. And therefore it should be no surprise that the time required to get from Lucene.NET 3.x to 4.x is also equally long because the architecture difference between the two versions is huge.
The Lucene.NET 4.8 library contains MUCH more code and the feature set is much richer then the 3.X version. This is because version Lucene 4.0 introduced a whole new pluggable, configurable architecture to make customization much easier. This architectural change required a rewrite of a great many areas of Lucene.
The Apache Lucene.NET team chose to port version 4.8 rather than 4.0 so it’s basically the 4.x feature set with a bit of time to mature and be polished. And that port has taken many years to accomplish but it is now in a late beta stage (beta16 just released). In the mean time there have been a few wonderful new features added to the Java version which is now at version 9.X. But the list of major feature differences is fairly sort, maybe a half dozen.
I don’t think I’d be going out on a limb to say that version 4.x is much more similar to version 9.X then it is to version 3.x. And by that I mean that code examples you find for version 7, for example, are more likely to “just work” in version 4.x then code examples from version 3.x.
So don’t let the version number fool you, Lucene.NET 4.8 is an amazing piece of technology that is extremely relevant today and brings an enormous amount of power to .NET developers. It’s not “old” or “outdated” by any means, and in fact it contains some of the most advanced search technology available anywhere.
The more you learn about Lucene.NET 4.8 the more impressed you will become. Give this Apache Lucene 4 whitepaper a read to see what I mean.
“I’ve been using Lucene.NET 4.8 for over a year and I am blown away by how powerful it is.” Ron Clabo, Founder, GiftOasis
Lucene.NET 4.8’s Features
Here are a few of the high level features found in Lucene.NET 4.8.:
- Inverted Index to support blazing fast full text search
- Database to store and retrieve documents indexed
- Supports schema-less documents that are fully searchable.
- Support for a wide variety of field types
- Ability to show where search phrase occurred in the matching documents
- Wide variety of text Analyzers to support advanced text processing
- Analyzer support for more than 35 languages including English, French, German, Spanish, Japanese, Korean, Chinese and more.
- Can support case insensitive indexing and searching
- Can support spelling correction
- Can collapse singular and plurals to match both
- Support for porter stemming so searches are based on word roots
- Supports ability to build and use custom text Analyzers
- Supports ability to not index common words like a, an, the (ie stopword removal)
- Rich query language for querying documents
- Supports non-blocking writes to index
- Supports non-blocking queries by multiple threads
- Supports multiple levels of extra data compression (or none) for stored documents.
- Supports a wide variety of query types
- Use of Finite-State Automata to boost query performance
- Supports multiple document scoring models including TF/IDF and BM25
- Supports custom algorithms for scoring documents
- Supports custom index payloads that can be used for query time document score boosting. For example if the search term is found in the title field is should get more weight in the score then if found in the body text.
- Auto complete support
- Faceting support (includes two different faceting approaches).
- Columnar storage of field data for fast sorting, grouping and aggregation.
- Supports pluggable codex which enable customization of where and how the data is stored.
- Designed by default to use very little RAM so that the system can scale to millions of documents with a lean memory footprint.
- Most all of the features are opt in and tunable so the system is a lean and fast as possible.
And the list above is just a partial list of the top features that come to mind. It’s not anywhere close to an exhaustive list.
So if you are thinking of giving Lucene.NET a try, definitely don’t let the 4.8 version number scare you off. It truly is an amazing search library.