What is the best way to run Lucene / Solr on Hadoop?

We run Solr on an Amazon Web Services EC2 instance with a 1TB EBS volume to store the index so we can easily run additional servers with the same index (read-only). However, our index will soon exceed 1 TB, and I really don't want to deal with interleaving multiple EBS volumes to store the index. In addition, index recovery is very slow. I would like to move index generation - and possibly hosting - to Hadoop, and preferably to Amazon Elastic MapReduce, although I can set up separate Hadoop servers if necessary. We use RightScale, so our ServerTemplates library is available to us.

What would be the best place to get started with Lucene / Solr on Hadoop?

+3
source share
2 answers

Take a look at ElasticSearch. You can index ElasticSearch with Hadoop for bulk upload. Infochimps has an ElasticSearch open source indexing tool called Wonderdog, which you can look at to prove the concept.

https://github.com/infochimps/wonderdog http://www.elasticsearch.com

This cloud is friendly (see cloud-aws plugin for discovery) and can scale up / down by adding nodes to store the index.

+1
source

Is your index described? You can outline the index and distribute the fragments in multiple instances.

+1
source

All Articles