Even splitting uneven rank data in cassandra

Question

Even splitting uneven rank data in cassandra

I have a rather difficult task, to carry me, because I try not to stumble on my words here. I do some research and my group goes to the cassandra database. Our study used MySQL before, but the data outgrew the database (192 million lines in @ 16G memory - this was the only way to quickly query the data). The data itself is a kind of static. There is a lot of it, but any new data at the moment is somewhat slow.

The data consists of boats of pairs of classifier pairs. We formulate queries for the database, which basically say: "Give me the top 500 for the following classifiers." Then the database returns a lot of points. For example, if we ask for 500 points for 2 classifiers, we get 1,000 lines (each line consists of a classifier identifier and a grade — that is, [4, 9100]). Estimates themselves are uneven (the distribution tends to shrink to one end of the values, which, by the way, are from -10000 to 10000)

When we move on to cassandra, there are a number of requirements. First of all, we should be able to request upper and lower bounds for N for each classifier. Usually I see that an ordered separator is suitable for this, but, as I said, points tend to contract at extreme points (which puts too much strain on one node). So my first question is how do I evenly distribute classifier / pair pairs while still being able to query for upper or lower N.

There is a secondary requirement, which largely twists the first. Sometimes you need to find all the estimates that are close to another result. Therefore, if I see Classifier 6 with a score of 400, I can ask, show me 500 points, which are closest to this (all in Classifier 6). I am absolutely stumped about this. I read that cassandra supports secondary indexes (yay), but only a hash type (boo - no range). Are we creating a separate instance of ColumnFamily for this use case?

And finally, speed is paramount. Data is used in an interactive graphics application. Ideally, queries should only take a few seconds. And if all the data is stuck on one particular node, this will slow down.

. , , 500 1, 500 2 .. , 500 1. . , MOST 1, node (, N , 500 * N ). , , , ( - , ).

. , , , . - , ( node, RDBM). , : ? , , . . .

+3

database cassandra database-design cassandra-0.7

Chris Eberle 16 . '11 17:34

1

Jcs · Accepted Answer · 2011-03-16T22:19:40+0000

. , / 500 . , s, , , 500 s 500 s, 500 s.

Even splitting uneven rank data in cassandra

More articles: