Using sklearn and Python for a large application classification / cross application

I am working on a relatively large text-based web classification problem, and I plan to use the Naive Bayes multicomponent classifier in sklearn in python and the scrapy platform to work around. However, I'm a little worried that sklearn / python might be too slow for a problem that could include classifying millions of websites. I have already trained the classifier on several thousand sites from DMOZ. The structure of the study is as follows:

1) The crawler lands on the domain name and discards the text from 20 links on the site (depth not more than one). (The number of token words here, apparently, ranges from a few thousand to 150K for a selective run of the searcher) 2) Run the multi-user classifier of sklearn classes with about 50,000 functions and write down the domain name depending on the result

My question is whether the Python-based classifier will fit the task for such a large-scale application, or should I try to rewrite the classifier (and possibly the scraper and token) in a faster environment? If so, what could be this environment? Or maybe Python is enough if it is accompanied by some parallelization of the code? Thanks

+5
source share
2

HashingVectorizer , API partial_fit, SGDClassifier, Perceptron PassiveAggresiveClassifier, , () .

, (, 100 .), Pipeline RandomizedSearchCV . (, C PassiveAggressiveClassifier alpha SGDClassifier) ​​ RandomizedSearchCV , , (, ).

( coef_ intercept_ ), , , .

+5

, numpy, scipy sklearn, Python , C-.

, , . , PiCloud [1] Amazon Web Services (EC2), .

Cloud Queues [2].

[1] http://www.picloud.com

[2] http://blog.picloud.com/2013/04/03/introducing-queues-creating-a-pipeline-in-the-cloud/

+3

All Articles