By making PCA on a very large dataset in R

Question

By making PCA on a very large dataset in R

I have a very large set of workouts (~ 2Gb) in a CSV file. The file is too large to read directly into memory ( read.csv()stops the computer), and I would like to reduce the size of the data file using PCA. The problem is that (as far as I can tell) I need to read the file in memory to run the PCA algorithm (for example, princomp()).

I tried a package bigmemoryto read the file in quality big.matrix, but it princompdoes not work in objects big.matrix, and it does not look like it big.matrixcan be converted to something like a data.frame.

Is there a way to run princompin a large data file that I am missing?

I am a relative newbie to R, so some of them may be obvious to more experienced users (apologies in advance).

Thanks for any info.

+5

r pca bigdata

user141146 15 sept. '12 at 1:23

source share

2 answers

.

.
:
initial < - read.table( "datatable.csv", nrows = 100);
classes < - sapply (, );
tabAll < - read.table( "datatable.csv", colClasses = classes)
, fread() , table.
PCA. , ZeroVariance, .
PCA.

,

0

Gaurav Chavan 05 . '18 18:10

Paul Hiemstra · Accepted Answer · 2012-10-01T10:09:34+0000

The way I solved this was by computing the covariance sample matrix iteratively. Thus, you only need a subset of the data for any point in time. Reading only a subset of the data can be done using readLineswhere you open a file connection and read iteratively. The algorithm looks something like this (this is a two-stage algorithm):

Calculate average values for a column (assuming these are variables)

Open file connection ( con = open(...))
Read 1000 lines ( readLines(con, n = 1000))
Calculate the sum of squares per column
(sos_column = sos_column + new_sos)
2-4 .
1, .

:

(con = open(...))
1000 (readLines(con, n = 1000))
- crossprod
2-4 .
1, .

, princomp covmat = your_covmat princomp, .

, , , , . , (, 1000 ), (nvar * nvar double).

By making PCA on a very large dataset in R

More articles: