By making PCA on a very large dataset in R

I have a very large set of workouts (~ 2Gb) in a CSV file. The file is too large to read directly into memory ( read.csv()stops the computer), and I would like to reduce the size of the data file using PCA. The problem is that (as far as I can tell) I need to read the file in memory to run the PCA algorithm (for example, princomp()).

I tried a package bigmemoryto read the file in quality big.matrix, but it princompdoes not work in objects big.matrix, and it does not look like it big.matrixcan be converted to something like a data.frame.

Is there a way to run princompin a large data file that I am missing?

I am a relative newbie to R, so some of them may be obvious to more experienced users (apologies in advance).

Thanks for any info.

+5
source share
2 answers

The way I solved this was by computing the covariance sample matrix iteratively. Thus, you only need a subset of the data for any point in time. Reading only a subset of the data can be done using readLineswhere you open a file connection and read iteratively. The algorithm looks something like this (this is a two-stage algorithm):

Calculate average values ​​for a column (assuming these are variables)

  • Open file connection ( con = open(...))
  • Read 1000 lines ( readLines(con, n = 1000))
  • Calculate the sum of squares per column
  • (sos_column = sos_column + new_sos)
  • 2-4 .
  • 1, .

:

  • (con = open(...))
  • 1000 (readLines(con, n = 1000))
  • - crossprod
  • 2-4 .
  • 1, .

, princomp covmat = your_covmat princomp, .

, , , , . , (, 1000 ), (nvar * nvar double).

+8

.

  • .

  • :

    initial < - read.table( "datatable.csv", nrows = 100);

    classes < - sapply (, );

    tabAll < - read.table( "datatable.csv", colClasses = classes)

  • , fread() , table.

  • PCA. , ZeroVariance, .

  • PCA.

,

0

All Articles