Advertisement

Important Big Data Additions to the R Analyst’s Toolchest

By on

Click to learn more about author Steve Miller.

After my partner read my last blog, Frequencies in R — Part 2, where I used R’s data.table and dplyr packages to construct performant frequencies procedures on an in-memory 27M+ row, 30 attribute data.table, he asked if I’d compared my results with the equivalent functionality of R’s MonetDB.R package.

I have indeed used MonetDB.R and love its capabilities. MonetDB is an open source column store DBMS that delivers speedy performance on analytic-like SQL queries. And MonetDB.R provides connectivity to MonetDB database servers, allowing the R developer to execute SQL commands, push and pull data from MonetDB to R, load CSV data into a database, etc. With these capabilities, a programmer can work around R’s memory limitations, accessing subsets of data into R structures as needed.

Somewhat newer is MonetDBLite for R, “a fully embedded version of MonetDB that installs like any other R package. The database runs within the R process itself (like its namesake SQLite), greatly improving efficiency of data transfers. The package is available for Linux, Mac OS X and Windows (64 bit).” Thus, with MonetDBLite, a MonetDB server needn’t be present; only the readily available R packages DBI and MonetDBLite need be installed. An additional benefit of both MonetDB.R and MonetDBLite is the ability to use Hadley Wickham’s powerful dplyr data management package commands in tandem with SQL.

So I decided to take a look at MonetDBLite against data that’s too large for my PC’s memory. Conveniently, a 128M+ record, 70 attribute data set from the Home Morthgage Disclosure Actmet the need. After downloading and unzipping the six annual “f” files, I was set to start. The remainder of the blog demonstrates loading the data from delimited files, then using both SQL and dplyr queries to tabulate attribute frequencies. Important to note is that data must be loaded into MonetDBLite only once. The persistent directory-based storage can then be accessed for subsequent analyses.

First, assign options, load R packages, set working directories, and “connect” to MonetDBlite.

Now load each of the 6 files into the MonetDBLite database in turn, invoking the dbWriteTable function and a helper loaddata procedure that reads and wrangles the data into R. The first call to dbWriteTable overwrites the existing table; subsequent calls append. Load time for the 128M+ record, 70 attribute table is about an hour on my notebook. Once the data are loaded, execute 2 simple SQL queries.

Next create a MonetDBLite handle and use the tbl function to enable dplyr functionality against the hmda table.

frequenciesmonetdb is a generic frequencies function that can be invoked against columns of a MonetDBLite table as well as against columns from an R data.frame. genallfreqmonetdb generates frequencies for either all or designated subsets of table columns. Frequencies for a given column complete in about 5 seconds against the 128M rows.

Now start over but this time access the persistent data. Run the same SQL and dplyr queries.

Observations: NA
Variables: 70
$ v1  <int> 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 200...
$ v2  <chr> "0000000008", "0000000008", "0000000008", "0000000008", "000000...
$ v3  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ v4  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, ...
$ v5  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ v6  <int> 1, 3, 3, 1, 3, 3, 1, 3, 3, 3, 1, 3, 3, 1, 1, 3, 1, 1, 3, 3, 3, ...
$ v7  <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, ...
$ v8  <dbl> 200, 184, 264, 585, 561, 185, 232, 192, 114, 280, 214, 200, 132...
$ v9  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
$ v10 <int> 6, 3, 1, 3, 3, 3, 1, 1, 1, 3, 1, 3, 2, 3, 1, 1, 3, 1, 6, 1, 2, ...
$ v11 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v12 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
$ v13 <int> 50, 70, 110, 110, 110, 110, 110, 110, 110, 110, 110, 110, 122, ...
$ v14 <chr> "0002.00", "0002.00", "0001.00", "0001.00", "0001.00", "0002.00...
$ v15 <int> 2, 2, 2, 3, 3, 2, 2, 3, 3, 2, 3, 2, 2, 2, 3, 2, 3, 2, 2, 3, 3, ...
$ v16 <int> 2, 5, 5, 3, 5, 2, 5, 5, 5, 2, 5, 5, 5, 5, 3, 2, 3, 2, 5, 5, 3, ...
$ v17 <int> 5, 5, 5, 5, 5, 5, 5, 6, 6, 5, 6, 5, 5, 5, 6, 5, 6, 5, 5, 6, 6, ...
$ v18 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v19 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v20 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v21 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v22 <int> 5, 8, 8, 5, 8, 5, 8, 8, 8, 5, 8, 8, 8, 8, 6, 5, 6, 5, 8, 8, 6, ...
$ v23 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v24 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v25 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v26 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v27 <int> 1, 1, 1, 1, 2, 1, 2, 3, 3, 1, 3, 1, 2, 1, 3, 1, 3, 1, 1, 1, 3, ...
$ v28 <int> 2, 5, 5, 2, 5, 2, 5, 5, 5, 2, 5, 5, 5, 5, 3, 2, 3, 2, 5, 5, 3, ...
$ v29 <chr> "0108", "0148", "0288", "0285", NA, "0083", "0070", "0059", "00...
$ v30 <int> 8, 0, 8, 0, 0, 0, 8, 8, 8, 0, 8, 0, 0, 0, 8, 8, 0, 8, 8, 8, 0, ...
$ v31 <chr> NA, "7", NA, "7", "7", "7", NA, NA, NA, "4", NA, "7", NA, "4", ...
$ v32 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v33 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v34 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ v35 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
$ v36 <int> 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, ...
$ v37 <int> NA, NA, NA, NA, 6, NA, NA, NA, NA, NA, NA, 6, NA, NA, NA, NA, N...
$ v38 <int> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, ...
$ v39 <dbl> 2141, 2142, 2143, 2144, 2145, 2146, 2147, 2148, 2149, 2150, 215...
$ v40 <int> 5471, 2512, 4468, 4468, 4468, 7445, 5135, 5135, 5135, 5135, 364...
$ v41 <chr> "73.84", "65.84", "13.59", "13.59", "13.59", "26.02", "29.76", ...
$ v42 <chr> "6.7e+04", "6.7e+04", "6.7e+04", "6.7e+04", "6.7e+04", "6.7e+04...
$ v43 <chr> "115.07", "112.12", "156.37", "156.37", "156.37", "137.56", "11...
$ v44 <int> 752, 449, 1268, 1268, 1268, 1955, 1026, 1026, 1026, 1026, 840, ...
$ v45 <int> 1744, 846, 1648, 1648, 1648, 2400, 1584, 1584, 1584, 1584, 1180...
$ v46 <int> 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 200...
$ v47 <chr> "13-4994650", "13-4994650", "13-4994650", "13-4994650", "13-499...
$ v48 <chr> "JPMORGAN CHASE BANK", "JPMORGAN CHASE BANK", "JPMORGAN CHASE B...
$ v49 <chr> "194 WOOD AVE SOUTH", "194 WOOD AVE SOUTH", "194 WOOD AVE SOUTH...
$ v50 <chr> "ISELIN", "ISELIN", "ISELIN", "ISELIN", "ISELIN", "ISELIN", "IS...
$ v51 <chr> "NJ", "NJ", "NJ", "NJ", "NJ", "NJ", "NJ", "NJ", "NJ", "NJ", "NJ...
$ v52 <chr> "08830", "08830", "08830", "08830", "08830", "08830", "08830", ...
$ v53 <chr> "JPMORGAN CHASE BANK, NA", "JPMORGAN CHASE BANK, NA", "JPMORGAN...
$ v54 <chr> "1111 POLARIS PARKWAY", "1111 POLARIS PARKWAY", "1111 POLARIS P...
$ v55 <chr> "COLUMBUS", "COLUMBUS", "COLUMBUS", "COLUMBUS", "COLUMBUS", "CO...
$ v56 <chr> "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH...
$ v57 <chr> "43240", "43240", "43240", "43240", "43240", "43240", "43240", ...
$ v58 <chr> "JPMORGAN CHASE BK NA", "JPMORGAN CHASE BK NA", "JPMORGAN CHASE...
$ v59 <chr> "COLUMBUS", "COLUMBUS", "COLUMBUS", "COLUMBUS", "COLUMBUS", "CO...
$ v60 <chr> "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH", "OH...
$ v61 <dbl> 1013985000, 1013985000, 1013985000, 1013985000, 1013985000, 101...
$ v62 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ v63 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
$ v64 <dbl> 756551, 756551, 756551, 756551, 756551, 756551, 756551, 756551,...
$ v66 <chr> "0", "0", "0", "0", "0", "0", "0", NA, NA, "0", NA, "0", "0", "...
$ v67 <int> 0, NA, NA, 0, NA, 0, NA, NA, NA, 0, NA, NA, NA, NA, NA, 0, NA, ...
$ v68 <int> 1, 1, 1, 1, 1, 1, 1, NA, NA, 1, NA, 1, 1, 1, NA, 1, NA, 1, 1, N...
$ v69 <int> 1, NA, NA, 1, NA, 1, NA, NA, NA, 1, NA, NA, NA, NA, NA, 1, NA, ...
$ v70 <int> 5, 5, 5, 5, 5, 5, 5, NA, NA, 5, NA, 5, 5, 5, NA, 5, NA, 5, 5, N...
$ v71 <chr> "Not Hispanic", "Not Hispanic", "Not Hispanic", NA, NA, "Not Hi...
   user  system elapsed 
   0.41    3.35   37.56 



Finally, an illustration of combining a SQL query with dplyr code.

Nice capabilities. A reasonable strategy for handling, say, 100 GB of data for R analysis would be to: 1) serially load and munge “pieces” of the larger data into R for subsequent append to a MonetDBLite “database”, and 2) Use a combination of SQL and dplyr to summarize the database and load query results back into R data.frames/data.tables for further analysis.

For my money, MonetDB, MonetDB.R, and MonetDBLite should be considered important “big data” additions to the R analyst’s toolchest.

Leave a Reply