Big data analytics with R
Big data analytics is a term thrown about a lot but often missing is how you practically go about things when your data get’s big. In this post I breakdown how I think about big data and the tools and packages I reach for in R when I want to read in and manipulate data, from small to large. If you want to get started with tools like vroom, data.table, Arrow, DuckDB, SQLite and Spark, and see how you can use these alongside the tidyverse then read on.
2 February 2023
15 minute read
First things first, what does big data mean?
Being a lazy person in February 2023 I of course asked ChatGPT this question. The response was extremely boring. So I asked it to spice it up a bit:
"Big data is the 21st century's gold rush! Imagine mountains of information waiting to be mined for hidden treasures, insights, and opportunities…”
This rambling went on for quite a while so I threw in the towel and drew a handy chart instead, outlining the various levels of big data as I see them and how they relate to R (and Excel for good measure).
This article discusses dealing with levels 1 to 4 of big data as I’ve defined it. For stage 5 you’ll need some cloud solutions. You can use these with R but not with your laptop alone.
My data fits in Excel. Happy days.
In level 1 your data fits in Excel with no problems. I would use R anyway. This isn’t a slight on other tools like Python, I just like R. There are many better reasons to use R over Excel than just the size of your data:
Reproducibility is much stronger in R. Write your code and you have a detailed set of instructions of exactly what is happening. In Excel, calculations are buried in individual cells.
It’s much easier to accidentally add a mistake in to Excel. Maybe you hardcode a value in a cell where a calculation should be when you are running your analysis. To run your script you just hit a button, no copy and paste required.
You can tackle much more complex problems with R and often someone has already wrote the code for you built into R or in a package.
There are many more data visualisation options in R.
Moving from Excel to R
Back in the day’s before data science became trendy, lots of data analytics tasks were carried out in Excel. In lots of cases they still are.
In recent years the the size of datasets we typically work with has grown but the spec of our laptops hasn’t kept up the pace. Excel also has a built-in maximum constraint on the size of your data: 1,048,576 rows by 16,384 columns. That sounds reasonably big, but in practice it falls over quicker than that. Files of a few megabytes can cause an oldish laptop a lot of problems, especially if you’re work has mediocre IT.
Why does it fall over though? Well, your computer has a certain amount of Random Access Memory (RAM). When the file you are loading is bigger than your available RAM it crashes. Excel is a memory hungry tool with all its individual cell calculations and just can’t cope with much.
You’re going to need something more powerful and R is a brilliant tool for the job. It’s much less memory hungry than a spreadsheet. For the uninitiated, R is an open source coding language. That means it’s free. You can find out how to get started here:
Imagine your dataset is stored as a CSV file and you want it in R.
R has a handy function for precisely this - read.csv. I prefer the consistent syntax of the tidyverse suite of packages though so I like to use read_csv. Both do pretty much the same job though. To use read_csv you can install the readr package directly or install the whole tidyverse (I recommend this) which will also install dplyr for you.
read_csv by default loads your data as a tibble rather than the usual base R data.frame. Tibbles are just data frames with a few better features so if you haven’t heard of a tibble just think data frame..
Then you can use base R, the dplyr package or whatever package you like to manipulate your data. Something like this will select a couple of columns and filter your rows based on the condition.
Excellent. Job done.
In my experience this covers most problems you come across, nothing extra needed, though the wrangling will be more involved. Maybe you use a different function for a different type of file but basically it’s the same thing.
What to do when your data is slowing down R
The read_csv function works great on most datasets but does slow down when your data gets past a certain size. What size is that? I can’t really say but you’ll know when it happens. It will depend on your machine.
Luckily R has a package designed for reading in and manipulating data that is much more efficient: data.table. Before you reach for data.table though, if your speed concerns are solely around the reading in the data, you can simply use the vroom package. Inside it has a function, also called vroom and the syntax is a near like-for-like replacement for read_csv, just faster.
But if the actual manipulation is also causing you grief then data.table is a good option. Some code like this will do the same job as above.
I find I always forget the syntax of data.table. It’s kind of a hybrid between base R and dplyr but with a couple of differences that catch me out. That’s why I reach for read_csv combined with dplyr first but, thanks to the tidyverse again, you don’t need to know it. The dtplyr package works in tandem with data.table allowing you to use dplyr syntax.
To see just how much faster it is at reading and manipulating data I generated a dataset with 10k rows and 100 columns filled with random numbers. Maybe not the greatest test but such is life. I then used the microbenchmark package to compare the run times of each function.
The data.table approach was 3.5x quicker and that includes converting it to a tibble to which you don’t necessarily need to do. For even bigger datasets and complex manipulation the time savings will grow.
So if things get slow or optimising for speed is a priority I reach for data.table and dtplyr.
My dataset is too big to load into R all in one go. Help.
In the last few years the number of tools available for dealing with datasets which are too big for your RAM but comfortably fit on your hard drive has grown.
The first thing we will look at is how your datasets are actually being stored. So far I have assumed we are working with CSV files. But there are more efficient options available.
A great alternative are parquet files an open source file type from Apache.
Parquet files are a column-orientated data file type. This majorly speeds up typical analytical queries. What does column-orientated mean though? Consider this dataset:
In the background, row-orientated files (like csv) store would store this row-by-row:
839, 88, FALSE, 573, 31, FALSE, 010, 45, TRUE
But column-orientated files store data like this:
839, 573, 010, 88, 31, 45, FALSE, FALSE, TRUE
If you want to read and write whole rows then option 1 is better. But usually with analytical queries we select certain columns first. This means you can quickly drop out big chunks of your data without cycling through it. Then you perform calculations and operations to reduce the number of rows. This speeds up these operations.
Storing by column also leads to highly efficient compression and decompression so the actual file size is smaller too.
To work with parquet files in R we can use the arrow package in tandem with dplyr. The arrow package allows you to connect to large files and run your queries. Your query should condense the size of the data: dropping columns, filtering rows, calculating summaries. Only the final, queried dataset is loaded into R’s memory so your strategy should be to get your queried dataset under your crash threshold even if the input file itself is larger.
Arrow is lazily evaluated. What this means is, when you write your commands, they aren't executed straight away which is subtly different to the standard approach. Think of it as a recipe. Only when we use the collect function is the recipe executed. Otherwise it’s pretty similar to using dplyr.
To write parquet files to disk is easy too.
This method of connecting to an external data source, querying it and then returning the queried data is how databases work. And there are a couple of great options for these too.
DuckDB is a fantastic tool currently seeing a rapid rise in the data world. It is designed for fast analytical queries and works brilliantly with big data.
Again it’s column-orientated so comes with all the same efficiency perks I mentioned before. It works by creating a database which you add your data to. Then you can use SQL code for the querying but, again you don’t need to know this and can use dplyr directly.
First we need to connect to a database. You need the duckdb and DBI packages. If your database doesn’t exist already this will create an empty one.
Now we need to add some data to the database. Let’s say you had a bunch of CSV files you want to put into a single database and these are all stored in a single folder. You can either read your data into R, file-by-file, by looping or mapping through multiple datasets. This could then create a larger than memory file. But you don’t need to read it into R first, you can use the duckdb_read_csv function to add all the files directly.
You can check that has created a table like so.
Now we can query it in much the same way we did with arrow.
So that is a little more involved than arrow but a database has some advantages for structuring your data.
SQLite is another solution you might reach to for certain use cases. Unlike DuckDB, SQLite databases are row-orientated. So usually, for querying DuckDB, will be the better choice. However, if you are reading and writing individual rows, maybe as part of an application or something, SQLite could be the better option.
Again we need to connect to, or create and connect to, a database.
You can check that has created a table like so.
At this point we need to loop or map through the rest of the files to append them to the existing table.
Now we can query it in much the same way as before.
So far all the tools we have covered are only for reading and manipulating data. Often we want to carry out some modelling using a machine learning technique on a big dataset. Let’s think about a regression. To fit a model on all your data your are going to run into memory trouble. But Spark allows you to do the actual model fitting piece-by-piece.
To work with Spark in R we can use the sparklyr package. It has one downside though compared to the other tools discussed. You need to install the sparklyr package and you need to download Spark. It isn’t tricky but downloading this extra tool can be a problem in corporate settings. Some IT departments won’t want you downloading extra software and, if they do allow you to download it, having the extra piece of tooling to take care of can be a headache for them. Maybe not a major one but you will likely need to speak to someone internally to get the permissions sorted. On your home machine this isn’t much of a problem though.
Spark does have a few quirks though. In the background Spark uses it’s own Spark functions. This means you won’t always be able to use the same R function directly to calculate something.
To get started with sparklyr just download and load in the usual way.
Then to install Spark directly from R run:
I wouldn’t reach for Spark over all the other tools if it’s just reading and manipulating data. Let’s imagine we have a big CSV file we want to work with. To do exactly what we have done above is straightforward though.
For doing actual machine learning that will need to wait for another post but if you want to get started have a look at the sparklyr GitHub.