First Steps #4: Digging Into DataFrames

First Steps #4: Digging Into DataFrames

DataFrames.jl provides the most widely used tabular data structure in Julia.  In this post we'll explore DataFrames using sample data from RDatasets.jl (and we'll plot stuff using StatsPlots).

A rather timely event: DataFrames.jl has reached version 1.0!

⚙️ Setup

First, install DataFrames and RDatasets via Pkg Mode (]) in the REPL:

(@v1.6) pkg> add DataFrames RDatasets

Now load both packages along with the diamonds dataset from R's ggplot2 package.  The diamonds data contains price/size/quality information on 53,940 different diamonds.  

julia> using DataFrames, RDatasets

julia> df = dataset("ggplot2", "diamonds")
53940×10 DataFrame
   Row │ Carat    Cut        Color  Clarity  Depth    Tabl ⋯
       │ Float64  Cat…       Cat…   Cat…     Float64  Floa ⋯
     1 │    0.23  Ideal      E      SI2         61.5     5 ⋯
     2 │    0.21  Premium    E      SI1         59.8     6
     3 │    0.23  Good       E      VS1         56.9     6
     4 │    0.29  Premium    I      VS2         62.4     5
     5 │    0.31  Good       J      SI2         63.3     5 ⋯
     6 │    0.24  Very Good  J      VVS2        62.8     5
     7 │    0.24  Very Good  I      VVS1        62.3     5
     8 │    0.26  Very Good  H      SI1         61.9     5
   ⋮   │    ⋮         ⋮        ⋮       ⋮        ⋮        ⋮ ⋱

🚀 DataFrames Quickstart

  • Variables (columns) of a Dataframe can be referenced either by strings or symbols, e.g. "I am a string" and :I_am_a_symbol.

Make a Copy of a Column

df[:, "Carat"]

df[:, :Carat]

Extract a Column

  • These commands retrieve the exact data held in the DataFrame.  Warning!  Making a change to the extracted data will change the values in DataFrame.

df[!, "Carat"]

Selecting a Subset of Columns

select(df, "Carat")

select(df, ["Carat", "Cut"])

Filtering a Subset of Rows

The syntax x -> do something with x is an anonymous function (sometimes called lambda expression).  The filter function will apply a function to each row and return back a DataFrame for the rows that returned true.

filter(row -> row.Carat > 1, df)
  • We can also use indexing (with broadcasting) rather than filter:
df[df.Carat .> 1, :]
  • For functions that accept a function as its first argument, Julia's do-block syntax can help you clean up your code.  Here we are using &&, the logical "and" operator, to create multiple filter conditions.
filter(x -> x.Carat > 1 && x.Cut == "Premium" && x.Color == "J" && 5000 <= x.Price <= 6000, df)

# Same as above, but with do-block
filter(df) do x 
    x.Carat > 1 && 
        x.Cut == "Premium" && 
        x.Color == "J" &&
        5000 <= x.Price <= 6000

You can now do several essential DataFrame tasks:

  • Get a single column
  • Choose a subset of columns
  • Choose a subset of rows

Next we'll use groupby and combine to apply functions across groups of data.

🤔 How does Price relate to Cut?

We are big on learning by example, so let's start by answering this relatively simple question.  First things first: What do the Price and Cut variables look like?

julia> df.Price
53940-element Vector{Int32}:

julia> df.Cut
53940-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
  • Price: The cost in US Dollars.
  • Cut: The rating of cut quality.  In order (best-to-worst): "Ideal", "Premium" "Very Good", "Good", and "Fair".  Side note: The data is stored in a CategoricalArray, which uses less memory than storing each element as a separate String.

Using groupby

We can use the groupby function to group our data by the "Cut" variable.

gdf = groupby(df, :Cut)

Using our grouped DataFrame, we can then apply a function to a variable in each group using combine.  Let's get the average Price for each level of Cut:

julia> using Statistics # for `mean`

julia> combine(gdf, :Price => mean)
5×2 DataFrame
 Row │ Cut        Price_mean
     │ Cat…       Float64
   1 │ Fair          4358.76
   2 │ Good          3928.86
   3 │ Very Good     3981.76
   4 │ Premium       4584.26
   5 │ Ideal         3457.54

Now we know what the distribution center is for each Cut, but what about the spread and shape?

📊 Using StatsPlots

The StatsPlots package adds functionality and plot recipes to Plots.jl.  We'll use it to do the grouping for us so that we don't need groupby.  First, add StatsPlots:

(@v1.6) pkg> add StatsPlots

Next, use the @df <dataframe> <plot command> syntax to create a violin plot overlaid with a box plot for each level of Cut.

julia> @df df violin(string.(:Cut), :Price, lab="")

julia> @df df boxplot!(string.(:Cut), :Price, alpha=.4, lab="")
Price vs. Cut

Things to note in the code/plot above:

  • The @df macro will replace Symbols with the associated DataFrame columns.
  • We must use string.(:Cut) because Plots/StatsPlots doesn't know how to work with CategoricalArrays directly.
  • We use boxplot! (instead of boxplot) to add a new series to the existing plot.
  • We set lab (shorthand for label) to "" to avoid adding an entry to the plot legend.  If all legend entries are blank, the legend will not appear.
  • We use alpha=.4 to set the opacity of the boxplot so that it doesn't cover up the violin in the layer beneath it.

From our plot, we can see the distributions are all similarly skewed with a long right tail.  Some Cuts (Good, Premium, and Very Good) are bimodal (they have two "peaks").  However, we are ignoring some important factors (such as how Carat and Color affect the price!), so we shouldn't make any conclusions based solely on this plot.

🚀 That's It!

You now know how to do a little bit of data wrangling with DataFrames.  What do you want to learn about next?

Enjoying Julia For Data Science?  Please share us with a friend and follow us on Twitter at @JuliaForDataSci.

Additional Resources