Saturday, February 16, 2013

Grouping and Summarizing Data with R

It’s often useful to pull a summary statistic by group. You can do this in R using the aggregate() function. First, let’s pull in some data using the method described in the previous post.

> #load the RODBC library
> library(RODBC)

> #make a connection
> conn <- odbcDriverConnect("driver=SQL Server;server=<SERVERNAME>;database=<DB>;uid=<USERNAME>;pwd=<PASSWORD>;")

> #use the connection to fire a SQL statement
> sqlResult <- sqlQuery(conn,"select orderDate, orderDollars, division from orders")

> #look at the summary
> summary(sqlResult)
     orderDate          orderDollars               division   
2011-10-24: 192    Min.   :    0.00    International: 242  
2011-10-19: 174    1st Qu.:   49.49    National     :4758  
2011-10-31: 171    Median :  100.76                       
2011-10-03: 168    Mean   :  163.27                       
2011-11-07: 168    3rd Qu.:  161.99                       
2011-10-12: 160    Max.   :16686.48                       
(Other)   :3967                                          

Now we can use aggregate() to see a summary statistic by group. Let’s use median orderDollars and division.

> aggregate(orderDollars ~ division, sqlResult, FUN = median)
      division orderDollars
1 International          76.49
2      National        101.99

Alternately we can look at sum or range (or any standard stats function).

> aggregate(orderDollars ~ division, sqlResult, FUN = sum)
      division orderDollars
1 International       23154.37
2      National     793206.88

> aggregate(orderDollars ~ division, sqlResult, FUN = range)
      division orderDollars.1 orderDollars.2
1 International            0.00          660.00
2      National            0.00        16686.48

If you’re familiar with SQL you can see that aggregate() is essentially the same as a GROUP BY. We could accomplish the same thing as above in SQL Server with this.

SELECT division, sum(orderDollars) FROM orders GROUP BY division

We could not, however, get to medians or ranges this way, since SQL Server doesn’t support them.

 

No comments:

Post a Comment