Saturday, February 16, 2013

Know Your Data Professionals: Processors vs. Programmers

I've recently been asked to help nontechnical business professionals distinguish between data processing and database programming. Here's my best shot at a clear & concise distinction.

Data processing involves organizing and manipulating data files using a variety of applications including Microsoft Excel and various relational database management systems (RDBMS) like SQL Server and MySQL. It focuses on the data, the specific values contained in the data, and the data’s relationship to the information it represents. A good data processor has a deep understating of the data's business context. Data processing is typically repetitive. A good data processor can consistently execute the same process the same way over and over. Anomalies in data processing are generally the result of either variations in source data or manual errors in processing. A good data processor will be familiar enough with the data's business context to be able to correct variations in source data to the extent that normal processing can occur.  

In contrast, database programming involves building procedures that organize and manipulate data files in a consistent and automated manner. While database programming pays attention to the data, its primary focus is on the procedures that act against the data. A good database programmer will have a deep understanding of the technical infrastructure around the data.  Database programming is not at all repetitive. A good programmer instinctively adheres to the DRY principle. DRY stands for “don’t repeat yourself.” A good programmer will always code an automated solution to a repetitive challenge. Anomalies in database programming are generally the result of errors in the procedure code. A good database programmer will be able to debug the code in such a way that a future instance of the same anomaly will be handled automatically by the procedure, requiring no additional intervention.

A good data processor is critical in the early stages of a project, to ensure a full understanding of the data and the business processes represented by the data. Once a data processor has worked out the best approach to project data, a good database programmer can automate the various processes, freeing the data processor from repetitive work, allowing them to move onto the next project, and reducing the likelihood of processing errors. 

Know Your Analytics Professionals: Tool Users vs. Coders

Data work falls into a number of different subcategories. One of these subcategories is analytics, also known as business intelligence, data mining, or machine learning. While these terms can be distinguished from one another based on nuance, they all have in common the use of computer programs to find patterns in data that can be generalized for reporting, optimization or prediction.

Analytics professionals fall into two categories, tool users and coders.

Tool users generally live in Stata, SPSS, SAS or sometimes Excel. They can have a reputation in industry for being on average less statistically literate than coders, but can also be more productive by virtue of their streamlined analytical methods that have been honed and optimized over the years by tool manufacturers. While they can be efficient, they tend to be disconnected from overall business processes. These analysts typically need to have data extracted for them from a central system. They do their analysis offline and usually work on aged data.

Coders use R, Python or sometimes SQL. They tend to be less concerned with standard approaches or productivity, and more concerned with flexibility and data integration. While they can be less efficient than tool users when working with extracted, disconnected data, coders will generally have a better sense of how a given process fits into the big picture, and they are more likely than tool users to be able to bolt their work to a larger system and access fresher, sometimes real-time data in an automated way. When this happens they can eclipse the efficiency of packaged tools and provide extraordinary additional value to an analytics effort.

These are of course generalizations. Nonetheless, a data or an analytics manager should be mindful of these distinctions when adding someone to a team. Balance is ideal. It may seem that in a shop where SAS is the standard, you might not want a Python developer lurching around. However, the ability to take proven processes into a live production environment and turn last-quarter’s reports into yesterday’s dashboard might be just the thing to take your team to the next level.

My Local CS Degree Programs Aren’t Helping Me Hire Database Developers

For a big chunk of my career I’ve hired and managed database developers. One thing I’ve noticed over the last fifteen years is that database developers are hard to find.  For every ten programmers you talk to, maybe one has any significant experience with databases.

This doesn’t make any sense. Our digital infrastructure rests upon a foundation of databases. Surely database theory and database development are a big part of any formally trained computer programmer’s education. If so, then why are so many programmers disappointingly inexperienced with database development?

I looked at my local Computer Science departments to see what kind of database course work they require of their students. Within a reasonable drive from my office are several colleges that offer a BS degree in computer science, including Bennet College, Elon University, High Point University, the University of North Carolina at Greensboro, and Wake Forest University.

I was disappointed to find that none of these schools have a database requirement as part of their Computer Science BS degree program.

Where are the database courses? Interestingly, UNCG has an optional bioinformatics concentration in their computer science major and this concentration requires a Principles of Database Systems course. And if you get an MS in computer science at UNCG they require an advanced databases systems course or its equivalent. There is no such requirement for the MS in computer science at Wake Forest University, however. Additionally, Bennet College and Elon University both have a Computer Information Systems (CIS or CIT) alternative to the traditional CS degree and each has a database requirement.

The lesson here, I guess, is to stop looking at undergraduate CS degrees and focus instead on people who have studied bioinformatics or CIS, or who have a graduate degree from a school like UNCG that requires database course work. This sucks, since it’s the background in algorithms, data structures, design patterns, and matrix math that make a good database developer valuable, and these topics are at the core of a traditional undergraduate CS degree.

So in the end I can hire someone with a strong foundation in computer science or someone with exposure to databases. That’s the wrong Boolean operator.

Email Best Practices

It’s been a while since I’ve had to worry about coding emails. But recently a coworker asked me to put together a brief presentation on best practices for email development. This is actually two separate topics, email content and email code. Setting aside email content for the moment, here are my completely derivative, unoriginal, you-can-find-this-anywhere advice for avoiding trouble in the inbox when you’re coding email. This advice assumes you’re building an email campaign for deployment to a wide audience and you have no control over the email client application your recipients will use.

Generally, when coding email it’s best to pretend it’s fifteen years ago. Seriously. I know Dreamweaver can do lots of cool things and I know CSS is awesome and HTML5 is just, like, unbelievable. Nonetheless, ignore it all. Back away from the shiny modern tech.

1. Learn to write HTML
Apps like Dreamweaver make it easy to build web pages without ever having to dig into the underlying code. Unfortunately, these apps make assumptions about the best way to code stuff that runs counter to what you’d want to do for HTML email development. The best way around this is to code your own HTML. There are plenty of great places to learn how to do this. My favorite is If you can use Dreamweaver or Photoshop, then you have more than enough technical ability to learn HTML. Just take an hour, one hour, and play around with it. You’ll be surprised how quickly you pick it up.

2. Use tables
Tables are the old way to build compositional structures in HTML. They’re pretty basic, they have rows and columns and that’s about it. The great thing about tables is that most email clients understand them.

Here’s a simple example. Assume you want to build an email with this composition.


The HTML for this table would look like this.

This is the header.
<TD WIDTH=66% HEIGHT=400px>
This is the body.
<TD WIDTH=34% HEIGHT=400px>
This is the sidebar.
This is the footer.

If you cut & paste that into a Simpletext or Notepad document and save it with a .htm extension, you can use it as a template and go from there. Double click on it and it will open in your browser. Fool around with it. Change some of the numbers or some of the text and see what happens. Go out to and learn more about what’s going on here. You’ll be amazed how well your emails render in all those different clients.

3. Avoid CSS
Don’t use CSS when you’re building an HTML email. Just don’t. Yes, I know that inline CSS is less problematic than file calls or style definitions in the header, but it’s still problematic. Just don’t use it. If you never use CSS you will eliminate the majority of your problems with inconsistent rendering in your recipients’ inboxes.

4. Don’t Embed Images
Don’t try to embed images into your email. Always link to image files out on a web server or an email content server. An image tag should always look something like this.

<IMG SRC="" alt="LOGO" />

Note that the SRC value points to an actual URL out on the web somewhere, not to a file on your local computer. An image tag should never look something like this.

<img src="C:\MyPictures\EmailImages\logo.jpg" alt="LOGO" />

Image tags that point to a file on your local computer will result in missing images in your recipients’ inboxes.

5. Don’t forget about an alternate text version
Nearly every Email Service Provider allows you to build a text version of your email. Many users set their email clients to filter out HTML emails and only read text emails. Others read their email on a phone or some other device that only displays text. These users will miss your email if you fail to build a text-only version.

6. Use Google Analytics tags
As the man said, “If you can't measure it, you can't manage it.” Designing and coding an email are just the beginning of the process. In order for testing and analytics to help you improve your email effectiveness, you’ll need to be able to track what happens to them once they arrive in the inbox. One of the best ways to do this is with Google Analytics tags. Neglecting to add simple campaign information to the end of a hyperlink can ruin downstream testing and analysis. Learn more about this at

7. Stay Informed
Finally, there is no shortage of great information on building and optimizing email. My personal favorite source of information on this is the blog at Bronto, an email service providor. You can find their blog at Add it to your RSS reader and stay current.

These are only the most basic best practices. There are many more things you can do to ensure your email gets where you want it to go and looks like you want it to look when it gets there. But if you follow these seven practices, you’ll be building from a great foundation.

Basic Matrix Operations with R

A while back I posted a series on basic matrix operations in VB.NET. Those posts looked at using arrays to represent matrices in VB.NET, how to add and subtract matrices, and how to transpose and multiply matrices.
I want to revisit those operations in R, mostly as an object lesson in using the right tool for the right job. If you look back at those previous posts, it took over a hundred lines of code to build, populate and operate against two small matrices. Here are the same matrices and the same operations in R.
> #First build the matrices A and B
> A <- matrix(
+ c(1,2,3,5,8,13,21,34,55,89,144,233), #These are the values in the matrix A.
+ nrow=3, #This is the number of rows in A.
+ ncol=4, #This is the number of columns in A.
+ byrow=TRUE)
> B <- matrix(
+ c(2,3,5,7,11,13,17,19,23,29,31,37), #These are the values in the matrix B.
+ nrow=3, #This is the number of rows in B.
+ ncol=4, #This is the number of columns in B.
+ byrow=TRUE)
> A
    [,1] [,2] [,3] [,4]
[1,]    1    2    3    5
[2,]    8   13   21   34
[3,]   55   89  144  233
> B 
    [,1] [,2] [,3] [,4]
[1,]    2    3    5    7
[2,]   11   13   17   19
[3,]   23   29   31   37
> #Add A and B.
> A+B
    [,1] [,2] [,3] [,4]
[1,]    3    5    8   12
[2,]   19   26   38   53
[3,]   78  118  175  270
> #Subtract B from A.
> A-B
    [,1] [,2] [,3] [,4]
[1,]   -1   -1   -2   -2
[2,]   -3    0    4   15
[3,]   32   60  113  196
> #Transpose B.
> Bt=t(B)
> Bt
    [,1] [,2] [,3]
[1,]    2   11   23
[2,]    3   13   29
[3,]    5   17   31
[4,]    7   19   37
> #Multiply A and Bt.
> A%*%Bt
    [,1] [,2]  [,3]
[1,]   58  183   359
[2,]  398 1260  2470
[3,] 2728 8637 16931

Grouping and Summarizing Data with R

It’s often useful to pull a summary statistic by group. You can do this in R using the aggregate() function. First, let’s pull in some data using the method described in the previous post.

> #load the RODBC library
> library(RODBC)

> #make a connection
> conn <- odbcDriverConnect("driver=SQL Server;server=<SERVERNAME>;database=<DB>;uid=<USERNAME>;pwd=<PASSWORD>;")

> #use the connection to fire a SQL statement
> sqlResult <- sqlQuery(conn,"select orderDate, orderDollars, division from orders")

> #look at the summary
> summary(sqlResult)
     orderDate          orderDollars               division   
2011-10-24: 192    Min.   :    0.00    International: 242  
2011-10-19: 174    1st Qu.:   49.49    National     :4758  
2011-10-31: 171    Median :  100.76                       
2011-10-03: 168    Mean   :  163.27                       
2011-11-07: 168    3rd Qu.:  161.99                       
2011-10-12: 160    Max.   :16686.48                       
(Other)   :3967                                          

Now we can use aggregate() to see a summary statistic by group. Let’s use median orderDollars and division.

> aggregate(orderDollars ~ division, sqlResult, FUN = median)
      division orderDollars
1 International          76.49
2      National        101.99

Alternately we can look at sum or range (or any standard stats function).

> aggregate(orderDollars ~ division, sqlResult, FUN = sum)
      division orderDollars
1 International       23154.37
2      National     793206.88

> aggregate(orderDollars ~ division, sqlResult, FUN = range)
      division orderDollars.1 orderDollars.2
1 International            0.00          660.00
2      National            0.00        16686.48

If you’re familiar with SQL you can see that aggregate() is essentially the same as a GROUP BY. We could accomplish the same thing as above in SQL Server with this.

SELECT division, sum(orderDollars) FROM orders GROUP BY division

We could not, however, get to medians or ranges this way, since SQL Server doesn’t support them.