profile image


  • 1 review
  • 0 completed
This course is worth taking simply because it actually offers a better summary of the R programming language than the previous course in the sequence, "The R Programming Language," especially in the Week 3 lectures. On the other hand, for a course on "getting and cleaning data," it misses the mark, simply because it doesn't confront the single most common issue in getting and cleaning data—the need to perform operations on variable data to get data points into the correct format for computation. I work with data sets on a daily basis, and by far, more headaches are introduced by data formatting when using or combining data sets than by any other issue. Differences in date representation, numeric formatting, string formatting, slight nuanced differences in the meanings of variables, etc. The task that I spend most of my time on when working with data is in *processing* sets(s) of variable data so that they're in the correct format to be merged with sister data sets or to be fed into program X or system Y. Instead, the central project for this course focuses most heavily on variable names (i.e. column headers) and writing codebooks. The actual data acquisition is easy (download a ZIP file), as is the "cleaning" (join two data sets with identical numbers of columns that have identical meanings and formats). From my money, this isn't a course on "getting and cleaning data" so much as it is a course on "navigating/summarizing an already imported data set in R" and "documenting your dataset." That can be useful in its own right, but I found the emphasis on codebooks in particular to be vexing. Yes, documenting what your variables mean is important, but that's at least as much a technical writing task as it is a coding/computing task, if not moreso. I wanted more theory on using R when you have to combine four disparate data sets into a single, coherent whole, then transform half the values to feed the whole into ane existing system. I could have done with less on joining already structurally identical data sets, summarizing them in various ways, and writing plain-English descriptions of their contents.