Analysis Workflow: Best practices

jpkelley

TS Contributor
#1
Hello all,

I hope this finds everyone doing well. I have a couple of questions about workflow for analyses. Hopefully, this will be of broad interest.

For each of multiple analyses within multiple projects, I have been using a simple workflow structure consisting of the following designations:

  • loadR: storing code for loading packages/libraries
  • functionsR: storing functions to be used in later analyses
  • cleanR: storing code that cleans (aka "massages") data
  • doR: storing code for analyses
  • plotR: stores codes for visualizations

These have been organized within the freeware organizer called Keynote NF which allows me to have all projects in a single file. But recently, due to the primitive nature of this program, there have been issues (e.g. autosave problems, etc.) that have caused some downtime. So, my questions for the forum are these:

  • What other workflow solutions are out there? Programs like Kepler, etc.?
  • Are there established "best practices" out there for code organization (under multiple projects) and code documentation?
As I'm sure many of us are attempting to achieve a healthy balance of high productivity and low out-of-pocket cost, it might be good to know some options for open-source (or other free) solutions to improve workflow.

Hopefully, this gets a discussion going. Maybe many of you assume that everyone already has a system for optimal workflow, but I'm finding that I sure as heck do not. Perhaps if this discussion gains some momentum, we can put a poll out about different workflow solutions and get a "best practices" list going.
 

bryangoodrich

Probably A Mammal
#3
Apparently there's a program out there to help manage or organize workflows? Never heard of it or software like it. I can see its fruitfulness in enterprise situations, but most of the computational workflows I've ever dealt with are small enough to not need any sort of standardization. That's not saying one cannot benefit from a formal workflow, but for smaller situations, I think the sort of compartmentalization provided above would work. You can keep your functions and library code in an environment (profile?) in a directory for the study that will get loaded every time the given R instance from that directory is loaded (for that project). From there, you can make directories for each of the tasks. Usually I simply keep my project folder as my analysis directory and a preprocessing folder for my data cleaning (outputs saved binary files to the analysis folder). Then I keep an outputs or report folder for whatever the analysis outputs. I guess if I wanted to organize things more appropriately, I could make a scripts folder for the R code and have directories in there for the various folders those scripts will get used in (cleaning, analysis, etc.). It would certainly manage the workflow better, but it may not be as efficient. Of course, that is the tradeoff for having managed workflows vs just producing stuff disorderly! Usually I just keep the data files and scripts in their respective folders/subfolders. For instance, I try to get my data into tab delimited text files, so my preprocessing folder is full of text files and relevant cleaning scripts (maybe python if it was needed). Then my project folder usually has one main script that does all the major work along with relevant Rdata files that were produced from cleaning up. However, now from my experience with GIS workflows and geodatabases, I might actually keep a folder for all my data, with one folder/database for the intermediate and one for the final data, and maybe let the text files that build my analysis liter the main folder. I'm contemplating incorporating sqlite databases for this purpose since they're free and relatively easy to manage.
 
Last edited:

jpkelley

TS Contributor
#5
Incredibly useful, Bryan. I like this framework much better than mine. For some reason --perhaps because I'm sick of always shoving parts of my scripts into certain designation-- your approach seems more intuitive.

Re: the program to manage and automate workflows. I've played with Kepler (LINK HERE), but some it doesn't seem intuitive. As of now, it seems like it doesn't allow full integration with R. There does seem to be great potential.

Thanks for the great response and suggestions.