You Should Learn To Code
If you are a biologist serious about your data, you should learn to code
There’s an interesting reaction I get when I tell my biologist colleagues that they should dip their toes into programming for data analysis—resistance, yes, but tinged with agreement. I imagine it’s what dental hygienists receive when they tell their patients they need to floss more: plenty of excuses, plenty of reluctance, but no real disagreement.
So, think of me as your friendly neighbourhood data hygienist—I see you copying and paste-transposing data around spreadsheets all day, and I’m telling you, you need to learn a little bit of programming. What’s more, I think that deep down, you agree with me.
I can already hear the objections. “I don’t have time”, you say; “I’m not tech-y enough”, “I tried during the lockdowns and it didn’t really stick”, and so on. I know this line of thinking well—I myself am a bioinformatician with absolutely no formal dry-lab background. I failed to teach myself to program for almost a decade before I finally made a dataset too big for any spreadsheet editor, and had to either learn R or find another postdoc. And it was hard. I struggled, and while I think it was worth it for me in the end, I know now that it didn’t have to be as difficult as it was. That’s what prompted me to write this blog series—not to shame you into coding, but to show you how to get started.
Before we get into that, though, I want to explain why you should be analyzing your data programmatically. You might think it has something to do with speed or automation — and yes, programmatic data analysis is faster and is amenable to automation — but even if you never take advantage of those benefits, you should still be analyzing data with code. Why, you ask? Because the act of analyzing your data becomes the act of documenting your analysis.
Something that took me some time to realize: you don’t have to be great at programmatic analysis to take advantage of it. In fact, you have my full permission to become the world’s okayest programmer and stop there — your work, efficiency, and sanity will still benefit greatly from it!
How many times have you looked at a graph you made a year ago and not been able to explain exactly how your raw data became this picture? Or how many times have you removed a datapoint from a graph for a valid scientific reason, documented that in a lab book somewhere, and then had to go digging through years of notes to justify it when pressed?
If your analyses are done through code, you have an automatic, built-in document of every single thing you did to transform raw data into graphs. You can comment throughout the analysis to explain why certain choices were made, and can guarantee that it is complete and accurate at any time by simply running the code on the raw data again—if the same graph comes out, those same steps were followed.
This is reason enough to embrace programmatic data analysis. So why is there such a lack of programming within biology specifically? I think it comes down to a misconception that good programmers are “math-y” or “tech-y” people, and biologists (or at least, the ones I know/am) wouldn’t describe themselves that way. In fact, the best programmers are analytical people, who can take a black-box system and design experiments and test hypotheses to figure out why it is (or isn’t) behaving the way they need it to. Put another way, the same skills that make you a good laboratory scientist will make you a good programmer: when you write a piece of software that isn’t doing what you want it to do, you come up with hypotheses about why, and then find ways to test them out.
So how do you get started?
There were two things that finally made programming click for me:
- An immediately applicable use-case for my actual day-to-day work.
- Access to a coding environment that was set up and ready for me( to focus on programming rather than focusing on setting up a computer).
For the first — that’s what this blog series will walk you through. I’m going to take you through two key skills for the biologist-turned-programmer, that will give you a real, applicable skillset that will transform how you do your laboratory analyses:
- How to plot a clean dataset
- How to clean a messy dataset
If that’s as far as you ever go in programming, you will be a successful, computationally literate biologist.
For the second — that’s why we built Watershed. Watershed is a bioinformatics platform that gives you a powerful compute environment, already set up with notebooks, libraries, and data access tools so you can get right to coding. It’s more than that, too — it’s an entire operating system for your scientific team’s work, but at its core, it’s the answer to this problem that biology is facing where the need for computational tools is growing, but so is the barrier to access them.
If you don’t have a Watershed account yet, and think your organization could benefit from one, reach out! If that isn’t an option right now, you can still follow along with the rest of this blog series, but you’ll need to get a python coding environment with a Jupyter notebook downloaded and installed on your computer.
Once you’re ready, join us for part 2 of this series (Coming soon!), where we’ll learn how to turn numbers into pictures.