Happy Darwin Day!

On this day 206 years ago a man named Charles Darwin was born. His curiosity, observation, and careful organization of evidence and thought changed how we think about the world around us, and dramatically altered the study of life sciences. Today, Darwin Day, allow yourself to be curious, and approach your problems with a scientific mindset. And remember the life of a man who traveled the world and came back to his home country with revolutionary ideas that have held up really well after over a century of research. If you’re interested in learning more about Darwin’s ideas and life, there’s a great website with his publications, notes and manuscripts, and other pieces about Darwin’s travels and observations.

Programming languages and their pros and cons: thoughts from a biologist

Yesterday, Nature published an article about how learning the programming language Python is an important and perhaps necessary thing for biologists to do. In December, Nature published a similar article about the rise of R in biology. It is indisputable that modern biology is becoming increasingly computational, as ‘big data’ is taking over the field with the development of new technologies and the advent of the “-omics” era (genomics, transcriptomics, metabalomics, etc.). And to deal with these new large datasets, researchers are required to do more than a ‘chug and plug’ approach to data analysis. So I believe that most biologists in the near future will be required to know at least rudimentary computer programming. And many people are turning to so-called ‘easy’ languages like Perl, Python, and R to do their analyses. However, having painstakingly learned C++, R, shell scripting, and bits and pieces of Perl, Python, and html in the past 3.5 years of my PhD, I have some opinions on these trends.

When first learning a programming language, it can be very confusing what all the syntax means and how to turn a simple task such as, “I want to read a file, do some simple math, and write a new file” into actual code. In my opinion, the way this become most clear is by clearly defining every variable so you know exactly what it is and explicitly saying what functions are doing and clearly delineating the ends of loops. I know that was very jargon-y, but basically what I mean is that when starting to learn a programming language it is so much easier and clearer if you are 100% clear on what every single thing in your program/script is doing. And that is why I would advocate to start by learning C++, not Python, Perl, or R.

R

I first started by learning bits of R to do statistics without needed to pay a lot of money to access statistical analysis programs, like JMP, SAS, SPSS, or STATA. And I came to appreciate R and everything I could do with it. It’s great with manipulating matrices of data, and you can perform the same function across an entire column or row of a dataset, or even across the entire dataset. And my favorite thing about it is the flexibility I have in creating graphs! It can produce beautiful figures and you can do almost any manipulation you’d want. It can take a lot of code and a lot of time tweaking things, but you can end up with publication-ready figures like this:

Figure 2 from my paper Sexual selection on female ornaments in the sex-role-reversed Gulf pipefish (Syngnathus scovelli), available here: http://onlinelibrary.wiley.com/doi/10.1111/jeb.12487/full
This figure used the library fields, and actually only required a small amount of code to make it:

pdf("Fig2_fitsurf.pdf")
image(surf.te.out, col=tim.colors(256), useRaster = T, lwd=5, 
       las=1, font.lab=2, cex.lab=1.3, mgp=c(2.7,0.5,0), 
       font.axis=1, lab=c(4,5,6),
       xlab=expression(paste("Band Area (mm"^"2", " )")),
       ylab=expression("Band Number"))
dev.off()

However, R is very slow when it comes to loops, which are an incredibly important part of doing a lot of real functionality. In my experience, R is really best for statistics and making graphs, not complex data storage and manipulation tasks.

C++

The next language I started to tackle was C++, because I wanted to create a simulation model for my dissertation and that’s the language my advisor uses. I quickly found it to be much easier to wrap my head around than R. That’s because every single variable you use must be explicitly defined as an integer, double (decimal number), character, string (words), or whatever class type you’re using. It makes it much easier to track what’s going on with every variable and the purpose of everything in the program. It’s also got very clearly defined syntax, so you always know when the end of the line occurs because there’s semi-colon. And it’s fast! Yes, you have to compile the code to run it, but if you write and debug (test) programs in Visual Studios, it’s super easy–just push the play button!

Visual studios screenshot

Some people suggest that the more explicit syntax makes it more difficult to learn, and certainly it means that you will  likely have more typos tripping up your programs at the beginning, but for me it was so much easier to grasp what my programs are doing and how they’re doing them, and it enhanced my understanding of every other programming language I’d come across. C and C++ also have the advantage of you being able to control how the computer stores its memory and to optimize the functionality of your program, which makes your programs better. There’s a reason Python has a tool called Cython to make it run faster.

Shell scripting

My coding journey then took me to learn shell scripting to do some genomics analyses, but since shell scripting isn’t a programming language, I’m not really going to say much about that. It is really helpful to know so that you can link together your programs and have your computer run programs over multiple datasets. I do highly recommend learning some shell scripting, because at the moment it’s basically a necessary thing to know for genomics analysis.

Perl 

My next programming adventure was during a mini-class led by a fellow graduate student to help people learn some basic genomics analyses. He knew Perl, and used that for most of his genomics work, so that’s what he taught us. I found Perl to be a less-clear and less-effective version of C++. I did like the flexibility with which it could read in files, but the variables could change their type (they could start out life as an integer and change to a double, for instance). The most confusing and aggravating thing about it to me was the way functions dealt with returning variables. In programming, you can write your own functions to perform a task that you’ll be doing multiple times in the program, so that you don’t have to write out the same code over and over again. In C++, if you write a function so that at the end it returns a new variable, then you must give it a variable to write to. Say you write a function to calculate the mean. In C++ this would look something like this:

double calculate_mean(vector number_list)
{
	double average = 0;
	int count = 0;
	for(int i = 0; i < number_list.size(); i++)
	{
		average = average + number_list[i];
		count++;
	}
        if(count > 0)
	{
		average = average/count;
	}
	return average;
	return average;
}

vector list = {1,2,3,4,5,6,7,8,9,10};
double list_avg;
list_avg = calculate_mean(number_list);

In Perl, however, you could have exactly the same function (slightly different syntax) but just have it run without having it return a value, and the program would run:

#!/usr/bin/perl
sub calculate_mean
{
	my @list = @_;
	my $average = 0;
	my $count = 0;
	my $size = @list;
	foreach $i (@list)
  	{
 		$average = $average + @list[$i]; 		
                $count++; 	
        } 	
        if($count > 0)
        {
		$average = $average/$count;
	}
	return $average;
}
my @list = {1,2,3,4,5,6,7,8,9,10};
calculate_mean($number_list);

In this example it’s a bit ridiculous that you would run the function without returning a value, but it is possible and actually happened in our class. These loopholes can make for sloppy programming practices, which are obvsiously bad for beginners to learn, and they can also make other peoples’ code (or your own, for that matter) much more difficult to understand.

Python

I have the least experience with Python, but I did learn a bit while working through the book Practical Computing for Biologists (which I reviewed in 2013). It is supposed to be easier to learn than C++ probably because it almost entirely lacks syntax. It doesn’t clearly delimit the end of lines and, like Perl, the type of variable you’re using is defined by context. Additionally, you don’t use brackets to define for loops and if statements and all of those things–you just have to make sure you tab in correctly. Here’s what the above program would look like in Python:

def calculate_mean(number_list):
	count = 0
	average = 0
	for i in range(len(number_list)):
	    average = average + number_list[i];
	    count+=1
	if count > 0:
	    average = average/count
	return average
	
list = [1,2,3,4,5,6,7,8,9,10]
avg = calculate_mean(list)

Summary, aka my opinion

I am biased by my own experiences, but I do think that starting by learning a clearly-defined language like C++ is really helpful by leading to the development of more rigorous coding practices. It’s easy to be sloppy when writing code, especially if you’re not using an integrated development environment that tells you when you make syntax errors. Although sloppiness may be initially faster and easier, is not only makes the learning process more frustrating and difficult, but also makes finding errors and bugs much more difficult, and makes your code less comprehensible by others (including your future self). So if someone were to ask me how to get started learning programming, I would definitely suggest a language like C++, where everything is very clear, and after that it will be easy to pick up other languages. However, learning any language will help with learning others, so as Dr. Titus Brown says in the Nature article, it may be best to simply start with whatever language is being used by the people around you–because asking others for help can also be a big boost to the learning process.