Programming languages and their pros and cons: thoughts from a biologist

Yesterday, Nature published an article about how learning the programming language Python is an important and perhaps necessary thing for biologists to do. In December, Nature published a similar article about the rise of R in biology. It is indisputable that modern biology is becoming increasingly computational, as ‘big data’ is taking over the field with the development of new technologies and the advent of the “-omics” era (genomics, transcriptomics, metabalomics, etc.). And to deal with these new large datasets, researchers are required to do more than a ‘chug and plug’ approach to data analysis. So I believe that most biologists in the near future will be required to know at least rudimentary computer programming. And many people are turning to so-called ‘easy’ languages like Perl, Python, and R to do their analyses. However, having painstakingly learned C++, R, shell scripting, and bits and pieces of Perl, Python, and html in the past 3.5 years of my PhD, I have some opinions on these trends.

When first learning a programming language, it can be very confusing what all the syntax means and how to turn a simple task such as, “I want to read a file, do some simple math, and write a new file” into actual code. In my opinion, the way this become most clear is by clearly defining every variable so you know exactly what it is and explicitly saying what functions are doing and clearly delineating the ends of loops. I know that was very jargon-y, but basically what I mean is that when starting to learn a programming language it is so much easier and clearer if you are 100% clear on what every single thing in your program/script is doing. And that is why I would advocate to start by learning C++, not Python, Perl, or R.

R

I first started by learning bits of R to do statistics without needed to pay a lot of money to access statistical analysis programs, like JMP, SAS, SPSS, or STATA. And I came to appreciate R and everything I could do with it. It’s great with manipulating matrices of data, and you can perform the same function across an entire column or row of a dataset, or even across the entire dataset. And my favorite thing about it is the flexibility I have in creating graphs! It can produce beautiful figures and you can do almost any manipulation you’d want. It can take a lot of code and a lot of time tweaking things, but you can end up with publication-ready figures like this:

Figure 2 from my paper Sexual selection on female ornaments in the sex-role-reversed Gulf pipefish (Syngnathus scovelli), available here: http://onlinelibrary.wiley.com/doi/10.1111/jeb.12487/full
This figure used the library fields, and actually only required a small amount of code to make it:

pdf("Fig2_fitsurf.pdf")
image(surf.te.out, col=tim.colors(256), useRaster = T, lwd=5, 
       las=1, font.lab=2, cex.lab=1.3, mgp=c(2.7,0.5,0), 
       font.axis=1, lab=c(4,5,6),
       xlab=expression(paste("Band Area (mm"^"2", " )")),
       ylab=expression("Band Number"))
dev.off()

However, R is very slow when it comes to loops, which are an incredibly important part of doing a lot of real functionality. In my experience, R is really best for statistics and making graphs, not complex data storage and manipulation tasks.

C++

The next language I started to tackle was C++, because I wanted to create a simulation model for my dissertation and that’s the language my advisor uses. I quickly found it to be much easier to wrap my head around than R. That’s because every single variable you use must be explicitly defined as an integer, double (decimal number), character, string (words), or whatever class type you’re using. It makes it much easier to track what’s going on with every variable and the purpose of everything in the program. It’s also got very clearly defined syntax, so you always know when the end of the line occurs because there’s semi-colon. And it’s fast! Yes, you have to compile the code to run it, but if you write and debug (test) programs in Visual Studios, it’s super easy–just push the play button!

Visual studios screenshot

Some people suggest that the more explicit syntax makes it more difficult to learn, and certainly it means that you will  likely have more typos tripping up your programs at the beginning, but for me it was so much easier to grasp what my programs are doing and how they’re doing them, and it enhanced my understanding of every other programming language I’d come across. C and C++ also have the advantage of you being able to control how the computer stores its memory and to optimize the functionality of your program, which makes your programs better. There’s a reason Python has a tool called Cython to make it run faster.

Shell scripting

My coding journey then took me to learn shell scripting to do some genomics analyses, but since shell scripting isn’t a programming language, I’m not really going to say much about that. It is really helpful to know so that you can link together your programs and have your computer run programs over multiple datasets. I do highly recommend learning some shell scripting, because at the moment it’s basically a necessary thing to know for genomics analysis.

Perl 

My next programming adventure was during a mini-class led by a fellow graduate student to help people learn some basic genomics analyses. He knew Perl, and used that for most of his genomics work, so that’s what he taught us. I found Perl to be a less-clear and less-effective version of C++. I did like the flexibility with which it could read in files, but the variables could change their type (they could start out life as an integer and change to a double, for instance). The most confusing and aggravating thing about it to me was the way functions dealt with returning variables. In programming, you can write your own functions to perform a task that you’ll be doing multiple times in the program, so that you don’t have to write out the same code over and over again. In C++, if you write a function so that at the end it returns a new variable, then you must give it a variable to write to. Say you write a function to calculate the mean. In C++ this would look something like this:

double calculate_mean(vector number_list)
{
	double average = 0;
	int count = 0;
	for(int i = 0; i < number_list.size(); i++)
	{
		average = average + number_list[i];
		count++;
	}
        if(count > 0)
	{
		average = average/count;
	}
	return average;
	return average;
}

vector list = {1,2,3,4,5,6,7,8,9,10};
double list_avg;
list_avg = calculate_mean(number_list);

In Perl, however, you could have exactly the same function (slightly different syntax) but just have it run without having it return a value, and the program would run:

#!/usr/bin/perl
sub calculate_mean
{
	my @list = @_;
	my $average = 0;
	my $count = 0;
	my $size = @list;
	foreach $i (@list)
  	{
 		$average = $average + @list[$i]; 		
                $count++; 	
        } 	
        if($count > 0)
        {
		$average = $average/$count;
	}
	return $average;
}
my @list = {1,2,3,4,5,6,7,8,9,10};
calculate_mean($number_list);

In this example it’s a bit ridiculous that you would run the function without returning a value, but it is possible and actually happened in our class. These loopholes can make for sloppy programming practices, which are obvsiously bad for beginners to learn, and they can also make other peoples’ code (or your own, for that matter) much more difficult to understand.

Python

I have the least experience with Python, but I did learn a bit while working through the book Practical Computing for Biologists (which I reviewed in 2013). It is supposed to be easier to learn than C++ probably because it almost entirely lacks syntax. It doesn’t clearly delimit the end of lines and, like Perl, the type of variable you’re using is defined by context. Additionally, you don’t use brackets to define for loops and if statements and all of those things–you just have to make sure you tab in correctly. Here’s what the above program would look like in Python:

def calculate_mean(number_list):
	count = 0
	average = 0
	for i in range(len(number_list)):
	    average = average + number_list[i];
	    count+=1
	if count > 0:
	    average = average/count
	return average
	
list = [1,2,3,4,5,6,7,8,9,10]
avg = calculate_mean(list)

Summary, aka my opinion

I am biased by my own experiences, but I do think that starting by learning a clearly-defined language like C++ is really helpful by leading to the development of more rigorous coding practices. It’s easy to be sloppy when writing code, especially if you’re not using an integrated development environment that tells you when you make syntax errors. Although sloppiness may be initially faster and easier, is not only makes the learning process more frustrating and difficult, but also makes finding errors and bugs much more difficult, and makes your code less comprehensible by others (including your future self). So if someone were to ask me how to get started learning programming, I would definitely suggest a language like C++, where everything is very clear, and after that it will be easy to pick up other languages. However, learning any language will help with learning others, so as Dr. Titus Brown says in the Nature article, it may be best to simply start with whatever language is being used by the people around you–because asking others for help can also be a big boost to the learning process.

Advertisements

One thought on “Programming languages and their pros and cons: thoughts from a biologist

  1. Hi Sarah, Just wanted to say that I enjoyed your article and I find it somewhat inspirational. I am a seasoned C++ programmer but more recently I have been tempted to move to Python as it feels like a higher-level language. I love C++ but it seems in less demand these days. BUT, like you say above, with C++ you know exactly what is what when it comes to variables and that is something that is very important in programming. Python (and Perl) may enable one to get something achieved very quickly, but they lack the strict structure and strongly-typed nature of C++, something that is even more important than I realised. (When I had a spell of using Perl, I did not like not knowing what types the variables were at all. Python is no different.) Way back long ago I once programmed in Ada and that is extremely strongly-typed, but for good reason: safety. I think C++ is a good compromise on Ada’s overly strongly typed nature and is still a great programming language to use today IMHO. Thanks for the article, it’s motivated me to stick with C++. 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s