Monday, August 14, 2017

Your first bioinformatics project

Nothing will improve your bioinformatics skills like creating your own project, and nothing proves your bioinformatics skills to a prospective employer (or university admissions committee) like having your own project to showcase. Yes, having a solid foundation in biology and programming is important for anyone looking to do bioinformatics, but studying biology and programming books until you're blue in the face will not get you practical insights like your first project will. 

Picking your First Project

As a beginner it might seem hard to know where you can carve out your own little piece of bioinformatics and contribute to the community, but rest assured that there are more problems than there are bioinformaticians to solve them, so with a little digging you can find something cool to work on while simultaneously solving a problem. Your project doesn't even have to solve a Big Problem™ in bioinformatics. A cooler/better/faster way of doing something existing is still enough for a first project. Even a neat way to visualize existing data is enough. If you're having trouble identifying a project to work on, then this might be a sign that you need to do some more background research. Once you're familiar with a topic, it is REALLY easy to identify a problem that hasn't been solved or an area for improvement. There are many different subspecialties in bioinformatics, so to help narrow down your choices, you could always consider an area that a potential employer might be interested in, or if you still can't decide, then pick an area that might have broad appeal. For example, genomics and personalized medicine are on fire right now, so a project related to next generation sequencing pipelines will probably have more appeal than a project dealing with protein crystal structures. Still looking for ideas? Why not ask a biologist on r/biology about the pain points in their work or data analysis and come up with a bioinformatics solution for them?

If you're having trouble choosing a project or if you're that you have the programming skills it takes to complete a project, then I'd like to suggest working on some bioinformatics-specific programming problems. The exercises in regular programming books are great, but they don't get you thinking about programming in the context of biology. With this in mind, try working through some bioinformatics-specific programming problems on rosalind. rosalind poses real bioinformatics problems for you to solve, gives you the relevant background, and allows you to instantly check your solution through a web interface. rosalind gives you exposure to a variety of topics including '...computational mass spectrometry, alignment, dynamic programming, genome assembly, genome rearrangements, phylogeny, probability, and string algorithms'. Working your way through these problems (warning: some can be quite challenging) will give you exposure to the variety of problems a bioinformatician might work on, allow you to hone your programming skills, identify gaps and your knowledge, and act as a springboard for you to create your own project. A bonus is that all of your rosalind solutions can be shown a potential employer. This isn't as great as having your very own project, but it will prove you have programming experience and you can score some bonus points for having a particularly clever solution to a problem.

Hosting Your Project on GitHub

Once you've created your project, it will need to be hosted somewhere for potential employers to access. GitHub is a popular choice that offers free hosting for publicly accessible open source projects, and you can get free private repositories if you're a student. It's now very common practice to ask for a GitHub link as part of the application process, but even if you aren't directly asked, you should always include a link to your project's repository on your resume/CV. Don't ever just provide a link to your GitHub profile because it makes more work for the other person and it might be unclear as to what project you're trying to highlight. If you're wondering, it's considered poor form to offer to email someone a zipped file of your work "upon request". However, this doesn't mean that you simply git push your code, provide a link, and then forget about it. Oh no. There is an entire art to exhibiting a professional looking project.

The first thing you need for your project is a proper readme. This is the information displayed below the folders and files in GitHub. GitHub readmes are written in markdown, a quick and popular way to style text. A good readme tells a potential user what your project is about and how to use it. You should avoid using field-specific lingo or abbreviations without defining them first. Since prospective employers could be reading this, make sure to use proper grammar, spelling, and punctuation. A good readme template can be found here, and you can find examples of projects with good readmes here. The readme should absolutely include screenshots or sample output if you're generating images. The readme can even include an embedded video in gif form if you want to show some cool visual effects or animations. If you built a web tool, then the readme should link directly to a publicly accessible web server running the software. If it's not a web tool, then there should be a link to an installer or at least easy to follow instructions for building your tool. If it's a pain to get to a useable form of your tool, then you risk losing the interest of or frustrating the potential employer. Even if you made the most mind-blowing tool, frustrating a potential employer will mean you have a near zero chance at landing an interview.

The second thing you need to do is properly organize your project's structure. There are certain conventions that people will expect to see when perusing your code, and a professional looking project structure will show recruiters that you know your stuff. PLOS has a great guide for structuring your bioinformatics project. The PLOS guide seems to be more geared for projects that process data, so if your project is more software oriented, then you should check out this guide for project structure.


Suggested structure for your bioinformatics project from PLOS Computational Biology.

For advanced users, you should consider utilizing continuous integration (CI). CI includes version control (you're already doing that with GitHub) combined with various other practices like build automation and self testing. CI is good development practice, and you can get sweet badges for your project's page. Potential employers will see a badge on your project's page and know that you're the real deal. Travis CI is a popular choice with GitHub integration, and Travis CI is free for open source projects. Check out this guide for getting Travis CI integrated into your GitHub project.

Contribute to Open Source

In closing, an alternative to creating your own project would be to contribute to an existing open source project. It's really easy to find an open source project to contribute to (Galaxy is my favorite recommendation), and you can read this wonderful guide for everything you need to know from finding a project to opening a pull request. Keep in mind that there are good ways and bad ways to go about contributing to an open source project. The good way would mean being involved in multiple aspects like contributing code and being part of a design team or a steering committee. This way you can talk to a prospective employer about many different aspects of the project. The bad way would mean contributing code without really being involved with the project. This way you can only say, "Oh, well I fixed this one bug," which doesn't come off as very impressive during an interview. Contributing to open source the good way has the secondary benefit of building your reputation. Many job candidates are found through networking, so showing people you do good work and push good code can directly help you land a job. If you have a specific company or research group in mind, then contributing to an open source project that is used there (or was written there) can help your chances as an applicant. 

Monday, August 7, 2017

The best programming language for getting started in bioinformatics

This post contains affiliate links, meaning when you click a link and make a purchase, we receive a commission that helps support this site.

"What programming language should I learn?" is one of the very first questions to tackle if you are a beginner wanting to learn bioinformatics. Choice of language is important since it will be a tool that you utilize often, so the obvious answer is that you should learn the "best" one. What features should this best language have? It should be easy to learn, cover all of your needs as a bioinformatician (like data analysis, text processing, and application development), and be amazingly fast at any task you throw at it... Unfortunately, any greybeard can tell you that this language does not exist. 

This language does not exist because it would be impossible for a room full of computer scientists to come to agreement on anything let alone the features of a best language. Instead, we have many computer languages that excel in some aspects while having shortcomings in others. This can probably best be summarized with a hammer analogy. Have you ever wondered why screwdriver attachments for hammers aren't more popular? A hammer is exceptionally good at driving nails, and a screwdriver is exceptionally good at driving screws. Why deal with an over-engineered scrammer when a hammer and a screwdriver work perfectly well independently?

For a seasoned bioinformatician, the "best" programming language would be whatever language gets the job done efficiently. They might choose JavaScript for making a web application, Java for a graphical user interface (GUI), and C for developing a fast algorithm (like the ones used in genomics for sequence alignment). However, this is not helpful for someone who might not know how to program in the first place. For a fledgling bioinformatician, the best language is actually a combination of two languages: R and Python

Why R an Python? These languages have all of the features you need to be successful, and it is unlikely that you will run into a bioinformatics problem that can't be solved because of the limitations of these languages. R and Python are consistently ranked as the two most popular programming languages for bioinformatics job positions according to indeed.com's job trends (accessed 08-02-17), so knowing these languages will likely help your job prospects. 


R and Python are consistently the most popular languages for bioinformatics jobs on indeed.com.


Lets break down three major needs of a bioinformatician (data analysis, text processing, and application development; by no means an exhaustive list) and find out why these two languages are the best for getting started.

Data Analysis (R)
Although bioinformaticians spend a lot of time building software tools, many will spend at least some time working with biological data. For data analysis, R is an excellent choice. It is both a language and an environment for statistical computing and graphics, and it has wide adaptation in the statistics and data science communities. This popularity means that there are thousands of libraries developed by others to take advantage of so you don't have to spend extra time coding. Even better, the Bioconductor project exists solely to provide R libraries for many types of bioinformatic analyses. R is a top choice by academics in bioinformatics and statistics, so a lot of the cutting edge tools based on the newest research are only available in R.

R has an absolutely wonderful (and free) integrated development environment (IDE) called RStudio, which takes the vanilla environment and transforms it into something much more useable. The RStudio folks also make Shiny, a web application framework for R. Shiny lets bioinformaticians take their R code and quickly make polished, interactive web applications without needing to know HTML, CSS, or JavaScript. RStudio's Chief Scientist and well-known data scientist, Hadley Wickham, has developed a suite of packages called the tidyverse. 'The tidyverse is a coherent system of packages for data manipulation, exploration and visualization that share a common design philosophy.' Basically, everything from transforming data to string manipulation to eye-pleasing visualizations can all be done with the libraries in this suite. The tidyverse is considered a default library installation by many in the R community, and I can't recommend it enough.

What's the best way to learn R? I am a big proponent of structured classes with homework and project deadlines to help facilitate learning, so I highly recommend the R Programming course at Coursera taught by three big names in the data science world (Peng, Leek, Caffo). This course is part of the Data Science Specialty, which is a great idea if you're going to be spending a lot of time analyzing biological data, and upon completion of the specialization you even receive a certificate that you can list on your resume/CV. If you prefer to be self taught, then I highly recommend R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham. If you already know a programming language, then some of the quirks of R might take some getting used to, but it is still an invaluable addition to your toolbox. 

Text Processing and Application Development (Python)
Python is an object-oriented scripting language. Python was designed with code readability in mind, so Python code tends to be more readable and can accomplish tasks in fewer lines compared to other object-oriented languages. Python is one of the most popular languages in the world, and IEEE listed Python as the top ranked programming language of 2017 (up from number 3 last year). Python's popularity benefits bioinformaticians using Python because there are extensive libraries for everything from web frameworks to scientific computing. Python can scrape files and websites right out the box thanks to built-in string and HTML/XML processing functions. Have a neat idea for utilizing natural language data? Your idea is only a library install away thanks to the Natural Language Toolkit

Python is great for both web-based and desktop-based application development. For web-based applications, beginning bioinformaticians can choose between three popular frameworks (flask, django, and Pyramid). For starting out, I recommend flask since you can have a "hello world" page up and running in about six lines of code, and it is very easy to find help if you get stuck. For desktop applications, there are a number of library choices. I recommend starting with the tkinter library, included in most Python installs by default, which provides an interface to the Tk GUI toolkit. The Tk GUI toolkit comes with or is available for most operating systems, meaning that tkinter applications you write are platform independent (can work on multiple operating systems). You can have a tkinter "hello world" pop-up window in as little as three or four lines of code, and it is very easy to find help if you need it.

Unlike R where RStudio is really your only IDE choice, Python has several options. I highly recommend using Sublime Text, a fast text editor with must-have features like regex searches and column highlighting. Sublime can run on Mac, Windows, and even Linux. It can run Python code directly, and with the help of a few easy to install packages, it can be your one stop solution for all Python development. Sublime Text is not free but has an essentially indefinite trial version, and you can get rid of the occasional nag screen by purchasing a lifetime license for $75 USD. If you're looking for something free and open source, then Spyder provides an interface similar to RStudio's. Advanced text editors like vim probably have too steep of a learning curve to be useful to a beginner. However, once you get the hang of programming you might be interested in upping your game.

For learning Python, I suggest An Introduction to Interactive Programming in Python and part two, which will get you well on your way. These courses are part of the Fundamentals of Computing Specialization. If you like the two introductory Python courses, then you should consider taking the rest of the courses in the specialization since the mathematical computing skills you will learn will be helpful on your journey as a bioinformatician (and you get a certificate). For self-teaching, I can't recommend Zed Shaw's Learn Python 3 the Hard Way book enough. This book is a great because it requires you to type all of the exercises, and I'm a big fan of "learning by doing". For biologists looking to learn how to code, I also recommend the well-written, beginner-friendly Python for Biologists.

Python 2 or 3?
Python 3 is the latest version of the official Python release, and I recommend starting with Python 3 since it's arguably better for beginners. I might have told you otherwise a few years ago due to backwards compatibility issues and lack of library support for Python 3, but Python 3 has now been out since 2008 so these issues are mostly history. If you're still concerned or maybe you need to work with some legacy code, then you can read more about Python 2 vs Python 3 to help you figure out what version is most appropriate.

Why not just Python?
If you twisted my arm while insisting that you didn't have time to learn two languages because you were working two jobs, helping shelter puppies find homes on weekends, and learning to play the cello, then I would acquiesce that it was ok to only learn Python. Python can have R-like data analysis functionality with the help of the pandas library, but you will likely run into situations where a library you need is available in R but not Python. In my experience, R is easier to use for data analysis because it was built for data analysis (see above hammer analogy). Keep in mind that not learning R could hurt your job prospects since there are a LOT of positions that want this kind of experience (see above indeed.com graph). If a position that lists R programming experience came down to a candidate who only knew Python and an equally qualified candidate who knew Python but who also had some small R project up on github, then who do you think would get the job? 

In closing
You'll know when it's time to learn another language. You might come across something unbelievably cool in Erlang that draws you in, or you might start running into performance issues with your code. R and Python are great for a lot of things, but they can be very slow for computationally heavy tasks when compared to a language like C. The good news is that you aren't going to have a hard time picking up new languages since your brain has already been introduced to programming concepts by R and Python. 

Biology can be quite complicated and problems are going to come in all shapes and sizes. Taking a complex problem and breaking it down into manageable pieces that can be solved by a computer program is a skill that transcends any language. As a bioinformatician, this is a very important skill to develop in addition to being able to code. With this in mind, I'd like to close with a link to this great article, 'Don't learn to code. Learn to think.'





Popular Posts