Skip to content

Good Software for Grad School

After one year in graduate school, I have found using certain computer programs has made my life much easier. I’ll give a list, and describe the uses\advantages of each.

Mendeley

As I stated in a previous post, this program is wonderful for organizing your pdf library. The set up is somewhat like an iTunes library for your journal articles, which can sync across your computers, as well as with the web-component. Additionally, Mendeley has a social networking component. You can see what are the most frequently read papers in your discipline, create groups on a certain topic, and see papers related to the ones in your library. Mendeley will also export your library into Endnote or BibTeX format. All in all, this is one of the most useful programs on my computer.

R

Especially if you’re planning to stay in academia, learn R. It is a free alternative to SAS and is better suited to programming new statistical methods. Many of the methods coming out of statistical genetics are developed in R, and often investigators will provide their code in supplementary material, or even write a new R package to perform the analysis. Thus, you don’t have to wait for a new version of SAS to perform novel analyses. You can be entering your data into the new analysis function in ~30 seconds.

A GUI implementation of R that I have really enjoyed is R Studio.

LaTeX

It’s a bit difficult to sense the benefit of LaTeX without using it. It is a document-processing language that creates beautiful documents with less effort than it would to create documents of the same quality in Word (or any other word-processor). I learned LaTeX enough to create most of the documents I need to in about 6 months, with off and on effort. I had no one to teach me, so if you can find someone I believe you could be on your way in ~1 hour. Plus, programs like Sweave (a function in R) will allow you to combine your R analysis with LaTeX markup to make wonderful, dynamic statistical reports. I don’t do an analysis now without documenting it using Sweave+R+LaTeX.

UNIX

Mostly, just learn UNIX/Linux syntax. If you stay in Science long enough, you run into a program that uses these languages. In statistical genetics, this is especially true (e.g., PLINK), or in situations where you need access to a High Performance Computing Server, which likely runs Linux. Plus, you earn points with the IT department.

While this is certainly not an exhaustive list of useful software, these are the programs I have found that have either: (a) made my life demonstrably easier; or (b) I already wish I knew more about.

May this summary be helpful.

Advertisements

Poster sessions

I’m aggressively simple when designing a poster. Like, “make a flowchart/diagram of your inclusion criteria for a meta-analysis” simple. The fact is most people are never going to experience your poster as a comprehensive unit. You will be standing next to it, ready to lead them through. So, you’re really just giving a talk with one slide that has all the information on it.

This makes me wonder if we put too much information on our posters (i.e., “Background” or “Definitions”). If you want someone to understand your poster, you will explain the background of the subject and any new vocabulary while talking to them. Adding more figures to posters, or even tables of your results, might: (a) make them more pleasing to the eye; (b) give the observer more information about the results of your experiment; and (c) make them more likely to read the entire thing. Just like with grants, a long string of cramped text just makes the reader ache.

So the next time you do a poster, be uncomfortably simple with your design and explanation. From my experience, the reader will walk away with a better understanding of your research than if a paper was blown up to 56×42 size.

Mendeley vs. Papers

Since virtually any article can be found and kept electronically, you naturally end up with many pdfs in your computer. To improve the storage of such articles on your computer, some companies have designed software to automatically sort your articles. This is done by using the metadata in each file and fetching the citation information over the web. There are two programs I have seen more often than others to fulfill this role: Papers and Mendeley. Mendeley is better, for the following reasons:

  1. Mendeley is free (up to a certain limit of storage). Papers is not.
  2. Mendeley has a web-component, so you can sync your article library across your computers, as well as have constant access to your library on their website.
  3. The pdf viewer within Papers is terrible. I have a last-generation Macbook Pro, and the screen lags as you scroll down a pdf article. Mendeley has shown no such issues.
  4. Just over time, I’ve found Mendeley matches the pdf you import with the citation information faster than Papers, and it has a higher success rate of finding them on its own. I’d say MOST of the articles I’ve imported into Papers, I’ve had to match to their online counterpart manually.
  5. Mendeley has an online networking component, which lets you see what others in your field are reading. Papers has no such component (to my knowledge).

There are probably more reasons why Mendeley is better than Papers. I know this is harsh on Papers, but there are just some things that a program such as these should do well. Mendeley delivers on them all; Papers delivers on none.

Science is a childish thing

Science is a continuation of that childish glee of discovery. When we learned to read. When we were an infant and learned that dropping something will make it fall. It’s the infinitesimal moment where we feel connected to the world. Discovery shows you there’s an entire world that you did not see, and you can affect it and have a place in it. Science makes us feel like an explorer tearing off into the woods, where no one else has been.

But we don’t often feel that way. We feel bogged down by deadlines, having to learn the new edition of a computer program, whether our project get funded, or wondering whether our advisor pleased with us. I want that joy from childhood exploration and discovery to be maintained in my career. Those who can maintain it perform more, better science.

T-test (independent samples)

Purpose

Determine whether 2 group means are significantly different

Assumptions

  1. The data from each group are normally distributed (you can check this by making a histogram of the observations; you should only be concerned if there is evidence of gross deviation from normality)
  2. The variances of each group are equal (if the data violate this assumption, it is a serious problem. Do an equality of variances test, like a Folded F Test. The t-test is so sensitive to this assumption that some have recommended, when sample sizes are not equal, to always do a Welch corrected t-test)

Equation

Independent Samples T-test

Is the “increasingly collaborative nature of science” a good thing?

This post’s topic occurred to me during lunch with a seminar speaker. She had just given a wonderful talk about genetic variation in fruit fly’s to UAB‘s NORC, and anyone involved in genetics knows how many experiments are involved while pursuing a project. A fellow student asked her who at the University of Alabama ran a certain analysis, and he answer was, to the effect of, “No one. I do all of my own analyses.” All I know is that statement made me respect this woman instantly, and I don’t think I was the only one feeling that way.

As a new graduate student and immigrant to the professional science world, I’ve heard about “the increasingly collaborative nature of science”. But from talking to older graduate students, this approach seems to produce a lot of waiting: on collaborators, on the next meeting, on your programming skills to develop, etc. I personally long for independence as a scientist. If I’m walking to the coffee shop one day and suddenly get struck by a new way to estimate disease risk, I want to run back to the office and get to work thinking, deriving, simulating, analyzing, and writing. Not emailing.

Now, I see a flaw with this plan for independence: I’m a biostatistician. Out of necessity (and desire), a large part of my research will involve bench researchers. I am excited about that! But I’d like to have the ability to not depend on anyone else as far as the statistics goes.

The larger question looms as to whether the “increasingly collaborative” model improves Science as a whole is a valid one. My answer to it is: I’m not sure. There are only so many hours in the day, and new technology appears constantly. Perhaps an extensively interconnected scientific community will produce more, better science. But my science, as well as the science of collaborators whose data I am analyzing, will be better if I have more extensive expertise.

Any thoughts?

First reading of “The Probable Error of a Mean” by Student

For those who don’t know, this paper by “Student” (William Sealy Gosset) derives the most commonly used distribution in scientific experiments: the t-distribution. The t-test is used to determine whether two group means are significantly different (e.g., the mean blood pressure of college football coaches vs. the mean blood pressure of librarians). The statistic produced in this test follows the t-distribution, from which you can calculate the probability of observing that statistic, or one even more extreme. This probability is the p-value associated with a t-statistic.

Statistics papers this old (1908) use a different vocabulary than we use today, making them difficult to follow. As I found out, replicating the derivations is even more of a challenge. The general structure of the paper is: (1) derive the distribution of sample standard deviations (from a normally distributed population); (2) show that the distance between the sample mean and the population mean and the sample standard deviation are independent of each other; (3) derive the distribution of the ratio of: the distance between the sample mean and the population mean over the sample standard deviation (the t-distribution, called the z-distribution in the paper); (4) compare this distribution to others known at the time; and (5) provide examples of the t-distribution’s applications.

What I do like about the paper is its simplicity in design. Even with the 100-year vocabulary gap, you can understand each section’s purpose. Gosset obviously took great care in writing, I assume because back then publishing an article took so much time and effort. I’ll strive to craft publications as well thought out as these landmark papers of our discipline.