A beginner’s guide to statistics for PhD research

Statistics can be invaluable for adding a level of rigour to your analysis, but they can be extremely technical and difficult for non-specialists.

This is not by any means a comprehensive guide, but I will try to give some basic working principles to help reduce the pain and avoid the most common mistakes.

Plot your data

Before doing statistical analysis, wherever possible create a visual representation of your data.

This will give you a much better intuitive understanding of what is going on.

For example, if you have survey data using a Likert scale, where answers to questions are given as;

  1. Strongly disagree
  2. Disagree
  3. Neither agree nor disagree
  4. Agree
  5. Strongly agree

You may want to see how the answers to a specific question are distributed across all respondents. You can do this by plotting a histogram showing the number of responses at each point in the scale.

Here are 3 examples of possible distributions:

likert histogram 1 likert histogram 2 likert histogram 3

 Without doing any statistics, you can instantly see how the data is distributed, and you can use this as a basis for your analysis

What does the mean mean?

If you take the means of each of the three distributions above, you will get values of 3.7, 3 and 2.8.

But what do these values mean? In the top histogram, 3.7 clearly correlates to the peak at 4. In the second, the distibution is flat, so the mean just represents the middle of the range, and in the third, the mean is the least selected option.

It is up to you to then interpret what the mean means, but you can only do that when you can see the distribution of the data.

Standard deviation

The standard deviation is a measure of the spread of data around the mean. It is widely used, but you need to be careful.

If you use the standard deviation without plotting your data, then you can end up with a meaningless number.

Standard deviation is best used when you have something approximating a normal distribution of data (the classic “bell curve” below)

from http://en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg
from http://en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg

When you say the standard deviation = x, this indicates that about 68% of the data lies within ± x of the mean.

But what if you have a graph with 2 peaks? Then the standard deviation becomes meaningless, even though a statistical program will still give you an answer.

from http://en.wikipedia.org/wiki/File:Bimodal.png
from http://en.wikipedia.org/wiki/File:Bimodal.png

Don’t include numbers you don’t understand

When you use statistical analysis software, it will spit out countless different results, some will be useful, some not.

Do not include in any report or table of results numbers you don’t understand. Imagine an examiner asking, “what do these numbers mean?” and if you can’t answer, either find out or don’t include them.

How many decimal places?

Another potential hazard is that stats software will often give you numbers to many decimal places.

For example, let’s say you measure the height of every adult human being on earth and look for the mean. With several billion data points, your calculation of the mean might look something like 1.68234597864422 m (I just made this number up as an example). If you copy and paste this number, you are effectively claiming that you can measure the height of a human being to an accuracy of  0.00000000000002 m, which is much smaller than the radius of an atom.

Much better to give the value as 1.68 or 1.682, since this reflects the accuracy with which you can make a single measurement.

Quoting errors

The same is true when giving an estimate of the error on a measurement. Giving an error of ± 2.336598774654654 is ridiculous! You can’t be that precise in an error estimate! Stick to one (or two at the most) significant figures.

Do analysis at a small scale early in your research

If you have 1 month left to submit your thesis, and you are doing analysis for the first time, it’s going to be difficult.

So do some analysis early, on a small scale, so you have some experience before you do the full analysis. You will be able to take your time, while the pressure is still low. Most mistakes happen when doing things in a rush at the last minute, especially if you have never done that type of analysis before.

If you know what methodology you are going to use, do a small trial run and analyse the data you get. Not only will this help you refine your methodology, but it will make the final analysis much, much easier.

Any questions?

I am not an expert in statistics, and cannot answer questions on specific analytical techniques or software, but am happy to answer questions on these basics.

If any statisticians want to contribute, you are more than welcome!

20 thoughts on “A beginner’s guide to statistics for PhD research”

  1. Hi, I have submitted my master thesis for examination, but to my horror surprise, the table that I included in thesis is the one without the standard error. Its a mistake on my part. Will this affect the validity of my data? Or even my whole thesis? Thanks in advance.

    • I really don’t know, because I don’t know your research and haven’t read your thesis. If you were to add the standard error, would it affect your conclusions?

      If you are really worried, go to your project supervisor or tutor and point out the mistake (and give them the correct table). They might not accept it, but it’s worth trying

  2. Hi James
    I will like to know if it is cool to use multiple softwares such as SPSS and STATA to analyse my work.

  3. Just wondering, i used a made up survey for my research but im worried that as i haven’t used standardised measures that i will fail when they question me in the viva? Would love your thoughts! Thsnks, sharan.

    • I don’t know if you will fail because of this, but if the viva is the first time you get feedback on your research from another academic, this is not good.

      It is crucial that you talk to other academics, your supervisor, or other PhD students throughout the course of your PhD.

  4. Howdy! This is my 1st comment here so I just wanted to give a quick
    shout out and say I truly enjoy reading your posts.

    Can you suggest any other blogs/websites/forums that go over the same subjects?


  5. I admire your intentions, but you mislead right from the outset:
    But, it is NOT legitimate to calculate a mean for Likert data. The categories are ordinal, and the mean requires interval data.

  6. Hey! I have a question about normality … ! I have 5 treatment conditions in my experiment, and 4 dependent variables which are normally distributed. So I can use ttests etc. But when I look to split my groups between low experience and high experience, or expert non expert, they then become non normal – do I then need to use non parametric tests on those hypotheses?!! Is it right to switch between parametric and non parametric with the same data?! Thanks James!!

  7. Helpful post thanks James. You explained the relevance behind these stats concepts in a way that I hadn’t heard before. Really helpful. Are you planning part 2?

    • I’m not planning a part 2, but might do if you have questions on really basic principles. I’m not a statistics expert, so have no intention of getting into really detailed stats guides.

  8. there are many stats resources out there in the wild www, however for those who are in the beginning of their PhD and want to learn certain quantitative techniques that they may use later on, I strongly recomend coursera: statistics with R, social network analysis, structural equation modeling are just some of the courses provided free of charge on this website : https://www.coursera.org/ . a true gold mine

  9. Hi! I’m doing my research comparing two groups of participats, patients and controls. I have done all my stats analysis usisng ttest on my different measures to check about my hypothetic differences between groups, but everybody on my field reported anovas at results section. Is my analysis reasonable or have no sense at all! Thanks!

    • In general,
      – Independent samples t-tests are used for testing differences between 2 groups.
      – ANOVA is used when you are testing for differences between more than two groups.

        • it depends how many groups you have, but ttests are better in helping you locate the differences, while anova only tells you that there is a difference. i’d say ttest is the right thing to do in your case.

Comments are closed.