# Designing your bioinformatics project

We'll soon be diving into many specific techniques that you can use in bioinformatics. However, many people find that it is easiest to learn this type of skill if you are applying what you are learning to a specific project that you care about.

So how do you design a good bioinformatics project?

Depending on your perspective, developing a scientific question that you'd like to answer may sound simple - or perhaps frighteningly open-ended. There is a sense (which we'll discuss below) in which something as simple as picking a project speaks to some fairly deep philosophical questions about how we come by scientific knowledge. So there is *a lot* that you can learn about designing an experiment or analysis.

But for now - especially if you are just starting out - it is most important that you *pick an interesting question* and *start trying to answer it*. You can learn a lot along the way. Don't feel like the first question that interests you is written in stone either. There are no penalty points awarded in science for switching questions (indeed this happens all the time!), so it is a good idea to think about the first question that interests you as a starting point.

## What makes a good scientific question?

What makes a good scientific question is of course somewhat subjective. I'm going to present my opinion. This advice is especially focused on scientific questions that you can pursue mostly independently as an advanced undergraduate or early graduate school student, without a source of funding to collect new data. 

If you are an experienced researcher heading a biological laboratory, some important considerations will be different (e.g. how will I get funding for the project?), but if you are in that position you are likely already all too familiar with those consideration.

Down below we'll talk a little more about a specific *process* for developing a question, so don't feel like you have to magically come up with one out of thin air. But for now let's get some sense of what we're aiming for.

__Here's what I think you should look for when picking a scientific question for *your* project:__

- **It interests *you***. Answering most questions in bioinformatics takes significant time and dedication. So the most important criteria in picking a question is that you genuinely care about the answer. You might care because of simple curiosity, or because you think the answer might suggest a new approach to a societal problem (e.g. predicting flu outbreaks to better target vaccines). No two humans have identical experiences, and so it's OK if the question that interests you is not the same as that that interests other people. Indeed, that's one reason that you have something unique to contribute to bioinformatics. Conversely, just because you think a question is important doesn't necessarily mean it's the best one for *you* to address. 

- **It is relatively specific**. The first questions that you think of may be very broad. ("How do cells work? How did selfless behavior evolve?"). Some might take a career's worth of work. Others might take concerted effort from the whole global scientific community over an a long time to really answer. If you come up with a very broad question, don't throw it away. You can save it, but understand that you will likely have to definitively answer many smaller (more 'bite-sized') questions in order to build up an understanding that speaks to your main question. Referring  it will be very useful later on 

- **It is one that *you* can start working on now**. In science, getting your hands dirty always beats idle speculation. You want a question that you can start addressing *now*. If you are a student who is new to bioinformatics, that typically means that the data you would need to test different hypotheses about your question are available and in a format that is not too difficult to work with. Down below we'll talk more about hypotheses and go over some of the common data formats that you might look for. Although there are thousands of data formats that are used by specific software packages in bioinformatics, many interesting problems can be addressed with data that are in a much more managable number of formats (say 4 or 5). Overall, it is much better to pick a question that you can make some progress on soon using data that is already available than it is to pick a very broad or unanswerable question. 


>__Example__: Let's suppose (I'm making this up for the sake of illustration) that I have read that giant African 
Land Snails (*Achatina fulica*) eat more than 500 species of plants, some of which are toxic to many other animals. I am interested in why they are able to eat so many kinds of plants. The first thing I write down is "How does this snail digest plant toxins?" 
>
> If I have local access to that snail, there may be some laboratory experiments I could devise that would let me  start working now on how this snail breaks down it's food. However, because there aren't many data available on this  particular snail, asking questions about its metabolism probably isn't a good topic for a first bioinformatic  project. 
>
>But I understand that I don't have to stick with my first question - I can adjust the question to make it answerable given the time, resources, and skills I have available to me. So I could broaden the question to ask about metabolism in all snails or all mollusks. I might also notice that 'how does metabolism work' is extremely broad.  Maybe what really interested me was how the snail dealt with a particular toxin from a plant that it eats. If so, I  could perhaps expand my question to deal with how snails in particular or molluscs in general metabolize plant  alkaloids.



# Why bioinformatics needs YOU

No two humans have exactly the same experiences, and therefore all of us view the world through a slightly different lens. Scientists agree (mostly) on the broad process by which we can test ideas about how the world works - the scientific method. By making predictions based on those ideas and then developing experiments or analyses that test those predictions, we can try to figure out which make useful predictions and which do not. However, the process by which we determine which topics are worthy of our attention, what questions are worth asking, and what hypotheses are worth testing is often much more subjective. Researchers with different backgrounds and perspectives (whether due to training, culture, or any number of other factors) may become interested in unique topics, or suggest potential hypotheses that had not previously been considered.

![Hypothesis Testing Diagram](./resources/hypothesis_testing.png "An idealized diagram of hypothesis testing. The shows a cloud representing how our beliefs,training,background, etc. may lead us to propose different hypotheses. These hypotheses are represented as different shapes. An experiment is represented as a board with diamond shaped holes in it. Several hypotheses 'fit' the data generated by the experiment, while others do not. Hypotheses that fit the data carry on while those that do not are abandoned.")

The diagram shows an idealized view of the **scientific method**. You've probably encountered this before, but bear with me. We propose all sorts of hypotheses based on how we understand the world to work. Then we design experiments or analysis, and use our various hypotheses to predict what will happen. Finally we see if the predictions match what happens. Broadly, those hypotheses that best predicted the outcome tend to be the most useful ones. (We'll talk more later about some of the nuances here).

Here's the point - the process of choosing what broad topics are worth of scientific inquiry (and how much effort and funding should be devoted to each) and proposing testable hypotheses is as much an art as a science. That's not to say we can't make *any* judgements about which questions are likely to be fruitful, just that it's more a matter of judgement and emphasis than strict rules. Therefore, if we are missing a lot of perspectives - including perhaps yours - we potentially may be missing out on important questions to ask or hypotheses to test.


![Hypothesis Testing Diagram with missing perspectives](./resources/hypothesis_testing_missing_perspectives.png "The same diagram as shown up above, but some perspectives are missing, leading to hypotheses that would explain the data never being tested.")

That being said, it's important to note that **most of our new hypotheses are wrong** and **evidence matters**. If you want to be in the business of scientific inquiry, you have to be very comfortable having your ideas - even your favorite ones - turn out to be wrong pretty often. Over time, you may find - as I have - that you get very excited when the evidence suggests that a hypothesis you thought was a sure thing is incorrect. That means that you may need to adjust your thinking to match what biology is showing you. In other words, you've learned something! 

I think this quote from Robert Pirsig in *Zen and the Art of Motorcycle Maintenance* sums this point up nicely:
>“The TV scientist who mutters sadly, "The experiment is a failure; we have failed to achieve what we had hoped for," is suffering mainly from a bad script writer. An experiment is never a failure solely because it fails to achieve predicted results. An experiment is a failure only when it also fails adequately to test the hypothesis in question, when the data it produces don't prove anything one way or another.”

>― Robert M. Pirsig, *Zen and the Art of Motorcycle Maintenance: An Inquiry Into Values*


## A process for developing a specific scientific question

Once you have a general topic in mind, the basic process for developing a specific scientific question is as follows:

- **Review the literature about your topic.** It's just fine to start with Wikipedia or a news article. It might seem backwards to read when what you want to do is write down a specific question and hypotheses. I promise, however, that if you immerse yourself in what's already known about a topic, questions and ideas will start to emerge. I talk more about some specific ways you can do a simple mini literature review in a very short amount of time below.

- **Talk with researchers in the area.** Once you have done your mini literature review and have some idea of what is known about your topic, it can be helpful to bring a list of brainstormed ideas to someone working in that area. If you are taking a class, you might bring them to office hours or schedule a meeting with the instructor.  

- **Try talking your ideas out with others who are at a similar stage.** If you find a particular question is easy to explain to others in a way that they clearly understand, that may be a sign that the question really is a clear and interesting one.


## Conducting a literature mini-review
Before you spend too much time developing a bioinformatic project, you want to get some sense of what is already known about that topic. Let's do that for the topic that you picked up above.

### Finding peer reviewed-studies
When doing bioinformatics, we want to focus on *peer-reviewed* scientfic publications. These are publications that have been read over and approved by several other scientists working in similar areas. The peer-review process is not perfect, but does tend to filter out many very poorly conceived or executed experiments, so in general you will find higher quality results if you look for peer-reviwed studies than if you do a general internet search.

Two common resources for finding peer-reviewed studies on your topic are [Google Scholar](https://scholar.google.com/) and the National Center for Biotechnology Information (NCBI)'s [PubMed search engine](https://www.ncbi.nlm.nih.gov/pubmed/). 

### Build a table of studies for your mini-literature review

I am very lazy. Just really abysmally lazy. However I also like to do a good job on projects that I'm pursuing. One way to do that is by getting good at *not wasting effort*. You can start right now as you begin your literature review. Even though your first steps might feel preliminary, getting organized now will save a ton of effort later on. 

So here's what I recommend. Set up a table of studies that address your question. Look up at least 10 peer-reviewed studies using [Google Scholar](https://scholar.google.com/) or [PubMed search engine](https://www.ncbi.nlm.nih.gov/pubmed/) that look interesting and address your topic. In Google scholar, papers will by default be sorted in part by their number of citations (how often *other* papers have referenced them). This can be useful when you know nothing about a field to find papers that are actively being discussed in the literature (but note that any given paper might be cited a lot because it is really good or just because it is really controversial).

To lower the barriers to getting started right away, I've put together a table that you can use to summarize your results. Here's the link: [Literature Review Template](./resources/Literature_Review_Template.xlsx). 
You can download it and fill in your literature survey in your copy. It is in Microsoft Excel format and can be loaded in either Excel or Google Sheets (which is free). 

Before you do anything else, just enter the information for these studies into your table. For now, you just want to enter into the table the basic information about the paper (the last name of the first author, the year of publication, the journal, the paper's title and a hyperlink to take you back to it).

Now that you have that recorded, I'll say something that you won't hear very many instructors say: **don't read the papers - or at least, don't feel that you must read each one, in full, from start to finish right now**. You may very well *end up* reading all of them in detail if you continue to work on your question, but it is not necessary (and may be counterproductive) to try to do that right at the beginning. 

*Instead*, **read the abstract of each paper (see below)**. Try to look for the key pieces of information in the example table. When you enter the information, also mark your 'depth of reading' as 'abstract only'. That will be very useful later on when you have read some papers but not others. 

Here's an example of how a first entry in your literature review table might look. It continues the theme of mollusc metabolism of plant alkaloids that we used in the example up above.

![Example Literature Review Table](./resources/Literature_Review_Example.png "An image of a filled literature review table similar to the one at the link up above.")



Once you have read a number of abstracts, *then* you can select a couple of papers that seem most relevant to your project to read in more depth. 



### Reading an abstract

Most peer-reviewed scientific papers have an abstract, which is like a version of the paper in minature. They are very, very short - from just 150 to about 500 words - so you can read a **lot** of them in an hour. In a good abstract, you should be able to figure out why the researchers care about the topic they're studying, what question they asked, their hypothesis (if any), the methods they used to test the hypothesis, the results of their experiment or analysis, and the broader implications for their original question. Typically each of these will be covered in just one or two sentences. Sometimes many individual results will be summarized in a single sentence.

### Skimming papers 

Now that you are beginning to get a sense of what's out there, you can pick 1-3 promising papers to skim. Look to understand the broad outline of what the authors did, without getting too lost in exactly how many miligrams of reagent they added to a particular reaction.

When skimming a paper, you do not necessarily need to read it in order. This is especially true if you are already familiar with a field and many common methods used in it. In that case, many readers will read the abstract, then if they understand the experimental setup jump ahead to the key figures, then back up to read the results and methods, and finally read the discussion.

As you read, pay attention to the inline citation used to support claims. If you see a claim that is interesting, it is often worth looking up the citation and adding that paper to your list. A good rule of thumb for when you are starting to have explored a particular literature is when you pick up a new paper and can recognize citations to several papers you have already encountered.

You may wish to make special note of what software of analytical methods researchers used to answer questions that are similar to those you wish to ask. You will probably need to look up methods that are unfamiliar.  Often, you will see some methods used in many papers on the same topic. If you see a method used repeatedly, that's a good sign that it's worth devoting some time to understanding how it works, and seeing if it can be used in your work. 

Record the key points in your literature survey so you can refer back to them later. As you skim several papers, you will hopefully find at least one that is worth reading in full. 

### Reading a paper in depth 

Reading a paper in depth takes time. Hopefully, the process outlined up above has weeded out many papers that would be less worth this much time. However, having some key papers that you know inside and out can be a powerful thing in terms of developing your expertise in a given area of science. 

Here are some pointers:

- **Identify the type of paper**. Is the paper a review, perspective, or research paper? 

- **Start with the big picture**. You must try to understand *why* the investigators did each analysis they did. Some papers make this easier than others. It helps to start with the hypotheses the investigators were trying to test.

- **Don't forget context** Often, papers make more sense in the overall context of a scientist's body of work. It may be useful to look up the senior author on Google Scholar. There you can find a sorted list of all (or most) of a scientists papers. You might also check out the lab website, as these often have an explanation of the broad questions the lab is trying to answer written in language that is broadly accessible.

- **Nail down key methods, terms, concepts, and acronyms**. If you see a term that is unfamiliar to you when reading a paper deeply, be sure to  pausing to look it up and record what you find. Then you'll recognize it in future papers. Sometimes 5 minutes on Wikipedia can save hours of confusion.

- **Distinguish results from interpretation**. Pay lots of attention to the distinction between what excactly the experimenters (or analysts) measured vs. the hypothesis they were trying to test. Are there alternative explanations that could produce the same outcome? Often good papers will include additional analyses to try to test whether these alternative explanations apply.

- **It's worth reading what others are saying about the paper** For papers that have been around for a while, Google scholar and similar resources are also very useful for finding other papers that discuss the paper you are reading. You can click on 'cited by' to find links to other papers that discuss this work. This can be very useful in finding what other scientists are saying about a paper. How do they summarize it's key conclusions? Do they find other results that agree with the paper? Do they critique it?

- **Don't forget the supplementary materials.** Many papers are written with very strict page and word count limits. As such, often much useful explanation and detail ends up in Supplementary materials, which usually have to be downloaded separately from the journal website. I often find that these are **extremely** useful. The explanations are often somewhat less formal and more accessible, and they often include raw data files that you might end up using in your analysis.

- **Be specific in your critiques**. If you don't like the paper, try to be very specific in what you would do differently. A constructive critique could be the basis of your own project! One good rule of thumb is to focus on criticisms that would actually be likely to change the biological interpretation. For example, if the investigators saw a strong effect with high statistical significance, then even if you wish they had collected ten times as many observations, it is also worth acknowledging that this probably wouldn't have changed the outcome - unless those extra samples differed systematically in some way from the ones the paper actually collected.

- **Summarize the paper in your own words**. Try explaining what the paper is about and what it found succinctly and clearly out loud, as if it was your own work. Can you give a clear 1 minute overview of why this study was done, how it worked, and what it found? If so that's a good sign that you're on the right track.

### Rinse and repeat!

Much of what researchers do to become experts on a topic is to repeat a process like this one. It helps to have an end goal in sight, such as writting the introduction and discussion sections of a paper about your own research. 



## Developing hypotheses from your question

If you have picked a sufficiently specific research question, then hypotheses are simply possible answers to that question. However, they are not just any answer to the question. 

The best hypotheses are *very specific* answers to your question that make detailed and testable *predictions* about the world. That is to say, good hypotheses make predictions that are so specific that they can be tested. If our hypothesis isn't right (most aren't and that's OK!) we'd like to find that out as quickly as possible.

The *worst* scientific hypotheses we can write are mushy. They may not make any predictions that can be tested with data. Or they may make predictions, but those predictions are so vague that really any data one could possibly collect would be consistent with them.


### How to spot - and fix - mushy hypotheses.

It's easiest to think about how to design a good hypothesis by considering some pretty bad ones. I work in the area of microbial ecology, studying the various microorganisms that live in and on animals. One hypothesis that several papers in the area purport to test is that the set of microbes living on tropical corals  (the 'coral microbiome') is very diverse.

I consider this hypothesis pretty bad. First, it's quite vague. When we say the microbes are diverse, do we mean in terms of *which* species are present or what they can do (these don't always line up perfectly)? What do we mean by 'very diverse'? If I find 50 species of bacteria living on a coral, is that 'highly diverse'? What about 20? 200? In other words, what would 'not diverse' mean? 

Let's say we are focused on species diversity. One way we could make our hypothesis less mushy is by making it comparative. For example, maybe we mean that corals have a diverse microbiome *relative to* all other animals. Or maybe we mean that corals have a diverse microbiome *relative to* the microbes that normally live in tropical seawater or seafloor sediments. 

Compare these examples and see if you agree that one is a lot less specific than the others:

1. "Coral microbiomes are highly diverse"
2. "More species of microorganisms live on corals than live in surrounding seawater"
3. "More species of microbes live on corals than other animals"

#### Comparative hypotheses

If you look carefully, you may see that many hypotheses you encounter in biology, including the one above, have an element of *comparison* to them. 

For example, when evolutionary biologists propose that a given trait is an adaptation to a particular environment, what they are really saying is that individuals with the trait have higher fitness (i.e. a greater number of surviving offspring) in that environment __relative to individuals without that trait__. This hypothesis makes a testable prediction: if we have individuals with or without the trait in that environment, and count up their surviving offspring, and *if* the hypthesis is correct, we should see that individuals with the trait have more surviving offspring. If instead we see that individuals *without* the trait have more offspring, then this would argue against our hypothesis.

#### Quantitiative hypotheses

Many hypotheses in biology are *quantitative*. That is, they are described by a mathematical formula. For example, I might suppose that a particular formula predicts how quickly a certain species of lizard will grow over time. Or another equation might predict the frequency of a recessive lethal mutation in a population after 10 generations of reproduction. The key thing to realize about quantiative hypotheses (especially for biologists) is that **any chump can write down a complex-looking formula**. Simply because mathematical notation is used does not by itself lend any validity to a quantitiative hypothesis. (An exception is if the new formula is a mathematical consequence of a different formula that has itself already been verified to accurately predict biologial outcomes). As with any hypothesis, **quantitative hypotheses must be tested through confrontation with data**.


There will be much more discussion on exactly what we mean by 'confronting hypotheses with data', and several different perspectives on this topic in the section on statistics later on.



 







### Write your hypotheses and specific predictions

**Take your specific question up above. Given what you've read so far in the literature (in your mini-review), write down 3-5 hypotheses for how your question might be answered.** If your hypothesis has already been discussed in the literature, you might note what paper it was proposed in, or what existing evidence supports it. However, this isn't strictly necessary, and many reasonable hypotheses about specialized topics may not yet have been explored in the literature.

| Hypothesis  | Rationale/Explanation  |  Specific predictions | 
|-------------|------------------------|-----------------------|
| H1.         |                                                |  
| H2.         |                                                |  
| H3.         |                                                | 

Once you've done that, for each hypothesis, see if you can write down at least one specific prediction it would make about data you have or could collect that differs from one or more of the other hypotheses.



 

## Three Perspectives on how to evaluate Scientific Hypotheses

[Epistimology](https://en.wikipedia.org/wiki/Epistemology) is, very broadly, the study of knowledge. It asks how we know what we know, and what it takes for a particular belief to be justified. Defining hypotheses and confronting them with data is one way of answering some of the questions in epistimology, though certainly not the only one.

However, although the scientific method is typically presented as universally accepted among scientists and a fairly clear, mechanical process, there are actually several distinct ways that scientific hypotheses can be evaluated. Many scientists may use several of these different approaches without formally distinguishing them. However, I would like to define them now, so that we can refer back to them later on. 

In particular, understanding the distinctions between these approaches will be important when we discuss statistics, because different approaches to statistical analysis approach epistimological questions differently. To understand what a particular equation is telling you, it is important to understand the question it is trying to answer.

- **Falsificationism.**. This is the approach you were mostly likely taught as part of the scientific method. In this approach, we write down predictions that our competing hypotheses make about the world. We then devise experiments or analyses that would produce *clearly different outcomes* if each hypothesis was correct. If the experiment produces an outcome that could not occur (or could only occur with a probability so low that we are willing to disregard it) under a given hypothesis, then we say that that hypothesis has been *falsified* by the data. However, this approach does not tell us that any given hypothesis is in fact true - it just tells us that some are false, and leaves others open as possibilities. In many modern applications, various trivial null hypotheses are proposed (e.g. there was no real change between two conditions, just random measurement error) etc, and by falsifying them you can progressively describe features of the process under investigation. The Wikipedia article on [Scientific Hypothesis testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing) is well worth a read, and contains more historical detail on how what is currently taught about hypthosesis testing through falsification derives from a combination of distinct approaches pursued by Karl Pearson & Jerzy Neyman and Ronald Fisher.

- **Information-theoretic Model Comparison**. In this approach, one is simultanously evaluating several models to see which best explains the data at hand. Ideally, the models compared should account for the range of possible explanations for whatever phenomenon your studying. A key feature of this approach is that it acknowledges that more *flexible* hypotheses (or statistical models with more parameters) will tend to fit *any* dataset more closely than *rigid* hyptoheses (or statistical models with only a few parameters). Therefore, Information-theoretic model comparison takes into account both the chances that each hypothesis would produce the data you observe, but also the complexity of the hypothesis. The [Akaike Information Criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion) (AIC) is a simple equation that is often used in practice to pick relatively simple or rigid models that still do a good job of explaining the data that you see.

- **Bayesian Model Comparison**. In Bayesian model comparison, each hypothesis under consideration should be assigned a *prior probability*, that reflects the chances that the hypothesis is true prior to starting the investigation. This might be a *flat prior* in which all hypotheses are treated equally if there is truly no available evidence (e.g. 0.50 if you had two hypotheses ). These prior probabilities are then updated based on the likelihood of observing the data under each competing hypothesis using [Bayes' theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) (we'll discuss this later on). The result is that hypotheses that make the data likely go up in probability relative to their prior probability, while those that don't match the data go down. The resulting probability is called the *posterior probability* and reflects the chances that the hypothesis is actually true. Note that this is a much stronger claim than that used in Falsificationism. However, the accuracy of the posterior probability depends on having a correct prior probability, which can be difficult or impossible to specify objectively. In the extreme case that the chances of getting observed data under a hypothesis are zero, then the posterior probability of that hypothesis is also zero and the hypothesis has been 'ruled out'. In more typical cases, the results of an experiment or analysis will adjust the probability of the hypothesis. 


## Designing a Bioinformatic Analysis to test your hypotheses

We haven't yet talked about the technical details of specific methods for doing bioinformatic analysis. However, before we dive into the details it's very important that we understand what our analysis must do if it is to test one or more hypotheses.

- Your hypotheses should make different predictions about what the result of your analysis will be.

- The data you use to evaluate your hypotheses should be *new data* that was not used in formulating your hypotheses. It turns out to be relatively straightforward to devise hypotheses that fit existing observations. Many fewer hypotheses make useful predictions about future observations. Those that do tend to be the most useful (and most likely to be biologically correct).

- If you imagine the different results you *could* see, each possible outcome should have a clear meaning in terms of your hypotheses.







## Designing a Descriptive Bioinformatic Study

At this point it's important to note that not all bioinformatic analyses, nor all scientific papers test a hypothesis. Sometimes, especially after something very new has been discovered, *descriptive studies* that explore what's out there as comprehensively as possible can be very important. In some ways, this is much easier than testing a hypothesis - your goal is to use many different tools and approaches to characterize what's out there, with the goal of uncovering new phenomena that can be the topic of future inquiries and investigations. 

Here are some points to keep in mind when designing a hypothesis: 

- If you are just starting out, I strongly recommend practicing hypothesis-driven analyses first. 

- Many successful descriptive studies are large in scope. Although they do not focus on hypothesis testing (at least not as a main focus), they present so much new data (genomes,physiological measurements, microbiomes, ecological surveys, transcriptomes, etc) that they can become an important new reference data set for the community. Having the data fully open-access, clearly described and documented, and making available intermediate data products (e.g. summary tables) can help to make this type of study useful.

- It is easy for a descriptive study to become very muddled and hard to follow. It may read like 'one thing after another' without any coherent thread that allows other workers to make sense of it. Even if you don't have a hypothesis in mind, a specific question is still very useful in guiding your investigation and making it coherent.

- Many descriptive studies are *hypothesis generating*. That is, they describe new observations that are intriguing and demand explanation. As such, it is important for researchers developing a descriptive study to both be aware of existing hypotheses and theories in their area of research, and to be open to 'weird' observations that fall outside of existing areas of focus. Many important discoveries have been made by extremely careful, methodical, and relentless pursuit of why a 'weird' finding happened. In many other cases it turned out just to be an experimental error. 

## Exercises

**Exercise 1**: Write down a biological topic that interests you. Using the steps above, conduct a literature review on your topic. Using the template provided above, build a table of at least 10 papers.

**Exercise 2**: Based on your papers, try writing down a biological question. For example, you might ask "How does body size of mammals influence their range?" or  "Why do some bacteria have very large genomes, while others have very small ones?" or "Over the course of plant evolution, what types of genes are most commonly or most rarely lost from genomes?". 

**Exercise 3**: Once you've got your question, propose three possible answers or hypotheses that represent potential answers to your question. For example,

hypothesis 1: "larger mammals tend to have smaller populations and therefore smaller total ranges" 
hypothesis 2: "larger mammals migrate further and therefore have larger ranges"

It doesn't matter if your hypotheses are correct or not - it's just important that they makes predictions that can be tested.

**Exercise 4**: Consider what kind of data you would need to test your hypotheses. Are there data that would let you tell if one hypothesis was correct vs. another?

For example, in the example about mammal range, you'd need data on the size and ecological range of many mammals. To test hypothesis 1, you would also need data on the population size for those mammals. To test hypothesis 2, you would also need data on how far the mammals in your study migrate. 

Try searching on Google for databases or papers that have this information. For most bioinformatic analyses, it's best to look for resources that let you download raw data.

Describe the data you think you would need, and whether you were able to find a place where you could get it.

**Exercise 5**: Describe in very general terms how you think you might use the data to test the predictions of your hypotheses. 

For example, if larger mammal body sizes result in smaller total ranges, then if we could get data on mammal body size and ecological range for many mammals, we could plot one as the x axis and one as the y-axis of a scatterplot. If hypothesis 1 were correct, we'd expect a negative correlation (indicating that bigger mammals have smaller ranges). We could then test if this negative correlation is statistically significant, or if it is likely due to chance. 

Don't worry if this overall plan is a bit general at this point - think of it as a scaffolding that you can fill in with specific details later on.

## Further reading
- Aguiar *et al.*, Chemoecology, 2005. [How do slugs cope with toxic alkaloids?](https://link.springer.com/article/10.1007/s00049-005-0309-5)

## [Reading Responses & Feedback](https://docs.google.com/forms/d/e/1FAIpQLSeUQPI_JbyKcX1juAFLt5z1CLzC2vTqaCYySUAYCNElNwZqqQ/viewform?usp=pp_url&entry.2118603224=Project+Design)
