# Exercise: Revising your writing about Statistical Results

## General Principles

When you first write about statistical results, you are likely just trying to record your thoughts. As you move towards sharing those results with others, you will want to revise your results to make them clearer and more readable. This may sound mysterious, but there are simple straightforward steps you can take to clean up writing about statistical results.

First, it's good to write down what you are trying to communicate to the reader. Broadly, if you want your reader to understand a statistical test, you need to tell them:

- What you were trying to figure out 
- What statistical test you ran, and what specific question that test answered. 
- What data the test was done on. This description should include the sample size (n), and where a reader could  find the raw data (e.g. in a Supplementary Data table? On GitHub?) if  they wanted to rerun the test for themselves.
- The results of the test. This should include both the effect size (e.g. R^2) and significance (e.g. p-value) of the test. When reporting statistical significance it's important to specify if you used a one-tailed or two-tailed p-value. The one-tailed value is appropriate if you know ahead of time you only care about one type of difference (e.g. if the value exceeds the expected one), whereas the two-tailed p-value is appropriate if a difference in either direction (e.g. higher *or* lower) is interesting.
- If multiple statistical tests were run (e.g. of every gene in a genome), a description of how the test accounted for multiple comparisons. For example, you may have used Bonferroni correction to adjust the p-value or Benjamini-Hochberg FDR control and q values.
- How the statistical results relate to the broader biological question

All these details are very important to present. However, the exact format in which they are presented can vary depending on the paper. If you run the same test many times in a row, some of this information may be stated once, and not repeated each time you describe test results. If you are describing many tests, it may be more appropriate to present the details of the effect sizes and significance values in a table rather than in the text. 

## Structuring reports of statistical results

Often bioinformatic papers involve dozens of separate analyses. In some cases, similar analyses will be performed with different programs or methods to test whether the biological conclusion is *sensitive* to the type of analysis used (which would make you trust the conclusion less) or *robust* to these choices (which would make you trust it more). How should one describe all these results in writing? 

Well-written bioinformatic papers rarely just list results one after another. Instead, scientists tend to think about how results relate to each other, and describe the results in a way that makes sense in terms of those relationships. The structure of how the results relate to one another conceptually is related to how they are described on the written page. This tends to make your descriptions of statistical results shorter and more fun to read.  

The easiest way to see how this works is by example. Read the following description of statistical results from an analysis that measured gene expression of 4 genes in samples from either healthy patients or patients with a disease. In order to focus on the structure of the results, I've used generic names for all the genes and pathways:

>We analyzed our gene expression data using the edgeR software. Gene1 showed significantly higher expression in healthy samples as compared to diseased samples. Gene1 is a member of pathway A. Gene2, a member of pathway B, showed enrichment in diseased samples, but this difference was not significant.  Expression of Gene3 was depleted in healthy samples when compared to diseased samples. Gene 3 is a member of pathway B. Gene4 was depleted in diseased samples with respect to health samples. Gene4 is a member of pathway A. 


Did you enjoy reading this description? Why or why not?
Does the text make the results clear?
Does the text make you interested in learning more?
Can you determine the validity of the statistical conclusions from the text or supporting materials it references?


In my mind, there are at least three related problems with the above text. It would be OK to write something like the above passage down at first, but during revision of our manuscript there are changes we could make that would improve the text.  

#### Problem #1 Lack of Parallelism

One problem with the text above is that it reports similar results in dissimilar ways. This hides some of the patterns in the data. For example, Gene1 is said to be significantly more highly expressed in healthy samples than diseased ones. Gene4 shows exactly the same pattern (higher expression in healthy samples), but is written in terms of lower expression in diseased samples vs. healthy ones. This hides that Gene1 and Gene4 are showing exactly the same pattern of expression across health states.

Parallelism has several meanings in writing, but a key one for science is that ideas that are functionally similar can be expressed in parallel or similar ways to aid in interpretation. Rewriting the results for parallelism still won’t make the above paragraph pretty, but it will at least make the results somewhat clearer.

#### Revision #1 Express results in a consistent way

Let us rewrite the paragraph with the rule that all discussion of which health states genes are expressed in will be compared to healthy. After all, if a gene is less expressed in diseased samples than in health samples, that's the same thing as saying that it is more expressed in healthy samples than diseased ones.

Here’s an expample of one way we could revise the text for consistency:

>Gene1 showed significantly higher expression in healthy samples as compared to diseased samples. Gene1 is a member of pathway A. Gene2 was depleted in healthy samples as compared to diseased samples, but this difference was not significant. Gene 2 is a member of pathway B.  Gene3 was depleted in healthy samples when compared to diseased samples. Gene 3 is a member of pathway B. Gene4 was enriched in diseased samples with respect to health samples. Gene4 is a member of pathway A. 

#### Problem #2 Similar ideas are not clustered.

In the above passage, genes are ordered by number (Gene1, Gene2) etc. But this means that we first talk about one gene in Pathway A, then two genes in Pathway B, and then have to loop back to talking about a gene in Pathway B.  Unless there is a reason the number of the gene is important, the paragraph could be made much easier to read by grouping the description of genes with similar results together. In many cases, you can then merge results described in multiple sentences into a single much more readable sentence.

#### Revision #2  Move related results together

Let's start by just rearranging the order of the sentences so genes in similar pathways are discussed close to each Check this out:

>Gene1 showed significantly higher expression in healthy samples as compared to diseased samples. Gene1 is a member of pathway A.   Gene4 was enriched in diseased samples with respect to health samples. Gene4 is a member of pathway A. Gene2 was depleted in healthy samples as compared to diseased samples, but this difference was not significant. Gene 2 is a member of pathway B.  Gene3 was depleted in healthy samples when compared to diseased samples. Gene 3 is a member of pathway B. 

This text clearly still needs some work, and is still pretty boring, but I think it's already a bit better than the initial version. If you reread this version it will likely jump out at you that there are two genes that were more highly expressed in healthy patients, and two others that were less expressed in healthy patients vs. diseased ones. You might also notice that both of the genes expressed more highly in healthy patients are in pathway A, whereas both the genes that show lower expression in healthy patients are in pathway B. 

#### Problem #3 Lack of Synthesis.

The final problem we have to overcome is that the paragraph above is very list-y. It’s just one thing after another. While we have now used parallelism and consistency to at least allow readers to draw connections among the results, the text itself does little to aide this problem. This kind of list-y text is sometimes necessary as an intermediate step in writing, but it is - to me at least - often quite painful to write and boring to read. Another way of diagnosing this problem is to say that in this paragraph there is no synthesis of how these results are grouped together logically. If there really is no logical connection the author can discern, then a table might be a better way to list results. No one wants a table written out as text. Try filling out the following table based on the paragraph above:

| Gene | Pathway | Expression in Healthy  vs. Diseased Patients| Significant? |
|------|---------|---------------------------------------------|--------------|
| 1    |         |                                             |              |
| 2    |         |                                             |              |
| 3    |         |                                             |              |
| 4    |         |                                             |              |

Now that we have things in a table, the patterns in the results are very clear. In our next revision we will *synthesize* the results by simply describing the patterns we see to the reader, like we would to a friend. 

#### Revision #3 Synthesize results

First, let's take a look at how the results look when placed together in a table. Here's what I got when I filled in the table based on the discussion up above:

| Gene | Pathway | Expression in Healthy  vs. Diseased Patients| Significant? |
|------|---------|------------------------------|--------------|
| 1    |   A     |     Enriched                 |        *     |
| 2    |   B     |     Depleted                 |              |
| 3    |   B     |     Depleted                 |        *     |
| 4    |   A     |     Enriched                 |        *     |

Now that we have things in a table, the patterns in the results are very clear. We can simply describe them to the reader, as we would to a friend:

>Two genes in pathway A - Gene1 and Gene2 - showed higher expression in samples from healthy patients. In contrast, two genes in pathway B - Gene 3 and Gene 4 - showed higher expression in samples from patients with the disease.

Try reading the first version out loud and comparing it to this version. To me this reads much better. One thing we haven’t yet described is the degree of statistical significance of each result, so we may want to include that. We’ll have to think a little about the best way to handle this since all the comparisons except B were significant: 

>Two genes in pathway A - Gene1 and Gene4 - showed significantly higher expression in samples from healthy patients. In contrast, we found that two genes in pathway B - Gene2 and Gene3 - showed higher expression in samples from patients with the disease (this difference was statistically significant for Gene3 but not Gene2).


#### Problem #4 Lack of statistical details

Now that we have the general structure sorted out, we want to make sure we report relevant details from the analysis. Depending on the requirements of a scientific journal, these might be described at length in a separate Methods section. Nonetheless, it is usually good practice to at least remind readers of the most important aspects of a statistical tests when describing the results. No one wants to have to flip back and forth between methods and results all the time!

Here are the key statistical details we said up above that it was important to include:

- What you were trying to figure out 
- What statistical question the test answered
- What data the test was done on, and ideally where to find these data (e.g. in a Supplementary Data table)
- The effect size (e.g. R^2) and significance (e.g. p-value) of the test
- If multiple statistical tests were run (e.g. of every gene in a genome), a description of how the test accounted for multiple comparisons (e.g. either using Bonferroni correction to adjust the p-value or using Benjamini-Hochberg FDR control and q values).
- How the statistical result relate to the broader biological question

So far, our description of results only talks about which genes were significantly enriched or depleted, but doesn't really describe the statistical methods, whether a correction for multiple comparisons was performed etc. There are several ways to try to incorporate this information. The best solution, in my opinion at least, depends on the context. 

#### Revision #4 - Add Statistical Details, Rationale and Interpretation

Clearly, although we now have a better-organized description of our results, we have a lot of detail to add on either side.

##### Revising to Explain the Statistical Test and Relate it to the Broader Question
We'll 'pad out' these results by adding a couple of introductory sentences explaining what analysis we did, roughly how the software we used works, and a final stentence interpeting the results. We should consider our interpretation carefully, and account for the most likely ways that the data could be produced. If our result has more than one plausible interpretation, it's good to state that directly. In this case for example, we might be excited by the idea that genes in pathway A might have increased expression in patients with the disease as part of either the actual pathology of the disease or a side-effect of the disease. However, without additional information, it is also entirely possible that differences in gene expression in pathway A instead affect *susceptibility* to the disease. 

>In order to study how this disease influences gene expression, we conducted RNA-seq on 20 healthy subjects and 20 subjects with the disease (see Methods; raw data are in Supplementary Data Table 1). We tested genes for differential expression using the edgeR software package. The edgeR software models gene expression in replicated samples using a negative binomial distribution, and then uses an exact test similar to Fisher's exact test to test for significant differences between categories ([Robinson et al., 2010](https://pubmed.ncbi.nlm.nih.gov/19910308/)). We controlled the False Discovery Rate (FDR) using Benjamini-Hochberg FDR control and q values. 

>Our result identified two pathways that were influenced by disease state. Two genes in pathway A - Gene1 and Gene4 - showed significantly higher expression in samples from healthy patients. In contrast, we found that two genes in pathway B - Gene2 and Gene3 - showed higher expression in samples from patients with the disease (this difference was statistically significant for Gene3 but not Gene2). This suggests that either pathway A and B are involved in some way in the pathology of the disease, or perhaps that external factor(s) that predispose patients to contracting the disease also influence gene expression of these pathways.


##### Revising to add effect sizes and signficance

Smoothly incorporating effect size and significance without breaking the flow of text is a bit of an art. I find it generally one of the harder revision steps. In some cases, it proves best to consign these details to a Supplementary Table. Let's try to integrate them here. 

**Option 1: put details in a Supplementary Data table and reference it.** If taking this option, we would put all our statistical test results into a Supplementary Data table that would be available for download along with the manuscript. The data in each row could specify the groups compared (e.g. healthy vs. diseased), the test used, the sample size (how many data points went into the analysis), the effect size found (e.g. fold change or whatever is appropriate), raw p-values, and FDR q values (after false discovery rate control). Then in the main text we can simply report the main takeaway findings and refer the reader to this table for additional information:

>In order to study how this disease influences gene expression, we conducted RNA-seq on 20 healthy subjects and 20 subjects with the disease (see Methods; raw data are in Supplementary Data Table 1). We tested genes for differential expression using the edgeR software package. The edgeR software models gene expression in replicated samples using a negative binomial distribution, and then uses an exact test similar to Fisher's exact test to test for significant differences between categories ([Robinson et al., 2010](https://pubmed.ncbi.nlm.nih.gov/19910308/)). We controlled the False Discovery Rate (FDR) using Benjamini-Hochberg FDR control and q values. 

>Our result identified two pathways that were influenced by disease state. Two genes in pathway A - Gene1 and Gene4 - showed significantly higher expression in samples from healthy patients (for significance and effect size see Supplementary Data Table 2). In contrast, we found that two genes in pathway B - Gene2 and Gene3 - showed higher expression in samples from patients with the disease (this difference was statistically significant for Gene3 but not Gene2). This suggests that either pathway A and B are involved in some way in the pathology of the disease, or perhaps that external factor(s) that predispose patients to contracting the disease also influence gene expression of these pathways.

**Option 2: put statistical details directly into the text**. For small statistical analyses, you may prefer to put the relevant information directly into the text. If taking this option, we would put all our statistical tests, their effect sizes (e.g. fold change or whatever is appropriate for the test used), raw p-values, FDR q values (after false discovery rate control). In this case since we tested many genes, we should really include a supplementary data table summarizing those results. It is highly likely that readers will want to look through the list for other genes that we weren't interested in, and we want to make sure they can see *all* the data, not just the parts we thought were important. So probably option 1 is best. That said, let's consider another hypothetical example to show how we would phrase this in general. Imagine we were trying to correlate weight to mating success in male Tuatara. We might say something like:

> Among more than 50 surveyed Tuatara, male weight was strongly correlated with mating success (Spearman r2 = 0.09, two-tailed p = 0.034).




#### Problem 5 Wordiness 

Our results description is getting to be pretty good. As a final editing step, let's see if we can find any places where overly wordy expressions could be replaced by something simpler *while still accurately conveying our results*. 

Why edit for wordiness? Scientific manuscripts are often subject to strict word limits. Even when they are not, replacing wordy expressions with more concise ones can sometimes make the text more readable. This is not a hard-and-fast rule, but something you can try. Often, when you read both versions out loud, the more concise one will sound better. In fact, editing your work to fit into strict word limits is one way to rapidly improve your scientific writing. It forces you to consider what information is important, and to try several ways of expressing your ideas.

#### Revision #5 Replace wordy expressions with more concise ones

Read through the above paragraph. See if you can find any phrases or sets of words that could be replaced with a shorter versions without changing the meaning. Once you have, read on.

**Replace stock phrases with shorter alternatives**

Let's read the first sentence:

>In order to study how this disease influences gene expression, we conducted RNA-seq on 20 healthy subjects and 20 subjects with the disease (see Methods; raw data are in Supplementary Data Table 1).

The 'in order to' at the start of the sentence isn't doing much work. It could simply be replaced with 'to':

>To study how this disease influences gene expression, we conducted RNA-seq on 20 healthy subjects and 20 subjects with the disease (see Methods; raw data are in Supplementary Data Table 1).

You may find other stock phrases that can be shortened without changing the meaning or readability of the text. In those cases, consider using the shorter one. In this case we saved two words for free. These small savings add up surprisingly quickly.

Try this out in this sentence:

>In contrast, we found that two genes in pathway B - Gene2 and Gene3 - showed higher expression in samples from patients with the disease (this difference was statistically significant for Gene3 but not Gene2). 

Here, the phrase 'we found that' is doing little to convey meaning. We can easily cut it and not lose anything.


>In contrast, two genes in pathway B - Gene2 and Gene3 - showed higher expression in samples from patients with the disease (this difference was statistically significant for Gene3 but not Gene2). 

We could consider also eliminating 'In contrast', but that phrase is doing useful work. It reminds us that the results for pathway B are *different* from the result for pathway A that we mentioned just before this. I would leave it in unless we were desperate to shorten the manuscript.

A final way to shorten this sentence would be to shorten 'higher expression in samples from diseased patients' to 'higher expression in disease samples' or 'higher expression in disease.

Here's one more to try:

> The edgeR software models gene expression in replicated samples using a negative binomial distribution, and then uses an exact test similar to Fisher's exact test to test for significant differences between categories ([Robinson et al., 2010](https://pubmed.ncbi.nlm.nih.gov/19910308/)).

The phrase 'and then' can be safely replaced with 'then' without changing the meaning. Similarly, we could rewrite 'test for significant differences between categories' to 'test categories for significant differences' and save a word. Let's make that edits:

> The edgeR software models gene expression in replicated samples using a negative binomial distribution, then uses an exact test similar to Fisher's exact test to test categories for significant differences ([Robinson et al., 2010](https://pubmed.ncbi.nlm.nih.gov/19910308/)).






Taken together we would then edit the paragraph to something like this:

>To study how this disease influences gene expression, we conducted RNA-seq on 20 healthy subjects and 20 subjects with the disease (see Methods; raw data are in Supplementary Data Table 1). We tested genes for differential expression using the edgeR software package. The edgeR software models gene expression in replicated samples using a negative binomial distribution, then uses an exact test similar to Fisher's exact test to test categories for significant differences ([Robinson et al., 2010](https://pubmed.ncbi.nlm.nih.gov/19910308/)). We controlled the False Discovery Rate (FDR) using Benjamini-Hochberg FDR control and q values. 

>Our result identified two pathways that were influenced by disease state. Two genes in pathway A - Gene1 and Gene4 - showed significantly higher expression in samples from healthy patients (for significance and effect size see Supplementary Data Table 2). In contrast, two genes in pathway B - Gene2 and Gene3 - showed higher expression in disease samples (this difference was statistically significant for Gene3 but not Gene2). This suggests that either pathway A and B are involved in some way in the pathology of the disease, or perhaps that external factor(s) that predispose patients to contracting the disease also influence gene expression of these pathways.





## Summary

I hope this guide suggested a useful overview of how one might begin editing initial notes on your statistical results into manuscript text. It is hardly comprehensive, nor are the specific problems and fixes highlighted in this text the only ones you'll encounter. The main things that have proven useful for me are:

- Writing is an iterative process. Don't feel like you have to produce text that looks like what you see in papers when you first write down results. You often have to write some bad text in order to have something to revise. 

- It's good to focus each revision pass over a given section on a particular type of problem. For example, you might allocate a certain writing time to ensuring that all citations in a section are in order, or that each paragraph in the overall document has a clear transition to the next. 

- A few things that will often improve writing about statistial results include:
    1) express similar results in a similar way, 
    2) reorganize the order in which results are presented so similar results go together,
    3) synthesize results rather than merely listing them, 
    4) check that statistical tests are explained and that appropriate statistical details are included and 
    5) replace wordy phrases and jargon with more direct language, where appropriate.

## Exercises

**Revise the following passage**. See if you can express the same information in a way that is more compact, better organized, or more readable: 

"In order to estimate and compare the effect of island size on the diversity of all vertebrate animals, we used line transects in 10 locations per island on each of our 17 study islands to quantify the counts of readily observable species (mostly large mammals, birds, and reptiles), and then correlated the results against the overall size of each island using Pearson correlation."


**Practice revising your writing for wordiness**. Find a paragraph or paper section that you have previously written about something scientific. Use Word Count in Google Docs or Microsoft word to get a word count. Try to cut ~20% of the words without affecting the meaning. Does the revised section read better when read aloud? Which revisions worked well and which didn't? Could you cut the word count down to 50% of the original without changing the meaning too much?

