# Biological Sequences

Sequences - ordered arrangements of repeated elements - are ubiquitous in biology. They include the DNA that makes up our genome, RNA that transmits the information in that genome to ribosomes for manufacture into proteins, and the strings of amino acids that compose proteins.

This section will introduce multiple ways in which we can represent such sequences as text and use them to calculate useful quantities. 

#### Prerequisites for this section

None.



## Sequences in Biology

Let's first review some of the biology of DNA, RNA and Protein sequences. Those already familiar with basic molecular biology and the Central Dogma should skim down to the heading on how to represent biological sequences in text. 

### The Central Dogma

Flow of information between different types of biological sequences is important in the biology of cells. The [Central Dogma of molecular biology](https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology) is a statement of which routes of information transmission are typical, which are rare, and which never happen. 

In cellular organisms, DNA forms the genome. That genome can be replicated by proteins called DNA polymerases to generate new copies of itself during reproduction. This replication is not perfect, and can introduce random changes in the DNA (mutations), which are an important source of variation among organisms. But overall DNA replication is quite accurate on average under normal circumstances. A key feature of DNA replication is that it is  *semi-conservative*. This means that each two stranded copy of the original two-stranded DNA genome contains one newly synthesized strand and one of the parental strands.

Some DNA forms genes. These contain the information needed to form the sequence of a protein (or, more rarely, a functional RNA). But proteins are not produced from DNA directly. DNA must first be **transcribed** into RNA before translation can occur.  Critically, each gene is transcibed at different rates under different conditions and at different points in time. This is important because cells in our liver and in our eye both inherit the same DNA. The main reason these tissues are so different is due to differences in which genes are transcribed into RNA (and therefore can be translated into protein).

Finally, RNA can be **translated** into Protein in a structure with both protein and RNA elements called a ribosome. Much as with transcription, not *all* RNAs are translated into protein (this is known as **translational regulation**). In the ribosome, 3 letter codes in the RNA (known as **codons**) guide the addition of new **amino acids** to a new protein. 

A key tenant of the central dogma is that once proteins are formed, information does not flow backwards to RNA or DNA. That is, while proteins can be generated from the information in DNA or RNA in the cell, the opposite is not true.

Let's consider each of the types of sequences involved in the Central Dogma in more detail.

### Nucleotide sequences
* DNA
* RNA

Nucleotide sequences include DNA and RNA. Although DNA and RNA can differ by only a single hydroxyl group in chemical structure, they play very different roles in the cell. 

**DNA** For cellular organisms (and some but not all viruses) the genome is formed by one or more chromosomes of double-stranded DNA. DNA can be thought of as a form of stable long-term storage of genetic information -- very loosly analagous to a computer's hard drive. The DNA is composed of **nucleotides**, which are often abbreviated `nt`.

The double-stranded string of DNA nucleotides has some **coding regions** that encode the information needed to put together proteins. Other regions are **non-coding regions**. While much non-coding DNA may have no specific benefit to the cell, other non-coding sequences hold important information such as sites where specific proteins or protein complexes should bind to the DNA to do things like starting DNA replication or transcribing a particular gene in the DNA into RNA. Broadly speaking, mammals like humans tend to have far more non-coding DNA than single celled organisms, which in turn tend to have more non-coding DNA than viruses. This is thought to be both because carrying extra DNA is a greater burdan on fast-reproducing organisms, and also because the larger population sizes of viruses or bacteria may allow for 'streamlining' of the genome to remove unneeded non-coding regions, even though the cost of carrying them is minimal.

The word **gene** refer to any sequences of DNA that encode an RNA. **Protein-coding genes** encode RNAs that go on to serve as the basis for manufacture of a protein. **Non-protein-coding genes** encode RNAs that serve other functions in the cell. Thus, all **coding-regions** are genes, but not all **genes** are coding regions.

**RNA** plays several roles in cells. The most common is to transmit the information in protein coding-genes to ribosomes, where that information can be used to make proteins. In this way, RNA can serve as a temporary form of portable storage for the infomation contained in the DNA. You could think of it as very loosely analagous to a computer's RAM, in the limited sense that programs are loaded into RAM before they can be executed, and there are generally many more programs on a hard drive than there are running programs at any given moment. Similarly, just as computers typically have more hard drive space than RAM, the DNA genome for organisms is much longer than any indiviudal RNA. 

In addition to its role in information transmission, RNA also plays direct functional roles in the cell. It contributes to the structure of some cellular complexes like the ribosome (which translates RNA into protein). It has also been discovered that RNA can  catalyze (speed up) reactions, a finding for which [Sydney Altman and Tom Cech won the 1989 Nobel Prize in Chemistry](https://www.nobelprize.org/prizes/chemistry/1989/summary/)

### Amino Acid Sequences 
* Protein 

**Protein**'s primary roll in cells is to catalyze reactions, and although both RNA and proteins can do so, proteins catalyze many more reactions. This causes reactions that would be energetically favorable anyway to happen much faster than they otherwise would. Protein also plays important roles in cellular structures such as actin filaments.

### Overview of Common Biological Sequences

Here is an oveview of differences between DNA, RNA and protein:

| Sequence Type     | Produced from     | Main functions |Units | Types of Units | Number of units |
| :---| :----|:--- |:--- |:--- |:---: |
| DNA      | DNA by DNA replication or (more rarely) from RNA by reverse-transcription (e.g. in retroviruses)   | Information storage |Nucleotides       | Adenosine, Thymidine, Guanine, Cytosine | 4  | 
| RNA      | Transcription of RNA from DNA | Information transmission, catalysis of some reactions, information storage in some viral genomes (e.g. retroviruses) | Nucleotides       | Adenosine, Uracil, Guanine, Cytosine |4  |   
| Protein  | Translation of RNA to Protein in a ribosome | Catalysis of reactions, structural roles | Amino acids       | Alanine, Arginine, Asparagine, <br> Aspartic acid, Cysteine, Glutamine, <br> Glutamic acid, Glycine, Histidine,<br>Isoleucine,Leucine,Lysine,<br>Methionine,Phenylalanine,Proline,<br>Serine,Threonine,Tryptophan,<br>Tyrosine,Valine| 20* |


\* There are 20 'standard' amino acids that are common in human biology. However, other amino acids appear in biology. For example, under certain circumstances 'UGA' codons - which usually halt translation - can produce selenocysteine, a 21st amino acid discovered in 1986. Further reading: [The 21st Amino Acid (Atkins and Gesteland 2000)](https://www.nature.com/articles/35035189)

## Important features of biological sequences


### Evolutionary relatedness and gene families

Mutations are changes in the sequence of the DNA. Where mutations occur is mostly random. Because the DNA in genes is transcribed into RNA, which may then be translated into protein, mutations in the DNA go on to influence the sequence of RNA, and can also influence the amino acid sequence of protein. 

These changes might substitute one nucleotide with another (e.g. `A` --> `T`), add one or more nucleotides (`A` -> `AAAAAA`), delete one or more nucleotides (`CCAGCA` -> `CC`), or duplicate one or more nucleotides (`CG` -> `CGCG`). 

This last point - that DNA can be duplicated - means that genes in a cell are not all equally related to one another. Instead, over the course of evolution, single genes are sometimes duplicated into two copies ('gene duplication'). This means that over time, whole families of genes can be produced by gene duplication. These are called *gene families*. Members of the same gene family will tend to have more similar DNA sequences than genes from outside that gene family. Often, members of a given gene family encode proteins with similar functions. As time goes on, members of gene families will undergo mutations that change their sequence, which can lead to changes in the function of the protein or functional RNA that those genes encode. 
 

### Complementarity and base pairing in nucleotide sequences 

A key feature of both RNA and DNA nucleotide sequences (but not amino acids) is complementarity. Consider double-stranded DNA. Each base in one strand must pair with a base in the other strand. But not all pairings are equal. Generally, Adenosine (A) nucleotides pair with Thymidine (T), and Guanine (G) nucleotides pair with Cytosine (C). Biology students memorizing this pattern of complemementation often state it in shorthand based on the letter codes of each nucleotide:

> In DNA, A pairs with T, and G pairs with C

Base pairing is also important in single stranded RNA. Since there is not another strand to pair with, single stranded RNA can pair with other nucleotides in its own sequence, causing the initially linear RNA to fold up into more complex **secondary structures**. If you hold up the end of a shoelace and fold it back on itself over a finger you will form something that looks like a **stem-loop**, a common type of RNA secondary structure. When RNA pairs with itself, compementation works similarly as in DNA, except that Uracil (U) replaces Thymidine. Thus A pairs with U and C pairs with G.

> In RNA, A pairs with U and G pairs with C

Complementarity between RNA and DNA is also important for the process of transcription, in which RNA is produced from DNA. During transcription, an `A` in the DNA will pair with a `U` in RNA.

### Transcription, complementation and conventions 

DNA is double-stranded, with each strand being complementary to the other (with rare exceptions). If we say a gene contains the sequence 'AAT', it is therefore critical that we know which strand of DNA we are talking about, because an 'AAT' on one strand will be paired to a 'TTA' on the other. This raises the question of which strand we're referring to.

When coding DNA is transcribed into RNA, each DNA codon is transcribed into its complement. Because the two DNA strands are complementary, it is critical to know which strand is being transcribed in order to be able to accurately predict what RNA will be produced. However, there is a convention for recording DNA sequences that makes this very easy to predict, if if it can be extremely confusing at first. 

By convention, DNA sequences record the 'coding' or 'sense' strand, which runs from the 5' to 3' end of the DNA molecule. The other strand of the DNA is the 'template' or 'anti-sense' strand, which runs 3' to 5' in the DNA molecule. When transcription happens, it is from the *template* strand, **not the coding strand**. Therefore, the final RNA produced has the same sequence as the original (coding or sense strand) DNA, except that any `T` nucleotides have been replaced by `U`. This is essentially because we have taken the complement of the complement of the original DNA.

Here's an example:  the sequence of a DNA codon is 'AAT'. Unless otherwise specified, you should assume sequences are reported for the sense strand, in the 5' to 3' direction. So the antisense strand will have a 'TTA', because this is the complement of 'AAT'. When transcription of RNA from this DNA happens, it will be from the *anti-sense* strand. During transcription, each anti-sense nucleotide will be paired with a complementary RNA nucleotide. So the RNA sequence produced by transcription will be `AAU` - the exact same as the original `AAT` sequence, except that the `T` has been replaced with a `U`. 

Although a bit confusing at first, this convention of reporting the 'coding strand' DNA sequence (even though transcription makes use of the opposite, 'template strand' sequence) makes it easy to predict RNA sequences from reported DNA sequences by just replacing the each `T` with a `U`.


### RNA codons encode amino acids in a 'degenerate' fashion

In coding regions of DNA, nucleotides are arranged into 3 letter chunks knows as [codons](https://en.wikipedia.org/wiki/Genetic_code). Most of these 3 letter codons encode a particular amino acid, while three 'stop codons' indicate places that translation should stop, and one 'start codon' `AUG` indicates the start of translation as well as the amino acid methionine. 

During translation, each codon is physically matched to a 3-letter anti-codon on a small transfer RNA (tRNA). The other end of that tRNA has a specific amino acid attached to it. (Separately from translation,  amino acids are attached to tRNAs in the cell by an enzyme called an aminoacyl tRNA synthetase). During translation, each codon is matched by a tRNA, which in turn adds its amino acid to the growing protein.

It's worth thinking about this system from the point of view of how information is encoded, because it has some important features that are a little strange. First, let's consider how many amino acids a 3 letter codon could encode, if each letter was one of 4 standard nucleotides. If codons were just 1 letter long, we could only encode 4 amino acids, since we'd have to map each nucleotide to a single amino acid. If codons were 2 nucleotides long, we could encode 16 amino acids by using every combination of the 4 DNA letters at the 1st position of the codon with every possible second letter as a different code (e.g. 'AA' would map to one amino acid, 'AT' to another, so in total we'd have 4 x 4 = 16 amino acids). With real codons of length 3, we could in theory encode as many as 4 * 4 * 4 = 64 possible amino acids!

So we've seen that in theory 3 letter codons could encode up to 64 different amino acids. However, most cells only encode 20 amino acids plus 1 start codon and 3 stop codons. This raises the question of what to do with the 'extra' codons we don't need? Evolution has favored using these 'extra' codons to encode amino acids in a redundant or 'degenerate' fashion. Thus, while each codon maps to just one amino acid (with the exception of the 3 stop codons), each amino acid can be encoded by multiple different codons! For example, either a 'UUU' or a 'UUC' codon in RNA will produce Phenylalanine. However, the number of codons per amino acid varies: Leucine can be encoded by `CUA`,`CUC`,`CUG`, or `CUU` (4 possibilities) while Methionine is only encoded by `AUG`. As in the example of Leucine up above, the first two letters of the codon are often most important, while the 3rd letter changes the amino acid less commonly. Even more subtly, organisms don't always use these different possible codons for a given amino acid equally! This phenomenon is called 'codon usage bias'. 

#### Importance of understanding the degeneracy of the DNA code


Understanding the degeneracy of the DNA code, and the molecular biology of how tRNAs are matched to particular amino acids by amioacyl tRNA synthetases is important for some new and exciting areas of biotechnology. For example, by understanding these concepts well enough to manipulate them, some labs have been able to trick cells into attaching new non-standard amino acids (NSAAs) to certain tRNAs, thereby expanding the genetic code to include amino acids that are not normally used in nature! At least 71 such [non-standard amino acids ](https://en.wikipedia.org/wiki/Expanded_genetic_code) have been introduced into various kinds of cells to form artificial, expanded genetic codes. 

The observation that changing the 3rd codon position often doesn't change an amino acid has important implication for studying evolution using genomes, because mutations at the 3rd codon position are typically less strongly impacted by natural selection compared to mutations at positions 1 or 2.

Finally, the observation that organisms exhibit codon usage bias is important for genetic engineering. It turns out that one reason for codon usage bias is that different organisms have different abundances of particular tRNAs. So for example, perhaps the tRNAs with the anticodon for 'CUA' are far less abundant than those with the anticodon for 'CUU' in a particular cell. If tRNAs with a particular anticodon are very rare, it can slow down translation. So if you want to make *a lot* of a protein, you would want to use codons corresponding to tRNAs that you have in abundance - perhaps using 'CUU' rather than 'CUA' to encode Leucine. Evolution has no such forward-looking intention, but mutations that changed a 'CUU' codon to 'CUA' in our example organism might be selected and take over the population if they let that organism make more of a key protein, and therefore survive and reproduce more often. Indeed, studies of codon usage bias often find particular codons that are preferred in highly expressed proteins. This helps the cell quickly make those proteins in vast quantities.



#### A Side Note about Start Codons and Methionine

You might have noticed an ambiguity in the description up above. We said `AUG` typically indicates the start of translation *and* the amino acid methionine. That might lead you to wonder whether most proteins start with methionine, or whether these are two different meanings for 'AUG'. The answer is that because the 'start codon' for most proteins also encodes methionine, the amino acid methionine does indeed start most new proteins. However, *after* proteins are made, [this methionine is often cleaved off and removed](https://www.nature.com/scitable/topicpage/translation-dna-to-mrna-to-protein-393/#:~:text=Although%20methionine%20(Met)%20is%20the,methionine%20is%20removed%20after%20translation.), this initial methionine is often cleaved from the protein (i.e. cut off). So if you look at mature proteins, you will see that not all of them start with Methionine, though far more will than you would expect by chance. In addition to this cleavage of Methionines, in bacteria (and in organelles like mitochondria and chloroplasts that evolved from bacteria by endosymbiosis) the starting methionine is often chemically modified to formylmethionine.





### Hydrophobicity and bonds in amino acid sequences

Unlike DNA and RNA, the amino acids that make up proteins don't link up into complementary pairs. Amino acids do however differ in chemical properties, and those differences are important for the function of the proteins they compose. One key property of amino acids is their degree of [hydrophobicity](https://en.wikipedia.org/wiki/Hydrophobe#:~:text=In%20chemistry%2C%20hydrophobicity%20is%20the,hydrophiles%20are%20attracted%20to%20water.). The word "hydrophobic" means water-fearing, and is the antonym of "hydrophilic", or water loving. Just like olive oil and vinegar won't easily mix, hydrophobic amino acids will tend to clump together, away from water. They might clump together in the middle of a protein, at the site where one protein touches another, or in a place where a protein touches or passes through another hydrophobic substance, like the interior of the fatty [cell membrane](https://en.wikipedia.org/wiki/Cell_membrane) that encloses our cells.

Some hydrophilic amino acids are either positively or negatively charged. Just like magnets stuck together the wrong way repel one another, amino acids with the same charge will tend to avoid interacting. On the other hand, amino acids of opposite charge can form ionic bonds that hold them together. 

Other amino acids, called polar amino acids, don't have a full negative or positive charge, but have slight difference in the distribution of charge that makes them partially negative or positive. Such amino acids can interact with one another through hydrogen bonds. 

Finally, Cysteine amino acids, can form a special kind of durable bond -called a covalent bond- that locks them together with other Cysteine amino acids. The special kind of covalent bond formed by Cysteines is called a disulfide bridge. When present, disulfide bridges are very strong compared to the hydrogen bonds formed by polar amino acids or the ionic bonds formed by charged amino acids.


### 3D Structure and Chemical Modification

Although DNA, RNA and Protein can all be expressed as linear sequences, they also form more complex 3D structures. These structures are oftenimportant for function. The role of complementation and base-pairing in RNA secondary structure was mentioned above. But in addition to this secondary structure, RNA can fold into more complex tertiary structures.

Protein also has [secondary structure](https://en.wikipedia.org/wiki/Protein_secondary_structure). These include spiral or sheet-like structures called alpha-helices and beta-sheets. These alpha-helices and beta-sheets can fold further to form the overall 3D structure of the protein. Many proteins that are enzymes (e.g. that catalyze a reaction) have an active site where the reaction occurs. Changes to the protein sequence that are in the active site often have a bigger effect on the function of that protein than changes at other locations. 

Finally, the 3d structure of proteins in a cell can be dynamic: steroid receptors like the estrogen receptor, for example, change shape when they bind to estrogen. Similarly, many proteins can become **phosphorylated** by a class of proteins called kinases. This adds phosphate groups which - without changing the amino acid sequence - can alter the proteins 3d shape and therefore sometimes also its function and/or interactions with other proteins. For example, these changes in shape - called conformational changes - may in turn expose sites favorable to binding by other proteins or change the activity of a protein from 'off' to 'on' or vice-versa. This trick is used extensively in biology to 'activate' or 'inactivate' proteins in order to allow the cell to respond rapidly to external stimuli.

## Biological sequences can be represented as strings of letters

The international union of pure and applied chemistry (IUPAC) has established standard codes for representing DNA, RNA, or amino acid (protein) sequences using single letter codes. It is worth noting that these codes represent basic information about the sequence, but in real cells additional signals may be present (e.g. modifications of the histones that wrap up DNA can make it more or less accessible and thereby alter rates of transcription). 

### Representing DNA and RNA as text

Standard codes for DNA nucletides (a nice summary is available [here](http://zhanglab.ccmb.med.umich.edu/FASTA/)):
<pre>
A: Adenosine
T: Thymidine
C: Cytosine
G: Guanine
</pre>
The codes for RNA nucletides are similar except that it has Uracil instead of Thymidine:
<pre>
A: Adenosine
U: Uridine
C: Cytosine
G: Guanine
</pre>

#### Codes for ambiguous nucleotides
These cover most cases. However, in some cases DNA sequencing machines can't tell the nucleotides apart.
Therefore there are also ambiguous characters to represent cases where the identity of a nucleotide cannot
fully be determined.

<pre>
N: any nucleotide
R: any purine nucleotide (G or A)
Y: any pyrimidine nucleotide (T or C)
M: any amino nucleotide (A or C)
K: any keto nucleotide (G or T)
S: any amino acid that forms strong bonds (G or C)
W: any amino acid that forms weak bonds (A or T)
B: any of G,T,C
D: any of G,A,T
H: any of A,C,T
V: any of G,C,A
</pre>

#### A special 'indel' character useful when comparing two sequences.

When comparing two sequences, it is useful to have a gap character to represent when a letter has been deleted from one of the two sequences or added to the other (this is called an **indel** for insertion or deletion). Indels are represented with a `-` character:

`-` gap in one sequence


### Representating Amino Acid Sequence as text

Each amino acid letter in a protein sequence has a full name, a short three letter name, and a one letter code commonly used in bioinformatic analysis.
<pre>
 A: Alanine (ALA)
 C: cystine (CYS)
 D: aspartate (ASP) 
 E: glutamate (GLU)
 F: phenylalanine (PHE)
 G: glycine (GLY)
 H: histidine (HIS)
 I: isoleucine (ILE)
 K: lysine (LYS)
 L: leucine (LEU)
 M: methionine (MET)
 N: asparagine (ASN)
 P: proline (PRO)
 R: arginine (ARG)
 S: serine (SER)
 T: threonine (THR)
 U: selecysteine
 V: valine (VAL)
 W: trypotophan (TRP)
 Y: tyrosine (TYR)
</pre>

#### Ambiguous amino acid codes

Just as with nucleotides, certain amino acids are hard to tell apart, and so one letter codes have been developed to represent this ambiguity
<pre>
X: any amino acid
B: aspartate or asparagine (ASX)
Z: glutamate or glutamine (GLX)
</pre>
** Special amino acid codes**. Special amino acid codes are:
<pre>
- a gap or indel. Only used when comparing sequences, and indicates insertion or deletion of an amino acid in one sequence relative to another. 

* translation stop.
</pre>

## FASTA files store sequence information

DNA, RNA and protein sequences are often stored in text files called FASTA files. These might have several text extensions: .fasta, .fna (usually a nucleotide FASTA file), .faa (usually an amino acid FASTA file).

Here's how the lines of a FASTA file might appear:
<pre>
>gene1
ATCGATCGATCGTACGTCAGTCGTACGTCAGTCAGA
ACTGACTGTACGTACGTACGATGCTACGTACGCATA
ACTACGTACGTACGTACGATCGTACGTACGCATACG
ATGCTACGTACG
>gene2
ACGCTATCGATCGTACGTACGTAGCTACGTGGGGGG
AATATATTTGCGCGCCGTATAATATGCCGATATGCG
GTGCCTCTCTCGGCGCGCGCATTTTTGCGCGAAAAA
AAAAGCGATCGATCGTACGAAAAATGCATAGCTACG
AYCGAYCGAGC
>gene3
AGTCGACTGATCGTAGCTAGCTAGCTACGTAGCTAG
GA
</pre>

Look carefully at the above sequence. Here are the important features:
* each **label line** starts with a greater than ('>') sign.
* depending on where you got the sequence, the label line might contain a great deal of additional structured information, or it might just have an id like seq1. 
* the text after the label line, but before you encounter the next label
  line or the end of the file is the sequence that goes with that label.
* the sequences in FASTA files are often broken up across more than one line
* depending on where you get your FASTA file, the label lines could have 
  additional information embedded in them, or they could just be an id.
* depending on where you get your FASTA file, the letters could be uppercase,
  lowercase, or a mixture.
  
  
Hopefully this chapter has served as either a brief reminder or a probably somewhat overwhelming introduction to biological sequences. If you'd like to learn more, Nature Educations SciTable articles, like this one on [Translation](https://www.nature.com/scitable/topicpage/translation-dna-to-mrna-to-protein-393/#:~:text=Although%20methionine%20(Met)%20is%20the,methionine%20is%20removed%20after%20translation.) are a great resource that provides short introductions to various topics in biology. 


## Exercises

**Exercise 1**. Guess the Sequence

Given the above information about the letters in DNA, RNA and amino acid sequences, try to figure out what type of biomolecule is represented by each of the following sequences written in IUPAC nomenclature. For each, write down if you know for sure that it is a DNA, RNA or amino acid sequence. If you have a good guess but can't tell for sure which it is, expalin why:


**Sequence 1:** MEELVVEVRGSNGAFYKAFVKDVHEDSITVAFENNWQPDRQIPFHDVRFPPPVGYNKDIN

**Sequence 2:** AUGAUCGUACGUCAGCUCGUACGUCGGCGGUGAUGCUAGCGCUACGUGACUACGUAGCUA

**Sequence 3:** ATGACAACACAATTAAATCCCTATTTTGGTGAATTTGGCGGAATGTATGTGCCGGAAATT


**Sequence 4:** AGC



**Exercise 2**. *Escherichia coli*, or *E. coli* is a bacterium that is commonly used in laboratory experiments. Imagine that in the laboratory you chemically modified the structure of one of *E. coli*'s proteins, such that one amino acid was replaced by another. Use the Central Dogma of Molecular Biology to predict whether this chemical modification will be passed on to the next generation of *E. coli*, and justify your prediction.

Once you have jotted down your answers, you can check them in the [answer key](./biological_sequences_exercise_answers.ipynb)

## [Reading Responses & Feedback](https://docs.google.com/forms/d/e/1FAIpQLSeUQPI_JbyKcX1juAFLt5z1CLzC2vTqaCYySUAYCNElNwZqqQ/viewform?usp=pp_url&entry.2118603224=Biological+Sequences)