The other day I was asked to make a plot for the cBioPortal which summarizes allele frequency across genes in a particular patient. Thought this data --- a list of frequencies (numbers) --- was actually already displayed in a table, this is a perfect illustration (pun intended) of how visualizations help us understand the numbers (see this for an example). Nothing like feeding in data directly through the optical nerve and then applying using the best pattern recognition machine known to man --- the human brain. Anyways, it turns out that you can just fire up R and use the density function on the list of numbers and get something like this:

Without knowing too much statistics, this curve can still provide a great deal of intuition about a tumor. For example, notice that there are two bumps. This indicates that there are at least two subclones in the tumor sample. You'll also notice that most of the distribution's area is shifted to the left of \( 0.5 \), due to tumor impurity. In the lines that follow we go in depth on the biological intuition, statistics, and code implementation of this funny little line.


Let's start by looking at these terms:

  • allele
  • subclone
  • tumor purity

starting with the allele

When I learned about alleles in high-school it was in the context of Mendelian genetics. Somehow you ended up dealing with discrete gene types --- A and a --- each of which represented different values (genotypes) for a particular trait. Since your somatic (Greek for body, i.e. not sex) cells are diploid (Greek for double) there are going to be two alleles for every gene and one can study how these (two) alleles for the same gene interact with each other. In high school there were basically only two types of interactions between alleles: dominant, and recessive.

Nowadys I think that it is more appropriate to think about genes as sequences of base pairs and of alleles as different versions of a particular sequence. Of course, cells are still diploid and so the labels, "homozygous," "heterozygous" are still very relevant. 1

and on to the sequencing (read making) machine

At this juncture it is probably best to say a few words about sequencing data and what is looks like. We're only going to give a very high level overview and are glossing over practically all of the details of the amazing technologies that are making the genome revolution possible, not to mention the countless hours of work by many brilliant scientists and engineers.

Long story short, when we take a tumor biospy, extract the genetic material and then sequence it, what we end up with is a long list of overlapping "reads" --- relatively short sequences of ~20-100s of nucleotides. Of course, when we first get these reads we don't even know a priori where they came form in the genome, so we do data analysis and run algorithms (and scientists, as experts, do some magic) to figure out where they go. This is called read alignment. Here's an example of a portion of an alignment, highly boiled down of course:



The first line is the reference gene, this is the thing that the reads were aligned against. Each row is another read, another measurement. When all of these measurements agree at a nucleotide, then we have high confidence in that measurement. Take a look at the first letter in green. The reference says that it should be C but our first read says that it is a T. Since only one read covers that nucleotide, maybe we aren't so confident in the reliability of the measurement. 2

Now take a look at the sequence of blue reads. We have a few reads there (in reality we'd probably want tens or hundreds of reads at a particular location) --- it's looking good --- except that there are a few red Gs.

How can we interpret these Gs? One possibility is that it is an artifact of the machine, i.e. measurement error as with the first T, but it's also possible that we have discovered an alternate allele in our sample. A natural thing to start thinking about is the fraction of Gs at that particular location, and indeed this blog post pivots on just this fraction. Let's write it out, even if it is mundane:

\[ freq_{position}(allele) = \frac{\#allele}{\#reads} \]

Once you've got the reads, it's simple. In our example, the allele indexed by the red G is simply \( \frac{2}{4} = .5 \).

and now, interpretation!

What sorts of values does one expect to get for allele frequencies? As we mentioned before, human cells are diploid. So if the cell population has one copy of a particular allele, i.e. if it is heterozygous, you'd expect the allele frequency to be \( .5 \). So for a list of heterozygous alleles, you'd expect the distribution of the allele frequencies to look something like this:

Since heterozygousity means that the allele is in half of the genetic material and therefore in half the reads, the distribution is centered around \( 0.5 \). The distribution has width because of the imprecision of the machine. Remember that we probably want tens or hundreds of reads at a location to have statistical confidence (aka power).

In cancer, genetic alterations are often so potent that only heterozygosity (one altered gene) is required to confer an evolutionary advantage. Without evolutionary pressure to change, there isn't likely to be any tractable change and so, if there's a genomic alteration that is cancer related, we expect it to be present in half of the genomic information. If it is in half of the genomic information, then it will be in half of the reads. Under "perfect conditions" (more about this in a moment), we expect a distribution of cancer gene allele frequencies to have a distribution similar to the one above.

tumor composition complects 3

So much for alleles coming from different chromosomes, how about alleles coming from different cells? And first of all, what sorts of cells do we even expect to encounter in the tumor? Remember, in the process of creating the genomic data all the different types of cells that are in the tumor --- the normal, noncancerous cells, the benign cancer cells, and the malignant cells --- are ground up and analyzed as one. This creates complex, heterogeneous data. Or put differently, there are too many variables influencing the data. Analyzing such data like listening to the radio in between channels --- it's noisy. In fact, statisticians and biologists often discuss the signal strength as well as the noise content of data. Biology tends to throw off a pretty noisy signal.

For genomicists, what makes all of these cells different is the fact that they have different genomes. Cancer cells, as we mentioned earlier, may even have genomes with large chunks of missing or additional genomic material. In fact, this genomic heterogeneity is well-known (as they say in the academic papers) and is part and parcel of the disease itself. Cancer originates in the accumulation of certain enabling genomic alterations. These are generally going to be things that most people recognize as cancer-like: loss of the ability to stop growing, ability to dodge the immune system, etc. The genomic instability that lead to this critical mass of genomic alterations snowballs, giving rise to a heterogeneous cell population. A heterogeneous population is exactly what allows for evolution. This is what gives cancer its agility and resilience. Here's a nice visualization of what's happening in Acute Myloid Leukemia (AML) pre- and post-treatment:

Visualization of Tumor Evolution in AML


So when oncologists talk about subclones they are implicitly assuming that evolution has taken place. They assume that there were at most a few "bad cells" that diversified and then evolved when put under the pressure from the environment (which is a mild way of putting it, what we are talking about is surgery, radiation therapy, chemotherapy).

Supposing we have such a heterogeneous population of cancer cell subclones and normal cells, what would a distribution of its cancer cell allele frequencies look like?

First of all, the presence of normal cells shifts the distribution of cancer alleles to the left by diluting them. So we generally expect the frequency of a cancer allele, not to be \( .5 \) but more like \( .3 \):

Assuming that what differentiates the subclones are some collection of mutations (otherwise, this method is mute and we have to look into other ways of slicing this data), we would get allele frequences other than \( .5 \). Remember that what is changing here is the numerator of the allele frequency. We still have the same number of total cells, but the number of times a particular allele shows us in the read counts decreases since the reads are now distributed among different subclones that have different mutation profiles. Furthermore, the new allele frequency that will appear has to be less than \( .5 \). You get something like this:

Here's a relevant thread

In summary, we are basically looking at three statistics calculated from the data:

  1. mean : indicating the purity of the tumor,
  2. variance : due to the imprecision of the sequencing machine, and
  3. modality (number of modes ): indicating the presence, possibly even the number of, subclones

philsophical take away

Take note that while it wasn't necessarily hard or nonintuitive, we did have to do some thinking about the interpretation of this data. The inherent complexity here due to the fact that the reads --- from different chromosomes and from different cells --- are all mixed together in one basket. The job of analysis then is to pull this data apart into its contributing components.

Statistics and Implementation

From this point of view, our list of allele frequencies are just a list of numbers which we want to estimate the generating distribution of. One way to do this is to make a histogram --- break up your data into groups (bins) by setting thresholds (all the numbers that fall between \( .25 \) and \( .5 \) ) and the count the number of data points that fall into each bin and divide by the number of data points you have. But this isn't going to give an algebraic expression for the distribution, nor will it give a smooth curve. To do this you can use a generalization of histograms called Kernel Density Estimators.

You can also think about histograms as a sum of uniform distributions. From your choice of bins make a list of uniform distributions. For example, the bin \( [.25, .50] \) turns into an indicator function \( \mathbb{1}_{[.25, .50]} \) which is equal to \( 1 \) on the interval \( [.25, .50] \) and zero everywhere else. Divide that by the length of the interval, in this case \( .50 - .25 = .25 \) and you get a uniform distribution,

\[ u = \frac{1}{0.25}\mathbb{1}_{[.25, .50]} \] .

Now turn your attention to the data. To each data point \( d\in D \), find which bin it falls into then take the associated uniform distribution. Take the average of all these uniform distributions and you get a functional form, a distribution in fact, for the histogram,

\[ \frac{1}{|D|} \sum_{d\in D} u_d \]

Okay, so I guess I was a bit misleading earlier when I said that you can't find an algebraic expression for the histogram. What's written is a bunch of symbols exactly conveys a histogram! And it sure looks like an algebraic expression to me! But, can we do better? Can we get a function that is continuous? It's also worth pointing out that as the number of data points grows, the error of histogram (distance to the true generating distribution) goes to zero slowly. This is due to the nontrivial relationship between the data and the bin size.

With this form, it isn't hard to generalize the histogram. Just replace the uniform distribution with another distribution, i.e. continuous function that integrates to one. Here's an illustration from wikipedia:

Kernel Density Estimator Construction

Take a Gaussian distribution for example, and replace the \(u\)s in the average above with this function and you would get a Kernel Density Estimator using the Gaussian Kernel.

\[ \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{(x - \mu)^2}{2\sigma^2}} \]

Backup a second, where's the word "kernel" coming from? The meaning of this term really seems to depend on who you ask. Some people would think of it as a generalized way of measuring distance between vectors in space. Just as in software where abstraction allows for the increased interplay of generalized abstract tools, here the abstraction of this form of a distance function allows statistical methods to modular in their choice of distance metrics.

In this sense of "generalized distance" then, its clear that a kernel function has to have two arguments. After all, what else would a distance function do if not take two things and return a number representing the distance between them? Okay, big deal, we are looking at a function of two arguments, \( K(x,y) \). How can we turn that Gaussian distribution into a function of two arguments? Pull out the mean \( \mu \):

\[ K(x,y) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{(x - y)^2}{2\sigma^2}} \]

Lo and behold, this is the Gaussian Kernel. Once you fix a value for \( y \) you can go about calculating some KDEs. Another way of thinking about this abstractly, is that the KDE is the center of mass of the distributions associated with each data point in distribution (kernel) space. This center of mass is parameterized by the mean parameter (or any other parameter that the distribution takes) that we just pulled out of the distribution function. For more about this, take a look at Michael Jordan's lecture notes . He goes into some bounds for histograms, i.e. convergence properties as the number of data points \( n \) goes to infinity, among many other interesting things. It was a pleasure to discover him and his work when poking around the Internet for stuff about KDEs.

But hold on, what about the parameter \( \sigma \)? In this context, \( \sigma \) is the Gaussian Kernel's "smoothing" parameter, aka bandwidth.

When I started looking into calculating this smoothing parameter, I realized that there is a whole literature devoted to KDEs and choosing smoothing parameters. There are papers like this, a whole chapter of a statistics book Maximum Penalized Likelihood Estimation "Choosing the Smoothing Parameter," and many other things too, until I made a quick return to the origin of so many escapades in "research," wikipedia, where I found Silverman's Rule of Thumb.

I poked around a bit looking for a derivation but as soon as I implemented it and saw that it worked, I have to admit that I was satisifed:

// Silverman's rule of thumb for calculating the bandwidth parameter for
// kde with Gaussian kernel.  Turns out that one can calculate the optimal
// bandwidth when using the Gaussian kernel (not a surprise).  It turns out
// to basically be, a function of the variance of the data and the number
// of data points.
var calculate_bandwidth = function(data) {
    var mean = d3.mean(data);
    var variance = d3.mean( { return Math.pow(d - mean, 2); }));
    var standard_deviation = Math.sqrt(variance);
    var bandwidth = 1.06 * standard_deviation * Math.pow(data.length, -1/5);

    return bandwidth;

Why do I say that it's not a surprise that you can calculate the optimal bandwidth when using the Gaussian Kernel? I dunno, things often work out when you use the Gaussian function. There you go for intuition.

Let's take a quick look at some of the implementation details in javascript. I got my entire inspiration for this from a Michael Bostock block :

// params: kernel, list of sampling points
// returns a function that takes a list of data points, and computes the kernel
// density estimator over those data points by averaging the kernel over each one,
// returning an x-y coordinate tuple [sample, kde(sample)]
var kernelDensityEstimator = function(kernel, sampling) {
    return function(data) {
        return {
                return [sample,
                    d3.mean(data, function(v) { return kernel(sample - v); })];

I just renamed a few variables so that they would be a bit clearer to me. Bostock is clearly an extremely talented guy, but why does he always name his variables badly? Is that what happens when you're famous?

Anyway, note that we are computing kernel(sample - v) for every datapoint v in data (nice syntax sugar d3js ), and then taking the mean of all of these mapped values. Since we are doing the norm operation inside of this kernelDensityEstimator function, we don't have to include it in the kernel function itself. Thus, although we wrote (( kde(x) )), what we meant was really \( kernel( | x- y | ) \). In any case, here's an implementation of the Gaussian kernel. No surprises here:

// params: variance, basically a smoothing parameter for this situation
// returns the gaussian function, aka kernel
var gaussianKernel = function(variance) {
    return function(x) {
        return (1 / (variance * Math.sqrt(2 * Math.PI)))
            * Math.exp( -1 * Math.pow(x, 2) / (2 * Math.pow(variance, 2)));

thanks for reading

So we talked a bit about tumor composition, sequencing machines, and how we can go about quickly getting a sense for the real knowledge about a tumor from sequencing data using Kernel Density Estimators.

And just in case it isn't already perfectly clear what the axises of the plot are: the x-axis is allele frequency as we defined above and the y-axis is the number of gene sequences (divided by the total number of genes being considered, though this is optional).

[1] In fact, one of the confounding aspects of the cancer genome is that due to the modifications that have taken place in the cell cycle, and probably among many other reasons, the number of chromosomes will often not be a multiple of 23. I recently learned (thanks Ricardo) that the term for this is aneuploidy -- an-eu-ploidy, Greek: not-good-ploidy. Wikipedia seems to think that ploidy is a kind-of-made-up-word, but I think it is much more poetic to see it this way: ps are like fs, and we get floidy. Sometimes letters switch places so we get folidy. And sometimes letters disappear, especially if make it hard to say a word, goodbye "i", and we get foldy. It's the number of folded up bundles, chromosomes, of genetic material you have during cell division. But, we digress. As the term suggests, aneuploidy is bad. A recessive allele on a duplicated (or triplicated, or...n-plicated) chromosome can overpower a dominant allele, and conversely for a dominant allele on a deleted chromosome. This is a completely different mechanism by which dominant and recessive alleles can be regulated than anything we thought about in high school.

[2] This is, of course the age old notion of measurement precision that every high school student spends far too much time laboring over at the beginning of every grade school science class. Having suffered through this as both a student and a teacher, I'll say that instead of doing this at the beginning when no one appreciates the importance of these subtle notions, what she should be doing at the beginning of a science class is getting hooked on science. Once a person is hooked, they'll go through all kinds of torture: understanding the difference between precision and accuracy (aka variance and mean), learning how to measure and how to record those measurements, and also how to do arithmetical operations on measurements and preserve their precision and accuracy properties (significant figures).

[3] There are actually machines that can sort a heterogeneous population of cells using flow cytometry --- cyto-metry, Greek: cell-measurement. Cells are tagged using a special molecule that, when exposed to a certain wavelength of light, reflects back a specific wavelength of light that can then be measured (fluorescence). You can use different tags to reflect different colors, the sorting machines (FACS for example) can then use the wavelength value to sort the cells. You flow the cells through a tube one at a time, the machine shines a light and then depending on what is reflected back, sends the cell into one of a number of pipes. Lo and behold, you have individually sorted a whole bunch of cells. You can sort ~5,000 cells per second using one of these machines. Wow! It would seem as if we can eliminate the complexity by simply pull apart the different types of cells in the sample.

Of course what we glossed over is also the trickiest part of this process: tagging the cells appropriately. It would be great if we could tag all of the cancer cells and then literally sort them out using one of these machines. Unfortunately, we don't know how to tag tumor cells yet. If we did, then this discussion is mute --- another example of technological advancement resulting in higher quality data, thereby making the statistics anachronus.

We are only at the beginning of a more complete molecular understanding of tumors. In fact, the data set we are working with comes from the hard won work of a large collaborative project called The Cancer Genome Atlas (TCGA). As the name suggests, this is very much an exploratory effort to get a rough map of the terrain of the cancer genome, and more generally of the "mole-ome," of cancer. I don't know how soon it will happen, but sometime in the near future we will have a sufficiently complete picture of the cancer genome to be able to know what to look for, tag it and study it in a relatively unhindered, uncomplected way.
(Thanks Hannah for the comment)!