PCA analysis on genes and transcripts TPM for kallisto and salmon

changed milestone to %downstream

added Blocked Discuss Documentation Doing + 1 deleted label

Depends really on how do you want to integrate it:

If you want a proper MultiQC module and interactive plots - I don't think scatterplots are supported, not even mentioning 3D scatterplots.
That would leave us with injecting plain png's just like we do it now for TIN scores. But if I remember correctly, in order for the plot to be automatically recognised by the aggregator the filename needs to start with mqc_.

I ran Zarp for some samples of GCN4 data set of Nitish and performed PCA for the gene expression quantified by salmon: and compared to results produced by the pipeline:

The scatter looks similar, however, the percentage of the variance explained by PCs differs substantially. I checked the code that was used by the pipeline (https://github.com/zavolanlab/zpca/blob/master/scripts/zpca) and found some differences with my code:

Prior to all manipulations, the table with the gene expression should be log-transformed. Log-transformation makes the distribution of data close to normal and allows to perform PCA. This is a crucial point that should be implemented.
Also I performed only mean centering, I did not modify the variance in the expression (the function scale used in the pipeline performs both mean centering and scales the variance to 1). I don't know, how the function scale normalizes the variance to 1 for genes with the expression 0 in all samples.

@boersch Thanks for the feedback. Can you try to run the updated version? I now add a pseudo count (1/32) everywhere and log2 transform. Just pull the current branch and delete the output for the PCA plots. I think it should look more similar now (though I still expect some changes).

We recently discussed that procedure with @zavolan and @boersch , pseudocounts should not be too low because after log-transformation they will artificially pull down the mean of the distribution of gene expression, thus, skewing the data centering. In that conversation we agreed to add 1 everywhere...

@bakma Thanks for the feedback. I assume that you add +1 to the counts and not to the TPMs?

Hm.... no, the discussion was about TPMs I think...
But now when I think of it... It should be counts, right?

@zavolan @boersch Can you please comment on this? Did you discuss adding +1 on the raw counts, or on the TPM?

I usually added to TPMs. Recently we shortly discussed with @zavolan that it would make more sense to add a pseudocount to raw counts and then normalize to TPMs, but I didn't try this implementation so far.

I just copy-paste here info from the slack discussion to keep it in the thread:

@zavolan wrote

@gypas I think there are currently 2 ways of doing this. The 'old' way was to add a pseudocount of 1 to the counts, and then calculate TPMs. Now, in the single cell analyses I think there is first a TPM calculation done and then a pseudocount is added. @boersch how/where do you implement the psedocounts?

@boersch wrote

First I normalize to TPMs, then I remove lowly expression genes, then I add a pseudocount, then I perform the log transformation, then I center the expression matrix, then I apply singular value decomposition, which provides principal components

@zavolan wrote

Hi all If we add to raw counts, then genes with no expression will get a baseline expression dependent on their length. If we add to TPM then genes with no expression will get the same pseudocount. I guess the first version is more appropriate when we have reads from all over the gene, the second would make more sense when we have 'tag' (e.g. 3' end sequencing).

So, based on the above we should follow the first approach, as we have reads from all over the gene. If you all agree I will update the Snakefile rules according to this.

In general I will follow the following:

Add +1 to the raw counts
Convert the values to TPM
Log2 transform
Run PCA

The updated script is here: https://github.com/zavolanlab/zpca (v0.3). I still have to fix the rules.

Script v0.4 and pipeline are updated. @boersch Can you please have a look at the PCA output of the new pipeline?

added 1 commit

0bda3e58 - Update zpca version from 0.1 to 0.2. This versions addss a pseudocount (1/32)...

Compare with previous version

mentioned in issue #142 (closed)

added 1 commit

7dc88e1a - Update PCA analysis to perform per sample gene/transcript length normalization...

Compare with previous version

changed the description

I ran the modified pipeline using the same data set (4 GCN4 samples from Nitish, 2 WT samples and 2 mutants). The outcome looks very similar across different quantifications (salmon genes, salmon transcripts, kallisto genes, kallisto transcripts), but differs a lot from the outcome obtained by adding a pseudocount to normalized expression (see above). Plots PC1 vs PC2 from the updated version are hard to interpret. If earlier PC1 clearly distinguished WT samples from mutants, now samples are distributed in the space without clear pattern. PC3 adds even more confusion. Shall we discuss the procedure again?

PC1 v PC2 for Salmon transcripts:

PC1 v PC2 for Salmon genes:

PC1 v PC2 for Kallisto transcripts:

PC1 v PC2 for Kallisto genes:

PC1 v PC3 for Salmon transcripts:

PC1 v PC3 for Salmon genes:

PC1 v PC3 for Kallisto transcripts:

PC1 v PC3 for Kallisto genes:

Thanks for the feedback @boersch

Copy paste Mihaela's comment from slack:

Hi all There seem to be more than 1 difference between the old and new computations. @gypas Foivos do you remove genes/transcripts with very low expression before the PCA?

@zavolan I remove all rows with zero counts. When running the PCA, I could filter low expressed transcripts/genes based on TPM. Should I apply that as well?

Copy paste Mihaela's comment from slack:

@boersch what sort of genes did you filter out before?

For testing I did not filter genes. I took the expression of genes from salmon in tpms, added pseudocount, made log2 transformation, performed mean centering and singular value decomposition of the resulting matrix.

OK, so there is even more going on here. First, the "testing" PCA is done without removing low-expression genes, whereas the reference PCA, which is at the beginning of this thread, did remove the low expressed genes.

Second, pseudocounts seems to be added twice, once at the read level, by Foivos, and second time at the TPM level, by Anastasiya, according to the description above. We shouldn't do that. So please, let's test one variable at the time. To compare with the previous results, the expression levels gotten by Foivos should be log-transformed, no additional pseudocounts added, low-expression genes/transcripts should be removed, the data centered and then compared with the result at the beginning of this thread.

These are now results of the test if I remove genes not expressed in all samples (as @gypas did), then perform PCA as I described in the previous comment. They are almost identical to results without filtering.

Pseudocounts are not added twice, I take the expression matrix and perform transformations by myself.

added 1 commit

9cbabb97 - Rollback to TPMs

Compare with previous version

@boersch If you have some time, can you please try once more with the last version?

Current pipeline

Use TPMs
filter transcripts were mean(TPM) per sample < 1
Add +1 TPM as TPM pseudocount
Log2 transform
Run PCA

added 1 commit

398de08c - Update to zpca v0.7. Disable unit variance scalling. Keep genes/transcripts >0 instead of >1.

Compare with previous version

@boersch I would really appreciate it if you can try one more time. I added the changes we discussed. Thank you in advance.

@gypas The pipeline reported the following error in all rules, where zpca-tpm was called:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 1665, in create_block_manager_from
    mgr = BlockManager(blocks, axes)
  File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 149, in __init__
    self._verify_integrity()
  File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 326, in _verify_integrity
    raise construction_error(tot_items, block.shape[1:], self.axes)
ValueError: Shape of passed values is (6655, 3), indices imply (4, 3)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/zpca-tpm", line 4, in <module>
    __import__('pkg_resources').run_script('zpca==0.7', 'zpca-tpm')
  File "/usr/local/lib/python3.8/site-packages/pkg_resources/__init__.py", line 650, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1453, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python3.8/site-packages/zpca-0.7-py3.8.egg/EGG-INFO/scripts/zpca-tpm", line 131, in <module>
  File "/usr/local/lib/python3.8/site-packages/zpca-0.7-py3.8.egg/EGG-INFO/scripts/zpca-tpm", line 112, in main
  File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/frame.py", line 497, in __init__
    mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
  File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/construction.py", line 234, in init_ndarray
    return create_block_manager_from_blocks(block_values, [columns, index])
  File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 1672, in create_block_manager_from
    raise construction_error(tot_items, blocks[0].shape[1:], axes, e)
ValueError: Shape of passed values is (6655, 3), indices imply (4, 3)

Yes, I noticed. I am fixing it now. I will let you know once I am done, so don't spend any time now.

@boersch I think it's fixed now. Please, let me know if you encounter any other problem.

added 1 commit

fe1c0694 - zpca version 0.8. Fix error of table transpose.

Compare with previous version

I think, now PCA results for GCN4 samples of Nitish obtained in R and with zarp look very similar.

Kallisto transcripts, PC1 vs PC2, zarp: Kallisto transcripts, PC1 vs PC2, R:

Kallisto genes, PC1 vs PC2, zarp: Kallisto genes, PC1 vs PC2, R:

Salmon transcripts, PC1 vs PC2, zarp: Salmon transcripts, PC1 vs PC2, R:

Salmon genes, PC1 vs PC2, zarp: Salmon genes, PC1 vs PC2, R:

Finally! Thanks a lot for checking. I think we can merge now.

marked the checklist item @boersch check if the current PCA analysis gives similar results to the standard PCA analysis she is performing on a real dataset. as completed

marked the checklist item @bakma Check is there are further changes required, in order to integrate the output of the PCA analysis in multiqc. as completed

PCA analysis on genes and transcripts TPM for kallisto and salmon

Merge request reports

Activity