PCA analysis on genes and transcripts TPM for kallisto and salmon
- Delete unused files
scripts/fg_extract_transcripts.py
,scripts/heatmap_and_clustermap.py
,scripts/perform_PCA.py
. - Add rules (
pca_kallisto
,pca_salmon
) that use zpca on genes and transcripts TPM tables from kallisto and salmon. - The output is wired to multiqc_report but the plots are not visualized to multiqc.
- Update documentation.
- Update dag and rulegraph.
- Note: No tests are added, because of random numbers used. No fixed md5sum.
Fixes #140 (closed) #142 (closed)
Before merging I would ask:
-
@boersch check if the current PCA analysis gives similar results to the standard PCA analysis she is performing on a real dataset. -
@katsanto to review the new snakemake rules to make sure that they are according to the snakemake standards. -
@bakma Check is there are further changes required, in order to integrate the output of the PCA analysis in multiqc. -
@kanitz have a look at zpca and work on the merge request.
Thank you in advance.
Merge request reports
Activity
changed milestone to %downstream
added Blocked Discuss Documentation Doing + 1 deleted label
Depends really on how do you want to integrate it:
- If you want a proper
MultiQC
module and interactive plots - I don't think scatterplots are supported, not even mentioning 3D scatterplots. - That would leave us with injecting plain png's just like we do it now for TIN scores. But if I remember correctly, in order for the plot to be automatically recognised by the aggregator the filename needs to start with
mqc_
.
- If you want a proper
I ran Zarp for some samples of GCN4 data set of Nitish and performed PCA for the gene expression quantified by
salmon
:and compared to results produced by the pipeline:
The scatter looks similar, however, the percentage of the variance explained by PCs differs substantially. I checked the code that was used by the pipeline (https://github.com/zavolanlab/zpca/blob/master/scripts/zpca) and found some differences with my code:
- Prior to all manipulations, the table with the gene expression should be log-transformed. Log-transformation makes the distribution of data close to normal and allows to perform PCA. This is a crucial point that should be implemented.
- Also I performed only mean centering, I did not modify the variance in the expression (the function
scale
used in the pipeline performs both mean centering and scales the variance to 1). I don't know, how the functionscale
normalizes the variance to 1 for genes with the expression 0 in all samples.
@boersch Thanks for the feedback. Can you try to run the updated version? I now add a pseudo count (1/32) everywhere and log2 transform. Just pull the current branch and delete the output for the PCA plots. I think it should look more similar now (though I still expect some changes).
We recently discussed that procedure with @zavolan and @boersch , pseudocounts should not be too low because after
log
-transformation they will artificially pull down the mean of the distribution of gene expression, thus, skewing the data centering. In that conversation we agreed to add1
everywhere...@bakma Thanks for the feedback. I assume that you add +1 to the counts and not to the TPMs?
I usually added to TPMs. Recently we shortly discussed with @zavolan that it would make more sense to add a pseudocount to raw counts and then normalize to TPMs, but I didn't try this implementation so far.
I just copy-paste here info from the slack discussion to keep it in the thread:
@zavolan wrote
@gypas I think there are currently 2 ways of doing this. The 'old' way was to add a pseudocount of 1 to the counts, and then calculate TPMs. Now, in the single cell analyses I think there is first a TPM calculation done and then a pseudocount is added. @boersch how/where do you implement the psedocounts?
@boersch wrote
First I normalize to TPMs, then I remove lowly expression genes, then I add a pseudocount, then I perform the log transformation, then I center the expression matrix, then I apply singular value decomposition, which provides principal components
@zavolan wrote
Hi all If we add to raw counts, then genes with no expression will get a baseline expression dependent on their length. If we add to TPM then genes with no expression will get the same pseudocount. I guess the first version is more appropriate when we have reads from all over the gene, the second would make more sense when we have 'tag' (e.g. 3' end sequencing).
Edited by BIOPZ-Gypas FoivosSo, based on the above we should follow the first approach, as we have reads from all over the gene. If you all agree I will update the Snakefile rules according to this.
In general I will follow the following:
- Add +1 to the raw counts
- Convert the values to TPM
- Log2 transform
- Run PCA
The updated script is here: https://github.com/zavolanlab/zpca (v0.3). I still have to fix the rules.
Script v0.4 and pipeline are updated. @boersch Can you please have a look at the PCA output of the new pipeline?
added 1 commit
- 0bda3e58 - Update zpca version from 0.1 to 0.2. This versions addss a pseudocount (1/32)...
mentioned in issue #142 (closed)
added 1 commit
- 7dc88e1a - Update PCA analysis to perform per sample gene/transcript length normalization...
I ran the modified pipeline using the same data set (4 GCN4 samples from Nitish, 2 WT samples and 2 mutants). The outcome looks very similar across different quantifications (salmon genes, salmon transcripts, kallisto genes, kallisto transcripts), but differs a lot from the outcome obtained by adding a pseudocount to normalized expression (see above). Plots PC1 vs PC2 from the updated version are hard to interpret. If earlier PC1 clearly distinguished WT samples from mutants, now samples are distributed in the space without clear pattern. PC3 adds even more confusion. Shall we discuss the procedure again?
PC1 v PC2 for Salmon transcripts:
PC1 v PC2 for Kallisto transcripts:
PC1 v PC3 for Salmon transcripts:
Thanks for the feedback @boersch
Copy paste Mihaela's comment from slack:
Hi all There seem to be more than 1 difference between the old and new computations. @gypas Foivos do you remove genes/transcripts with very low expression before the PCA?
@zavolan I remove all rows with zero counts. When running the PCA, I could filter low expressed transcripts/genes based on TPM. Should I apply that as well?
Edited by BIOPZ-Gypas FoivosCopy paste Mihaela's comment from slack:
@boersch what sort of genes did you filter out before?
OK, so there is even more going on here. First, the "testing" PCA is done without removing low-expression genes, whereas the reference PCA, which is at the beginning of this thread, did remove the low expressed genes.
Second, pseudocounts seems to be added twice, once at the read level, by Foivos, and second time at the TPM level, by Anastasiya, according to the description above. We shouldn't do that. So please, let's test one variable at the time. To compare with the previous results, the expression levels gotten by Foivos should be log-transformed, no additional pseudocounts added, low-expression genes/transcripts should be removed, the data centered and then compared with the result at the beginning of this thread.
These are now results of the test if I remove genes not expressed in all samples (as @gypas did), then perform PCA as I described in the previous comment. They are almost identical to results without filtering.
Pseudocounts are not added twice, I take the expression matrix and perform transformations by myself.
@boersch If you have some time, can you please try once more with the last version?
Current pipeline
- Use TPMs
- filter transcripts were mean(TPM) per sample < 1
- Add +1 TPM as TPM pseudocount
- Log2 transform
- Run PCA
Edited by BIOPZ-Gypas Foivosadded 1 commit
- 398de08c - Update to zpca v0.7. Disable unit variance scalling. Keep genes/transcripts >0 instead of >1.
@boersch I would really appreciate it if you can try one more time. I added the changes we discussed. Thank you in advance.
@gypas The pipeline reported the following error in all rules, where
zpca-tpm
was called:Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 1665, in create_block_manager_from mgr = BlockManager(blocks, axes) File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 149, in __init__ self._verify_integrity() File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 326, in _verify_integrity raise construction_error(tot_items, block.shape[1:], self.axes) ValueError: Shape of passed values is (6655, 3), indices imply (4, 3) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/bin/zpca-tpm", line 4, in <module> __import__('pkg_resources').run_script('zpca==0.7', 'zpca-tpm') File "/usr/local/lib/python3.8/site-packages/pkg_resources/__init__.py", line 650, in run_script self.require(requires)[0].run_script(script_name, ns) File "/usr/local/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1453, in run_script exec(script_code, namespace, namespace) File "/usr/local/lib/python3.8/site-packages/zpca-0.7-py3.8.egg/EGG-INFO/scripts/zpca-tpm", line 131, in <module> File "/usr/local/lib/python3.8/site-packages/zpca-0.7-py3.8.egg/EGG-INFO/scripts/zpca-tpm", line 112, in main File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/frame.py", line 497, in __init__ mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy) File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/construction.py", line 234, in init_ndarray return create_block_manager_from_blocks(block_values, [columns, index]) File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 1672, in create_block_manager_from raise construction_error(tot_items, blocks[0].shape[1:], axes, e) ValueError: Shape of passed values is (6655, 3), indices imply (4, 3)
@boersch I think it's fixed now. Please, let me know if you encounter any other problem.
added 1 commit
- fe1c0694 - zpca version 0.8. Fix error of table transpose.
I think, now PCA results for GCN4 samples of Nitish obtained in R and with
zarp
look very similar.Kallisto transcripts, PC1 vs PC2,
zarp
:Kallisto transcripts, PC1 vs PC2, R:
Kallisto genes, PC1 vs PC2,
zarp
:Kallisto genes, PC1 vs PC2,
R
:Salmon transcripts, PC1 vs PC2,
zarp
:Salmon transcripts, PC1 vs PC2, R:
Salmon genes, PC1 vs PC2,
zarp
:Salmon genes, PC1 vs PC2,
R
:
marked the checklist item @boersch check if the current PCA analysis gives similar results to the standard PCA analysis she is performing on a real dataset. as completed
marked the checklist item @bakma Check is there are further changes required, in order to integrate the output of the PCA analysis in multiqc. as completed