Skip to content
Snippets Groups Projects

PCA analysis on genes and transcripts TPM for kallisto and salmon

Merged BIOPZ-Gypas Foivos requested to merge pca into dev
  • Delete unused files scripts/fg_extract_transcripts.py, scripts/heatmap_and_clustermap.py, scripts/perform_PCA.py.
  • Add rules (pca_kallisto, pca_salmon) that use zpca on genes and transcripts TPM tables from kallisto and salmon.
  • The output is wired to multiqc_report but the plots are not visualized to multiqc.
  • Update documentation.
  • Update dag and rulegraph.
  • Note: No tests are added, because of random numbers used. No fixed md5sum.

Fixes #140 (closed) #142 (closed)

Before merging I would ask:

  • @boersch check if the current PCA analysis gives similar results to the standard PCA analysis she is performing on a real dataset.
  • @katsanto to review the new snakemake rules to make sure that they are according to the snakemake standards.
  • @bakma Check is there are further changes required, in order to integrate the output of the PCA analysis in multiqc.
  • @kanitz have a look at zpca and work on the merge request.

Thank you in advance.

Edited by BIOPZ-Gypas Foivos

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • changed milestone to %downstream

  • Depends really on how do you want to integrate it:

    • If you want a proper MultiQC module and interactive plots - I don't think scatterplots are supported, not even mentioning 3D scatterplots.
    • That would leave us with injecting plain png's just like we do it now for TIN scores. But if I remember correctly, in order for the plot to be automatically recognised by the aggregator the filename needs to start with mqc_.
    • I ran Zarp for some samples of GCN4 data set of Nitish and performed PCA for the gene expression quantified by salmon: image and compared to results produced by the pipeline: PCA_1_2

      The scatter looks similar, however, the percentage of the variance explained by PCs differs substantially. I checked the code that was used by the pipeline (https://github.com/zavolanlab/zpca/blob/master/scripts/zpca) and found some differences with my code:

      • Prior to all manipulations, the table with the gene expression should be log-transformed. Log-transformation makes the distribution of data close to normal and allows to perform PCA. This is a crucial point that should be implemented.
      • Also I performed only mean centering, I did not modify the variance in the expression (the function scale used in the pipeline performs both mean centering and scales the variance to 1). I don't know, how the function scale normalizes the variance to 1 for genes with the expression 0 in all samples.
    • Author Maintainer

      @boersch Thanks for the feedback. Can you try to run the updated version? I now add a pseudo count (1/32) everywhere and log2 transform. Just pull the current branch and delete the output for the PCA plots. I think it should look more similar now (though I still expect some changes).

    • We recently discussed that procedure with @zavolan and @boersch , pseudocounts should not be too low because after log-transformation they will artificially pull down the mean of the distribution of gene expression, thus, skewing the data centering. In that conversation we agreed to add 1 everywhere...

    • Author Maintainer

      @bakma Thanks for the feedback. I assume that you add +1 to the counts and not to the TPMs?

    • Hm.... no, the discussion was about TPMs I think...
      But now when I think of it... It should be counts, right?

    • Author Maintainer

      @zavolan @boersch Can you please comment on this? Did you discuss adding +1 on the raw counts, or on the TPM?

    • I usually added to TPMs. Recently we shortly discussed with @zavolan that it would make more sense to add a pseudocount to raw counts and then normalize to TPMs, but I didn't try this implementation so far.

    • Author Maintainer

      I just copy-paste here info from the slack discussion to keep it in the thread:

      @zavolan wrote

      @gypas I think there are currently 2 ways of doing this. The 'old' way was to add a pseudocount of 1 to the counts, and then calculate TPMs. Now, in the single cell analyses I think there is first a TPM calculation done and then a pseudocount is added. @boersch how/where do you implement the psedocounts?

      @boersch wrote

      First I normalize to TPMs, then I remove lowly expression genes, then I add a pseudocount, then I perform the log transformation, then I center the expression matrix, then I apply singular value decomposition, which provides principal components

      @zavolan wrote

      Hi all If we add to raw counts, then genes with no expression will get a baseline expression dependent on their length. If we add to TPM then genes with no expression will get the same pseudocount. I guess the first version is more appropriate when we have reads from all over the gene, the second would make more sense when we have 'tag' (e.g. 3' end sequencing).

      Edited by BIOPZ-Gypas Foivos
    • Author Maintainer

      So, based on the above we should follow the first approach, as we have reads from all over the gene. If you all agree I will update the Snakefile rules according to this.

      In general I will follow the following:

      • Add +1 to the raw counts
      • Convert the values to TPM
      • Log2 transform
      • Run PCA
    • Author Maintainer

      The updated script is here: https://github.com/zavolanlab/zpca (v0.3). I still have to fix the rules.

    • Author Maintainer

      Script v0.4 and pipeline are updated. @boersch Can you please have a look at the PCA output of the new pipeline?

    • Please register or sign in to reply
  • added 1 commit

    • 0bda3e58 - Update zpca version from 0.1 to 0.2. This versions addss a pseudocount (1/32)...

    Compare with previous version

  • mentioned in issue #142 (closed)

  • added 1 commit

    • 7dc88e1a - Update PCA analysis to perform per sample gene/transcript length normalization...

    Compare with previous version

  • BIOPZ-Gypas Foivos changed the description

    changed the description

    • I ran the modified pipeline using the same data set (4 GCN4 samples from Nitish, 2 WT samples and 2 mutants). The outcome looks very similar across different quantifications (salmon genes, salmon transcripts, kallisto genes, kallisto transcripts), but differs a lot from the outcome obtained by adding a pseudocount to normalized expression (see above). Plots PC1 vs PC2 from the updated version are hard to interpret. If earlier PC1 clearly distinguished WT samples from mutants, now samples are distributed in the space without clear pattern. PC3 adds even more confusion. Shall we discuss the procedure again?

      PC1 v PC2 for Salmon transcripts: PCA_1_2

      PC1 v PC2 for Salmon genes: PCA_1_2

      PC1 v PC2 for Kallisto transcripts: PCA_1_2

      PC1 v PC2 for Kallisto genes: PCA_1_2

      PC1 v PC3 for Salmon transcripts: PCA_1_3

      PC1 v PC3 for Salmon genes: PCA_1_3

      PC1 v PC3 for Kallisto transcripts: PCA_1_3

      PC1 v PC3 for Kallisto genes: PCA_1_3

    • Author Maintainer

      Thanks for the feedback @boersch

    • Author Maintainer

      Copy paste Mihaela's comment from slack:

      Hi all There seem to be more than 1 difference between the old and new computations. @gypas Foivos do you remove genes/transcripts with very low expression before the PCA?

    • Author Maintainer

      @zavolan I remove all rows with zero counts. When running the PCA, I could filter low expressed transcripts/genes based on TPM. Should I apply that as well?

      Edited by BIOPZ-Gypas Foivos
    • Author Maintainer

      Copy paste Mihaela's comment from slack:

      @boersch what sort of genes did you filter out before?

    • For testing I did not filter genes. I took the expression of genes from salmon in tpms, added pseudocount, made log2 transformation, performed mean centering and singular value decomposition of the resulting matrix.

    • Please register or sign in to reply
  • OK, so there is even more going on here. First, the "testing" PCA is done without removing low-expression genes, whereas the reference PCA, which is at the beginning of this thread, did remove the low expressed genes.

    Second, pseudocounts seems to be added twice, once at the read level, by Foivos, and second time at the TPM level, by Anastasiya, according to the description above. We shouldn't do that. So please, let's test one variable at the time. To compare with the previous results, the expression levels gotten by Foivos should be log-transformed, no additional pseudocounts added, low-expression genes/transcripts should be removed, the data centered and then compared with the result at the beginning of this thread.

  • These are now results of the test if I remove genes not expressed in all samples (as @gypas did), then perform PCA as I described in the previous comment. They are almost identical to results without filtering.

    image

    Pseudocounts are not added twice, I take the expression matrix and perform transformations by myself.

  • added 1 commit

    Compare with previous version

  • Author Maintainer

    @boersch If you have some time, can you please try once more with the last version?

    Current pipeline

    • Use TPMs
    • filter transcripts were mean(TPM) per sample < 1
    • Add +1 TPM as TPM pseudocount
    • Log2 transform
    • Run PCA
    Edited by BIOPZ-Gypas Foivos
  • added 1 commit

    • 398de08c - Update to zpca v0.7. Disable unit variance scalling. Keep genes/transcripts >0 instead of >1.

    Compare with previous version

  • Author Maintainer

    @boersch I would really appreciate it if you can try one more time. I added the changes we discussed. Thank you in advance.

    • @gypas The pipeline reported the following error in all rules, where zpca-tpm was called:

      Traceback (most recent call last):
        File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 1665, in create_block_manager_from
          mgr = BlockManager(blocks, axes)
        File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 149, in __init__
          self._verify_integrity()
        File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 326, in _verify_integrity
          raise construction_error(tot_items, block.shape[1:], self.axes)
      ValueError: Shape of passed values is (6655, 3), indices imply (4, 3)
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/usr/local/bin/zpca-tpm", line 4, in <module>
          __import__('pkg_resources').run_script('zpca==0.7', 'zpca-tpm')
        File "/usr/local/lib/python3.8/site-packages/pkg_resources/__init__.py", line 650, in run_script
          self.require(requires)[0].run_script(script_name, ns)
        File "/usr/local/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1453, in run_script
          exec(script_code, namespace, namespace)
        File "/usr/local/lib/python3.8/site-packages/zpca-0.7-py3.8.egg/EGG-INFO/scripts/zpca-tpm", line 131, in <module>
        File "/usr/local/lib/python3.8/site-packages/zpca-0.7-py3.8.egg/EGG-INFO/scripts/zpca-tpm", line 112, in main
        File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/frame.py", line 497, in __init__
          mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
        File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/construction.py", line 234, in init_ndarray
          return create_block_manager_from_blocks(block_values, [columns, index])
        File "/usr/local/lib/python3.8/site-packages/pandas-1.1.3-py3.8-linux-x86_64.egg/pandas/core/internals/managers.py", line 1672, in create_block_manager_from
          raise construction_error(tot_items, blocks[0].shape[1:], axes, e)
      ValueError: Shape of passed values is (6655, 3), indices imply (4, 3)
      
    • Author Maintainer

      Yes, I noticed. I am fixing it now. I will let you know once I am done, so don't spend any time now.

    • Author Maintainer

      @boersch I think it's fixed now. Please, let me know if you encounter any other problem.

    • Please register or sign in to reply
  • added 1 commit

    • fe1c0694 - zpca version 0.8. Fix error of table transpose.

    Compare with previous version

    • I think, now PCA results for GCN4 samples of Nitish obtained in R and with zarp look very similar.

      Kallisto transcripts, PC1 vs PC2, zarp: PCA_1_2 Kallisto transcripts, PC1 vs PC2, R: image

      Kallisto genes, PC1 vs PC2, zarp: PCA_1_2 Kallisto genes, PC1 vs PC2, R: image

      Salmon transcripts, PC1 vs PC2, zarp: PCA_1_2 Salmon transcripts, PC1 vs PC2, R: image

      Salmon genes, PC1 vs PC2, zarp: PCA_1_2 Salmon genes, PC1 vs PC2, R: image

    • Author Maintainer

      Finally! Thanks a lot for checking. I think we can merge now.

    • Please register or sign in to reply
  • BIOPZ-Gypas Foivos marked the checklist item @boersch check if the current PCA analysis gives similar results to the standard PCA analysis she is performing on a real dataset. as completed

    marked the checklist item @boersch check if the current PCA analysis gives similar results to the standard PCA analysis she is performing on a real dataset. as completed

  • BIOPZ-Gypas Foivos marked the checklist item @bakma Check is there are further changes required, in order to integrate the output of the PCA analysis in multiqc. as completed

    marked the checklist item @bakma Check is there are further changes required, in order to integrate the output of the PCA analysis in multiqc. as completed

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading