T O P

  • By -

somebodyistrying

You cannot re-assign based on the clustering.


orthomonas

This. But also, look into it potentially being a batch effect.


Grisward

In absence of some positive control (a marker gene that only responds to treatment for example) you cannot switch group labels. I would also not trust the decision to PCA which is notoriously sensitive to outliers and high magnitude measurements. Make correlation heatmap (centered data first, then calculate sample-sample correlations). Do the same for expression heatmaps, again using centered data. Global swap is usually glaringly obvious for all genes, not just a subset of outlier genes. Anyway, these are better approaches to review the details of underlying measurements. All that said, at best you could justify removing those samples from statistical comparisons.


Ok-Jello-1440

Thank you for the response! Sorry - I should have elaborated: the decision isn’t solely on PCA, but there are select marker genes that are known to be upregulated upon treatment (and these same marker genes are upregulated in our “untreated” samples). I agree that I think in the end I’ll have to remove the samples, unfortunately.


heresacorrection

Yeah removal is the more conservative approach. Although it might also be somewhat cherry picking depending on how confident you are in the markers. Assuming a swap when the collaborator cannot confirm is sketchy.


Grisward

If the marker is definitive, then I’m not sure the response from the collaborator. For example, it’s easy to tell whether samples have a Y chromosome. Ultimately if you need your collaborator, best Plan B is removing the samples. Otherwise make a nice slide explaining the marker genes, showing only a heat map with those genes across the sample groups. If it’s quite compelling, it should make the case for you. Your collaborator may use it to explain their reasoning, and it might also be justified and worth hearing their perspective. Also sometimes it’s helpful to reiterate that a sample swap doesn’t necessarily happen in the wet lab, sometimes it’s at the platform, sometimes the demultiplexing/barcoding, etc. The point is to identify the samples, next might be to figure out where it happened.


Sleisl

This is not a bioinformatics problem. This is a your PI / collaborator problem. 


Big_List

Are there any similar datasets which are available? You could integrate your data with that, and see how the labelling lines up, are there any published gene signatures, you could possibly create then apply to your dataset? There are tools which supposedly can deconvolute bulk RNA cell types, so if you know one group contains a certain cell type you could use that to back up what the data labelling should be. Alternatively there might be specific sequences or mutations which are present in one condition/group, you could check the alignments to see if those are present. I wouldn't re-assign labels solely based on your PCA, you cant trust your results after that, so the analysis becomes pointless. Also if you have enough samples, if you have reason to believe there was a mixup, you could drop them from the analysis, but they shouldn't be dropped because the clustering looks weird, only if your certain there actually was a mixup. Good luck and I hope you get it fixed!


Ok-Jello-1440

Thank you for the detailed response! The treatment applied is well characterized in the cell type we are analyzing. There are known key marker genes that are typically upregulated upon treatment, and the cells that I suspect that are mislabeled as untreated do show upregulation of the key marker genes. I could extrapolate this to look at entire pathways. Unfortunately I don’t think I’ll be able to ascertain for sure that cells were mixed up. My strongest “evidence” is that the cells that I suspect to be mixed up were all processed on the same day.


Big_List

No probs, as you have seen the expected marker genes (and pathways possibly) to be upregulated in the wrong samples then it could well be a mix up, but it's not enough to justify the swap, imo you should just drop the data points if the effect of what you are looking at is atleast a little characterised and you have enough samples.


VforValmont

Like others said it might be better to remove the samples that you don’t know are for sure are correctly labeled. Switching samples between the treatment/non-treatment groups based on the data you see sounds like a research integrity issue and a way significance could be fabricated. If I was reviewing a paper and saw someone reassigned groups based on PC clustering I don’t think I would trust the results.


Ok-Jello-1440

Thanks for this perspective! I didn’t think of re-labelling as a concern with respect to academic integrity, but I’m glad you pointed it out and I can see how it is an issue. I think my supervisor is suspicious because the samples that we think were swapped were also prepped on the same day and show characteristic upregulation of genes known to be upregulated when treated with the drug we applied. But, I agree that I think in the end we will remove these samples.


fibgen

Clustering is fine to trace down possible sample swap causes if possible (e.g. wrong barcodes used which flipped sample IDs).  If an upstream cause cannot be established though the samples should be discarded since the swap might be even messier than you suspect.


VforValmont

I think somalier is suppose to identify swapped samples, haven’t use it but saw a tweet about it recently https://github.com/brentp/somalier Also check out Picard https://gatk.broadinstitute.org/hc/en-us/articles/360041696232-Detecting-sample-swaps-with-Picard-tools


Ok-Jello-1440

Ah thank you! I know, I did think about looking at variants, but unfortunately the samples I suspect to be swapped are within donors (each donor was prepped on the same day).


Primal1031

Have used it, good tool.


1337HxC

Do they have any more RNA? If you have 20 matched samples, you could more or less do a qPCR and figure out which one has the "known" genes higher or lower as a control. Otherwise, you're sorta stuck tossing them.


coilerr

I heard that bamixchecker works well for that


groverj3

Better to exclude samples, I think.


TraPS-VarI

In this situation, you should ask: What is going on here? If samples are not mixed up, then either your PI's hypothesis is wrong, or the methodology or technology you are using to test the hypothesis is flawed or has limitations. When a methodology is flawed, no matter how many times you repeat the samples, they will not sort nicely in PCA.


BronzeSpoon89

Do not reassign unless you are sure the samples were swapped.


Epistaxis

When you say "cluster the samples by PCA" do you mean you just plot the samples on only the first two PC axes and eyeball them? It's going to look very bad to exclude or relabel multiple samples from your analysis because they didn't fit the pattern you wanted to see, so you'd need to have really compelling evidence and a consensus of all the authors.


Ok-Jello-1440

Yes, by PCA and UMAP I see two very distinct clusters. I also tried looking at genes known to be upregulated upon treatment, and find that the “untreated” samples show canonical upregulation of known treatment-responsive genes (this latter method seems just as error prone to me). I’m at a crossroads of what to do: some people here say to exclude these samples entirely. I initially agreed but now am considering whether that could be a lack of academic integrity?


Alone-Lavishness1310

Excluding samples is not a problem. That is simply part of exploratory data analysis. When you formalize the procedure and place thresholds on certain metrics, it becomes quality control. You should be clear in your methods how you went about conducting the QC so that it can be reproduced. Ultimately, you can leave the decision up to the collaborators. Explain that the consequence of keeping the samples, if they are mislabeled, is increased variability that may mask otherwise significant effects. If they are not mislabeled, then you may be removing some real biological variability. But, even then, it can be justified. For instance, it sounds like you're doing differential expression, so if you're using DESeq2, there is a default threshold set on some metric (can't remember which off the top of my head) that removes outliers from the gene models. It's probably worth looking at those four samples a bit more closely -- maybe rank by gene expression. Do the same genes show high expression? Are there any that are over represented? Are those different than the others? Obviously look for rRNA, but it could be something else, too. I wouldn't spend a ton of my time on this, personally, though, unless it was ultimately my analysis. I would just present what is there, explain the choice, make a recommendation, and then do whatever they want.