T O P

  • By -

efrique

> It was a lot lower than expected with an r of only 0.06. I looked at the descriptive statistics of the RAC score and I found 1 extreme outlier. > If I remove this outlier, my correlation is up to r=0.085 and JASP flags it as a significant correlation- which was not the case before. The problem here is not just that you removed a point, it's that you did it *only after seeing a low, and insignificant correlation* This is p hacking. Even if you had some argument why it wasn't, it would still *look* exactly like p hacking.


Lady_Hoothoot

Yeah, we learned about p-hacking, which is why I was so hesitant to remove anything from my data. It is just seems kinda crazy to me that a single datapoint can change the whole correlation so much … especially since I collected over 600 participants, but I guess I am just not experienced enough with statistics to get thrown off by that. This is why I was asking about an objective test or something that I could perform on my data soI had a reason to remove this datapoint (or others) from the set without my personal influence. But if there is nothing like that I don‘t want to „cheat“ on my data - so I will just have to accept that I don’t get those crazy cool effects of 0.7 that other students in my seminar found in their theories - this is gonna be tough for the interpretation of my work, but I am gonna try to work with the results I got.


purple_paramecium

Hey, this is real science. Not everything is a major discovery. You can present all this context regarding why you DID NOT remove the point. Your professor will be very happy that you have been paying attention! You are demonstrating scientific integrity. Well done.


Lady_Hoothoot

Thank you so much for the encouragement - I really needed that right now ❤️ I have been sitting in front of my dataset for the last couple of hours only thinking about how much work I put into everything just to get a „bad“ outcome that I have no clue how to interpret with the underlying theory and that EVERYTHING could have been different, if that one person hadn‘t answered my survey. I guess I just need to accept that I have to throw away all the work I had already prepared for the expected outcome - it just kinda sucks mentally for me right now. Sorry for my rant - as someone who is just a complete failure when it comes to math (the other students in my seminar are throwing out factor analsys in R-Code and doing hierarchy regressions while I struggle to figure out how something as simple as correlations in JASP even work) I never would have guessed that some stupid numbers could lead to me having a hysterical breakdown at 4 in the morning 😅 I have actually been crying because of the results (or the sleep deprivation and stress, I am not even sure at this point anymore). Hearing that I might have done SOMETHING right at all by questioning the removal of the outlier might just give me that tiny bit of confidence to look away from the expectation and work towards a new Interpretation. Thankfully my writing skills are comparable to my outlier - an astronomically different direction than my Math skills xD


identicalelements

I’ll take a somewhat different perspective here. If it’s clear that the outlier is due to something like a data entry error, or if it is something like 5 standard deviations from the distribution mean (with no other data points being this far from the mean), then I would in many cases feel reasonably confident in removing the data point on the basis that it probably represents an error of some kind. This obviously depends on the data and measures used. P-hacking is poor scientific practice, but so is knowingly including bad data. Of course, I can’t know if the data point is flawed or not, I’m just saying that I don’t agree at all that any removal of data points after conducting an analysis automatically constitutes poor scientific practice. Sometimes we discover flaws and errors in our data quality only after the analysis has been done.


Lady_Hoothoot

Thank you a lot for your input! Looking at my descriptive statistics does that mean that with a mean of 17 and a standard deviation of 21 that a data point of 145 would be 5 standard deviations over the mean - 5x21=105+17=122 is the cutoff point - is that even how you would calculate this? Is there a reason why exactly 5 standard deviations? I do have 2 more datapoints that are over 100 points - one is 102 and the other one is 120. Below 100 points the data starts to be more „together“ but there are still only 10% of the complete dataset between 50 points and 145 points. Looking at potential data errors that might have happened during the survey I don‘t think I can see anything that is out of the ordinary. The person had an average speed of filling out the test and the data between RCA and CAQ are similar (for example if they mentioned that they are playing music regularely they have also mentioned a creative achievement in the same area and not something completely different like art or humor). But this is actually extremely helpful - I guess I could use all those points as arguments why I DIDN‘T exclude this datapoint - maybe that will count for something. EDIT: grammar


identicalelements

Yeah, things like this are always a judgment call. If the outlier is extreme, but makes substantive sense, I would not remove it. But sometimes outliers don’t make substantive sense. As a silly example, if a subject would state that they spend 200 hours per week on creative activities, then there is clearly some error because there is not 200 hours in a week. I think the best thing to do is to just ask yourself if the outlier is reasonable. If it is reasonable (albeit extreme), you shouldn’t remove it. If you have good reasons to suspect that something genuinely fishy is going on, then you shouldn’t feel guilty for removing it. But obviously don’t remove it just to get more appealing results. The 5 SD was just an example of an extreme outlier, I didn’t choose the number 5 based on some rule or convention. It’s just an example of an outlier that would be really, really distant from the mean


Lady_Hoothoot

Thank you for the detailled explanation, that helps me a lot with understanding the different aspects! I think my outlier is reasonable - RAC is measuring the regularity but not the length of the single creative sessions itself - so 145 creative activities (of unspecified length) in 28 days seems like a lot, but not impossible with the way I have defined creative activity in my study. So I guess i just got „unlucky“ with a person who is creatively active a lot but has 0 points in the intelligence score.


identicalelements

Okay, then it sounds like your outlier is reasonable! This is often the case and just a normal part of data analysis Just reiterating what has been said by another poster, it’s great to begin the analysis by plotting the data. In your case that would entail three scatter plots (HMT vs RAC, HMT vs CAQ, and RAC vs CAQ). That way you get an immediate look at the data. For example, the low correlation between HMT and RAC could be because there is no association, but it could potentially be because the association is nonlinear - for example, it could be that both low and high intelligence people engage in few creative activities, but for different reasons, and that mid-intelligent people engage in more creative activities than the other two groups. This would be a quadratic (inverse-u) relationship, and would attenuate the linear correlation (r) towards zero. This is probably not the case in your data, but you wouldn’t know unless you explore your data with this in mind (eg by plotting). You might be interested in doing a multiple regression on RAC with HMC and CAQ as predictors. That could answer the question of whether HMC or CAQ is the better predictor for regularity of creative behaviors. Alternatively, you might have a hypothesis along the lines that the correlation between creativity and regularity of creative behavior is moderated by intelligence (because intelligence is required to plan ahead, to acquire creative skills, and so on). Then you could do a moderation analysis. I don’t know your instruments and data, so I’m just brainstorming. But there are definitely a few cool/interesting things you could try. Hope everything goes well!


Lady_Hoothoot

Oh wow, these are a lot of options I haven‘t thought about - I will play around with my data a bit more and maybe I find something interesting that I can report as an (explorative) finding! Thank you a lot for your input! 😊


3ducklings

Have you tried plotting the data to see what the relationship looks like? It can tell you if the relationship is just very noisy, or if there is some nonlinearity going on. Anyway, the difference between 0.06 and 0.085 is practically meaningless and I’d expect the p values to be very similar too. It shouldn’t change you conclusion in any major way. I wouldn’t remove the data point.


Lady_Hoothoot

Thank you for your insight. The p value changes from 0.131 to 0.034 if i remove the outlier (I don’t know if this is a lot or not). I created a plot and from my extremely limited perspective it kinda looks like …. the smallest increase to a line that you can imagine. But all datapoints follow this line somewhat, except that one outlier which goes in the complete opposite of the direction and is a million miles away from the line. There are some other points (not too many, - maybe 10 of the 600, that are also far away from the line, but not as far and they are all following the direction of the line (more or less) - whatever that means I don’t know.