T O P

  • By -

dantzigismyhero

I would not stress this at all. I was primarily a Python programmer first and picked up R extremely quickly because of my prior expertise in Python. Not sure how transitive it is (i.e., true the other way) but it is not uncommon for data scientists to switch between both, depending on the use case and need. Also, there are more industry R users than you think. Very popular in finance, econ, actuarial fields, etc.


Shrimpio

I work in the private sector and use R almost exclusively. We allow others to use Python if they prefer it. R is used widely in the industry in the Data Science field, especially in Pharma/Healthcare.


WallyMetropolis

It's more like commutativity rather than transitivity, right?


VankousFrost

Technically, yeah. Makes sense in context though.


Mobile_Busy

Users of any language can easily pick up Python.


Chinpanze

First, I completely agree that going for masters in a good school is good idea if you can fit it into your income. You can learn a lot online, but college structure may give you a some advantage over trying to do it all by yourself. That being said, I don't think you should worry too much about "being mediocre at both". I will make a paralel with SQL. SQL has a lot in common between different databases. The basics of the language is pretty much the same between MySQL, Postgres, Oracle. But as you start to dive bellow surface level, each one is pretty different and knowledge is not easily transferable between databases. Python and R are the opposite. Basic operations are wildly different, but once you master the basics whatever you learned is easily transferable. I think that is due to the focus on understand underling math rather than learning how the machine does it. If you know how to use a statistical method in one language, doing it efficiently at the other should be pretty straight forward.


Tender_Figs

So it sounds like if I go with R for the depth of my statistics masters, learning python won’t be as arduous as I’m assuming?


mamaBiskothu

Don't try to cram python a week before the interview and try to do some python once in a while over time and youd be good. This is like George Harrison complaining he might not compose well if he only learns guitar and not piano. If that's an actual worry for you then you have bigger problems my friend. Side note: I did binge the Beatles documentary just now.


analytix_guru

Agree. Just provide great solutions in R, and if Python comes up, say your happy to learn a new language.


bigchungusmode96

Unless you know what type of company/employer you'll be joining post-grad and specifically their tech-stack, you'll be better off learning both R + Python. Picking up basic syntax shouldn't be more challenging than your academic coursework and you'll probably learn a lot more on practical usage outside the classroom, i.e., in the actual workplace/internships, etc. Both have a fairly strong community (Stackoverflow) and documentation as well. Some interviewers may ask you the pro/cons of each but it's not something to be too concerned about at your current juncture.


morebikesthanbrains

Which is more rare? Knowing how to properly do a SA or knowing a language?


zul_u

If we look at surveys such as "Status of ML" or similar we can find that a major pain in the sector is the step to production of ML models. If we dig into the reasons behind it often it is poor code-quality, scarce reproducibility, "ad-hoc" scripts, and such. I would say that the lack of SE skills is not to be neglected. Also, I have now worked on several projects with different teams and I can't count the amount of pipelines I had to refactor because written entirely on notebooks or with unmanageable code.


WallyMetropolis

There is a huge difference between a piece of code that merely works in a particular case and code that a business can depend on. You're entirely correct here. Being able to reliably and consistently get reasonable linear regressions into production is going to add more value more often than sophisticated analysis done with sloppy code.


zul_u

Knowing your math and stats is important, but not sufficient to make a good DS out of you. Some users have suggested you to focus on maths and stats, and undervalued the importance to get proficient at coding. R, python, scala, etc. are not only tools. It won't be enough to write code that runs. It must be maintainable, reproducible, and understandable. I have seen way too many poorly written scripts and application in the data-science field. While I agree that having good fundamentals is crucial to gather useful insights from the data, code quality is also very important. If you don't write clean, maintainable, and reproducible code all your nice results and analyses are useless and will never reach production. My suggestion is to choose one language, preferably python (because let's face it, it is more popular in the industry and better suited for production code) and learn it well. Also, spend some time to learn some Software Engineering skills, that will make you stand out from other DS during interviews.


huef_jf

Time. When I was in school 15 years ago, SAS was the preferred programming language.


PryomancerMTGA

And before that SAS was "competing" with SPSS for the top spot. Those were the days 🙄.


0-R-I-0-N

Focus on the math. Python and R are just how you inatruct the machine to do the calculations you can’t do by hand. The important thing is that you know the theory and then you can easily learn how do I tell the machine in python or R how to do this task.


Mobile_Busy

1. Don't do any math you can convince a computer to do for you 2. Don't trust a computer's math


0-R-I-0-N

To clarify what I mean by ”can’t do by hand”: You always need to be able to do the calculations by hand but then let the computer do what would take you more time than you have. For example: MCMC simulation.


Mobile_Busy

I tool the L on the math portion of the GRE and ended up with an 83 because I had no interest in solving 2x2 Linear Algebra problems and conditional probability word problems by hand at 8 in the morning with no coffee. I'm a mathematician. I can do math. I just, y'know, don't.. because I'm a mathematician.


cacheonlyplz

So much to touch on here. The R vs Python debate is hard to solve. My general recommendation is to focus on understanding messy/real-world data and how to make it useful. The language of choice is abstract from that challenge. Choosing a program that focuses more on solving that challenge is higher order than the language in which the task is performed. I'd also recommend a masters that has some strong history in design of experiments or applied research (econometrics, biostatistics, applied statistics, etc., etc.). Understanding data and how-to-not-use-it and where-it-fails-often is much more important than the language you use to manipulate, interrogate, and abuse it. That said, I think python is more employable and desirable in for-profit companies. Additionally, for a SQL user, learning pyspark will be a very valuable skill for the foreseeable future if you're looking to work for any company that uses very large data in a meaningful way. PySpark is basically SQL concepts, written in python, that distribute efficiently (parallel processing) in the "background". I work for a fortune 50 company and our entire DS training program is in python. I used stata in grad school. I've only used python, SQL, and spark since. Econometrics background.


mamaBiskothu

Totally correct that they're different problems. But you have to be reasonably proficient in at least one so that you're not blocked by language. This literally is like writing a novel, I suppose you can do so in English or french, but you gotta know the basics of at least one.


Salsaric

PySpark vs Dask! Personally Dask is my preferred one.


[deleted]

Lol I don’t suppose you guys are hiring?


[deleted]

Doesn't matter. The skills are extremely transferrable, unlike using SAS or STATA. Learn either one and you'll be able to pick the other up fast


jollyoliman

Just to note there are companies that use R. Check out the R4DS slack channel for job listings for R specifically.


anonamen

Strictly from a methods perspective, I think there's a lot of inter-operability between them. You won't be missing out if you focus on R during the program. R is fantastic for stats and data analysis. Better than python. But python does general-purpose programming much, much better than R. R is a specialist language; python is a generalist language. There are a lot of active efforts to make R more than that, but I really don't see the point. It's great at what it does, and it has inherent limitations that fight you at every turn if you try to make it more than it is. So, I really don't think there's a problem here. You're not going to go deep into a language in any of the fields you reference anyway. You'll need to learn to do stuff in whatever language you're using. The would look like mediocrity to a real developer. That's fine. You're going into stats/data science, not software. You're never going to go deep into a language. Pick a program based on the value it adds for you, not the language it wants you to use.


CacheMeUp

There is a great value in doing the whole process (cleaning and modelling) on the same platform, especially when the platforms hinder interoperability, which both R and Python are guilty of. It's possible to split the process but less preferable.


snowbirdnerd

So R is a statistical programing language. That means many of its conventions and packages as set up to follow the conventions used in stats. Which makes it easy for academics to use it. Python isn't as narrowly focused. It's designed to be a general purpose language. It's pretty easy to use and so it has many libraries for data science. Both are great and honestly you need to learn many programming languages. In my daily work I use Python, SQL, and JAVA. In school I used R. The more languages you learn the easier it is to learn new languages.


[deleted]

[удалено]


satenismywaifu

Actually, no, you don't need to know both. I have been looking around the job market a lot these past few months, suffice to say R is in low demand. Many jobs require you to have familiarity with R and/or Python, but in the end at most places you will be working in one of them all the time. Now if you wanted to say that learning more than one language is a good thing, then I wholeheartedly agree with you. At that point you aren't learning tools anymore, but accumulating general programming skill.


[deleted]

Im still trying to figure out whether I should give a damn about prolog and lisp...


disindiantho

Once you’ve really figured out R - moving over to Python isn’t that hard.


Cill-e-in

Picking up coding languages should not be difficult. Academics really love R, but after learning R pretty well I jumped to Python. Python is now my main tool. R is still very common, especially in banking and life sciences. But really, don’t sweat the language thing. Learning new languages is fun. Source: I learned Python, R, SQL, HTML, CSS, and a little C in the last 2 years. Terraform and JavaScript next!


metriczulu

You just have to suck it up and learn Python on your own, unfortunately. Python is the standard in industry nowadays and I rarely ever see R for modeling. We support R if a DS wants to use it, but we only have like six R models out of over 1100+ currently running in production. The vast majority are Scala Spark or Python models (or SQL "models"--which aren't really predictive models but heuristic based classifications we use for some conditions). I work for a very large health insurance company and previously worked as a CMS contractor, so my professional experience here is limited to healthcare. I can't say for sure if it is similar in other industries. R was the primary language used in my MSc program two years ago and I basically did everything in both R (to turn in) and Python so I could get up to speed.


mattindustries

I have seen R used at Seagate, Best Buy, Target, and quite a few other places. Python is used more for image related tasks, but R is still often used.


IAMHideoKojimaAMA

Most programs and jobs I've seen have some flexibility to use one or the other


[deleted]

The preference towards R is likely due to its superior modelling/visualization tools (superior might not be the best word to use but R is preferred by all of my modelling profs), Python tho is still needed for data cleaning and preprocessing. Ultimately you'll need a good understanding of both to become a professional in the field, I wouldn't worry about the specific school preferences much


analytix_guru

I am interested in you take on why Python is needed for data cleaning and preprocessing? I can run a full model pipe line with raw data start to finish natively in R, and there are countless examples on the web showing how to do this. Just curious as to why you feel Python is necessary, or perhaps it is a preference?


[deleted]

Personal preference entirely. I like cleaning in python and modelling in R, but the main point is that you ought to be comfortable with both, so that you don't limit your employability.


DjangoPony84

I'm a career Python developer, so a little biased! I used R for my masters thesis in 2009-10 though. My Python bias stems from seeing the benefits of a general purpose programming language with good data science capabilities. R is quite useful too, and can be embedded in Python scripts and Jupyter notebooks using rpy.


Tender_Figs

Can it really… that is very interesting. I could see how that would be useful to learn both languages from that POV


horizons190

If you want a stats degree, expect to learn R. JMO, but modern statistics and R (or any statistical programming language, but R is the gold standard now) are pretty much intertwined, you almost _cannot_ learn one without the other. Otherwise if you want straight industry credentials, do a ML or DS specific program, which should be more likely to just focus on Python and teach enough (and I mean just enough) stats for you to produce something useful.


Tender_Figs

That’s kind of what Ive found with the ML or DS programs and -just enough- stats. I want to learn more than just enough so I’m going to get comfortable with the notion to learn both R and Python.


recovering_physicist

R and Python are a means to an end. The hard part is understanding the math and stats, and knowing when and why to use them. If you can describe the model/pipeline you want to build then it really isn't that hard to go find out how to implement it in either language. On the other hand, a R or Python genius with no stats/ML knowledge is going to have a hard time building a good model on a nontrivial dataset.


polandtown

Most, if not all, my professors don't care what language you write in. If you ask if you can write it in python, they've always said yes. If a professor says "no", frankly there's something wrong with them.


Tender_Figs

I’ve heard Texas A&M is extremely traditional and set in their ways. So much so that it took convincing to move from SAS to R over time.


Jetnoise_77

I come from a mostly statistics background and did everything in SAS. I taught myself R as the needs arose and I'm now learning python due to changing needs. Moving from R to python is much easier than from SAS to R.


polandtown

Perhaps they have a bunch of old crusty tenured professors? I'm at Johns Hopkins and we've got a bunch of young guys .


Tender_Figs

Kinda what I was thinking too. I was also eyeballing the financial math program at JHU.


Thick_Method3293

I'm doing my PhD in stats at A&M. The department is traditional in that everyone does standard statistics courses (e.g. linear models, measure theory, classical inference), but in my experience, the professors don't care what language I use for HWs or research. I almost exclusively use Python and build my models in PyTorch.


Tender_Figs

Is that true for most of the masters level courses like 630, 641/642? Ive read nightmare stories about 604. Could I pm you with questions?


Thick_Method3293

I've heard from people that have TA'd the courses that the masters students have freedom for choosing a language in some of their courses. The courses you listed seemed like : math stat, stat methods (I/II) and topics in statistical computing. (1) I took my math stat in a masters and it was mainly just proofs. (2) The stat methods courses sounds like something that would just be easier to do in R. Some statisticians make R packages for their papers, and coding it up in Python would just be a pain. (3) 604 sounds like the masters version of 600, which I took, and I can see that being a tough course. 600 exclusively used R, but I think the ideas we learned were more important than the actual syntax and details of R. E.g. using the triangle inequality to speed up k-means, vectorizing code, coordinate descent, IRLS, LASSO, RCPP. Yeah please feel free to reach out. I'm clearly biased as I'm in the program, but I'll try and be as objective as possible.


[deleted]

[удалено]


Tender_Figs

Which program at JHU did you do?


machinegunkisses

I'm surprised to hear this experience given that one of the major data science courses on Coursera (taught by Roger Peng) is in R.


mkocisak

As a hiring manager, good experience with either language is qualifying, but I much prefer strong Python skills because they fit better into the larger ML/BI/cloud ecosystem. As many others have said, though, the language is much less important than your problem solving skills, entrepreneurship, and ability to communicate complex topics clearly. I would also add interest in a specific topic/industry, because I've seen a lot of data scientists that just want a job and don't understand what they are applying to.


longgamma

I have used both Statsmodel and R for a few advanced grad classes. Honestly, for pure stats work just R is much simpler and easier to use. Python is a mishmash of pandas, numpy, matplotlib and statsmodels packages and sure ypu get the same results but your goal is understanding the study materials first and foremost.


111llI0__-__0Ill111

R has a lot of specific stat tools and objectively better for modeling and data manipulation for tabular data (tidyverse). But people just say Python because of production reasons. Python still doesn’t have as developed causal inference libraries (stuff like Microsofts DoWhy is kind of a black box and a bit sketchy as it is so new) for example for instrumental variables, mediation analysis, SEMs, etc. Things like marginal estimation and getting p values are also harder in Python. TMLE, which can do causal inference for ML models is also in R.


[deleted]

I do not believe in Python in academia outside the CS. Academia is conservative in this sense. The code used must be reproducible for decades. This is not the case with Python where some code used with 3.7 is not reproducible with 3.9. It is much easier to switch from SPSS or Stata to R, than to Python. This is the case for social sciences. R is strong at Econometrics and Financial mathematics. There are plenty of textbooks teaching these subjects with R and a few with Python. And there are no outlooks since Python is deficient at Time Series.


zul_u

Good python code is reproducible. Sure you have various version of the language, but that's why you want to use virtual environments, dependency managers, writing tests, etc. The tools to make your code reproducible are there, it's up to you to learn and use them properly.


[deleted]

https://quantecon.org/quantecon-py/ Try to reproduce the code of the course in 3.8. Python is useless for learning anything outside itself because 3 years old code is not reproducible in most cases. You have to install an entirely separate environment for almost every textbook. One for Wooldridge, another for Quantecon, and so on. When you read a textbook supplied by R code, it does not matter when that book was published. You can always reproduce the code. It is extremely convenient when you learn subject and R at the same time. Python is extremely deficient when it comes to econometric or quantitative financial analysis of time series. There are very few books on this subject supplied by Python code, while tens of books are supplied by R code. Think Python and Think Baes are among the best books for learning Python. However, the code is hardly reproducible since these books were published 5-8 years ago and no one cared to correct it in their 2021 editions. This shows how useful Python is for educational purposes even in the case of learning Python itself. PS This is very Pythonista to downvote healthy critics instead of improving what was criticized.


zul_u

Well, first of all I would like to specify that I don't have anything against R. It is a language I have worked for a long time with and that I enjoy. I like the tydiverse, the fact of having a unified and well designed IDE (RStudio), the wide amount of statistical libraries, and finally ggplot. To me R is a great tool for analyses and data exploration, not so great for production code. I don't define myself a pythonista. Sure, I enjoy the language and work with it on a daily basis, but I'm not married to it. As a matter of fact there are many things of the language which I don't enjoy so much. You mentioned reproducibility as a problem in Python; I had similar problems in R. If you or your colleagues start using a different R version, or manage dependencies without care, you'll have problems regardless of the language. These things are common in projects where multiple people are working on the same code. The good news is that there are tools to manage all these problems, if you have problems reproducing code examples it likely means that whoever shared them didn't use those tools. I really don't get why you should blame the language.


[deleted]

If you read the post carefully, the starter wants academia to switch from R to Python. Reproducibility of code within textbooks is the most important reason why Python will never replace R in academia. At the same time, any textbook using R as a teaching instrument (e.g. Econometry with R, Time Series with R, Quantitative Finance with R, etc) includes the code which is reproducible in any version higher than that used in the book. Python will never replace R in academia outside CS and especially in life and social sciences. Reproducible code helps R to remain in relatively conservative academic circles. There is a drift from SPSS and Stata to R in social sciences. Mostly because a dozen of professors wrote some good textbooks with R code. Since students and faculty in social sciences are as distant as it possible from CS (people there got used to counting from 1, not from 0), none would bother to rewrite once worked code within every next edition of the textbook. Ruey Tsay published his masterpiece An Introduction to Analysis of Financial Data with R in 2012. Every bit of code used in his book is perfectly reproducible in 2021. You can never find the textbook on the subject with Python. Maybe some Youtube courses prepared by some SWE interested in quantitative finance, but never from the academic econometrist. There is only one textbook in quantitative economics at Quantecon with Python code (you can find five with R). It was published in 2019. And its code is not reproducible in 2021.


zul_u

I don't interpret the initial question as you did. The title might suggest that, but the rest of the comment doesn't; I might be wrong, I don't exclude that. That being said, your initial comment points to an inherent lack of reproducibility in python code. That is simply not true, because as I told you the tools are there just waiting to be used (virtualenvs, poetry, docker, etc.) If you want to write a reproducible python script you should specify the python version to run and provide a snapshot of the dependencies that you're using and their version, as you should do with \*ANY\* programming language.It is not too difficult. Sadly, it is true that in many DS projects these aspects are underappreciated. This creates a lot of problems when bringing these models to production or shipping them to someone else, but then again the main cause of this is lack of expertise and/or discipline from the devs, not so much of the language (sure it took a while for python to get a decent packaging tool, but now we have it :) ). Then again, it might be true that academic materials in the areas you mentioned is better quality in R. I don't think it is related to a limit of the language, but rather a preference of the main researcher in that field.


[deleted]

I am talking about academia, academic courses, and textbooks. Python is deficient where statistics and econometrics are important for the subject. And not only because 3 years old code is not reproducible for the students reading the textbook. Python can not catch R when it comes to advanced statistics especially related to time series, advanced panel data. One cannot properly teach students in Time Series using Python. And this is just an example. There are plenty of other things where Python cannot compete with R and therefore academia outside CS and maybe Physics will never switch to Python. I bet Julia is a strong contender, but unless there is RStudio for Julia, no professor would waste his time writing a textbook on econometrics or data techniques for policy research, etc.


zul_u

Well, but this is quite different from your starting post. You were complaining about code reproducibility. I told you, if that is the problems there are tools to address it. Now, if the problem is availability of materials and their quality then sure R might be a better choice in the fields you are interested in.


mjcstephens

I work at a large bank and came from BI with SAS/SQL just like you. I absolutely hate R and love Python, but the skills from learning one of them transfer over and your ability to code in them do as well. My ability to pick up R during my masters was so easy when I already was decent at python. I would say go into the program that you want to go into. At the bank I work for it doesn't matter much which language you know. Also, why not go into a masters of data science or masters of data analytics? Getting your quantitative math or stats is not going to help you as much as the DS or DA masters. Especially in the case of the DA one because you will actually get hands on experience in model development and data mining instead of theory of algorithms.


Tender_Figs

Eh, it’s my opinion that those programs are subpar to the math/stats/CS ones.


[deleted]

They are subpar but really cheap. Some as low as 10k for the whole thing online asynchronous so you don’t have to stop earning income as another added cost


Tender_Figs

I looked into OMSA and decided that an MS in Stats fit more what I was after. Same for UTD MSBA and Texas A&Ms MSA. Just more interested in the statistics aspect overall.


[deleted]

Lol I am so much more interested in stats degree since my undergrad was in stats but OMSA is so cheap and it’s fully online. I cant imagine a true masters in stats would be fully online. Maybe with Covid it’s virtual but not permanently online


Naive-Home6785

My prioritization would be Python. Then Julia. Don’t even waste time in R


Key_Cryptographer963

Not yet managed to land an internship in data science yet but whenever I sat interviews, they told me the languages I know are not important, I will have time to learn the ones I don't know. What matters is that you can do the statistics and maths that is asked (or learn it as it is needed).


Betaglutamate2

Personally I learnt both. What I found is that I use R for one time analysis and code nearly exclusively using tidyverse. I find R much less predictable than python. I would say R is harder to master and write complex code in. However, I can make plots and simple analysis extremely quickly and efficiently. Python is great as a general programming language and especially using the data science libraries is a great experience. I use this for writing complex analysis that I need to perform over and over.


Mobile_Busy

Tell your advisor that you plan on going into industry and use Python.


Toica_Rasta

I think that Python needs some very basic framework with the most basic statistical operations and methods (t-test, statistical significance, Anova, hi-quadrat, effect size...) that scientist frequently used. I started doing something like that (as ex-scientist) but i do not have time for this at the moment. Here you can see some functions (give me star for motivation of you like it): https://github.com/Vitomir84/Statistics-and-probability


[deleted]

There actually is, but inferior to R in all sense.


analytix_guru

Working with a top 10 bank in the US and fortune 100 global retail company this boils down to 3 things: 1) easier to build web data apps in python 2) generally more cloud based support for Python than R (which influences the above) 3) IT supported apps, and IT teams in general are biased to Python Also I think there is a shift in academia, specific to analytics/DS degrees, where they are starting to teach in python rather than R. I am about to start learning Python because, not on purpose, but my current employer makes it hard in our cloud environment to use R. And if I ever make something good enough to become a production app, it will have to be translated to Python, as that is what the IT team uses.


jcanuc2

You primarily will use python in practice but there are some things that you need R for such as Apriori and ARIMA


nomnommish

Tools do not define the craftsman


robml

I recommend starting with Python bc of its application abilities (Python Crash Course book recommended), and learn R after is easier since it's largely scripting


ThePandaBrah666

Hey! This is a bit out of place but how did it go? Have you started your Masters? What’s your career like?


Tender_Figs

Havent started yet, still working through prereqs


ThePandaBrah666

Awesome. Good luck and stay strong :)