share
Stack OverflowWhat can be done in R that can't be done with Python/Numpy/SciPy
[+52] [19] Eli Bendersky
[2009-07-24 11:15:09]
[ python r comparison numpy ]
[ http://stackoverflow.com/questions/1177019]

I've been recently wondering about the over-proliferation of DSLs like R - and thinking whether this is good or bad. Specifically, I wonder what right has R as a stand-alone language and environment? Wouldn't it be much better to build it as an internal DSL/library on some popular language (like Python)?

What can R do that Python with suitable libs (Numpy/Scipy etc.) can't? Is it a matter of performance? Ease of use? Special abilities?

Because I know for sure that there's a damn lot of things Python can do and R definitely can't, so I'm unsure about the motivation for this fragmentation of the open-source programming community.

P.S. Please don't point to RPy - it's absolutely irrelevant to my question. P.P.S. I have nothing against R whatsoever - this is just a rant to raise a valid discussion and gain insight.

(22) The way this question is phrased does you no favours. It would read better as a plain question, without the ranting and swearing. E.g. "proliferation" rather than "over-proliferation" and without the sentence about "(deliberate?) fragmentation". Ranting does not encourage valid discussion, discussion does. - Nat
(1) Nat, I've removed the word 'deliberate?' as it's more incendiary than I intended. Other than that, there's no swearing here AFAIK, and I explicitly declared this is a rant (and tagged it subjective) - Eli Bendersky
(4) well, rants don't really belong here, whether or not they're tagged as such. This is meant to be a site to get programming questions answered. It's not a blog, and it's not a general discussion forum. :) - jalf
(5) from the faq: "Avoid asking questions that are subjective, argumentative, or require extended discussion. This is not a discussion board..." - Argalatyr
(4) why are you so sure that you can do things with Python and not with R? both are full featured multiparadigm (functional, imperative, object oriented) high level dynamic languages, so you should be able to do the same things... just some things will be a little bit easier with one of them. Even if that wasn' true, definetively you can do in C everything you can do in Python, and there are some thigns that you can do in C and cannot in Python (e.g. manual memory management), so whats the motivation of Python and why fragmenting open source community with a useless project like Python? XD - fortran
(5) Using R, you get to interact with a large gang of awesome statisticians, who are having a lot of fun. - Mike Dewar
[+64] [2009-07-24 12:45:37] quillbreaker

The difference between a general purpose language and a DSL is the difference between a cow and a carton of milk. Sure, the cow has lots more features - it can make lots more milk and fertilizer to boot, and it can even pull a wagon if you are so inclined. But if all you want is a glass of milk and you get a cow, then you've got an oversized bovine crapping in your office.


(2) great analogy ! - orlandu63
Couldn't have said it better! - CDR
That just made me laugh. I am an agricultural economist using R so I can't wait to whip that analogy out. - JD Long
(27) While your command of words in creating this analogy is notable, I'd dare say that it's not very relevant. Taking this analogy far enough, every discipline should invent its own programming language and environment. One for statistics, one for algebra, one for calculus, and so on. Somehow in the "commercial tool" world everyone uses Matlab with its toolkits, but in open source, everyone has his own pet milk carton. This fragments the open-source tool community and makes the open-source tools weak against commercial ones - Eli Bendersky
(3) What's wrong with a discipline inventing a programming language to better do the work before it? Surely the betterment of the individual disciplines is more important than the running conflict between open source and closed source. - quillbreaker
(19) +1 for eliben's comment. I'm not a fan of argument by clever quips. It's the kind of thing politicians are fond of pulling out so the rest of us can parrot it endlessly and not think too hard about an issue. How does the spectrum from general purpose to specialized work? When do you cross over from one point to another? Specifically, what gives R its edge? That's far more interesting. - ars
(2) @eliben: s/Taking this analogy far enough/Taking this analogy far enough while completely disregarding common sense/ -- FTFY. - Steve Losh
@ars Such as stackoverflow.com/questions/1109237/… ? Someone should probably tag this question DSL, by the by. - quillbreaker
this is beautiful :)) - ldigas
1
[+48] [2009-07-24 14:20:38] Josh Reich

Intrinsic language differences aside, the difference mainly comes down to mindset. Most people work with R in an interactive fashion. The nature of data analysis is that much of your time is spent exploring the dataset and understanding its shape before you can write any code or apply any pre-existing algorithms.

If you happen to know exactly what your data is like a priori, and you just need to implement algorithm X (where X isn't available in a pre-existing library/module) then you will probably end up using whatever language is most comfortable for you.

However the dirty reality is that for general data analytic problems, most of your time is spent visualizing, cleaning, exception handling and exploring - long before you even begin to do the actual 'analysis.' And while Python, and ipython in particular, provide environments suitable for this type of iterative work - R was designed for this from the ground up.

If you don't have much experience with real world analytics it is easy to fall in to the trap of saying, "Well, all I need to do is fit model X, so all I need to do as implement model X". If you approach data analysis this way, you will spend more time debugging your code as there are so many gotchas to real world data. Statisticians approach things differently, putting work ahead of modelling to ensure that the data fits your preconceived notions of what it should look like. Are there missing values? Are there erroneous values? Does the data follow a particular distribution? Etc.

As I said in response to an earlier posting on this thread, the downside of this approach is that the interactive/iterative approach lends itself to writing some fairly ugly code. And for one off jobs, that can be OK. But if you are trying to build a system that needs to be understood by others you need to take the approach of a Software Engineer.

I tend to fall somewhere between the two; using R for to understand the shape of the data, and then writing code (be it in R, Python, C or Perl) designed with my Software Engineer hat on.


(8) This is an interesting point of view. But Python with matplotlib and pylab (matlab-like interactive commands) and armed with a powerful shell like IPython or Pydee provides very a very good interactive environment, IMHO - Eli Bendersky
2
[+46] [2009-07-24 13:54:45] dalloliogm [ACCEPTED]

I have been using both Python/PyLab and R/ggplot2, but I have more experience with Python.

Python:

  • it is object oriented, this means that you can organize your data in terms of objects instead of tables.

    Let's say you have a 'gene', which has many transcripts, and is related to cancer and other metabolic processes. How would you represent that in your favorite programming language?

    In Python, you can create a class (the syntax is quite easy) named 'gene', containing various attributes, like the list of transcripts (which can also be different objects), the association with cancer and the other metabolic processes.

    In R, you would have to create different tables, one with 'Genes and their transcripts', and another with 'genes and their associated metabolic processes', and maybe merge the two tables.

    The example may be stupid and it is difficult to explain this in a short text, but in general Python has an additional option to organize data, which is object oriented programming, and which is also present in R but with a difficult syntax

  • Python is easy to interface with databases, and has good ORM libraries. For example, have a look at the examples in this tutorial: http://elixir.ematia.de/trac/wiki/TutorialDivingIn

    The example from the previous point can be also implemented as a relational database, with a 'gene' table and a 'transcript' table, etc..

  • in Python, it is easier to write and use functions. Moreover, you can't use loops in R, as they are very slow, so you must think in terms of a procedural language

  • I don't know how is it to use regular expressions in R, but in Python the re module is very good.

  • the bio-opensource community around Python is big, however many people are moving toward R, as there are more functions in Bioconductor [1].

PyLab

  • PyLab and matplotlib [2] are good, but it takes more time to produce a plot with them compared with R. For example, you can't create a plot from a list of strings with the plot function, you can't do things like plot(x = ['a', 'b', 'c'], y = [1, 3, 14]) with PyLab: you have to convert the x variable to a list of numbers and then modify the x labels manually.

  • you don't have a structure like datamatrix in Python, at least not in NumPy [3]/ SciPy [4]. There is not a read.csv function, you have to read the file manually every time.

  • if you use IPython [5], you can mimic the behaviour of an R session file. However, I have never used it.

  • the masked arrays thing is a pain, at least for me. If you have an array of values and one of them is NaN or None, and you want to calculate its average or standard deviation, you can't do it directly: you have to convert it to a masked array and then used a different function, numpy.ma.average

  • there is not a default 'identify' function in PyLab. You can produce interactive graphs, but you have to use some recipes or implement it directly.

R

  • R [6]'s syntax is a bit like PHP: you have a lot of functions with different names, and you have to know them to use them.

    For example, in R you have different functions to read a delimited file: read.delim, read.table, read.csv, etc... In Python, they tend to have the minimum number of function possible, so you would have a read_delim functions which could behave like the other functions according to a parameter (read_delim(filename, type=csv), read_delim(filename, type=tsv), etc..)

  • I like very much the ggplot2 library and approach. You can produce an high quality plot with few functions, without having to worry of the defaults. These kind of libraries are what Python lacks.

  • there are a lot of open source libraries for R, to manipulate microarrays and other biological data. Many R libraries are published as papers in scientific journals, and this improves the quality of the code, because the authors can get fundings and have their work rewarded. On the other side, BioPython [7] and BioPerl [8] are monolitic packages, in which you can't select which packages to install, and it is more difficult to contribute to them.

Well, I am sorry I don't have time to answer you more and I had to write this quickly... Probably now you are more confused than before, but the problem is that there is not an easy answer to your question.

I suggest you to learn Python if you don't use it already, because it is a more general purpose language, and it has a good and easy syntax which will help you to improve your programming skills.

[1] http://en.wikipedia.org/wiki/Bioconductor
[2] http://en.wikipedia.org/wiki/Matplotlib
[3] http://en.wikipedia.org/wiki/NumPy
[4] http://en.wikipedia.org/wiki/SciPy
[5] http://en.wikipedia.org/wiki/IPython
[6] http://en.wikipedia.org/wiki/R_%28programming_language%29
[7] http://en.wikipedia.org/wiki/BioPython
[8] http://en.wikipedia.org/wiki/BioPerl

(7) R does support its own flavor of object orientation. But it does tend to be the case that when you are writing quick and dirty code, it can end up looking a lot like functional PHP. But as with any language, if you take the time to organize your thoughts your code can be as clean as you'd like it to be. - Josh Reich
(3) Very useful answer. Good to hear from someone with experience using the different tools. - John D. Cook
Regarding R's object orientation, the last time I read the documentation it basically said not to push the OO support far, that is works for overloading printing but not much else. Maybe it has gotten better since then, but at least a couple years ago the R team admitted that the OO support was hardly top notch. - John D. Cook
I would argue that R's interface to relational databases is better (or at least more natural) than Python, since relational DBs map directly to R data.frames rather than going through an object persistence layer. - Daniel Dickison
(8) <quote>There is not a read.csv function, you have to read the file manually every time."</quote> Try import csv. - J.F. Sebastian
(10) R's support for object oriented programming is sophisticated and (nearly) complete, but comes from a different heritage to python, java, c# etc. It is based around the idea of generic functions (rather than message passing) inspired by languages like common lisp and dylan. The thing that most new programmer struggle with is that object are not mutable, although there are three add-on OO packages (R.oo, oop, and proto) that remedy this. - hadley
(6) There is more to R (and CRAN) than ggplot2 and you should not colour your experience through that lense only. - Dirk Eddelbuettel
numpy.loadtxt can read csv - Benjamin
pandas (pandas.sourceforge.net) now offers the equivalent of data.frame and read.csv in Python (as DataFrame and read_csv). - Thomas K
3
[+44] [2009-07-25 02:32:01] gappy

What can R do that Python with suitable libs (Numpy/Scipy etc.) can't?

The short answer is: 1600 packages and counting, mostly devoted to statistical methods and visualization. Pick a random R package, and see if Python has an equivalent module. 9 out 10 cases, it doesn't.

The long answer: I feel I can speak for most people in this forum that Python is a beautiful, robust, versatile language, and a pleasure to code in. And it is not too hard to pick up one knowing the other (for a synopsis, check link text [1]). But nowhere in Python can you find the same breadth and depth of statistic functionality that you have in R. There are at least four packages devoted to wavelets; three to L1-L2-elasticnet shrinkage methods; four to variants of boosting alone. Grid, lattice and ggplot2 offer a flexibility and power that cannot be matched by matplotlib. And I am not even mentioning the beautiful, rather esoteric capabilities of Bioconductor packages.

If all these capabilities were available in Python, there would be no need for R. I agree with most (but not all) of dalloliogm's observations about the two languages, and Python is just better designed. R is not terrible however. Some quirks aside, the one big issue I have is that S4 objects seem "bolted on" the language (like Perl OOP).

Regarding your other questions: performance is very similar, but ease of use is greater for Python, because of its language features and great IDEs. In my experience, R is perhaps easier to interface with C, and all computationally intensive modules call C, C++ and Fortran.

Looking forward, I don't see Python gaining any advantage over R in the ML and Statistics communities. Currently, mloss.org lists fewer projects in Python (39) than in R (47), but then only a small set of R packages are listed in mloss. Increasingly, R is used in production environments for prediction tasks at hedge funds and a few companies. Most surprisingly, in the past year R has attracted an extraordinary amount of attention and new users, while increasing its developer base. The questions you see here are mostly from and for neophytes, exactly for this reason.

[1] http://tinyurl.com/matlabpythonR

gappy. I just picked up R. I have interest in quantitative finance, audio dsp, and image processing. You seem reasonably knowledgeable about what's out there in the R package world. Any hints on which of the three fields I mentioned would be most fun to play with in R? - Nosredna
(4) R is being used more and more in finance, and offers many packages on the subject. In dsp I believe the lingua franca for quick prorotyping has been Matlab for some time. But to get started, check out the CRAN Task Views on R's homepage. - gappy
(2) Re: R v. Python in ML; interesting since it's not at all my impression (which may need correction). Do you base this on mloss alone? I get a completely different impression from searching delicious bookmarks, google code, or code released on graduate research sites. In my experience, finance industry (hedge funds, etc) is generally adventurous/cutting edge with tech, but not an indicator of general trends (see K, CORBA, OCaml, etc; ditto for telecom, incidentally). - ars
(2) @ars my evidence is largely circumstantial. mloss.org is one data point. Another is that last three KDD cups were won by teams using R alone or R+C. Another one is that the Journal of Statistical Software publishes more on R sw than anything else. But I am not really interested in keeping tally. Simply, R has a solid and growing group of users and contributors. But ever tool has its use. - gappy
I wasn't really expecting scores or hard data, but wanted to see how much my own views, based on circumstantial/anecdotal points, needed recalibration. :) I wasn't aware of the KDD entries, good to know. Thanks for clarifying. - ars
4
[+26] [2009-07-24 14:16:54] DrewConway

As someone who is very partial to Python...

Without repeating what everyone has already said, I think the best answer to your question comes from the question itself, that is "What can R do that Python with suitable libs (Numpy/Scipy etc.) can't?"

Python cannot do anything (save very basic things in the standard math lib) that R does without using special third party libs. R is designed to be an out of the box statistical analysis platform, while Python is a language to support application design. For example, to create a basic OLS model in Python one has to combine lots of functionality from the standard Python libs, SciPy, and even do some independent scripting, while in R you just call the built-in lm().

In reality, these two language are entirely different and have vastly different purposes (as the previous answers have rightly noted. It just so happens that there is a large enough community of people using Python to do scientific research that the powerful combination of SciPy/NumPy/matplotlib exist.


5
[+25] [2009-07-26 08:13:48] David Cournapeau

(Disclaimer: I am a NumPy developer).

One big advantage of R over any general purpose programming language is the integration. There is CRAN, which can be used to try packages very easily. Python does not have this ATM, and it will never be as good as R I am afraid, because this kind of problems, although relatively simple to solve for a special-purpose language, is incredibly hard for a general purpose, portable programming language like python. This and the history of S/R are the two big advantages of R IMHO.

Note that R has this quite rare special-purpose + open source nature which makes this a killer combination. For other special purposed languages like say matlab, distributing your code is a PITA. There is no real infrastructure to easily share and install 3rd party code.


(2) R has the added advantage of competing against SAS, which many statisticians hate (very expensive, limited features, hard to extend, and non-open). Since extending R is trivial (usually!) compared to writing custom C code, the user base is eager to write modules for it. - Gregg Lind
6
[+23] [2009-07-27 04:38:50] kpierce8

As a former S, now R and more recently python user, the languages have very different goals. I've used R quite a bit longer (around 10 yrs) and python (about 3). Python can do many things better than R. As a programming language it is much more versatile. However, R is a scriptable data analysis tool with 15+ years of code libraries written largely by the academic community for statistical and data analysis. R would be useless for writing a database application, let a alone a website, content management system, word processor, image manipulation program, game, etc etc etc. R, on the other hand, is entirely devoted to data analysis tasks. You can push it a bit to do some rudimentary programming and OOP, but it was never written or conceived for that purpose. Probably anything that can be done in R can be done in python. There's an excellent chance it could be redone in a more elegant fashion.

I would say for non-programmers who understand what they're doing with statistics, R is a good deal easier to get going. Perhaps simply the fact that every assignment creates a complete copy in a new object makes it easier to handle. It precludes many elegant programming constructs, but the vast majority of R users are scientists, statisticians and economists who may well be completely oblivious to those same elegant constructs. I know dozens of R users and would bet 90% of them (my sample only) haven't a clue what a class method is. That doesn't make them stupid, it just means they haven't been exposed to Comp Sci 101. While there may be few things (maybe nothing) that could strictly be done BETTER in R than in python, the fact is hundreds (or more realistically almost thousands) of things have already been done in R and are ready to use. So R is better for data analysis due to years of effort and the world-class pedigree of people writing cutting edge modules for it.

One last thing to remember is that very often R users will have many different data analysis tasks. Any one data set may get worked with in a specific way and not repeated very many times with a new data set. As such, practitioners are not-generally writing general purpose tools to analyze production volumes of data. many tasks are one of a kind that need to draw from many potential techniques. Just my 2 cents worth (if that).


+1 for an informed comparison and realistic view from both sides of the fence. - Joao Figueiredo
7
[+16] [2009-07-24 15:27:48] ars

"what right has R as a stand-alone language and environment"

That's an odd phrasing, I think. At the very least, the question ignores history. These things never come down to what's available today. R has been around since the early-mid 1990s with a wealth of statistical computing facilities. Can you point to comparable Python libraries from around that time? The way these things work, as I'm sure you know, once anything gets some mindshare, it gets harder to give up or displace. Just because Python can do a lot of these things now with scipy packages, etc, isn't going to cause R programmers to jump ship.

But, I'm not convinved Python can do a lot of the statistical computing work with the same effort. For example, poke around:

Can you do some of the regression models, mixed effects, longitudinal analysis, etc in Python with the same ease?

That said, for matrix computing and such, I'd go straight for Python/NumPy. And I also prefer Python the programming language and the wealth of general purpose and specialized (GUI, web, etc) libraries it offers. I do wish the statistics community had discovered Python before writing their own thing. :-) But Python just didn't have as much of a mindshare at that time -- I think Perl was the rage back then.

EDIT: @eliben: I'm speculating here, but I'll take a crack at answering your comment. Sometime in the 1990s, statistical computing folks, especially those who'd already been exposed to S, learned about the open source variant of S, R. Many of these folks were adventurous graduate students, not ingrained in the ways of SAS and the like. Sometime in the late 90s and early 2000s, they became full blown professors with a captive audience of their own students, and chose to spread the R gospel. This narrative would account for at least a super-linear or exponential increase in the R users, which you might see.

Keep in mind that we're not just talking about statistics departments. The people who benefit from statistical computing includes Ecologists [3], Linguists [4], Biologists [5], Astronomers [6], Econometricians [7], etc, etc, etc. Basically, any type of experimental scientist or data analyst. If one wanted to change this state of affairs, one would be advised to develop useful Python libraries for the graduate students of today. I don't actually claim any of this as fact though. Just a guess. :-)

As an aside, I think it's interesting that the statistical NLP and Machine Learning folk have gravitated toward Python.

Here' another theory: because statistics is the new sexy [8]. :-)

[1] http://www.statmethods.net/
[2] http://www.ats.ucla.edu/stat/r/
[3] http://ecology.msu.montana.edu/labdsv/R/
[4] http://www.ualberta.ca/~baayen/publications/BaayenCUPstats.pdf
[5] http://cran.r-project.org/doc/contrib/Seefeld%5FStatsRBio.pdf
[6] http://www.sr.bham.ac.uk/~ajrs/r-project.html
[7] http://www.mayin.org/ajayshah/KB/R/R%5Ffor%5Feconomists.html
[8] http://flowingdata.com/2009/02/25/googles-chief-economist-hal-varian-on-statistics-and-data/

Thanks for the historic background - it is indeed relevant. However, it appears that there's a recent proliferation of interest in R in the programming community (not the SO flash-mob of the past couple of weeks). Until a year ago I never heard of R, but it's now a common feature in Proggit or YC hacker news. This doesn't play well with the historical rationale, as Python is mature now and its libraries provide the vast majority of what's needed for R. - Eli Bendersky
@eliben: updated my answer to offer a response to your comment. - ars
8
[+13] [2009-07-24 12:08:59] Jared

R can use the existing code base that was created before Python was of general use. R was first released in 1993. Python was released in 1991 and at that time was not a viable language for scientific computing. R also tries to do one thing well rather then be a general language. In order to make a python DSL easily usable for the group R is targeted at you’d have to make so many changes that it wouldn’t look like Python anymore.


(3) Timing is everything, and history matters. - dmckee
(1) Good point. And Old S is 1975 and New S is 1988. - Nosredna
9
[+10] [2009-07-24 13:08:23] dmckee

Domain specific tools, use the idiom of the people who work in the field. As such they have much lower barriers to entry for non-geeky, non-programming practitioners of the domain.

Those who have the itch to be real programmers will diversify at a later date.

Moreover most domain specific languages that i have encountered are Turing complete, so

Because I know for sure that there's a damn lot of things Python can do and R definitely can't

is too strongly stated. You meant "things that are easy in Foo that are hard in Bar" or something similar.


(1) Look up R questions here on SO. Most deal with achieving trivial programming tasks like working with strings, GUIs and so on. Seems like people are in love with solving all the old problems all over again. Brainf*ck is Turing complete too, by the way. But I do agree with your statement about using the idiom of the people working in the field, although R's statistical operations don't look much different from Numpy function invocations to me - Eli Bendersky
It helps to know that there has been an intentional effort to bring up the profile of R on StackOverflow the last few days. People are asking whatever just to be asking things. Now, I am not a statistician, and I don't do R, but idomatic things can be subtle, and when a general purpose tool gets them wrong, they itch. It really matters. - dmckee
(1) Admittedly, I don't know R, but - IMO - Python has the lowest barrier to entry for any language I have used (and I have used over a dozen). For me, its syntax is pretty close to how I write pseudocode. - PTBNL
(3) @PTBNL: Sure python is easy---so easy that sometimes it not like writing code at all---but it uses programmer idiom. - dmckee
(4) @dmckee you are spot on. The underlying assumption in many of these comments and answers is that the person learning R is a programmer to begin with. I do not believe it is typical for a programmer to learn R. I believe it is typical for a statistician or economist to learn R in order to do specific analysis. - JD Long
"Most deal with achieving trivial programming tasks like working with strings, GUIs and so on." Well, people are learning the language. That's what happens. Seems like most iPhone questions are about memory leaks. Not sure what that proves. - Nosredna
10
[+9] [2009-07-24 16:40:16] Carlos Santos

It might interest you to know that Ross Ihaka (R co-creator) and Duncan Temple Lang recently proposed a statistical environment based on Common Lisp to address R's perceived shortcomings:

http://books.google.com/books?id=8Cf16JkKz30C&pg=PA21&source=gbs_toc_r&cad=4 [1]

They rule out Python for lack of speed and (optional) static typing. I'm not sold on their arguments. I guess you would oppose their idea, as it will lead to more fragmentation.

[1] http://books.google.com/books?id=8Cf16JkKz30C&pg=PA21&source=gbs_toc_r&cad=4

(1) Thanks for the pointer. Reading that paper was very interesting. Speed, multithreading and language design are some of R's limitations, and Ihaka and Lang are clearly looking ahead. This effort reminds me also of PyPy. Having a thin layer on top of CLOS would make R more portable, and push language optimization where it belongs. - gappy
11
[+8] [2009-09-03 03:03:12] Abhijit

I've been a S+/R programmer for almost 15 years, and a Python programmer for 3 (with some Matlab in the interim). First and foremost to this discussion is the fact that R is a specialized program for statistical analysis. It has developed a huge code base and developer base over the years, and has become, because of Bioconductor, a preferred language in bioinformatics.

Python, on the other hand, is a general purpose tool, one segment of which (the Numpy/Scipy/matplotlib world) is geared towards statistical analysis. Even there, it's strength is in random number generation, and there are not many modules to really do more involved statistical modeling (like glm, mixed effect models, survival analysis, let alone shrinkage methods.) Python has quite a bit of ML code available (see Orange [1] and MLPy, for example). Some of these algorithms are also available in R, which also has an interface to Weka.

The one place I've actually favored Python is in reading large data sets. Python is a lot more efficient and quick in my experience at doing this. The iterators in Python are quite useful for this (though this is being rectified in R), but even reading the data into memory is quicker in Python.

The tools in R that I miss most in Python are a good "apply" function (though I know this exists in Numpy) and its extension in Hadley Wickham's plyr; the ease of using lattice for exceptional graphics; and the code base for modeling that allows me to do quite complex modeling, both traditional and Bayesian (potentially using RWinBugs to talk to WinBugs), quickly and efficiently.

The other feature in R is the availability of the literate programming library Sweave, which I use regularly to write reports for clients and collaborators. However, for this purpose the pendulum is tilting towards Python for me. I have been using restructured text and formatting it using Python docutils to LaTeX, HTML and OpenOffice odt. I already have written Python code to transform documents using the OpenOffice Python API from odt to MS Word doc using other available Python libraries (reflecting the general nature of the Python language). Even the LaTeX done this way creates better formatted document to my eye than Sweave. Tables are also quite easy to create in rst by cutting and pasting raw R output. This has shifted my report-writing of analysis towards Python-based workflows.

[1] http://www.ailab.si/orange/

(1) You can look at pyreport for literate programming in python gael-varoquaux.info/computers/pyreport - bgbg
(1) Python's Sphinx is also good for literate stuff, and (imho) easier to extend than Sweave. - Gregg Lind
I've been playing around with Sphinx a bit. I actually use rst a lot, and I think the literate programming part in R can be extended to use rst via the brew package - Abhijit
I'd like to add that the ascii library in R does quite well with weaving rst code with R, for the most part. - Abhijit
12
[+6] [2009-07-25 00:59:21] Sharpie

For me the defining difference was definitely the package management systems- because it's really the extension packages created for each system that makes them so flexible. Back when I was new to both R and Python, R's R CMD INSTALL and R CMD REMOVE "just worked" consistently across a wide variety of platforms more often than Python's easy_install and...

So I became an R programmer because it was the language that put the tools in my hands the fastest.


13
[+5] [2009-07-25 01:00:21] Nosredna

The inquisitive programmer will learn languages as they gather attention. The popularity of a language often illuminates a need that you may not even know existed. People rarely choose a language to spite other programmers (brainfuck and LOLCode excepted).

I like to learn languages in interesting pairs. LISP and Forth, for instance, make a great pair. What's more fun than contrasting a prefix high-level language with a postfix low-level language?

When a friend of mine learned Python (years and years ago), I rolled my eyes. After all, interpreted languages are toys, right? Too slow to be useful. Why would you even waste the time?

Don't be close-minded when it comes to languages. That reflex will bite you.

However, if you go off tomorrow and write 20,000 lines of R over the next few weeks, feel free to come back and tell me why it sucks. I'll listen to you then.

(BTW, your question has encouraged me to download R. Damn. Now I need a good book.)


Apparently, R is used at Google [1]. Who knew?

[1] http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html

+1 for "go off tomorrow and write 20,000 lines of R". It is hard to pass judgements on a language without experience in how to use that language. - Sharpie
14
[+5] [2011-06-18 06:05:21] Owen

Without regard for statistics/numbers/vectors/etc, I have to say that R as a language is one of the most beautiful programming languages I have ever used. I think it is sad that R is usually regarded as just a tool to get things done.

I've been doing a lot of Python programming lately, but here are some of R's features I sorely miss:

  1. Dynamic binding
    • Allows things like R's with(df, x + y)
    • Some hacks which I won't mention here for risk of being shot
  2. Lazy evaluation:
    • Let's you define new control structures
    • Let's you define new kinds of pattern-matching in assignments
  3. Attributes: sometimes you just need to tack on an extra bit of information
  4. User-defined infix operators: can make code more readable
  5. Perverse protection against side-effects: makes R code substantially slower, but man is it nice not to have to worry whether you're dealing with a copy of a vector or the original one
  6. Unified syntax for accessing associative arrays, struct-like things, and environments: because honestly they're all the same in my mind and having to use different syntaxes feels like an endianness-style distinction
  7. Call-by-name: Automatic labels for plot(); if I'm lazy today I don't have to come back tomorrow and wonder what all those mysterious unlabeled axes are
  8. Default args can refer to each other:
    • Allow multiple calling protocols in one function
    • Define variables without regard for what order they come in
  9. list/dict is the same type of object: encourages people to use names

I think it's sad that almost all of these points have been largely ignored by other programming languages -- if Python would adopt these, I would use nothing else ever again. Because R is a "practical" language, too often the debate over its efficacy centers around libraries. But with decent cross-language integration, any language can use any library. I want a language to be expressive, and R is the best I've come across.

In R, there are a few things I miss from Python:

  1. The ability to easily make self-referential structures: this is hard when assignments make copies all over the place
  2. varargs become a list (or dict) automatically: R's list(...) is a little annoying and leads to hacks like as.list(substitute(list(...)))[-1L]
  3. Visually prettier (just my taste)
  4. Arrays should start at 0

15
[+5] [2009-07-24 18:02:01] Dirk Eddelbuettel

This is not easy to state concisely. And while in a simple Venn-diagram we may see the large overlap between R and Python (or Ruby or ...) the fact remains that R was designed for Programming with Data (to quote one of Chambers' books). With that in mind, it makes data exploration, analysis, programming, estimation, ... fairly easy. Of course, this does not mean you cannot do it in other languages (quite the opposite) but once you grasp some of the intrinsic advantages in R (working with data.frame object, missing values, modeling functions, visualization, ...) you may prefer to work in R.

Chambers most recent book Software for Data Analysys (2008) [1] makes this point rather well.

Lastly, the CRAN Task Views [2] also give a nice overview of what is being done with R by different scientific communities. That takes nothing away from Python, but it may chip away at the somewhat pejorative notion of R as 'yet another DSL'. It has become the primary language for data analysis and statistical computing.

[1] http://books.google.com/books?id=UXneuOIvhEAC&dq=chamber+software+for+data+analysis
[2] http://cran.r-project.org/web/views/

16
[+3] [2009-07-26 14:01:08] hadley

What can python do that R can't?

But seriously, the question seems ill-posed. Both R and python are Turing-complete languages that support aspects of both functional and object-oriented programming paradigms. If python does something that R does, you could port the R code to python, and vice-versa. (This might not be easy, but it is possible).


17
[+3] [2009-07-24 17:34:41] geoffjentry

It isn't an issue of what one language can do that the other can't, but it just comes down to the age old notion that a programmer should use the right tool for the right job. I use both languages heavily (primarily Python now, used to be primarily R). I find myself bouncing between them (as well as some other languages) based on the task at hand.


18
[+2] [2009-07-24 11:30:04] freiksenet

To use Python+NumPy you need to learn Python AND NumPy. For person who just needs a tool for eg scientific calculations it might be an overkill to learn Python.

To use R you just need to learn R. On the other hand Python is more "low-level" in terms of abstraction in this case and it should be easier to modify/extend it than R. (Though I can't be sure as I have not had much experience with R)


(3) why would just learning R be simpler than Python + Numpy? Isn't R also a set of statistical routines + some programming constructs to bind them together? - Eli Bendersky
@eliben Because R is "just this" and Python is general purpose language and it is hard to use NumPy before getting idea of this general-purpose language. - freiksenet
19