I've been recently wondering about the over-proliferation of DSLs like R - and thinking whether this is good or bad. Specifically, I wonder what right has R as a stand-alone language and environment? Wouldn't it be much better to build it as an internal DSL/library on some popular language (like Python)?
What can R do that Python with suitable libs (Numpy/Scipy etc.) can't? Is it a matter of performance? Ease of use? Special abilities?
Because I know for sure that there's a damn lot of things Python can do and R definitely can't, so I'm unsure about the motivation for this fragmentation of the open-source programming community.
P.S. Please don't point to RPy - it's absolutely irrelevant to my question. P.P.S. I have nothing against R whatsoever - this is just a rant to raise a valid discussion and gain insight.
The difference between a general purpose language and a DSL is the difference between a cow and a carton of milk. Sure, the cow has lots more features - it can make lots more milk and fertilizer to boot, and it can even pull a wagon if you are so inclined. But if all you want is a glass of milk and you get a cow, then you've got an oversized bovine crapping in your office.
Intrinsic language differences aside, the difference mainly comes down to mindset. Most people work with R in an interactive fashion. The nature of data analysis is that much of your time is spent exploring the dataset and understanding its shape before you can write any code or apply any pre-existing algorithms.
If you happen to know exactly what your data is like a priori, and you just need to implement algorithm X (where X isn't available in a pre-existing library/module) then you will probably end up using whatever language is most comfortable for you.
However the dirty reality is that for general data analytic problems, most of your time is spent visualizing, cleaning, exception handling and exploring - long before you even begin to do the actual 'analysis.' And while Python, and ipython in particular, provide environments suitable for this type of iterative work - R was designed for this from the ground up.
If you don't have much experience with real world analytics it is easy to fall in to the trap of saying, "Well, all I need to do is fit model X, so all I need to do as implement model X". If you approach data analysis this way, you will spend more time debugging your code as there are so many gotchas to real world data. Statisticians approach things differently, putting work ahead of modelling to ensure that the data fits your preconceived notions of what it should look like. Are there missing values? Are there erroneous values? Does the data follow a particular distribution? Etc.
As I said in response to an earlier posting on this thread, the downside of this approach is that the interactive/iterative approach lends itself to writing some fairly ugly code. And for one off jobs, that can be OK. But if you are trying to build a system that needs to be understood by others you need to take the approach of a Software Engineer.
I tend to fall somewhere between the two; using R for to understand the shape of the data, and then writing code (be it in R, Python, C or Perl) designed with my Software Engineer hat on.
I have been using both Python/PyLab and R/ggplot2, but I have more experience with Python.
it is object oriented, this means that you can organize your data in terms of objects instead of tables.
Let's say you have a 'gene', which has many transcripts, and is related to cancer and other metabolic processes. How would you represent that in your favorite programming language?
In Python, you can create a class (the syntax is quite easy) named 'gene', containing various attributes, like the list of transcripts (which can also be different objects), the association with cancer and the other metabolic processes.
In R, you would have to create different tables, one with 'Genes and their transcripts', and another with 'genes and their associated metabolic processes', and maybe merge the two tables.
The example may be stupid and it is difficult to explain this in a short text, but in general Python has an additional option to organize data, which is object oriented programming, and which is also present in R but with a difficult syntax
Python is easy to interface with databases, and has good ORM libraries. For example, have a look at the examples in this tutorial: http://elixir.ematia.de/trac/wiki/TutorialDivingIn
The example from the previous point can be also implemented as a relational database, with a 'gene' table and a 'transcript' table, etc..
in Python, it is easier to write and use functions. Moreover, you can't use loops in R, as they are very slow, so you must think in terms of a procedural language
I don't know how is it to use regular expressions in R, but in Python the re module is very good.
the bio-opensource community around Python is big, however many people are moving toward R, as there are more functions in Bioconductor .
PyLab and matplotlib  are good, but it takes more time to produce a plot with them compared with R. For example, you can't create a plot from a list of strings with the plot function, you can't do things like plot(x = ['a', 'b', 'c'], y = [1, 3, 14]) with PyLab: you have to convert the x variable to a list of numbers and then modify the x labels manually.
if you use IPython , you can mimic the behaviour of an R session file. However, I have never used it.
the masked arrays thing is a pain, at least for me. If you have an array of values and one of them is NaN or None, and you want to calculate its average or standard deviation, you can't do it directly: you have to convert it to a masked array and then used a different function, numpy.ma.average
there is not a default 'identify' function in PyLab. You can produce interactive graphs, but you have to use some recipes or implement it directly.
R 's syntax is a bit like PHP: you have a lot of functions with different names, and you have to know them to use them.
For example, in R you have different functions to read a delimited file: read.delim, read.table, read.csv, etc... In Python, they tend to have the minimum number of function possible, so you would have a read_delim functions which could behave like the other functions according to a parameter (read_delim(filename, type=csv), read_delim(filename, type=tsv), etc..)
I like very much the ggplot2 library and approach. You can produce an high quality plot with few functions, without having to worry of the defaults. These kind of libraries are what Python lacks.
there are a lot of open source libraries for R, to manipulate microarrays and other biological data. Many R libraries are published as papers in scientific journals, and this improves the quality of the code, because the authors can get fundings and have their work rewarded. On the other side, BioPython  and BioPerl  are monolitic packages, in which you can't select which packages to install, and it is more difficult to contribute to them.
Well, I am sorry I don't have time to answer you more and I had to write this quickly... Probably now you are more confused than before, but the problem is that there is not an easy answer to your question.
I suggest you to learn Python if you don't use it already, because it is a more general purpose language, and it has a good and easy syntax which will help you to improve your programming skills. http://en.wikipedia.org/wiki/Bioconductor
What can R do that Python with suitable libs (Numpy/Scipy etc.) can't?
The short answer is: 1600 packages and counting, mostly devoted to statistical methods and visualization. Pick a random R package, and see if Python has an equivalent module. 9 out 10 cases, it doesn't.
The long answer: I feel I can speak for most people in this forum that Python is a beautiful, robust, versatile language, and a pleasure to code in. And it is not too hard to pick up one knowing the other (for a synopsis, check link text ). But nowhere in Python can you find the same breadth and depth of statistic functionality that you have in R. There are at least four packages devoted to wavelets; three to L1-L2-elasticnet shrinkage methods; four to variants of boosting alone. Grid, lattice and ggplot2 offer a flexibility and power that cannot be matched by matplotlib. And I am not even mentioning the beautiful, rather esoteric capabilities of Bioconductor packages.
If all these capabilities were available in Python, there would be no need for R. I agree with most (but not all) of dalloliogm's observations about the two languages, and Python is just better designed. R is not terrible however. Some quirks aside, the one big issue I have is that S4 objects seem "bolted on" the language (like Perl OOP).
Regarding your other questions: performance is very similar, but ease of use is greater for Python, because of its language features and great IDEs. In my experience, R is perhaps easier to interface with C, and all computationally intensive modules call C, C++ and Fortran.
Looking forward, I don't see Python gaining any advantage over R in the ML and Statistics communities. Currently, mloss.org lists fewer projects in Python (39) than in R (47), but then only a small set of R packages are listed in mloss. Increasingly, R is used in production environments for prediction tasks at hedge funds and a few companies. Most surprisingly, in the past year R has attracted an extraordinary amount of attention and new users, while increasing its developer base. The questions you see here are mostly from and for neophytes, exactly for this reason. http://tinyurl.com/matlabpythonR
As someone who is very partial to Python...
Without repeating what everyone has already said, I think the best answer to your question comes from the question itself, that is "What can R do that Python with suitable libs (Numpy/Scipy etc.) can't?"
Python cannot do anything (save very basic things in the standard math lib) that R does without using special third party libs. R is designed to be an out of the box statistical analysis platform, while Python is a language to support application design. For example, to create a basic OLS model in Python one has to combine lots of functionality from the standard Python libs, SciPy, and even do some independent scripting, while in R you just call the built-in lm().
In reality, these two language are entirely different and have vastly different purposes (as the previous answers have rightly noted. It just so happens that there is a large enough community of people using Python to do scientific research that the powerful combination of SciPy/NumPy/matplotlib exist.
(Disclaimer: I am a NumPy developer).
One big advantage of R over any general purpose programming language is the integration. There is CRAN, which can be used to try packages very easily. Python does not have this ATM, and it will never be as good as R I am afraid, because this kind of problems, although relatively simple to solve for a special-purpose language, is incredibly hard for a general purpose, portable programming language like python. This and the history of S/R are the two big advantages of R IMHO.
Note that R has this quite rare special-purpose + open source nature which makes this a killer combination. For other special purposed languages like say matlab, distributing your code is a PITA. There is no real infrastructure to easily share and install 3rd party code.
As a former S, now R and more recently python user, the languages have very different goals. I've used R quite a bit longer (around 10 yrs) and python (about 3). Python can do many things better than R. As a programming language it is much more versatile. However, R is a scriptable data analysis tool with 15+ years of code libraries written largely by the academic community for statistical and data analysis. R would be useless for writing a database application, let a alone a website, content management system, word processor, image manipulation program, game, etc etc etc. R, on the other hand, is entirely devoted to data analysis tasks. You can push it a bit to do some rudimentary programming and OOP, but it was never written or conceived for that purpose. Probably anything that can be done in R can be done in python. There's an excellent chance it could be redone in a more elegant fashion.
I would say for non-programmers who understand what they're doing with statistics, R is a good deal easier to get going. Perhaps simply the fact that every assignment creates a complete copy in a new object makes it easier to handle. It precludes many elegant programming constructs, but the vast majority of R users are scientists, statisticians and economists who may well be completely oblivious to those same elegant constructs. I know dozens of R users and would bet 90% of them (my sample only) haven't a clue what a class method is. That doesn't make them stupid, it just means they haven't been exposed to Comp Sci 101. While there may be few things (maybe nothing) that could strictly be done BETTER in R than in python, the fact is hundreds (or more realistically almost thousands) of things have already been done in R and are ready to use. So R is better for data analysis due to years of effort and the world-class pedigree of people writing cutting edge modules for it.
One last thing to remember is that very often R users will have many different data analysis tasks. Any one data set may get worked with in a specific way and not repeated very many times with a new data set. As such, practitioners are not-generally writing general purpose tools to analyze production volumes of data. many tasks are one of a kind that need to draw from many potential techniques. Just my 2 cents worth (if that).
"what right has R as a stand-alone language and environment"
That's an odd phrasing, I think. At the very least, the question ignores history. These things never come down to what's available today. R has been around since the early-mid 1990s with a wealth of statistical computing facilities. Can you point to comparable Python libraries from around that time? The way these things work, as I'm sure you know, once anything gets some mindshare, it gets harder to give up or displace. Just because Python can do a lot of these things now with scipy packages, etc, isn't going to cause R programmers to jump ship.
But, I'm not convinved Python can do a lot of the statistical computing work with the same effort. For example, poke around:
Can you do some of the regression models, mixed effects, longitudinal analysis, etc in Python with the same ease?
That said, for matrix computing and such, I'd go straight for Python/NumPy. And I also prefer Python the programming language and the wealth of general purpose and specialized (GUI, web, etc) libraries it offers. I do wish the statistics community had discovered Python before writing their own thing. :-) But Python just didn't have as much of a mindshare at that time -- I think Perl was the rage back then.
EDIT: @eliben: I'm speculating here, but I'll take a crack at answering your comment. Sometime in the 1990s, statistical computing folks, especially those who'd already been exposed to S, learned about the open source variant of S, R. Many of these folks were adventurous graduate students, not ingrained in the ways of SAS and the like. Sometime in the late 90s and early 2000s, they became full blown professors with a captive audience of their own students, and chose to spread the R gospel. This narrative would account for at least a super-linear or exponential increase in the R users, which you might see.
Keep in mind that we're not just talking about statistics departments. The people who benefit from statistical computing includes Ecologists , Linguists , Biologists , Astronomers , Econometricians , etc, etc, etc. Basically, any type of experimental scientist or data analyst. If one wanted to change this state of affairs, one would be advised to develop useful Python libraries for the graduate students of today. I don't actually claim any of this as fact though. Just a guess. :-)
As an aside, I think it's interesting that the statistical NLP and Machine Learning folk have gravitated toward Python.
Here' another theory: because statistics is the new sexy . :-) http://www.statmethods.net/
R can use the existing code base that was created before Python was of general use. R was first released in 1993. Python was released in 1991 and at that time was not a viable language for scientific computing. R also tries to do one thing well rather then be a general language. In order to make a python DSL easily usable for the group R is targeted at you’d have to make so many changes that it wouldn’t look like Python anymore.
Domain specific tools, use the idiom of the people who work in the field. As such they have much lower barriers to entry for non-geeky, non-programming practitioners of the domain.
Those who have the itch to be real programmers will diversify at a later date.
Moreover most domain specific languages that i have encountered are Turing complete, so
Because I know for sure that there's a damn lot of things Python can do and R definitely can't
is too strongly stated. You meant "things that are easy in Foo that are hard in Bar" or something similar.
It might interest you to know that Ross Ihaka (R co-creator) and Duncan Temple Lang recently proposed a statistical environment based on Common Lisp to address R's perceived shortcomings:
They rule out Python for lack of speed and (optional) static typing. I'm not sold on their arguments. I guess you would oppose their idea, as it will lead to more fragmentation. http://books.google.com/books?id=8Cf16JkKz30C&pg=PA21&source=gbs_toc_r&cad=4
I've been a S+/R programmer for almost 15 years, and a Python programmer for 3 (with some Matlab in the interim). First and foremost to this discussion is the fact that R is a specialized program for statistical analysis. It has developed a huge code base and developer base over the years, and has become, because of Bioconductor, a preferred language in bioinformatics.
Python, on the other hand, is a general purpose tool, one segment of which (the Numpy/Scipy/matplotlib world) is geared towards statistical analysis. Even there, it's strength is in random number generation, and there are not many modules to really do more involved statistical modeling (like glm, mixed effect models, survival analysis, let alone shrinkage methods.) Python has quite a bit of ML code available (see Orange  and MLPy, for example). Some of these algorithms are also available in R, which also has an interface to Weka.
The one place I've actually favored Python is in reading large data sets. Python is a lot more efficient and quick in my experience at doing this. The iterators in Python are quite useful for this (though this is being rectified in R), but even reading the data into memory is quicker in Python.
The tools in R that I miss most in Python are a good "apply" function (though I know this exists in Numpy) and its extension in Hadley Wickham's plyr; the ease of using lattice for exceptional graphics; and the code base for modeling that allows me to do quite complex modeling, both traditional and Bayesian (potentially using RWinBugs to talk to WinBugs), quickly and efficiently.
The other feature in R is the availability of the literate programming library Sweave, which I use regularly to write reports for clients and collaborators. However, for this purpose the pendulum is tilting towards Python for me. I have been using restructured text and formatting it using Python docutils to LaTeX, HTML and OpenOffice odt. I already have written Python code to transform documents using the OpenOffice Python API from odt to MS Word doc using other available Python libraries (reflecting the general nature of the Python language). Even the LaTeX done this way creates better formatted document to my eye than Sweave. Tables are also quite easy to create in rst by cutting and pasting raw R output. This has shifted my report-writing of analysis towards Python-based workflows. http://www.ailab.si/orange/
For me the defining difference was definitely the package management systems- because it's really the extension packages created for each system that makes them so flexible. Back when I was new to both R and Python, R's R CMD INSTALL and R CMD REMOVE "just worked" consistently across a wide variety of platforms more often than Python's easy_install and...
So I became an R programmer because it was the language that put the tools in my hands the fastest.
The inquisitive programmer will learn languages as they gather attention. The popularity of a language often illuminates a need that you may not even know existed. People rarely choose a language to spite other programmers (brainfuck and LOLCode excepted).
I like to learn languages in interesting pairs. LISP and Forth, for instance, make a great pair. What's more fun than contrasting a prefix high-level language with a postfix low-level language?
When a friend of mine learned Python (years and years ago), I rolled my eyes. After all, interpreted languages are toys, right? Too slow to be useful. Why would you even waste the time?
Don't be close-minded when it comes to languages. That reflex will bite you.
However, if you go off tomorrow and write 20,000 lines of R over the next few weeks, feel free to come back and tell me why it sucks. I'll listen to you then.
(BTW, your question has encouraged me to download R. Damn. Now I need a good book.)
Apparently, R is used at Google . Who knew? http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html
Without regard for statistics/numbers/vectors/etc, I have to say that R as a language is one of the most beautiful programming languages I have ever used. I think it is sad that R is usually regarded as just a tool to get things done.
I've been doing a lot of Python programming lately, but here are some of R's features I sorely miss:
I think it's sad that almost all of these points have been largely ignored by other programming languages -- if Python would adopt these, I would use nothing else ever again. Because R is a "practical" language, too often the debate over its efficacy centers around libraries. But with decent cross-language integration, any language can use any library. I want a language to be expressive, and R is the best I've come across.
In R, there are a few things I miss from Python:
This is not easy to state concisely. And while in a simple Venn-diagram we may see the large overlap between R and Python (or Ruby or ...) the fact remains that R was designed for Programming with Data (to quote one of Chambers' books). With that in mind, it makes data exploration, analysis, programming, estimation, ... fairly easy. Of course, this does not mean you cannot do it in other languages (quite the opposite) but once you grasp some of the intrinsic advantages in R (working with data.frame object, missing values, modeling functions, visualization, ...) you may prefer to work in R.
Chambers most recent book Software for Data Analysys (2008)  makes this point rather well.
Lastly, the CRAN Task Views  also give a nice overview of what is being done with R by different scientific communities. That takes nothing away from Python, but it may chip away at the somewhat pejorative notion of R as 'yet another DSL'. It has become the primary language for data analysis and statistical computing. http://books.google.com/books?id=UXneuOIvhEAC&dq=chamber+software+for+data+analysis
What can python do that R can't?
But seriously, the question seems ill-posed. Both R and python are Turing-complete languages that support aspects of both functional and object-oriented programming paradigms. If python does something that R does, you could port the R code to python, and vice-versa. (This might not be easy, but it is possible).
It isn't an issue of what one language can do that the other can't, but it just comes down to the age old notion that a programmer should use the right tool for the right job. I use both languages heavily (primarily Python now, used to be primarily R). I find myself bouncing between them (as well as some other languages) based on the task at hand.
To use Python+NumPy you need to learn Python AND NumPy. For person who just needs a tool for eg scientific calculations it might be an overkill to learn Python.
To use R you just need to learn R. On the other hand Python is more "low-level" in terms of abstraction in this case and it should be easier to modify/extend it than R. (Though I can't be sure as I have not had much experience with R)