Question

I hope we all know by now that Premature optimization is the root of all evil.

One side of that quote means by optimizing you are increasing complexity and complexity is evil. The other less known side is today's optimization might be meaningless tomorrow.

What I mean is we used to think that, "Floating-point operations are slow, therefore you should use integers since they are 100x faster! You must use integers even if it makes your code more complex!

That is completely false today. Today, modern processors actually do floating point operations faster than integer operations. Did you know that? I was surprised as well when I tested it in Java.

Also, what about addition and minus are much faster than multiplications or divide? False today (no difference)

Double is much slower than float! False (only half as slow, which is nothing)

A few other things that I think will be useless soon (if they are not already):

Threads. Threads aren't commonly thought as an optimization, but it really is. It trades complexity for performance. I imagine a concurrent model of tomorrow where we don't explicitly deal with threads.
Inlining functions, the runtime or compiler should be able to figure this out for you
The final keyword in Java. Basically inlining variable references (but we'll still use it for protection, today I use it for both protection and performance)
Buffered IO, the operating system or runtime should figure this out for you
Object caching (and caching in general at many levels), it's not far fetched for the runtime to handle this
Minimizing branch statements, a smarter compiler can remove branches for you
Using arrays instead of Lists, a smarter runtime can just use arrays underneath
Unraveling/reversing nested loops
Replacing recursion with a for loop
Minimizing object size to fit into cache or cache alignment issues. This is a big one. Its a hard problem to crack, but if the runtime could really figure out that really needs to be in cache, then we can finally not care about object size.
Selecting the smallest data type that will work. Maybe in the future an integer is rangeless
Scalability. This sounds like an odd one, but scalability is a performance issue in the end.
newing/deleting memory (looking at you C++), Java already solved some of this, others just need to catch up. It can still be improved even more
Using shift instead of * or /
Avoiding the use of Exceptions
Tuning Garbage collector
Double buffering, I think Swing already does this by default

My main point in all this is to be careful with your optimizations and you really should code for clarity and not speed.

So my question is, what are some "optimizations" you are doing today that probably will be useless tomorrow?

Answer 1

I'm reminded of one of Michael Jackson's aphorisms. No, not THAT one, this one:http://en.wikipedia.org/wiki/Michael_A._Jackson

Jackson's Rules of Optimization:

Rule 1: Don't do it.

Rule 2 (for experts only): Don't do it yet.

Answer 2

As someone who works with embedded systems, I am amused by your assumption of an operating system and a run-time library.

Answer 3

1) Threads. - I don't see how we can avoid working directly with threads if we stick to the traditional OO paradigm. If we want multithreading to be a "behind the scenes" feature, functional programming needs to become more mainstream.

2) Inlining functions - Already obsolete for the most

5) Object caching - Actually, this can decrease performance if used with a generational garbage collector!

7) Using arrays instead of Lists - Unless performance is critical, you should already be using lists. In some languages though, working with arrays are just syntactically more convenient. Python comes a long way to encouraging us to use lists.

11) Selecting the smallest data type that will work. Already bad! Either the data has to be aligned on a word boundary (in which case you won't save space), or it'll be slower to work with than native-length data types (32/64 bit depending on proc).

12) Scalability. I don't understand. Scalability is about choosing the right algorithms and data structures to solve your problems. Unless the compiler already has a preprogrammed solution to your problem, it will never be able to choose your algorithm for you!

14) Using shift instead of * or / - Actually, using explicit shift operators improves readability. It makes it clear that you're manipulating bits rather than performing arithmetic operations.

17) Double buffering, - This is a non-issue if you're just building a dialog box with buttons. If you're doing animation/games, often you just need to set some flag and be done with it.

Answer 4

I'm going to take an alternate tack to this.

If you need the optimization in your code today, then possible future advances are of limited concern.

If you NEED the gains from optimization, and actually do get a measurable gain from such an optimization, then it's use has been justified.

The real trick is keeping on top of the advances so that you are not using FALSE optimizations today simply because they were necessary last year.

Cheers,

-Richard

Answer 5

Many architecture specific things are temporary. For instance, it used to be the fastest way to compute a DFT (Discrete Fourier Transform) was to store the sin and cosine values in memory. Now it CAN be cheaper to compute (or approximate) the sin/cos than it is to do a single non-cached memory access.

In "Writing Solid Code" Steve Maguire advocates writing clear (obvious) code, and allowing the compiler to optimize it. Often when you optimize code, you make it hard for the compiler to help.

Answer 6

Don't agree on point one (thread optimization).

I think that algorithmical thread optimization (for example writing a parallel raytracer) will still be useful tomorrow, since the day compilers will do that type of optimization seems far away to me (and I guess that for that day compilers will do any type of optimization :)

Answer 7

Anything involving the Stack.

I worked a gig doing embedded programming for a while, and we'd have actual conversations about whether passing in two parameters to a function versus one was worth the extra stack space. Recursion was out of the question - and we tended to only write new functions when we absolutely had to, since we barely had any stack space to play with.

But, with the cost of flash memory dropping, that's going to quickly go away - much like it did with PCs. If we had started that project a year later, we could have probably had ten times the ram in that device for the same price, which would have made all my weeks of optimization fine-tuning totally obsolete.

(and good riddance.)

Answer 8

Lots of filesystems use some (or a lot!) of extra computation to optimize for spinning hard drives, whether for access latency or fragmentation. Solid state disks make a lot of this work obsolete.

Answer 9

I'm amused by the hubris of the original article, claiming that those optimizations are unnecessary.

It's like watching a Python guy asking why anyone would bother learning assembly, while at the end, Python is written in C, which the run time has to execute in assembly.

Most of your so call obsolete optimizations are merely shuffle, and rightfully so. The bulk of programmers shouldn't care about shifts vs x/2, yaddi yaddi yadda. But someone still has to learn it if they were to maintain, revise, or create a new language and associated compiler / interpreter.

Answer 10

Addressing some of the points you brought up...

Inlining functions. C and C++ have the biggest problem with this, because of the interface/implementation separation. However, toolchains are getting better at inlining across compilation units (ICC and MSVC, not GCC).
The final/const keyword. This will forever be useful because it contains semantic information.
Buffered IO works for many applications, but involves a data copy. High-performance applications do their own buffering because they can do a better job of it at the application level, and potentially avoid a data copy.
Using arrays instead of Lists is not an optimization, it's an algorithmic design. If anyone ever makes a list structure that performs as well as an array, it's because their array sucks.
Unraveling/reversing nested loops is part of GCC 4.4. It's already on its way out for basic blocks. The problem is, many developers write code that can't be automatically unraveled or reversed because it has trivial data dependencies that the compiler must preserve, due to sloppy coding.
Replacing recursion with a for loop already happens in many languages. However, this can be an algorithmic decision, because using a loop allows you to control your traversal order, whereas recursion implies a depth-first traversal.
Minimizing object size to fit into cache or cache alignment issues will never be fully automated for the systems where it really matters. Too much information that doesn't make it into the source code, in any language.
Selecting the smallest data type that will work may be obviated on a virtual machine, but not on any physical machine. Unfortunately, it means that you can no longer depend upon predictable truncation or overflow behavior, since the virtual machine will just promote your type's size.
Scalability will always be a human concern. If you can teach a computer how to automate scalability, it'll probably end up with an AI-induced crush on you.
Tuning Garbage collector can be automated if you can annotate priorities for certain code paths. The runtime just needs to know what parts of the application are latency-sensitive, and it can avoid disrupting those parts.

Answer 11

Some gripes with this article:

I think there will always be a place for threading. No matter how fast the CPU there will always be slow devices to deal with (if nothing else, distant devices--the computer in Timbuktu.)

I don't believe we can ever have rangeless integers as there's no way for the compiler to know how big a number could get.

Finally, I believe that for many purposes caching can become a thing of the past--if you need the whole thing memory-mapped files are the answer if you have enough address space. There still will be some caching, though--things where the system has no way of knowing that it's acceptable.

Answer 12

Many of the things you point to are already being done by the compiler.

Done by most modern compilers
final keyword, 1.5 handles this automtaically I believe
branching, compilers can minimize these to some extent
loop unrolling, done by compilers where the bounds are known
recursion, compilers will optimize tail call recursion
shifts, again most compilers do this

as for GC, the choice not to have a GC in C++ was intention, it's perfectly possible to add your own, such as the Boehm GC.

Answer 13

Point 11 is already a bad idea. On modern CPUs, a byte can take a lot longer to operate on than a 32/64 bit word.

That said, I think in all these cases you just have to remember your original statement--PREMATURE optimization. If you code it and it fails a specification, THEN consider fixing it, and test each fix.

Answer 14

Another good one is pre-allocation.

How many times have you called new ArrayList with an integer? That's an optimization that can go away.

Answer 15

Floating point operations can be slow on certain processors, as some DSPs don't have the necessary hardware. I recently had to optimize a function by changing all doubles into fixed point integers. While the code did turn out to be harder to read initially, we saw more than a 2x speed up when real-time processing is what we're trying to achieve.

Don't be so quick to assume that just because compilers are improving and processors are getting faster, that optimizations are going to be useless. Programmers are seeing the end of the free ride of faster hardware. Just because there are more core's doesn't mean that everyone knows how to utilize them correctly. If the memory can't supply all of the core's with enough code to keep the pipeline full, then you're just wasting cycles waiting.

Answer 16

EDIT: I've change my wording now that I understand the question.

It isn't that these optimizations will "go away", more that they will be abstracted so that most developers don't have to deal with them directly. They will still be used on a machine code level, but they will be transparent to the developer.

Think about things like openMP and how g++'s libstdc++ now offers parallel versions of many of the standard algorithms. This is one example of many showing how people will be using threads implicitly and not have to worry about them directly.

Also, many of the things you mention are something that modern compilers deal with already automatically. For example, function inlining, shifts instead of multiplications, minimizing branchs, unraveling/reversing nested loops, replacing recursion with a for loop. These are all things that compilers will do for now today.

Many of these optimizations are also be offered as "compiler intrinsics". This allows developers to use advanced features of the CPUs in an abstract yet powerful way.

For these, it doesn't matter if the optimization becomes irrelevant in the future, since only compiler writers will be concerned with it. As newer CPUs come out, the compilers will be able to tune the code appropriately. Most compilers already do this: think -march=i686 vs -march=pentium3 options on gcc. These options produce very different code which favored the architectures features.

As for other optimizations, personally I find that having powerful computers is no excuse to write crappy code. Choose the correct algorithm, implement it in a simple and direct way. And let the compiler do the heavy lifting for you.

There are a few things which I think will go away though. For example fixed point math is largely necessary for day to day code. There are platforms which this is still necessary. But for the usual desktop programming, it is pretty much a thing of the past except for the most performance critical of applications.

Answer 17

Are a lot of these even necessary in today's world. I'm thinking specifically about stuff like "using shift instead of / or *". Maybe in some very extreme instances, this stuff is useful, but for most day-to-day coding tasks, I would have to say, don't do it. Make your code clear, rather than fast, unless you have that specific line of code is a bottleneck.

Answer 18

Let's look at modern, dynamic languages like Python. Most of this list is useless already. The bulk of the list is not a "someday in the future," it's already been done and works really well.

The remaining items, however, are not done. And I think they're always going to be hard. This is my take.

Threads -- always hard. Granularity of locking and contention for resources will always be hard.
Inlining functions -- irrelevant in dynamic languages.
Final -- irrelevant in Python.
Buffered IO -- done.
Object caching -- done. Use an ORM layer to maximize this. Otherwise, buy more memory.
Minimizing branch statements. Plain old design.
Using arrays instead of Lists -- done. Lists are first class, arrays are an extension.
Unraveling/reversing nested loops -- hard. I'm not sure what the vision is, this sounds like plain old design. Loops and search are central to most algorithms and algorithm design is hard.
Replacing recursion -- at the fringes. Sometimes recursion is faster; you can't allow an optimizer to guess at this. [However, see below]
Minimizing object size -- irrelevant. You specifically elect to give up control over this to get something that works and is readable.
Selecting the smallest data type -- done. long integer is rangeless, and Python switches politely from short integers to long integers.
Scalability -- always hard. Always will be hard. Demand rises as fast as capability to deliver. There's no possibility of a "solution" to this.
newing/deleting memory -- done.
Using shift instead of * or / -- you can still use shift if you want to.
Avoiding the use of Exceptions. I'm not sure what the issue is. Exceptions are first-class and low-overhead.
Tuning Garbage collector -- maybe there's still more to do here. AFAIK this is still being tweaked.
Double buffering. Value of this varies; generally we just buy more memory to make this irrelevant except for streaming data (video, audio, game animations, etc.)

Edit on recursion. The Python stack limit may be the remaining big "in the future" optimization issue in Python. You can write some algorithms correctly, but run afoul of the stack limitations. This requires either removing the recursion, introducing memoization (which is clearly an optimization) or replacing the recursion with loops or generators.

Answer 19

One obvious one we missed is compilers.

We compile code transform it to a format that is closer to the hardware and we do this for performance. Therefore all compilers are optimizations.

We could hide the fact that CPU instructions are not similar to source code if source code could be compiled in real time down to the CPU. This will occur in the future and the compiler will become invisible, until it breaks that is.

We already do this in Java and .net with JIT, we'll just do it at a higher level one day.

Answer 20

Here's one that is a little out there, as in it might not even be possible.

In computers today performance is all about caching, and caching is having to deal with multiple levels of memory. As memory goes from small to large, it goes from fast to slow.

Caching is ensure you are maximizing the capabilities of all levels of memory. This is the upper bound of performance for most applications.

All that memory management is complex and is basically a huge optimization. Imagine one day where you just used memory and didn't care what level of memory you used. When you add a hard drive or ram you just get more total memory. Not large slow persistent memory and smaller fast volatile memory. You just get more total memory. It is then up to the operating system/runtime to maximize your hardware for you.

Then going off in a tangent, by removing the designer's responsibility to optimize memory usage, we might as well remove the responsibility to persist data. We could just make all memory appear to be persistent. This is important as user input is sacred.

Answer 21

Out of curiosity, I compiled two programs to object code with gcc -c -O3, one using shift right by 3, and one integer division by 8, and disassembled them to see what the compiler does:

[me@myhost ~]$ cat test1.c
include <stdio.h>
int main(int argc, char *argv[])
{
        volatile int x;
        x = 100;
        printf("x/8 = %d\n", x / 8);
}
[scameron@zuul ~]$ cat test2.c
include <stdio.h>
int main(int argc, char *argv[])
{
        volatile int x;
        x = 100;
        printf("x/8 = %d\n", x >> 3);
}
[me@myhost ~]$ cat test1.dis

test1.o:     file format elf32-i386

Disassembly of section .text:

00000000 :
   0:   8d 4c 24 04             lea    0x4(%esp),%ecx
   4:   83 e4 f0                and    $0xfffffff0,%esp
   7:   ff 71 fc                pushl  -0x4(%ecx)
   a:   55                      push   %ebp
   b:   89 e5                   mov    %esp,%ebp
   d:   51                      push   %ecx
   e:   83 ec 24                sub    $0x24,%esp
  11:   c7 45 f8 64 00 00 00    movl   $0x64,-0x8(%ebp)
  18:   8b 55 f8                mov    -0x8(%ebp),%edx
  1b:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
  22:   89 d0                   mov    %edx,%eax
  24:   c1 f8 1f                sar    $0x1f,%eax
  27:   c1 e8 1d                shr    $0x1d,%eax
  2a:   01 d0                   add    %edx,%eax
  2c:   c1 f8 03                sar    $0x3,%eax
  2f:   89 44 24 04             mov    %eax,0x4(%esp)
  33:   e8 fc ff ff ff          call   34 
  38:   83 c4 24                add    $0x24,%esp
  3b:   59                      pop    %ecx
  3c:   5d                      pop    %ebp
  3d:   8d 61 fc                lea    -0x4(%ecx),%esp
  40:   c3                      ret    
[me@myhost ~]$ cat test2.dis

test2.o:     file format elf32-i386

Disassembly of section .text:

00000000 :
   0:   8d 4c 24 04             lea    0x4(%esp),%ecx
   4:   83 e4 f0                and    $0xfffffff0,%esp
   7:   ff 71 fc                pushl  -0x4(%ecx)
   a:   55                      push   %ebp
   b:   89 e5                   mov    %esp,%ebp
   d:   51                      push   %ecx
   e:   83 ec 24                sub    $0x24,%esp
  11:   c7 45 f8 64 00 00 00    movl   $0x64,-0x8(%ebp)
  18:   8b 45 f8                mov    -0x8(%ebp),%eax
  1b:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
  22:   c1 f8 03                sar    $0x3,%eax
  25:   89 44 24 04             mov    %eax,0x4(%esp)
  29:   e8 fc ff ff ff          call   2a 
  2e:   83 c4 24                add    $0x24,%esp
  31:   59                      pop    %ecx
  32:   5d                      pop    %ebp
  33:   8d 61 fc                lea    -0x4(%ecx),%esp
  36:   c3                      ret    
[me@myhost ~]$

(Pardon the missing "#" in front of "include," couldn't be bothered to figure out how to get it to show up inside the "pre" tags.)

I had been under the impression that "x / 8" would compile to the same code as "x >> 3" but this appears not to be the case, at least with -O3.

The shift does seem to compile down to fewer instructions (no idea if it's faster) though the division did wind up being a shift, if I read that right.

Interesting.

Well, I may have botched things by using "volatile." My intent with that was to prevent the compiler from computing x/3 at compile time.

Answer 22

Well, you sort of covered it, but address spaces and memory size should be exploding soon, such that we'll never have to worry about writing to disk. We'll do everything in memory (full databases etc), and the OS will save things to disk ("as a backup") when idle.

On the other hand, memory access time is the last bottleneck (the speed of light prevents you from getting a bit of information from one end of a CPU chip to the other in one clock cycle), so calculating things on the fly (that used to be "lazy evaluated" & cached) may be the way to go.

Answer 23

When design a data format that is a collection of items, including the count of the number of items is an optimization. This number is used to preallocate space before loading the file.

It can also lead to contradictions if the number of items and the count don't match.

Answer 24

In the future, compilers will be smarter than the programmers using them, and CPUs so fast that they can execute an infinite loop in ten seconds.

Answer 25

Expanding on "11. Selecting the smallest data type that will work. Maybe in the future an integer is rangeless"

The fact we have multiple integer types is to not "waste" bits.

If we could process two 16 bit integers in a 32 bit register, we can get double the performance! That's all nice, but by exposing the different integer sizes it becomes an optimization.

Ideally we just want an integer type that is undefined in size. Underneath it would be a variable length byte array. Now you are all thinking it would be hard to optimize, and that true, it would be harder. But that's pushing the work down to the operating system/runtime and not having the high level designer worry about overflows.

The runtime can still use existing 32/64 bit registers to do integer calculation that fit in that range and fallback to a more complex code if its out of range. So maintaining existing performance with seamless transition to boundless integers with just a bit of overhead. And this is overhead that will disappear if implemented in hardware.

Answer 26

From the perspective of someone who has not used c++ for years, and maybe once every couple years might have to re-learn pointer arithmetic for some unsafe code blocks in c#, things like inlining, malloc/free, are just distant memories, but that does not really mean that optimisation is out the window, there are just different optimisations to think about now.

Whilst premature optimisation is the root of many evils, it is all too common for developers to pull out that chestnut as an excuse for being lazy, it doesn't mean you should just use a bubble sort and let the compiler fix it, or throw more hardware at it until it runs fast enough. On the other hand, bit shifting to do arithmetic is rarely neccessary these days (for what I do anyway, embedded systems maybe).

Would love to not have to think about threading, I hope that comes soon, does my old school head in.

Personally, mainly do web stuff, so most of the optimisations I deal with are to do with bandwidth and network latency, and whilst bandwidth is (slowwwwwly) getting better, it will never be infinite, and the speed of light is not likely to change any time soon. http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing

Optimisations have changed a lot in the 20something years I have been tinkering with computers, but will never go away, because we will always be trying to outdo each other.

There will always be someone willing to get down and dirty to dazzle the market. You can use the out of the box inverse square root if you want, but some freak might just write a faster one.....

Answer 27

I think you'll find it's a bit more complicated than that ^[1]. You are assuming a magic run-time and a magic compiler. The dirty little compiler secret is that most of these optimizations aren't useless, and are unlikely to become so in the near future. Most optimizations sort-of work, most of the time, especially on code that has lots of scalars, and where the entire code is available to the compiler at compile-time.

So lets take a look:

1. Threads. I imagine a concurrent model of tomorrow where we don't explicitly deal with threads.

They exist, with Erlang for example. But it will be a long time before you'll see something like this for free in C/C++ or even Java. We haven't seen anything close to helpful to do automatic threading.

2. Inlining functions, the runtime or compiler should be able to figure this out for you

Yes, it sounds like it should. In fact, this is the biggest optimization performed by Java JITs. Compilers like LLVM are designed for this, and projects like gcc's LTO will deliver a big boost here. However, like many optimizations doing this perfectly in the general case is hard, and it is also very expensive. Static compilers have a hard time knowing good places to inline, which is why inlining is typically only performed at -O3 (aka it might hurt, it might help, we can't tell). gcc's LIPO ^[2] should also help, but this problem is surprisingly unsolved.

3. The final keyword in Java. Basically inlining variable references (but we'll still use it for protection, today I use it for both protection and performance)

Again, yes and no. Java suffers from dynamic classloading, which means that the otherwise brilliant javac can't do some optimizations if final is missing. It can do some of the same optimizations at run-time (in fact, that's what's great about JITs), but it won't do run-time optimizations which are expensive.

4. Buffered IO, the operating system or runtime should figure this out for you

I don't know much about this, but I would ask how you hope that the OS would decide whether you want your IO buffered or not.

5. Object caching (and caching in general at many levels), it's not far fetched for the runtime to handle this

I guess this isn't too far fetched, but I think you'd be lucky.

6. Minimizing branch statements, a smarter compiler can remove branches for you

With feedback-directed optimization it can get them right more often. But no-one uses that. It can't remove branches which might be necessary. A JIT can sometimes, but again, can be expensive.

7. Using arrays instead of Lists, a smarter runtime can just use arrays underneath

Yeah, thats iffy. Doing the transformation should be simple enough. Actually getting a speed-up is uncertain.

8. Unraveling/reversing nested loops

For simple loops, compiler optimizations do a decent job. But if you aren't doing scientific applications with huge multi-dimensional arrays, your mileage will vary, especially if you have function calls and anything in the loop escapes.

9. Replacing recursion with a for loop

OK, finally, one that is mostly solved. But not in the general case.

10. Minimizing object size to fit into cache or cache alignment issues. This is a big one. Its a hard problem to crack, but if the runtime could really figure out that really needs to be in cache, then we can finally not care about object size.

Not a chance. Copying garbage collectors are about as close as we can get. But think about this for C or C++: pointers everywhere! I think this one won't die until C/C++ go away. However, I think LLVM's early research ^[3] was on this.

11. Selecting the smallest data type that will work. Maybe in the future an integer is rangeless

Yeah, in quite a lot of cases. For scalar code that doesn't escape a function, this is already practical.

12. Scalability. This sounds like an odd one, but scalability is a performance issue in the end.

I would love to hear some theory about this :)

13. newing/deleting memory (looking at you C++), Java already solved some of this, others just need to catch up. It can still be improved even more

Sadly, this is pretty hard too. Java does great at it since its in a VM, but for non-managed code like C++, this might never happen.

14. Using shift instead of * or /

This one has long been solved, I believe (well, for built-in types).

15. Avoiding the use of Exceptions

Do you mean that you will no longer have to avoid them? This is largely solved for Java, using a technique call dynamic deoptimization ^[4]. With separate compilation, this is a hard one for C++.

16. Tuning Garbage collector

This one I think will be solved. Garbage collectors have gotten amazing, and you still hear of funky GC research going on.

[1] http://i-think-youll-find-its-a-bit-more-complicated-than-that-and-other-excellent-christmas-gifts/
[2] http://gcc.gnu.org/wiki/LightweightIpo
[3] http://llvm.org/pubs/2005-05-04-LattnerPHDThesis.html
[4] http://java.sun.com/products/hotspot/docs/whitepaper/Java%5FHotspot%5Fv1.4.1/Java%5FHSpot%5FWP%5Fv1.4.1%5F1002%5F4.html

Answer 28

The new Intel Nehalem processor relaxes rules on the data alignment, has larger vectorization capabilities and much more intelligent caching. We are seeing 50% performance boosts without even recompiling in heavily math-intensive applications.

Hell, the Intel compiler can give a 50% boost over gcc by itself.

What we have to jump through hoops for now is to make loops vectorize, and this involves real obfuscation of the simple intent of the loop. It's almost like writing code to efficiently compile under CUDA on the GPU (hideous).

I'd also like to see compilers be able for figure out if and how a loop could be spawned across multiple cores. Some loops are no brainers, some are very difficult to get right, leading to weird code manipulations to get the correct result, which will be confusing for the poor SOB following your code later.

And I wish Intel would devote part of their silicon to understanding Parrot byte code, so that all interpreted languages could benefit down the road.

Answer 29

I got another one and its a biggie.

Shared state is an optimization.

When you pass data to another function its usually by reference so it doesn't need to make a copy of the data. This is an optimization. While it might require some rework, you can design everything to be pass by copy.

And once you achieve that, concurrent code becomes easier.