Stack OverflowWhat optimizations today are going to be useless tomorrow?
[+37] [29] Pyrolistical
[2008-11-17 17:42:32]
[ optimization future speculative ]

I hope we all know by now that Premature optimization is the root of all evil.

One side of that quote means by optimizing you are increasing complexity and complexity is evil. The other less known side is today's optimization might be meaningless tomorrow.

What I mean is we used to think that, "Floating-point operations are slow, therefore you should use integers since they are 100x faster! You must use integers even if it makes your code more complex!

That is completely false today. Today, modern processors actually do floating point operations faster than integer operations. Did you know that? I was surprised as well when I tested it in Java.

Also, what about addition and minus are much faster than multiplications or divide? False today (no difference)

Double is much slower than float! False (only half as slow, which is nothing)

A few other things that I think will be useless soon (if they are not already):

  1. Threads. Threads aren't commonly thought as an optimization, but it really is. It trades complexity for performance. I imagine a concurrent model of tomorrow where we don't explicitly deal with threads.
  2. Inlining functions, the runtime or compiler should be able to figure this out for you
  3. The final keyword in Java. Basically inlining variable references (but we'll still use it for protection, today I use it for both protection and performance)
  4. Buffered IO, the operating system or runtime should figure this out for you
  5. Object caching (and caching in general at many levels), it's not far fetched for the runtime to handle this
  6. Minimizing branch statements, a smarter compiler can remove branches for you
  7. Using arrays instead of Lists, a smarter runtime can just use arrays underneath
  8. Unraveling/reversing nested loops
  9. Replacing recursion with a for loop
  10. Minimizing object size to fit into cache or cache alignment issues. This is a big one. Its a hard problem to crack, but if the runtime could really figure out that really needs to be in cache, then we can finally not care about object size.
  11. Selecting the smallest data type that will work. Maybe in the future an integer is rangeless
  12. Scalability. This sounds like an odd one, but scalability is a performance issue in the end.
  13. newing/deleting memory (looking at you C++), Java already solved some of this, others just need to catch up. It can still be improved even more
  14. Using shift instead of * or /
  15. Avoiding the use of Exceptions
  16. Tuning Garbage collector
  17. Double buffering, I think Swing already does this by default

My main point in all this is to be careful with your optimizations and you really should code for clarity and not speed.

So my question is, what are some "optimizations" you are doing today that probably will be useless tomorrow?

(5) Code for clarity, not speed, except when you have to. :) - Terry Donaghe
This is out of my realm, but is Java the language to use for testing processor speed with ints vs. floats? Aren't you also partially testing how well the VM works with the processor? - MusiGenesis
I'm not a Java-hater, it's just that I have a lot of code that works with int arrays (for other reasons) and one of the side benefits of that is (or at least used to be) improved speed. - MusiGenesis
Well, go ahead and test it on your system. When it comes to profiling, you always have to say "It depends...", in this case its true in Java. - Pyrolistical
Musi--There are issues with FPUs returning incompatible results, so Java tends to do them (more slowly) in software, but the differences are not going to make a difference except in some graphical applications, but to some degree you're right. - Bill K
There's plenty to disagree with here (and I do a little bit), but I really liked this post. +1 Thought I'd let the world know ... - A. Rex
Integers are still faster than floating-point numbers, so "integer"-izing mathematical code is still a valid optimization. Particularly serialization (try it - see how many cycles you burn on printf("%f", floatval) vs printf("%d", intval)). - Tom
Additions are faster than divisions. Time it and see. - Crashworks
Faster, sure. But not 100x faster and that's my point. - Pyrolistical
Integer division produces approx. 2 bits of result / cycle on x86. Addition/subtraction of 32-bit integers takes 1 clock-cycle, whereas 64/32-bit division takes approx. 40. It's still an order of magnitude difference. - zvrba
+1, Best question on SO IMO ! Would upvote it thousand times if I could! :D - missingfaktor
[+23] [2008-11-17 18:43:35] cookre

I'm reminded of one of Michael Jackson's aphorisms. No, not THAT one, this one:

Jackson's Rules of Optimization:

Rule 1: Don't do it.

Rule 2 (for experts only): Don't do it yet.

(3) I'm tempted to give this one a -1, because it's exactly opposite of the question asked - it will be true forever! - Mark Ransom
My best advice on optimization, don't use malloc and don't use processes. - Fu.
[+16] [2008-11-17 20:51:33] Paul Nathan

As someone who works with embedded systems, I am amused by your assumption of an operating system and a run-time library.

Amused how? Do you disagree with that I think should be operating system or runtime's responsibility? - Pyrolistical
(4) My dev system has no: run-time libs, OS, threads, or GC. The optimizations you present largely assume the full bloat of a Python-esque environment. So I honestly am quite amused...all those tricks become very important when there's only 1 MB of RAM and your program does everything itself. - Paul Nathan
Ah well, embedded systems today are what computers were a few years ago. The environment I am thinking about is just a modern PC or Mac and their future. - Pyrolistical
Ahh, and we've just finished a 512byte RAM (in 2 banks, of course), 8-bit processor job - you big memory boys! - Byron Ross
1MB of RAM? Luxury! The embedded systems I used to work on had 1K. And 2K of EEPROM. - Darron
(2) As an added hilarity, we don't have to go all the way to primitive embedded systems. Cutting-edge CPU's like, say, the CELL (as used in PS3 and a lot of high-performance grid setups) goes the opposite way, making a lot of the above optimizations more relevant, not less. If the Cell is a sign of what's to come, you'd better get used to thinking about manual branch prediction, caching, DMA transfers and everything else yourself. - jalf
@jalf. Sweet! Then all those arcane CS courses are coming in handy! - Paul Nathan
[+14] [2008-11-17 19:06:56] Cybis

1) Threads. - I don't see how we can avoid working directly with threads if we stick to the traditional OO paradigm. If we want multithreading to be a "behind the scenes" feature, functional programming needs to become more mainstream.

2) Inlining functions - Already obsolete for the most

5) Object caching - Actually, this can decrease performance if used with a generational garbage collector!

7) Using arrays instead of Lists - Unless performance is critical, you should already be using lists. In some languages though, working with arrays are just syntactically more convenient. Python comes a long way to encouraging us to use lists.

11) Selecting the smallest data type that will work. Already bad! Either the data has to be aligned on a word boundary (in which case you won't save space), or it'll be slower to work with than native-length data types (32/64 bit depending on proc).

12) Scalability. I don't understand. Scalability is about choosing the right algorithms and data structures to solve your problems. Unless the compiler already has a preprogrammed solution to your problem, it will never be able to choose your algorithm for you!

14) Using shift instead of * or / - Actually, using explicit shift operators improves readability. It makes it clear that you're manipulating bits rather than performing arithmetic operations.

17) Double buffering, - This is a non-issue if you're just building a dialog box with buttons. If you're doing animation/games, often you just need to set some flag and be done with it.

Threading optimization will lose prominence as Functional Languages become more mainstream - Chris Ballance
Functional languages will never come mainstream. Perhaps functional style features like closures will become more popular, but that still won't be threadless programming. - Pyrolistical
What he was talking about the * and / operators is the substitution of shifts when performing non-bit related arithmetic. It's like using number & 1 to find out if a number is even or odd rather than performing the more appropriate number % 2. - Cristián Romo
Care to explain #7? Unless you're doing a lot of inserts at both the front and back, dynamically-resized arrays a much more efficient in time, space, and spatial locality. - Mr Fooz
Don't confuse "List" and "Linked List". The "List" datatype in .NET, Java, Python (or "vector" in C++ STL) are implemented as dynamically-resized arrays. What I'm suggesting is that unless performance is critical, you should stay at the appropriate level of abstraction. - Cybis
[+13] [2008-11-17 19:23:05] Huntrods

I'm going to take an alternate tack to this.

If you need the optimization in your code today, then possible future advances are of limited concern.

If you NEED the gains from optimization, and actually do get a measurable gain from such an optimization, then it's use has been justified.

The real trick is keeping on top of the advances so that you are not using FALSE optimizations today simply because they were necessary last year.



I agree, I am not saying don't optimize. I just want to understand what will won't need to optimize in the future. If we were to design a new language today, it would be good to know what you need to handle under the covers to minimize the amount of optimization required at the user level. - Pyrolistical
I agree. Too many people take premature optimization to mean avoid optimizing. If it's needed, do it, and only do it if it's real. Optimize where you have a performance problem and don't just do things because they used to work. Actually measure to determine what your gains are. - bruceatk
[+7] [2008-11-17 17:58:51] Dan Hewett

Many architecture specific things are temporary. For instance, it used to be the fastest way to compute a DFT (Discrete Fourier Transform) was to store the sin and cosine values in memory. Now it CAN be cheaper to compute (or approximate) the sin/cos than it is to do a single non-cached memory access.

In "Writing Solid Code" Steve Maguire advocates writing clear (obvious) code, and allowing the compiler to optimize it. Often when you optimize code, you make it hard for the compiler to help.

[+4] [2008-11-17 18:03:34] friol

Don't agree on point one (thread optimization).

I think that algorithmical thread optimization (for example writing a parallel raytracer) will still be useful tomorrow, since the day compilers will do that type of optimization seems far away to me (and I guess that for that day compilers will do any type of optimization :)

"#pragma omp ..." is the way to get a C or C+++ compiler to parallelize your raytracer. Raytracers are a bad example because individual rays are not required to share any state at all, hence the compiler can already do this optimization for you. - Tom
[+4] [2008-11-17 18:09:49] Electrons_Ahoy

Anything involving the Stack.

I worked a gig doing embedded programming for a while, and we'd have actual conversations about whether passing in two parameters to a function versus one was worth the extra stack space. Recursion was out of the question - and we tended to only write new functions when we absolutely had to, since we barely had any stack space to play with.

But, with the cost of flash memory dropping, that's going to quickly go away - much like it did with PCs. If we had started that project a year later, we could have probably had ten times the ram in that device for the same price, which would have made all my weeks of optimization fine-tuning totally obsolete.

(and good riddance.)

flash? its terrible performance (relatively speaking). Your harddrive is faster than most flash memory modules. - gbjbaanb
Faster in what way? flash is getting faster in all ways every day - Pyrolistical
(1) We're talking microcontrollers here. Believe me a write once flash memory mapped into yoru processor running at the clock speed is a WHOLE lot faster than any harddrive that you care to throw at it. - Spence
[+4] [2008-11-17 18:53:03] sep332

Lots of filesystems use some (or a lot!) of extra computation to optimize for spinning hard drives, whether for access latency or fragmentation. Solid state disks make a lot of this work obsolete.

Solid state disks just change what you need to do to keep things fast and keep the damn things from gaining strings of bad sectors. The problem is about as complex either way. - tloach
(1) I didn't say "This is totally useless!", I said that a lot of it is obsolete. On an SSD, you don't have to worry about reading a lot of discontiguous blocks, or about different parts of the disk reading or writing slower than others. Caching strategies are dramatically different, etc. - sep332
[+3] [2009-01-27 01:20:28] Calyth

I'm amused by the hubris of the original article, claiming that those optimizations are unnecessary.

It's like watching a Python guy asking why anyone would bother learning assembly, while at the end, Python is written in C, which the run time has to execute in assembly.

Most of your so call obsolete optimizations are merely shuffle, and rightfully so. The bulk of programmers shouldn't care about shifts vs x/2, yaddi yaddi yadda. But someone still has to learn it if they were to maintain, revise, or create a new language and associated compiler / interpreter.

(1) We are taught encapsulation and abstraction, but optimizations are the exact opposite of that. - Pyrolistical
[+3] [2009-01-27 02:56:40] Tom

Addressing some of the points you brought up...

  1. Inlining functions. C and C++ have the biggest problem with this, because of the interface/implementation separation. However, toolchains are getting better at inlining across compilation units (ICC and MSVC, not GCC).
  2. The final/const keyword. This will forever be useful because it contains semantic information.
  3. Buffered IO works for many applications, but involves a data copy. High-performance applications do their own buffering because they can do a better job of it at the application level, and potentially avoid a data copy.
  4. Using arrays instead of Lists is not an optimization, it's an algorithmic design. If anyone ever makes a list structure that performs as well as an array, it's because their array sucks.
  5. Unraveling/reversing nested loops is part of GCC 4.4. It's already on its way out for basic blocks. The problem is, many developers write code that can't be automatically unraveled or reversed because it has trivial data dependencies that the compiler must preserve, due to sloppy coding.
  6. Replacing recursion with a for loop already happens in many languages. However, this can be an algorithmic decision, because using a loop allows you to control your traversal order, whereas recursion implies a depth-first traversal.
  7. Minimizing object size to fit into cache or cache alignment issues will never be fully automated for the systems where it really matters. Too much information that doesn't make it into the source code, in any language.
  8. Selecting the smallest data type that will work may be obviated on a virtual machine, but not on any physical machine. Unfortunately, it means that you can no longer depend upon predictable truncation or overflow behavior, since the virtual machine will just promote your type's size.
  9. Scalability will always be a human concern. If you can teach a computer how to automate scalability, it'll probably end up with an AI-induced crush on you.
  10. Tuning Garbage collector can be automated if you can annotate priorities for certain code paths. The runtime just needs to know what parts of the application are latency-sensitive, and it can avoid disrupting those parts.

[+3] [2009-02-21 05:39:11] Loren Pechtel

Some gripes with this article:

I think there will always be a place for threading. No matter how fast the CPU there will always be slow devices to deal with (if nothing else, distant devices--the computer in Timbuktu.)

I don't believe we can ever have rangeless integers as there's no way for the compiler to know how big a number could get.

Finally, I believe that for many purposes caching can become a thing of the past--if you need the whole thing memory-mapped files are the answer if you have enough address space. There still will be some caching, though--things where the system has no way of knowing that it's acceptable.

(1) Will of course there will be threading, just not explicit threading. The idea is to make threading automatic/invisible. We want it to just work - Pyrolistical
I don't think it will be possible to make invisible threading. How can the compiler know what's safe to run in parallel??? - Loren Pechtel
(1) You are stuck in the current paradigm, so of course it seems impossible to make threading invisible. There are many turing complete system that have automatic threading. The most well known one is functional programming. FP has its own issues, but as you can see invisible threading can be done. - Pyrolistical
[+2] [2008-11-17 18:10:28] Alex Gaynor

Many of the things you point to are already being done by the compiler.

  • Done by most modern compilers
  • final keyword, 1.5 handles this automtaically I believe
  • branching, compilers can minimize these to some extent
  • loop unrolling, done by compilers where the bounds are known
  • recursion, compilers will optimize tail call recursion
  • shifts, again most compilers do this

as for GC, the choice not to have a GC in C++ was intention, it's perfectly possible to add your own, such as the Boehm GC.

[+2] [2008-11-17 18:16:51] Bill K

Point 11 is already a bad idea. On modern CPUs, a byte can take a lot longer to operate on than a 32/64 bit word.

That said, I think in all these cases you just have to remember your original statement--PREMATURE optimization. If you code it and it fails a specification, THEN consider fixing it, and test each fix.

[+2] [2008-11-18 01:43:10] Pyrolistical

Another good one is pre-allocation.

How many times have you called new ArrayList with an integer? That's an optimization that can go away.

[+2] [2009-01-27 03:14:47] fitzfitsahero

Floating point operations can be slow on certain processors, as some DSPs don't have the necessary hardware. I recently had to optimize a function by changing all doubles into fixed point integers. While the code did turn out to be harder to read initially, we saw more than a 2x speed up when real-time processing is what we're trying to achieve.

Don't be so quick to assume that just because compilers are improving and processors are getting faster, that optimizations are going to be useless. Programmers are seeing the end of the free ride of faster hardware. Just because there are more core's doesn't mean that everyone knows how to utilize them correctly. If the memory can't supply all of the core's with enough code to keep the pipeline full, then you're just wasting cycles waiting.

[+2] [2009-02-21 05:30:30] Evan Teran

EDIT: I've change my wording now that I understand the question.

It isn't that these optimizations will "go away", more that they will be abstracted so that most developers don't have to deal with them directly. They will still be used on a machine code level, but they will be transparent to the developer.

Think about things like openMP and how g++'s libstdc++ now offers parallel versions of many of the standard algorithms. This is one example of many showing how people will be using threads implicitly and not have to worry about them directly.

Also, many of the things you mention are something that modern compilers deal with already automatically. For example, function inlining, shifts instead of multiplications, minimizing branchs, unraveling/reversing nested loops, replacing recursion with a for loop. These are all things that compilers will do for now today.

Many of these optimizations are also be offered as "compiler intrinsics". This allows developers to use advanced features of the CPUs in an abstract yet powerful way.

For these, it doesn't matter if the optimization becomes irrelevant in the future, since only compiler writers will be concerned with it. As newer CPUs come out, the compilers will be able to tune the code appropriately. Most compilers already do this: think -march=i686 vs -march=pentium3 options on gcc. These options produce very different code which favored the architectures features.

As for other optimizations, personally I find that having powerful computers is no excuse to write crappy code. Choose the correct algorithm, implement it in a simple and direct way. And let the compiler do the heavy lifting for you.

There are a few things which I think will go away though. For example fixed point math is largely necessary for day to day code. There are platforms which this is still necessary. But for the usual desktop programming, it is pretty much a thing of the past except for the most performance critical of applications.

We are talking about exactly the same thing. By making these optimizations invisible to most developers, its the same thing as making them go away in high level code. - Pyrolistical
I guess I misunderstood. Yes they will go away in the sense the high level code won't need it. But they will also stay in the sense that the compilers will still implement them. - Evan Teran
[+1] [2008-11-17 18:06:45] Kibbee

Are a lot of these even necessary in today's world. I'm thinking specifically about stuff like "using shift instead of / or *". Maybe in some very extreme instances, this stuff is useful, but for most day-to-day coding tasks, I would have to say, don't do it. Make your code clear, rather than fast, unless you have that specific line of code is a bottleneck.

Some of them are, some of them aren't, some are still used even though they're unnecessary. For example, modern compilers are very efficient at function inlining. Also, any modern compiler will transform an x / 2 into an x >> 1 under the hood. - Wedge
[+1] [2008-11-17 18:38:56] S.Lott

Let's look at modern, dynamic languages like Python. Most of this list is useless already. The bulk of the list is not a "someday in the future," it's already been done and works really well.

The remaining items, however, are not done. And I think they're always going to be hard. This is my take.

  1. Threads -- always hard. Granularity of locking and contention for resources will always be hard.
  2. Inlining functions -- irrelevant in dynamic languages.
  3. Final -- irrelevant in Python.
  4. Buffered IO -- done.
  5. Object caching -- done. Use an ORM layer to maximize this. Otherwise, buy more memory.
  6. Minimizing branch statements. Plain old design.
  7. Using arrays instead of Lists -- done. Lists are first class, arrays are an extension.
  8. Unraveling/reversing nested loops -- hard. I'm not sure what the vision is, this sounds like plain old design. Loops and search are central to most algorithms and algorithm design is hard.
  9. Replacing recursion -- at the fringes. Sometimes recursion is faster; you can't allow an optimizer to guess at this. [However, see below]
  10. Minimizing object size -- irrelevant. You specifically elect to give up control over this to get something that works and is readable.
  11. Selecting the smallest data type -- done. long integer is rangeless, and Python switches politely from short integers to long integers.
  12. Scalability -- always hard. Always will be hard. Demand rises as fast as capability to deliver. There's no possibility of a "solution" to this.
  13. newing/deleting memory -- done.
  14. Using shift instead of * or / -- you can still use shift if you want to.
  15. Avoiding the use of Exceptions. I'm not sure what the issue is. Exceptions are first-class and low-overhead.
  16. Tuning Garbage collector -- maybe there's still more to do here. AFAIK this is still being tweaked.
  17. Double buffering. Value of this varies; generally we just buy more memory to make this irrelevant except for streaming data (video, audio, game animations, etc.)

Edit on recursion. The Python stack limit may be the remaining big "in the future" optimization issue in Python. You can write some algorithms correctly, but run afoul of the stack limitations. This requires either removing the recursion, introducing memoization (which is clearly an optimization) or replacing the recursion with loops or generators.

Except, in most benchmarking I've done, python is hundreds of times slower than C. So maybe these things are important in practice? - Chris Jefferson
Most of these optimizations aren't available in Python because you're solving a different class of problems than you solve with c. That's the ambiguity with this question. It doesn't specify any particular class of problems. - S.Lott
Doing recursion without blowing the stack is an important optimization. - Svante
Thanks for bring up, but that's sort of like caching - Pyrolistical
Memoization is a very specialized use of caching. For the transcendental functions (like Gamma) and Stirling numbers, and what-not, memoization helps compute results in less time and less memory use. - S.Lott
[+1] [2009-02-21 04:44:50] Pyrolistical

One obvious one we missed is compilers.

We compile code transform it to a format that is closer to the hardware and we do this for performance. Therefore all compilers are optimizations.

We could hide the fact that CPU instructions are not similar to source code if source code could be compiled in real time down to the CPU. This will occur in the future and the compiler will become invisible, until it breaks that is.

We already do this in Java and .net with JIT, we'll just do it at a higher level one day.

[+1] [2009-02-21 04:59:21] Pyrolistical

Here's one that is a little out there, as in it might not even be possible.

In computers today performance is all about caching, and caching is having to deal with multiple levels of memory. As memory goes from small to large, it goes from fast to slow.

Caching is ensure you are maximizing the capabilities of all levels of memory. This is the upper bound of performance for most applications.

All that memory management is complex and is basically a huge optimization. Imagine one day where you just used memory and didn't care what level of memory you used. When you add a hard drive or ram you just get more total memory. Not large slow persistent memory and smaller fast volatile memory. You just get more total memory. It is then up to the operating system/runtime to maximize your hardware for you.

Then going off in a tangent, by removing the designer's responsibility to optimize memory usage, we might as well remove the responsibility to persist data. We could just make all memory appear to be persistent. This is important as user input is sacred.

[+1] [2009-02-21 05:19:04] smcameron

Out of curiosity, I compiled two programs to object code with gcc -c -O3, one using shift right by 3, and one integer division by 8, and disassembled them to see what the compiler does:

[me@myhost ~]$ cat test1.c
include <stdio.h>
int main(int argc, char *argv[])
        volatile int x;
        x = 100;
        printf("x/8 = %d\n", x / 8);
[scameron@zuul ~]$ cat test2.c
include <stdio.h>
int main(int argc, char *argv[])
        volatile int x;
        x = 100;
        printf("x/8 = %d\n", x >> 3);
[me@myhost ~]$ cat test1.dis

test1.o:     file format elf32-i386

Disassembly of section .text:

00000000 :
   0:   8d 4c 24 04             lea    0x4(%esp),%ecx
   4:   83 e4 f0                and    $0xfffffff0,%esp
   7:   ff 71 fc                pushl  -0x4(%ecx)
   a:   55                      push   %ebp
   b:   89 e5                   mov    %esp,%ebp
   d:   51                      push   %ecx
   e:   83 ec 24                sub    $0x24,%esp
  11:   c7 45 f8 64 00 00 00    movl   $0x64,-0x8(%ebp)
  18:   8b 55 f8                mov    -0x8(%ebp),%edx
  1b:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
  22:   89 d0                   mov    %edx,%eax
  24:   c1 f8 1f                sar    $0x1f,%eax
  27:   c1 e8 1d                shr    $0x1d,%eax
  2a:   01 d0                   add    %edx,%eax
  2c:   c1 f8 03                sar    $0x3,%eax
  2f:   89 44 24 04             mov    %eax,0x4(%esp)
  33:   e8 fc ff ff ff          call   34 
  38:   83 c4 24                add    $0x24,%esp
  3b:   59                      pop    %ecx
  3c:   5d                      pop    %ebp
  3d:   8d 61 fc                lea    -0x4(%ecx),%esp
  40:   c3                      ret    
[me@myhost ~]$ cat test2.dis

test2.o:     file format elf32-i386

Disassembly of section .text:

00000000 :
   0:   8d 4c 24 04             lea    0x4(%esp),%ecx
   4:   83 e4 f0                and    $0xfffffff0,%esp
   7:   ff 71 fc                pushl  -0x4(%ecx)
   a:   55                      push   %ebp
   b:   89 e5                   mov    %esp,%ebp
   d:   51                      push   %ecx
   e:   83 ec 24                sub    $0x24,%esp
  11:   c7 45 f8 64 00 00 00    movl   $0x64,-0x8(%ebp)
  18:   8b 45 f8                mov    -0x8(%ebp),%eax
  1b:   c7 04 24 00 00 00 00    movl   $0x0,(%esp)
  22:   c1 f8 03                sar    $0x3,%eax
  25:   89 44 24 04             mov    %eax,0x4(%esp)
  29:   e8 fc ff ff ff          call   2a 
  2e:   83 c4 24                add    $0x24,%esp
  31:   59                      pop    %ecx
  32:   5d                      pop    %ebp
  33:   8d 61 fc                lea    -0x4(%ecx),%esp
  36:   c3                      ret    
[me@myhost ~]$

(Pardon the missing "#" in front of "include," couldn't be bothered to figure out how to get it to show up inside the "pre" tags.)

I had been under the impression that "x / 8" would compile to the same code as "x >> 3" but this appears not to be the case, at least with -O3.

The shift does seem to compile down to fewer instructions (no idea if it's faster) though the division did wind up being a shift, if I read that right.


Well, I may have botched things by using "volatile." My intent with that was to prevent the compiler from computing x/3 at compile time.

yes, they both ended up as a "sar $0x3,%eax" - Evan Teran
Division of -1 by 8 will return 0, whereas shifting -1 right by 3 will still return -1. The two operations are not equivalent over negative numbers, hence the different code. - zvrba
[0] [2008-11-17 18:00:51] James Curran

Well, you sort of covered it, but address spaces and memory size should be exploding soon, such that we'll never have to worry about writing to disk. We'll do everything in memory (full databases etc), and the OS will save things to disk ("as a backup") when idle.

On the other hand, memory access time is the last bottleneck (the speed of light prevents you from getting a bit of information from one end of a CPU chip to the other in one clock cycle), so calculating things on the fly (that used to be "lazy evaluated" & cached) may be the way to go.

This is a good one. No more memory hierarchy. You just persist stuff, and the operating system will eventually write it to non volatile storage. - Pyrolistical
Doing everything in memory and using the disk as a backup has already been done years ago, in a system called KeyKOS (the antecessor of the Eros operating system). - CesarB
Well, no, we will not do everything in memory, because there are too many programmers out there who think that memory is practically unlimited nowadays. 640 GB (sic) should be enough for anyone? - Svante
I say we will. What we consider memory will just change. If you can make the slower devices transparent, then who's to day you don't have 100 TB of "memory" - Pyrolistical
Problem is, slower devices aren't transparent. How many times have you had to wait on disk I/O for the Windows "Start" menu to pop up? - Tom
[0] [2009-01-27 00:42:49] Pyrolistical

When design a data format that is a collection of items, including the count of the number of items is an optimization. This number is used to preallocate space before loading the file.

It can also lead to contradictions if the number of items and the count don't match.

[0] [2009-01-27 03:32:31] Crashworks

In the future, compilers will be smarter than the programmers using them, and CPUs so fast that they can execute an infinite loop in ten seconds.

An infinite loop in 10 seconds, I assume you are being ironic - johnc
yes. And it fits the question. - jalf
[0] [2009-02-21 05:34:52] Pyrolistical

Expanding on "11. Selecting the smallest data type that will work. Maybe in the future an integer is rangeless"

The fact we have multiple integer types is to not "waste" bits.

If we could process two 16 bit integers in a 32 bit register, we can get double the performance! That's all nice, but by exposing the different integer sizes it becomes an optimization.

Ideally we just want an integer type that is undefined in size. Underneath it would be a variable length byte array. Now you are all thinking it would be hard to optimize, and that true, it would be harder. But that's pushing the work down to the operating system/runtime and not having the high level designer worry about overflows.

The runtime can still use existing 32/64 bit registers to do integer calculation that fit in that range and fallback to a more complex code if its out of range. So maintaining existing performance with seamless transition to boundless integers with just a bit of overhead. And this is overhead that will disappear if implemented in hardware.

[0] [2009-02-21 07:00:01] seanb

From the perspective of someone who has not used c++ for years, and maybe once every couple years might have to re-learn pointer arithmetic for some unsafe code blocks in c#, things like inlining, malloc/free, are just distant memories, but that does not really mean that optimisation is out the window, there are just different optimisations to think about now.

Whilst premature optimisation is the root of many evils, it is all too common for developers to pull out that chestnut as an excuse for being lazy, it doesn't mean you should just use a bubble sort and let the compiler fix it, or throw more hardware at it until it runs fast enough. On the other hand, bit shifting to do arithmetic is rarely neccessary these days (for what I do anyway, embedded systems maybe).

Would love to not have to think about threading, I hope that comes soon, does my old school head in.

Personally, mainly do web stuff, so most of the optimisations I deal with are to do with bandwidth and network latency, and whilst bandwidth is (slowwwwwly) getting better, it will never be infinite, and the speed of light is not likely to change any time soon.

Optimisations have changed a lot in the 20something years I have been tinkering with computers, but will never go away, because we will always be trying to outdo each other.

There will always be someone willing to get down and dirty to dazzle the market. You can use the out of the box inverse square root if you want, but some freak might just write a faster one.....

btw, bubble sort is actually really fast if it was automatically parallelized. so yes developers should implement bubble sort and expect compilers to fix it ;) - Pyrolistical
[0] [2009-07-31 13:24:39] Paul Biggar

I think you'll find it's a bit more complicated than that [1]. You are assuming a magic run-time and a magic compiler. The dirty little compiler secret is that most of these optimizations aren't useless, and are unlikely to become so in the near future. Most optimizations sort-of work, most of the time, especially on code that has lots of scalars, and where the entire code is available to the compiler at compile-time.

So lets take a look:

1. Threads. I imagine a concurrent model of tomorrow where we don't explicitly deal with threads.

They exist, with Erlang for example. But it will be a long time before you'll see something like this for free in C/C++ or even Java. We haven't seen anything close to helpful to do automatic threading.

2. Inlining functions, the runtime or compiler should be able to figure this out for you

Yes, it sounds like it should. In fact, this is the biggest optimization performed by Java JITs. Compilers like LLVM are designed for this, and projects like gcc's LTO will deliver a big boost here. However, like many optimizations doing this perfectly in the general case is hard, and it is also very expensive. Static compilers have a hard time knowing good places to inline, which is why inlining is typically only performed at -O3 (aka it might hurt, it might help, we can't tell). gcc's LIPO [2] should also help, but this problem is surprisingly unsolved.

3. The final keyword in Java. Basically inlining variable references (but we'll still use it for protection, today I use it for both protection and performance)

Again, yes and no. Java suffers from dynamic classloading, which means that the otherwise brilliant javac can't do some optimizations if final is missing. It can do some of the same optimizations at run-time (in fact, that's what's great about JITs), but it won't do run-time optimizations which are expensive.

4. Buffered IO, the operating system or runtime should figure this out for you

I don't know much about this, but I would ask how you hope that the OS would decide whether you want your IO buffered or not.

5. Object caching (and caching in general at many levels), it's not far fetched for the runtime to handle this

I guess this isn't too far fetched, but I think you'd be lucky.

6. Minimizing branch statements, a smarter compiler can remove branches for you

With feedback-directed optimization it can get them right more often. But no-one uses that. It can't remove branches which might be necessary. A JIT can sometimes, but again, can be expensive.

7. Using arrays instead of Lists, a smarter runtime can just use arrays underneath

Yeah, thats iffy. Doing the transformation should be simple enough. Actually getting a speed-up is uncertain.

8. Unraveling/reversing nested loops

For simple loops, compiler optimizations do a decent job. But if you aren't doing scientific applications with huge multi-dimensional arrays, your mileage will vary, especially if you have function calls and anything in the loop escapes.

9. Replacing recursion with a for loop

OK, finally, one that is mostly solved. But not in the general case.

10. Minimizing object size to fit into cache or cache alignment issues. This is a big one. Its a hard problem to crack, but if the runtime could really figure out that really needs to be in cache, then we can finally not care about object size.

Not a chance. Copying garbage collectors are about as close as we can get. But think about this for C or C++: pointers everywhere! I think this one won't die until C/C++ go away. However, I think LLVM's early research [3] was on this.

11. Selecting the smallest data type that will work. Maybe in the future an integer is rangeless

Yeah, in quite a lot of cases. For scalar code that doesn't escape a function, this is already practical.

12. Scalability. This sounds like an odd one, but scalability is a performance issue in the end.

I would love to hear some theory about this :)

13. newing/deleting memory (looking at you C++), Java already solved some of this, others just need to catch up. It can still be improved even more

Sadly, this is pretty hard too. Java does great at it since its in a VM, but for non-managed code like C++, this might never happen.

14. Using shift instead of * or /

This one has long been solved, I believe (well, for built-in types).

15. Avoiding the use of Exceptions

Do you mean that you will no longer have to avoid them? This is largely solved for Java, using a technique call dynamic deoptimization [4]. With separate compilation, this is a hard one for C++.

16. Tuning Garbage collector

This one I think will be solved. Garbage collectors have gotten amazing, and you still hear of funky GC research going on.

[1] http://i-think-youll-find-its-a-bit-more-complicated-than-that-and-other-excellent-christmas-gifts/

About Scalability. You can scale up (faster single computer) or scale out (cloud). Right now to scale out you need to use frameworks like MapReduce. This is an optimization due to the fact that you don't have enough performance in a single computer. What I imagine in the future is concepts like MapReduce would be baked into the language and running on one vs N machines is trivial. - Pyrolistical
I see your point. If you put it like that, Erlang has already solved it to a certain extent. But I think it will be a long long time before scalability as part of the language is mainstream. - Paul Biggar
[0] [2009-07-31 14:20:34] xcramps

The new Intel Nehalem processor relaxes rules on the data alignment, has larger vectorization capabilities and much more intelligent caching. We are seeing 50% performance boosts without even recompiling in heavily math-intensive applications.

Hell, the Intel compiler can give a 50% boost over gcc by itself.

What we have to jump through hoops for now is to make loops vectorize, and this involves real obfuscation of the simple intent of the loop. It's almost like writing code to efficiently compile under CUDA on the GPU (hideous).

I'd also like to see compilers be able for figure out if and how a loop could be spawned across multiple cores. Some loops are no brainers, some are very difficult to get right, leading to weird code manipulations to get the correct result, which will be confusing for the poor SOB following your code later.

And I wish Intel would devote part of their silicon to understanding Parrot byte code, so that all interpreted languages could benefit down the road.

Very nice. Cache alignment issues was definitely one problem I don't want to care about. - Pyrolistical
[-2] [2009-02-21 01:54:54] Pyrolistical

I got another one and its a biggie.

Shared state is an optimization.

When you pass data to another function its usually by reference so it doesn't need to make a copy of the data. This is an optimization. While it might require some rework, you can design everything to be pass by copy.

And once you achieve that, concurrent code becomes easier.

Doesn't that mean that every parameter passing will include the overhead of a memcpy? - Crashworks
Yes. Sharing state partly to remove that overhead. It is an optimization. - Pyrolistical