share
Stack OverflowTricky Algorithm Interview Question
[+121] [36] Jessica
[2010-10-30 10:05:27]
[ algorithm microsoft interview-questions ]
[ http://stackoverflow.com/questions/4058172] [DELETED]

Possible Duplicate:
Algorithm: How to tell if an array is a permutation in O(n)? [1]

I was asked this interview question at Microsoft, and it's been bugging me that I haven't been able to solve it. Does anybody have any ideas?

You're passed an array with n integers in it. Each integer is between 1 and n (inclusive). These numbers are not in sorted order and there may be duplicates or missing numbers.

The problem is to figure out if there are any duplicates in the array in O(n) time and O(1) space without destroying the array.

(1) Good luck in the interview process ! - steve
(1) Where did all the comments go? - CodeInChaos
(1) Indeed. Why were the comments removed? - Paul
"duplicates OR missing numbers"?? - Philip
[+41] [2010-10-30 11:01:53] Marcelo Cantos

For a probabilistic solution, compute m different order-independent checksums and verify that they all correspond to the known checksums for an array containing every number once.

To choose checksums, you could sum, multiply, sum the squares, multiply the squares, etc., allowing each of them to wrap rather than using bignums. There's probably some way to synthesise an arbitrarily large family of checksums. For instance checksum q could sum the result of multiplying each element by the (1000+q)th prime number.

Crude analysis: Assuming 32-bit checksums with passable avalanche and reasonable independence between checksums, the probability of a collision is about 1/(232m). This means that a five-checksum verification is on par with sha1.

Edit: Note @Steve Jessup's point in the comments that simple multiplication with different multipliers won't produce independent checksums. @CodeInChaos's suggestion to use sha1 on each character would.


That's a nice way to think about it. - Jessica
(2) One implementation of this which even works if the array is chosen by an attacker who tries to fool your result is *xor*ing the sha1(a[i]+salt) of all entries with a random salt. - CodeInChaos
@CodeInChaos: Hehe, yes I guess that qualifies. O(100*n) == O(n). - Marcelo Cantos
(4) There is one problem: the number of checksums you'll check will grow with n, so technically the complexity will be slightly larger than O(n). - ruslik
@ruslik: Fix m to some constant you are comfortable with. I'd be happy with m = 5 for any size of input. - Marcelo Cantos
(1) "For instance checksum q could sum the result of multiplying each element by the (1000+q)th prime number" - those would be redundant, surely. If you just multiply by a constant, they'll either all match, or none of them will. However, if checksum q raised each element to the power of q modulo some large prime, then we could be in business. Ideal would a hash function which is cryptographically secure except for the property that it ignores the order of its input bytes. Not sure if such exists and/or can be easily derived from a normal hash as CodeInChaos implies. - Steve Jessop
@ruslik I don't think it grows with n. It only growth if you want do decrease the failure probability. The failure probability is just 1/2^m (m being the bitlength of the hash) and thus independent of n. and for m>100 it is small enough for all practical purposes. - CodeInChaos
@Steve: The choice of multipliers would be important, and my off-the-cuff suggestion is probably too low, but as long as they are sufficiently large to cause wrap-around, they should be fine (and cheaper than raising to some power). The basis of this idea is the LCG, so perhaps using five common LCG multipliers would give sufficient avalanche and independence. - Marcelo Cantos
Sorry, I don't understand the suggestion then. For n=3, (Kx + Ky + Kz) = (K*1 + K*2 + K*3) if and only if (x + y + z) = 6, regardless of K and regardless of wraparound (as long as K isn't a power of 2, of course, but you've excluded that already). I thought this was relevant to your suggestion. - Steve Jessop
@Steve: You are absolutely right. I'm forgetting my modulo arithmetic. - Marcelo Cantos
I voted for this answer anyway, btw, because it's the best so far and I think something along the lines of CodeInChaos's suggestion will probably provide a good set of checksums. - Steve Jessop
@CodeInChaos You're right. I see now. - ruslik
The probabilistic approach is interesting; if one has a set of independent bijective functions, one can compute the checksum of multiple such mappings, selected at random. If the time required for each such mapping is independent of 'N', one could reduce the probability of a false 'no duplicates' report to any particular desired level by using a fixed number of passes. - supercat
I don't get the "crude analysis": There are n^n possible sequences of n numbers in the range [1..n]. So, if I want to be reasonably sure that there's no collision between the checksums of [1..n] and the checksums of any sequence containing duplicates, then n^n * (1/(2^32m)) has be small. And that's only true if n << m. So the solution is at least O(n^2). - nikie
@nikie: This is standard hashing theory, which relies on the ability to have m << n; if you couldn't, then sha1 wouldn't be useful. You don't need to guarantee that there are no collisions at all, which would obviously require m >> n. You just need to ensure a low probability that a bad sequence coincidentally generates the same checksum as a known solution. - Marcelo Cantos
(1) I don't like probabilistic solutions because it is just "waiting for it to break" -- it is like saying the chance of the algorithm failing is very small, but it does happen. Like the chance of winning the lottery is very small, but still some people win it. - 動靜能量
@nikie: Minor correction on my previous comment. To guarantee no collisions you only need m == n, where the "hash" is just the sequence itself. - Marcelo Cantos
(3) @動靜能量: In a universe built on quantum physics, there is no such thing as a non-probabilistic solution. Every possible test for equivalence has a non-zero probability of failure. Scientists and statisticians generally consider a probability of 10**-50 to be zero for all intents and purposes, and sha1 or something similar is on par with that. In practice, the probability of a false positive or negative is much higher than that, since CPUs fail, gamma rays from the sun flip bits randomly every once in a while, etc. The mathematics of cryptographic hashing is usually the least of your worries. - Marcelo Cantos
(3) @ 動靜能量 The chances of random CPU errors is much higher than the chance of a hash collision for typical hash sizes(160-512 bits). So I wouldn't worry about that. It's something like unless there is an attacker who breaks the crypto behind the hash function it will take all the computers in the world until a collision happens at random in a 512 bit hash function. - CodeInChaos
(3) @動靜能量: P.S.: For a bit of perspective on the numbers, the probability of winning a national lottery is about 10**38 times greater than that of seeing a sha1 collision between two different sequences. If you ran one billion sha1 comparisons per second on random pairs of documents for one billion years, you would still be 10 quadrillion times more likely to win the lottery than to see a collision. - Marcelo Cantos
@Marcelo Cantos it seems like the analysis is pure based on Sha1, but disregarding the size of n. What if n is really large, so large that it doubles the order of magnitude of 10^-50, or triple? Then the chance of failure is not as simple as 10^-50. - 動靜能量
(1) @動靜能量: There are two problems with this, either one fatal on its own. 1) How do you propose to create an array of size 10^100? There aren't that many atoms in the known universe, even if you include purported dark matter. - Marcelo Cantos
@動靜能量: 2) n is not the number of checksums you are comparing to each other, it is the size of the input to a single checksum, and the size of the input has absolutely no effect on the security of sha1. Perhaps @CodeInChaos's idea of computing a sha1 sum on each character confused you, but these sums aren't compared to each other, they are simply used to build a total checksum for the entire array. It is this total checksum that is compared to a known checksum to check that the array has no duplicates. Only one checksum comparison is made, with a 2^-160 (about 10^-48) probability of failure. - Marcelo Cantos
If the array size is about 2^(m/2) then we get collisions. So in theory we need O(log(n)) storage for the hash and O(n*log(n)) calculation. But in our finite universe this doesn't really matter. - CodeInChaos
1
[+30] [2010-10-30 11:41:44] Sudhanshu [ACCEPTED]

Looks like this will work:

Think of each array slot as a linked list node, containing the address of the next slot in the list. We calculate the address of next slot as (k-1), where k is the number stored in the slot, so if array[0] == 4, then next slot to visit is array[3]. An example array:

S = 0 1 2 3 4 (Array Idx)
k = 1 2 3 4 5 (Array Values)

The above array has 5 linked lists of one node each i.e. each list's last slot (the only slot of each list) pointing to the starting slot (i.e. itself). Another possible example:

S = 0 1 2 3 4 (Array Idx)
k = 2 1 3 5 4 (Array Values)

This array has 3 lists (2 1), (3), (5, 4) - two of them containing cycles ending at the starting node i.e. with the last slot in the list pointing to the starting slot of the list. Another example:

S = 0 1 2 3 4 (Array Idx)
k = 2 3 3 5 4 (Array Values)

This array has 3 lists as well, with one containing a cycle, that does not end at the starting node of that list, but at the last node itself: (2,3). There's another list starting in slot 1, which shares a slot with another list (3,3). And the last list is (5,4).



Our goal is to identify a list, whose last node does not point to itself, or the starting slot of that list. If such a list exists there is a duplicate in the array. We'll use this strategy to do this:

  1. Start traversing the first linked list starting at slot 0. Save the starting node idx of the "current" linked list in a variable S, so S = 0.

  2. As we visit each node, we will mark that node as visited by multiplying the number in the slot by -1.

  3. If a visited slot 'k-1' contains a number 'n' such that (n-1) != (k-1) && (n-1) != S, and n < 0, we are visiting the slot again, and we have found our duplicate number 'n'.

  4. However, if n-1 == k-1 || n-1 == S, the traversal of this list ends there and we need to find the next list to traverse.

  5. Store the index of current slot in S. To find the starting node of the next list - walk down the rest of the array in a circular fashion, until you find a slot which does not contain a -ve number, or you reach the slot pointed to by S again.

  6. If you reached the starting slot S, all the nodes in the array have been visited, and there are no duplicates.

  7. If you find a new slot, save its index in S, and go back to 2.

  8. After we are done we'll need to make one more pass through the array to restore the values (we are not supposed to destroy the array). This is still O(n) though.


(14) O(n) in space (you're using the sign bit). Nice try. - SimonJ
@SimonJ: Not using extra space though apart from S. Just what's already there. :) - Sudhanshu
(4) @Sudhanshu: Nothing says that the domain of values of the array extends from -N to N. - Nabb
@Nabb: I am not sure I get what you are saying. I am just transforming already stored numbers in the array to be -ve to mark them as already visited. That's possible because the question says the domain is between 1..N. And I restore the numbers in the array at the end, after finding the duplicate. - Sudhanshu
This is a very slick solution, and is probably as good as is possible. - Jessica
Using -1 as flag is possibile, and I think it's the key to obtain other solutions. - Fabio F.
(4) Your step 2 is just the same cheat I do. You use the sign bit, I use the lowest bit. But your algorithm is more complicated without avoiding to cheat. - CodeInChaos
@CodeInChaos: If I had asked this question during the interview, I'd really not consider this as cheating. I don't think your approach is cheating too - except it has the disadvantage of reducing the range of the numbers. I really didn't look at your approach before using the sign bit as flag in my solution though - I saw this approach being used before in some other place (for something else). :) - Sudhanshu
You require the range of numbers from -n to n, and I from 1 to 2n+1. Personally I think both solutions are cheating, but since nobody here has found a clean solution yet this kind of cheating is probably the best deterministic solution one can expect in an interview. - CodeInChaos
(5) @Sudhansu: it all depends whether "integer" in the question should be taken to mean signed int. What if "integer" is supposed to mean unsigned int, and n is supposed to be allowed to range all the way to UINT_MAX? The question doesn't say, so all you'd need is to clarify with the interviewer before continuing. Unfortunately we don't have the interviewer here to ask, so we talk about "cheating" and "not cheating" instead... - Steve Jessop
@CodeInChaos, @Steve: Yup, I agree with you guys - all depends on how you see it. I am happy with the top answers on this page too. I am really not looking to prove that someone's wrong - given that we don't have the interviewer here, and cannot determine what exactly what she wanted. It was satisfying to get to this answer - in the end its all that matters. :) - Sudhanshu
The problem may be trivially solved if there's a spare bit in each array element, since that's O(N) extra space. One could require that each array element be able to hold numbers from 1 to N+k for some constant k (the amount of extra space this would require for each element diminishes as the number of elements increases). I wonder if that would be useful. - supercat
I think you're on the right track. You can find duplicates in a linked list without modifying it, so you should be able to fix your solution. - Gabe
@Gabe: Yeah, you're right about the fact that I can find duplicates in a linked list without modifying it - however, since there can be multiple linked lists in the array - the problem boils down to whether you can keep track of which linked lists you have already looked at without modifying the list? That eludes me so far... - Sudhanshu
I believe this is O(n log n). I can't see it being O(n). - Loren Pechtel
@Loren, this is O(3n). Using an extra bit to mark visited nodes means that in the worst case, you visit each node 3 times, once to mark, once to find the next unmarked node, and once to unmark all the nodes. You can do this without modifying the array in O(n^2) by not marking any element. - MSN
(1) @MSN We have one loop across all of n and it spawns a subloop for each entry hit. You can't nest loops in a O(n) routine. - Loren Pechtel
(1) @Loren, you can if the order of each loop in aggregate is O(n). In this case, the algorithm marks visited array elements and skips them if they are already marked in the top loop. In the worst running case, all elements are unique and within [1, N]. In that case, you still loop when you hit an element that is unmarked, but you are guaranteed to never mark an element more than once, which is O(n) for all sub-loops. - MSN
2
[+25] [2010-10-30 10:48:00] CodeInChaos

I found one solution, but it borders on cheating. Technically I don't destroy the array if I restore it afterwards to the original state. But of course it's bad style and fails in multithreaded code. And it requires MaxInt>=2*n+1.

I abuse one bit of each of the ints in the array as my n-bit bitvector to encode if I already encountered the number n. And when I'm done I strip them out again.

//Empty last bit
for(int i=0;i<n;i++)
{
   a[i]*=2;
}

//Fill last bit with my data
for(int i=0;i<n;i++)
{
   int num=a[i]/2-1;
   a[num]|=1;
}

bool result=false; 
//Restore the array and check for missing number
for(int i=0;i<n;i++)
{
   if(a[i]&1!=0)
     result=true;
   a[i]/=2;
}
return result;

(1) You loop through the array 3 times. Isn't that O(n*3) time? - Core Xii
(23) Check the definition of the big O notation. O(n)==O(n*3). en.wikipedia.org/wiki/Big_O_notation - CodeInChaos
(2) What if a[i] *= 2 overflows? - IVlad
(2) If n is doubled then the execution time doubles, hence it's in O(n). Unfortunately this solution is also O(n) in space... - SimonJ
@IVlad overflow would be an issue, but the answer states the requirement that MaxInt>=2*n+1, so as far as interview questions go he's covered it:) - Steve Haigh
@IVlad That's why I added the requirement of MaxInt>=2*n+1. But yes technically it's still O(n) in space. And if one is really precise the input data is O(n*log(n)) in size and thus any algorithm is at least O(n*log(n)) in time. - CodeInChaos
(2) Returns true for both a = {1, 2, 3, 4} and a = {1, 2, 3, 3}, n = 4 for me in C++. - IVlad
It's cheating but very clever cheating :) - tauran
I agree with tauran. This is clever cheating. - Jessica
(15) It is plain cheating... if you are using an array of "flags", then you may as well call it space requirement of O(n). What is the difference between simply using an array of flags and "hacking it" to use an array of flags? And for any problem that requires space of O(n), you can hack it to use an extra 32-bit too... or extra 64-bit by assuming a "bigger" integer size, and that is totally non-O(1). - 動靜能量
(7) +1 for creativeness, cheating or not - smirkingman
@Core Xii: 3*n main operations it's O(n) - dfens
(2) I would confirm with the interviewer that it is an array of 'integers' and not 'unsigned integers' and then would use negative sign as the flag, rather then multiplying by two - Michael
This partial answer may please the interviewer for creativity. - SHiNKiROU
(1) Isn't this O(n) in memory? - Green Code
3
[+12] [2010-10-30 10:34:44] stakx

Very interesting problem. I personally suspect that there is no solution to this problem, ie. they wanted to see your way of reasoning and analyzing the problem... but I'm looking forward to other answers here and more than glad if another answer will prove me wrong!

The reasons why I suspect that no solution exists are:

  • If you do this by sorting the array, you need a sorting algorithm that takes O(n) time and O(1) memory. I think that the O(1) memory requirement means you'd have to sort in-place, which the interview question explicitly disallows. O(n) time sorting algorithms are rare I think; radix sort comes close with O(k×n), where k depends on the number of digits of your numbers; but even that is disputed by some.

  • If you try to do this without sorting, the obvious way is to somehow remember which numbers you've already encountered. The O(1) memory constraint doesn't allow this.

  • You're then left with "heuristic" methods, such as the n × (n+1) / 2 trick (invented by mathematician Gauss, I believe), but that's not 100% accurate, as demonstrated in the comments to some other answers here. For this particular solution, I'm not sure if it helped if you multiplied each number by a large constant so that they are "spread" a little further, such that they won't "interfere" when you add them up or multiply them; but that's a very vague idea which I'm almost sure won't make a difference in the end.


It would take O(nk) to sort with radix sort, but here, k is equal to log(n). I suspect you're right about a solution not existing, but I'm not 100% sure. It is possible if there is only 1 number missing (duplicated) or really any finite number of numbers missing (duplicated), so it's not outside the realm of possibility that there is a solution. - Jessica
(1) To the downvoter, I'd appreciate it if you would state why you think the answer wrong, or unhelpful. - stakx
I guess the downvoters don't see it as a proof that there is no answer. For example, the TV program "Myth Buster"... they try to do it and fail, and they say they bust the myth. So if they try to build an airplane and fail, they will say, "we bust the myth that any man made machine can fly in the sky"? - 動靜能量
(1) @動靜能量, hehe! I see your point. Still, the question asked was, "Does anybody have any ideas?" This can go both ways. To me it seemed more promising to first reflect about whether the problem can be solved at all, before I'd start thinking about hundreds of potential solutions. But of course my answer isn't thorough enough to prove that no solution exists. It's only a tentative start in that direction. - stakx
(1) Marcelo Cantos's solution shows that it can be solved with arbitrarily high success probability. And it's easy to get the probability beyond the error rate of any physical CPU, so it is solved in practice. - CodeInChaos
What is arbitrarily high? As compare to n being what? What if n is 2^10000000? Then the chance of success will be not so high. You have to understand that in math, limit of x / y with both x and y being able to approach infinity is indeterminate. You can't say x can be arbitrarily high and therefore x / y must be zero. Then you can say, how can n be so high, there is no computer that can handle it, well, then what is n high enough for it to be practical? Is that true 10 years later, or 20? So your answer may be practical, but is not pure computer science. - 動靜能量
(1) Checking whether the sum is n(n+1)/2 doesn't pass the O(1) space requirement,as it takes O(log n) space to store the result of the arithmetic. - uckelman
4
[+11] [2010-10-30 10:54:37] IVlad

This paper [1] describes a solution to a similar problem under the same constraints. It's pretty advanced but seems to be what you're after.

The only difference is that there's always a duplicate in that problem, whereas in your case there might not be. So you just need to figure out how to detect that.

[1] http://geomete.com/abdali/papers/duplmath.pdf

(1) That's an interesting paper, but it doesn't directly apply since their technique to find an indirection list of infinite period in constant time does not work in this problem. - Jessica
(1) Unfortunately only a straightforward O(N²)/O(1) solution comes out of the paper for this problem, which isn't asymptotically faster than the simpler O(1) space solutions. - Nabb
If you have n numbers and each number is 1..n either you have 1..n exactly or you have at least one missing number AND at least one duplicate. - Hightechrider
@Hightechrider: The difficulty with applying the method in the paper to the stated problem is that it presupposes that one knows an array element whose index will not be found in the list (in an array where the slots at indices 1..n+1 hold values 1..n, the index of slot n+1 will not be found in the list). The puzzle as stated offers no similarly-nice starting point. - supercat
5
[+9] [2010-10-30 12:45:26] 動靜能量

The original question says:

figure out if there are any duplicates in the array in O(n) time and O(1) space without destroying the array

Does re-arranging the element count as "destroying the array"? For example, if those are just some collected data in some experiment where the order of performing the experiment is not important, then the order of those numbers doesn't matter.

In that case, we can take element 1, say, if it contains 8, then we swap element 1 with element 8, so now element 8 contains the number 8.

Now we look at element 1 again, if it contains 3, we swap it with element 3. If we are swapping into an element that already contains that number, then we have found the duplicate. If we reach the end of array without trying to swap the same number into a slot that already has that number, then we know there is no duplicate.

Just keep on swapping element 1 in this fashion until 1 is swapped into element 1, in which case we will move onto element 2.

It is like doing a Radix sort. It may look like it is O(n * n), but since each swapping will swap 1 more correct element into place, we can at most swap n times, plus the "moving along the array" which is also O(n), so it is O(n + n), which is still O(n).


(1) Agree, worst case is something like [2, 3, 4, 5, 1] where every element needs swapping but once you've followed the cycle (bounded in size by the array length) you need only visit each element one more time. I suspect this is what's meant by "destroying", though... - SimonJ
(3) I interpret rearranging the elements as destroying the array. - Don Kirkby
6
[+9] [2010-11-04 01:12:28] phetus

The array is a perfect discete uniform distribution ( Wikipedia [1]). A perfect discrete uniform distribution has a known variance that will change if there are any duplicates, so calculate the variance of the data and compare it to the expected variance. This algorithm seems to be working against a couple quick tests and is O(n) in time, O(log(n)) in space (only to grow the floating point values), non-destructive to the array. It does start to fail on very large arrays, hence the rounding-error check at the end.

Edit: Added skewness calculation, no false negatives through 2^n arrays up to 2^27, largest I can initialize. It will also detect a single duplicate in 2^27 item array. The wonky stuff in the loops is the Kahan summation algorithm, dramatically reduces error in summation.

C#:

public static bool ContainsDuplicates(int[] nums)
{
    double ERROR_ALLOWED = 0.00000000000001;
    double average = 0;
    ulong sum = 0;
    for (int i = 0; i < nums.Length; i++)
    {
        sum += (ulong)nums[i];
    }
    average = (double)sum / (double)nums.Length;

    double variance = (double)nums[0] - average;
    variance *= variance;
    double c = 0;
    for (int i = 1; i < nums.Length; i++)
    {
        double val = (double)nums[i] - average;
        double y = val * val - c;
        double t = variance + y;
        c = (t - variance) - y;
        variance = t;
    }

    variance /= nums.Length;

    double skewBottom = Math.Pow(variance, 1.5);
    double skewness = (double)nums[0] - average;
    skewness *= skewness * skewness;
    for (int i = 1; i < nums.Length; i++)
    {
        double val = (double)nums[i] - average;
        double y = val * val * val - c;
        double t = skewness + y;
        c = (t - skewness) - y;
        skewness = t;
    }
    skewness /= skewBottom;

    double expectedVariance = nums.Length;
    expectedVariance *= expectedVariance;
    expectedVariance--;
    expectedVariance /= 12;

    double varianceError = Math.Abs(variance / expectedVariance - 1);

    return varianceError > ERROR_ALLOWED ||
        Math.Abs(skewness) > ERROR_ALLOWED;
}
[1] http://en.wikipedia.org/wiki/Uniform_distribution_%28discrete%29

(1) But almost any distribution with two or more parameters (gaussian, binomial...) can be tweaked to have the same mean and standard deviation, right? - nikie
Either (1) this isn't a general solution, or (2) it's not O(1) space. If your double type uses a fixed number of bits, then there is some n after which you'll overflow it and won't get the right answer anymore. If your double type expands to accommodate numbers of any size, then the number of bits you need to store the average and the variance will be linear in the number of bits you need to store n, so O(n) space. Since your code appears to be Java, it looks like you fail on (1). - uckelman
Hi Nikie, I'm not a statistician but I think any duplicates will necessarily change the variance, so I suspect there are no special cases that will give you an algebraically equal variance (although you may get very close values, hence the rounding problems. - phetus
uckelman, you are right, it does start to fail for very large arrays due to rounding problems which could be fixed with arbitrary precision values [O(log n) space? I fixed it in the solution] but I think in practical terms this is an effective and fast solution to the problem. - phetus
(1) I came up with a counter example (1,2,3,4,5 vs 1,1,3,4,4) where the variance is equal to the ideal even with duplicates. However, calculating the skewness (still O(n) in time), still a theoretical O(log(n)) in space, O(1) practically) would eliminate this problem. - phetus
@phetus: Look at it this way: Fixing the mean is one equation. Fixing the standard deviation is another equation. Adding the skewness makes three equations. So you have an equation system with 3 equations and n unknowns. This has infinitely many solutions, if n > 3. You would need n equations to make the solution unique. I.e. to ensure no duplicates, the number of statistical measurements would have to grow with n - And then it's not O(n) anymore. - nikie
7
[+6] [2010-10-30 11:45:43] Paul

Answered here [1]

looks like I was on the right lines with the linked list idea, but I'm not sure I would have got there

[1] http://aperiodic.net/phil/archives/Geekery/find-duplicate-elements.html

The link here has some interesting ideas, but it does not answer the same question. He is answering the question of how to find duplicates in an array of size n containing numbers 1 to n-1. - Jessica
(1) Looks like the same algorithm as what IVlad linked to, and doesn't solve the problem. - Nabb
This finds a duplicate value when it's known that the array contains at least one such duplicate. How would you adapt it to show the existence of a duplicate in the first place? - SimonJ
(1) I think using floyd cycle finding algorithm might also lead to a solution to the question here ... some tweaks are needed though. - Peer Stritzinger
Its not a case, in this case you can't use pigeonhole principle because 1. you don't know any duplicate exist, 2. numbers are between 1..n in pigeonhole principle should be between 1..k were k < n. - Saeed Amiri
I'm also not entirely clear how that algorithm is supposed to deal with the possibility that there may exist perfect cycles which don't have tails. - supercat
Doesn't he find the length of the cycle? If the cycle is of length n, there are no dups. - Thomas Ahle
@TA, yes, but there are many cycles that are less than length n that have no dups too (e.g when array[n]= n) - Paul
8
[+5] [2010-10-30 12:44:16] Fabio F.

Using Sudhanshu idea a possibile algorithm in Java could be.

boolean findDuplicate(int[] array){
    for(int i=0;i<array.length;i++){
        array[Math.abs(array[i])-1]*=-1;
        if(array[Math.abs(array[i])-1])>0 return false;//key already visited
    }
    for(int i =0;i<array.length;i++){
        //if all is fine all values are negative now
        array[i]*=-1;
    }
    return true;
}

Is it correct?

(Thanks to Maciej Hehl)


You want array[array[i]] not array[array[i]-1]. and then (array[array[i]])>0 on the next line. And there is a potential tricky sign bit problem. - P i
(3) mm array={1} get overflow with array[array[0]] ... - Fabio F.
(1) I'd change array[array[i]-1]*=-1; to array[abs(array[i])-1]*=-1; and the next line in a similar way. - Maciej Hehl
9
[+3] [2010-10-30 11:47:05] Jessica

The paper that IVlad linked to inspired me to think of this solution (which unfortunately also cheats by destroying the array):

The idea is that you start at index 1. Then you look at the sequence 1,a[1], a[a[1]], a[a[a[1]]], and so on, and you look for when that sequence hits 1 again, or comes back to a number that it's already hit. If the latter occurs then you know that there is a duplicate. The cheating part comes in where you zero out every array element in this sequence. The reason for this is two-fold: (a) it makes it easy to determine if you hit a number twice in your sequence and (b) it makes the runtime O(n), since we mark this element as not needing to be iterated on when we go through the rest of the array.

Here is python code implementing the algorithm and some test cases, in case my description is not clear:

def arrayHasDuplicates(array):

    startLoopIndex = 1;


    while startLoopIndex < len(array):
        loopIndex = startLoopIndex
        while array[loopIndex-1] > 0:
            nextLoopIndex = array[loopIndex-1]
            array[loopIndex-1] = 0
            loopIndex = nextLoopIndex

        if  loopIndex != startLoopIndex:
            return True
        startLoopIndex = startLoopIndex + 1

    return False 

def test(array):
    print str(array)+" duplicates? "+str(arrayHasDuplicates(array))

test([1])
test([1,2])
test([2,2])
test([1,2,3])
test([2,2,2])
test([3,2,3])
test([1,1,1])
test([3,1,2])
test([7,5,3,1,2,4,6])
test([7,5,3,1,2,2,6])

Getting closer ... this inspired me to my answer ... - Peer Stritzinger
There's no need to zero out elements to detect if you've hit a cycle somewhere other than your start point. Just let the iteration run 'n' times and if you haven't hit your start point, you've hit a cycle somewhere else. Since when that happens, you're done, that case will only happen once and it doesn't matter if it takes time 'n'. If one could compute the convolution of an array in-place in linear time, that might be good (since the convolution of a permutation would be reversible) but I don't see any nice way to do that. - supercat
You need to zero out elements to mark cycles you've already visited, otherwise the algorithm is O(n^2). - Jessica
10
[+3] [2010-10-30 13:47:37] P i

[edited!]

The only problem seems to be that the question is not specifying whether these are signed or unsigned integers. If they are signed, we can use the sign bit for storage, as they are all guaranteed to be positive, and the solution is relatively trivial, and has been presented. For the sake of completeness, I present it once more:

//Solution involves the fact that we can use the sign-bit of 
//the i'th element to record whether the value i is duplicated.

// suppose array is P[0] thru P[N-1]
for (int i = 0; i < N; i++)
{
    int Pi = ABS(P[i]);

    if (P[Pi] < 0)
        break; // DUPLICATE

    P[Pi] = -P[Pi];
}

bool foundDuplicate = (i < N); // if i reached end, that means no duplicates

// now clean array
for (int i = 0; i < N; i++)
    if (P[i] < 0)
        P[i] = -P[i];

(NB: this method is not thread safe as it messes with the array before restoring it)

Note that zero duplicates implies that the target contains each of the numbers 1 thru N. it implies that there are no omissions. So we only need to check duplicates.

As it is an interview question, it would probably be bad practice to optimise it straight away, as it would just give the interviewer a headache, but it could be sped up tons.

If, however, we are given unsigned integers, then we are not guaranteed the sign bit to play with. in this case a solution is not obvious, and if it exists, it is almost certainly going to be impractical. My gut feeling is that this problem with unsigned integers is intractable, and that a proof of this could be arrived at, though it would be very tricky.

It is worth remembering that this is a question in the context of an interview, and the interviewer is likely looking for qualities such as the ability to work efficiently, rather than eagerness to waste hours worrying out theoretical minutiae. Business is usually more a matter of making gut calls; most problems in business do not have a perfect solution.

Also, a company hiring software engineers would want people with the ability to detect a flawed specification, rather than simply those that would blindly do what they have been told.

I know this is not going to be a popular angle, but being realistic, while an interview for a position in a university would appreciate a lengthy discussion, an interview for a position in a corporation would most likely put the emphasis on 'ability to get things done'.

I suspect also that the interview question assumes signed integers. It is a syntax convention of C, is it not, that an integer is assumed to be signed unless explicitly marked with the 'unsigned' keyword? I'm not sure if this is contained in the original C specification, but I would be surprised if it isn't.


You're the third one with this idea. But since you assume an integersize with one more bit than n needs to fit(the sign bit) this can be considered cheating. Check the solutions of me and Sudhanshu. - CodeInChaos
Of course the question doesn't specify whether the data type of N is the same as the datatype of P[]. I am guessing that this algorithm together with a brief exposition of the potential pitfalls would be enough to make the interviewer happy. - P i
A signed 16 bit int typically has a range -32768 to 32767, so this is no problem.But nobody stated that they are signed. The only thing we know is that they support the interval from 1 to n. - CodeInChaos
I just deleted my comment which CodeInChaos was answering above, as it contained wrong information. - P i
11
[+2] [2010-10-30 21:10:11] Saeed Amiri

Its easy to find one duplicate: Start from first position in array, check it's value if is 1 move to next, but if is i > 1 move it to a[i] (save current a[i] value) now set the ist position value to a[1] + n and move previous a[i] value to it's appropriated home like a[1] (i.e a[i] == j move it to a[j] and do the same with a[j]) when you move 'a[i]' it to its ... and so on if you finding two n +i in a same home you find one duplicate, in the case where a[1] is in place (1st value in array) select a[2] and do the same, your running time algorithm is O(n) (because you visit each house at most twice) and the extra space you using is O(1). also you should skip items which are has a value of bigger than n in your iteration.


And how do you restore the array in the end? YOur code looks destructive to me. - CodeInChaos
@CodeInChaos, its easy just do reverse in O(n) because you have n + i so you know its exact place :D, I say just one duplicate when you find it you doing reverse. - Saeed Amiri
Can you post that as code? Your description sounds like 動靜能量's solution, but his destroys the ordering of elements. - CodeInChaos
@CodeInChaos, Ok but i didn't find any similarity - Saeed Amiri
Oh it's yet another of the cheater solutions where you use one bit on each array element as tag. Like I used the lowest bit while multiplying the original number with two, or like using the sign as tag. - CodeInChaos
@CodeInChaos, I try to wrote it but time is 2 am and I should to go to sleep, should wake of at 8 tomorrow, I should change n + i with n * j + i I'll wrote it tomorrow. - Saeed Amiri
12
[+2] [2010-10-30 18:09:12] Zarkonnen

A solution, assuming that:

  • we can modify the array as long as we put it back the way it was.
  • we can store numbers at least twice the size of n in each array element. Specifically, that there is at least one high binary digit available that will never be used for storing n.

The algorithm: Go over the elements of the array. At each element with value k, go to element k - 1 (assuming 0-based indexing) of the array, and set its high digit to 1. This gives us a bitmap of which elements are taken without destroying the original data.

Then, go over the array again, and check if there are any elements that do not have the high digit set. While doing so, clear the high digit to return the array to its original state.

If there are elements that do not have the high digit set, we know that there must be missing numbers, and hence that there must be duplicates.

So we end up having taken up a O(1) space (the boolean used to keep track of whether any element does not have its high bit set) and O(n) time (we loop over the array exactly 2n times).


You're the fourth one with that idea. - CodeInChaos
Check the comments on my solution for a discussion why it can be considered cheating. - CodeInChaos
(1) Gah, yes, you're quite right. I did that thing where I thought I'd read the comments, and hadn't. - Zarkonnen
13
[+2] [2010-10-30 18:13:58] Joe D

Firstly, let me start off by saying that this is cheating. This is O(n) as O(n * SIZE_MAX), where SIZE_MAX is constant, is equivalent to O(n). And it uses O(1) space by allocating the same amount of space entirely independent of n.

This won't work on a real machine, as it is a horrible abuse of big-O notation.

int
test_dups (size_t *array, size_t n)
{
  int tmp[SIZE_MAX] = { 0 };
  for (int i = 0; i < n; i++)
    tmp[array[i]] += 1;

  for (int i = 0; i < SIZE_MAX; i++)
    if (tmp[array[i]] > 0)
      return 1;
  return 0;
}

(1) Yep, I noticed this as well. +1 for pointing it out. This is a bit like the philosophy interview: Q:"Prove this table exists." A:"What table?" as it moves the focus to the definition of the question. if the question is designed to be tricky, a tricky answer is an apt response. - P i
14
[+2] [2010-10-30 19:01:17] Niraj Nawanit

A solution without doing any sum or multiplication:

Credit to Paul as I got this idea from his comment: "One thing about the problem as stated is every index must present in the array. So if you consider the array as a set of linked lists where teh value is used as the next index, then the sum of the lengths of all of the linked lists must be N. it seems this should be possible in O(N) but I can't see how to do it yet...".

Algo:

int i = 0;
int j = 0;
while(true) {
    if (arr[i] > 0) {
        arr[i] = - arr[i];
    }
    else {
        printf("duplicate number at index %d\n", i);
        break;
    }
    i = -1 * arr[i]; // we just made a[i] negative!
    j++;
    if (j == n) {
        break;
    }
}
// Again run a loop to make each number positive if it had become negative.

Proof:

1) If j < n and we are able to detect a cycle then there is a duplicate.

2) If j = n and we still did not catch cycle, then we have gone through each index and so each number is unique.

2) From pigeon-hole principle, If at least one number is missing, then a number must be duplicate


(3) Yet another variation of the cheater solution where you use one bit of each array element as memory. - CodeInChaos
15
[+1] [2010-10-30 22:00:28] 動靜能量

This is in fact similar to CodeInChaos's solution: the original question states

You're passed an array with n integers in it.

without saying unsigned integers, so we can toggle the numbers to be negative:

for each elements in array, say if it contains 8, then go to element 8 and toggle the number to be negative. If it was already negative, that means it was toggled previously and therefore a duplicate is found.

otherwise, just go on, and if an element we reach has a negative number in it, just take the absolute number (-10 becomes 10 and go to element 10 to do the toggling)

If it can toggle every element without finding a value inside the array that we want to toggle but is already toggled, then we know there is no duplicate.

at the end, we just go through the array and toggle the elements back to positive numbers. (and O(n + n) is the same as O(n))

we can always negate the number, because such as for an 8-bit integer, the max is 127, and the min is -128, so 127 can always be toggled to be -127.

It is a bit like "using an extra bit", but make use of the fact that we can turn the number into a negative one.


if the given numbers are unsigned integers, we can still do it like this:

8 7 2 3 5 1 9 6 4

we use 2 pointers, one pointing at 8, one pointing at 1, can swap the numbers if the left one is greater than the median, and the right side is less than the median. If the number on either side is "already on the correct side", then just move the pointer one step forward and repeat the above. This will move all numbers to the left and right of the median in the correct group and is O(n).

1 4 2 3 5 8 9 6 7

now we have this array, we can do the toggling like the above by using 10 - n... so if we want to toggle 1, we put a 9 in it, if we want to toggle 6, we put a 3 in it. The idea is, we make it so that it is not supposed to be on the left side, or not supposed to be on the right side, to use as a flag.

So for example, the number 1, if it is "toggled", it will be 9. The number 9 is not supposed to be on the left side of the array, so we know that this number has be toggled. It uses the same method as the algorithm first described in this post, only the initial toggling is done by making it a negative number, but this toggling is done by using (n+1) - i

This method involves moving the elements, so don't know if it counts as "destroying the array". (or we can use my other method in this thread that involves rearranging the items in array as well)


16
[+1] [2010-10-30 12:13:45] Peer Stritzinger

This wont work, see the comments ... just leaving it for discussions sake

Consider the following directed graph: n nodes (one for each array position) edges are from node k -> node a[k]:

Use Floyds Cycle finding algorithms to find a cycle and measure the length. If the length is equal to n there are no duplicates.

Only if the numbers are exactly the set 1..n the cycle length will be exactly n.

Cycle detection [1]

Even better use Brent's algorithm to find the cycle length directly ( Brents algorithm with Python example [2]):

Brent's algorithm Richard P. Brent described an alternative cycle detection algorithm that, like the tortoise and hare algorithm, requires only two pointers into the sequence. 2 [3] However, it is based on a different principle: searching for the smallest power of two 2i that is larger than both λ and μ. For i = 0, 1, 2, etc., the algorithm compares x2i−1 with each subsequent sequence value up to the next power of two, stopping when it finds a match. It has two advantages compared to the tortoise and hare algorithm: it finds the correct length λ of the cycle directly, rather than needing to search for it in a subsequent stage, and its steps involve only one evaluation of ƒ rather than three.

[1] http://en.wikipedia.org/wiki/Cycle_detection
[2] http://en.wikipedia.org/wiki/Cycle_detection#Brent.27s_algorithm
[3] http://en.wikipedia.org/wiki/Cycle_detection#Brent.27s_algorithm

(2) what about (2,1,3). Doesn't this have two independent cycles of length 2 and 1? And I found no way to extend cycle based code quickly to multiple cycles. - CodeInChaos
rats you are right :-( - Peer Stritzinger
17
[+1] [2010-10-31 05:02:33] Margus

You're passed an array with n integers in it. Each integer is between 1 and n (inclusive). These numbers are not in sorted order and there may be duplicates or missing numbers.

The problem is to figure out if there are any duplicates in the array in O(n) time and O(1) space without destroying the array

This is my solution (in Java). It is O(n) time and O(1) space and it does not destroy the array.

How does it work and why is this correct?

  1. My solution has its range where it works, but that can not be reached - or actually could if JRE could address 8.59 GB on memory 32bit kernel or 12.9 GB memory on 64bit kernel (and arrays need sequential memory block :D).
  2. Assuming you can't do 1., then this enables me to use 1 bit in every number in the array, because by task definition values are between 1 and n. Also note that all values are positive.
  3. This is not sorting, because I know where the values will go in the array (like linear reordering). I'm not switching elements, but changing 1 bit in numbers.
  4. At all time 31 bits will remain same, and I'm not destroying any value that could be there. After I'm done, I will restore even that 1 bit, that I temporarily changed. This can be thread safe if threads read data with value & ((1 << 31)-1).

Code example:

// Finds if array contains at least 1 duplicate
// O(n); Space(1); does not break array
public static boolean isDup(int a[]) {
    int n = a.length, temp, mask = 1 << 31, i;
    boolean result = false;

    for (i = 0; i < n; i++) {
        if ((temp = a[i]) < 0)
            temp ^= mask;
        if (a[temp - 1] > 0)
            a[temp - 1] ^= mask;
        else {
            result = true;
            break;
        }
    }
    for (i = 0; i < n; i++) 
        if (a[i] < 0)
            a[i] ^= mask;
    return result;
}

In case for C or C++ you only need to change int n = a.length in the method body.

Pro bono publico, answer to similar question - if you would need to print out all the duplicates. Here I'm exploiting the fact that xor is fastest operation and 0 is false in C (this will not work in Java).

    // Finds all duplicates
    // O(n log n); Space(1); does not break array
    int i, j;
    for (i = 0; i < n - 1; i++)
        for (j = i + 1; j < n; j++)
            if (!(a[i] ^ a[j]))
                printf("Duplicate: %d : %d.", i, j);

If you need one extra bit per integer, this yields O(n). There are many answers with your solution here. - dark_charlie
Extra bit is contained inside existing array, so no extra space is required, this is O(1). O(n) space would be, if I used external array for this. And at moment I posted this, there was no algorithm or example that was correct (and there is no atm. - you can not calculate checksums or add/multiply/divide anything). - Margus
How is that different from for example my solution? Except that you're using the sign bit instead of the lowest one. Or one of 動靜能量's solutions es pretty much identical using the sign bit. - CodeInChaos
@CodeInChaos 動靜能量 swaps elements, i don't. Your solution uses multiplication and devision (have you any idea how costly that is?). My solution shows, that you do not even have to shift them (using << 1 and >> 1). Note also my solution changes the last bit on only for elements that it needs, in your case, that is not even possible. for (i = 0; i < n; i++) if (a[i] < 0) a[i] ^= mask; should be something like int umask = mask - 1; for (; i >= 0; i--) a[i] &= umask;, but I guess you get the point already. - Margus
Multiplication and Division by constant two are just bitshifts. And bitshifts are close to free. And none of the differences affects the runtime order. They are just a constant cost multiplicator. And if there are no duplicates you'll need all elements. - CodeInChaos
18
[+1] [2010-10-30 11:35:05] corriganjc

I think rados was on to something. If in addition to the check that the sum adds up to n(n+1)/2, add a check to ensure that the product of the array is n! and if either check fails you have duplicates.


You can't compute n(n+1)/2 and n! under the given constraints. - IVlad
n! takes O(n) bits to store. - Jessica
(1) @Jessica: ... and n takes O(log n) bits to store, which is also beyond the O(1) space requirement. - William
19
[+1] [2010-10-30 21:04:58] macieksk

Basing on "http://www.wolframalpha.com/input/?i=Sum(e^(k/n*2*pi*i),1..n)"

if n is prime compute

sum=0
for k in 1..n: sum += exp(a[k]/n * 2*pi*i )

Now, it's enough to check if sum==0. (Here i = sqrt(-1) and computation is on complex plane).

Primality of n is important, because there is no way to make the sum be 0 when there are duplicates (no subset of n-th order roots of 1 on the unit circle sums to 0, except the whole set). It can be however pretty close to 0 in some cases of repeating numbers and the difficulty is to asses how close.

Edit:

The precision of floats needs to be high enough to see if sum == 0. I don't know yet how to bound number of bits needed for complex numbers with respect to n.


I'd guess the required precision for the number is O(n), which would make this algorithm violate both constraints (the math on such large numbers would be at least O(n) so total time at least O(n^2) - CodeInChaos
@CodeInChaos: hmm, numbers used here are of length 1 (they lie on a circle of radius 1), so any sum of such is < n... The question is, how much precision we need to really see if something is 0 or not. - macieksk
there are n^n-n! array states which must be recoginized as wrong. These would fill n*log2(n) bits. So I'd guess we need about that many fractional bits to encode the number - CodeInChaos
@CodeInChaos: ...most of those states n^n states, are pretty far from 0, aren't they? So, many of them can blend together (as it normally happens with floating point precision)... it seems that we only need enough precision around zero... then, the summation might introduce new errors... i don't know. - macieksk
Yes the bit count was only a rough estimation. But it still seems like you won't get below O(n) bits. And proving correctness and not just that it's likely correct is really hard. So I'd rather go with Marcelo's probabilistic solution since it's easy to estimate the chance of failure for it. - CodeInChaos
Somehow this approach leads to a perfect "hashing" strategy in Marcelo's solution. Instead of using SHA, which leaves some very small prob. of error, we construct a perfect hash for a table of 2^m numbers (m smallest possible so n<=2^m). We compute it for lacking numbers n+1,...,2^m (time O(n)), and for n numbers in the array. The hash is a sum on every bit position treating 0 as -1 and 1 as 1. We need m=ceiling(log n) counters, each spanning range -2^(m-1),2^(m-1). So, the required space is (log n)^2. Result is ok if every counter is 0 at the end. The idea from above for p=2 & group product. - macieksk
20
[0] [2010-11-22 23:05:53] mbrevoort

A little late to the party but here's an i mplementation I wrote in Groovy a few months ago [1]:

// check if an list of n integers inclusive between 1 and n has a duplicate

def hasDup(list) {
  def i=0, hasDup = false
  list.each { item ->
    def key = Math.abs(item) - 1
    def value = list[ key ]
    list[ key ] = -value             //mark item as visited by making negative              
  }
  list.eachWithIndex{ item, pos ->
    if(item > 0)
      hasDup = true                //thus item as current position was not visited
    list[pos] = Math.abs(item)       //restor the value to it's original positive value
  }
  return hasDup
}

assert hasDup([1,2,3,4,5,5])         == true
assert hasDup([1,2,3,4,5])           == false
assert hasDup([1,2,2,3,4,5])         == true
assert hasDup([1,2,2,3,3,5])         == true
assert hasDup([1,1,3,3,5,5])         == true
assert hasDup([5,5,3,3,1,1])         == true
assert hasDup([2,4,3,3,1,2])         == true
assert hasDup([5,1,1,5,2,5])         == true
assert hasDup([1,6,1,6,1,6])         == true
assert hasDup([1,6,1,6,1,6,6,2])     == true
assert hasDup([1,6,1,6,1,6,6,6,1,2]) == true
assert hasDup([5,1,4,2,3,7,6])       == false
assert hasDup([4,3,2,1])             == false
println "all assertions valid"
[1] http://groovyconsole.appspot.com/script/232001

21
[0] [2010-10-30 15:54:10] supercat

Interesting puzzle. If one assumes that array elements are sized to precisely fit numbers 1..N, then there will be a tiny bit of extra space in the array, but I'm not sure how one could really exploit it, especially since the number of array elements one has to munge to get each extra bit of storage will increase with N. Indeed, experiments with values of N up to 2^24 suggest that one will have to munge about 90.9% of the numbers to free up 'n' bits (this ratio is essentially the same with n=2^16 or n=2^24). If one could free up 'n' bits in linear time, of course, the problem would readily be solved from there on out, but I don't see any nice way of subdividing the problem which wouldn't result in NlgN or worse time.


22
[0] [2010-10-30 23:07:01] dfens
n = 0
i = 0
retval = true
while true
  next_i = a[i]
  i = next_i
  a[i] = -a[i]
  n++
  if n == array.size then break
end
for i in 0..array.size do |i|
  if array[i] > 0
    retval = false
  else
    array[i] = -array[i]
  end
end

return retval

hope this works


23
[0] [2010-10-30 23:23:34] mikera

Here's an interesting (though ultimately flawed) approach for algorithm connoisseurs:

  • Allocate a constant size buffer of length 100 integers
  • Loop through the entire input array for k=2 to 100, and create a histogram of (array[i] mod k) in your buffer
  • For each value of k, check that the histogram is correct (i.e. what you would expect for an array containing all the numbers 1 up to n)

This is O(n) time and O(1) space.

It's clearly not a perfect solution as it can fail if n is too large (it can't distinguish between 1 and 69720375229712477164533808935312303556801 for example).

It possibly may fail for some other very carefully arranged input. But I haven't yet been able to find a case with 64-bit integers where it doesn't work. Counterexamples welcome!


p.s. this algorithm was inspired by the Miller-Rabin primality test - in the sense that it is using lots of modulo tests to show that there are probably no duplicates in the array. - mikera
24
[-1] [2010-10-30 23:23:35] William

Edit: Below, I'm going to assume that the O(1) space requirement "really means" that we can't store more than a constant number of integers, not bits. Yes, I realize you can store anything encoded in an arbitrarily-large integer. Oh well :-) Note that even storing n is going to violate the space requirement by requiring O(log n) bits of storage.

[Assuming the array contains only elements from [n], as stated by the OP:]

The array is a permutation of [n] (that is, the array passes the OP's check) if and only if the sum of elements equals T_n and the product of elements equals n!.

The reasoning: Suppose the array A is a permutation of [n]. Now suppose we want to generate an array A' to "fool" the above-implied algorithm that checks sum(A) and product(A). Without loss of generality, we begin with a sum-preserving alteration. Take any two elements x and y of A. To preserve sum, we change them to x-k and y+k for some 0<k<min(x,n-y+1). Now, for this alteration to be product-preserving, it must be that xy=(x-k)(y+k). This simplifies to k=x-y. Finally, this implies that the sum-preserving alteration has not, in fact, altered anything; x and y have simply switched locations in the array. Thus, A remains a permutation.

Edit: The above reasoning will not work, as one can essentially perform another sum-preserving alteration to compensate for the product difference generated by the first one. Thanks go to the commenters for patiently getting me round to this.


Unfortunately, while computing the sum of all the elements may be done in "constant" space (meaning that one unit of storage is presumed to be capable of storing a number from 1 to N, no matter how big N happens to be) the space required to store N! is not going to be constant. It might be possible to subdivide things somehow, but I suspect one will either end up having to use a non-constant amount of space or non-linear time. - supercat
There are counter-examples to the hypothesis: "The array is a permutation of [n] (that is, the array passes the OP's check) if and only if the sum of elements equals T_n and the product of elements equals n!" - Jessica
n! is of similar size as n^n, and requires about n*log(n) bits or n integers. - CodeInChaos
@supercat, @CodeInChaos: very fair points about differentiating between "O(1) space" interpreted as being able to essentially store O(log n) bits, versus being able to store the O(n log n) size of n!. - William
@Jessica: I invite you to present the counter-examples that you claim exist. - William
@William: Consider the groups 3,4,12 or 2,8,9. Both total 19, and both multiply to 144. So replacing the numbers 3, 4, and 12 with duplicates of 2, 8, and 9 would not change the product nor the sum. - supercat
@supercat: I have to apologize here, since I made an unstated assumption that elements could only be from [n]. My hypothesis should be clarified; alas, this question's closure has disabled editing. My invitation stands for anyone to show counterexamples with elements only from [n]. - William
I tried the edit again and it worked. My chrome is mis-behaving. - William
It seems impolite for me to comment so much, but I have to wonder at this point: Is anyone disagreeing with the "reasoning" paragraph? Or is this being skipped and are gut feelings the root of the "counter-examples exist" response? - William
@William - Both [7, 5, 10, 4, 9, 9, 4, 4, 1, 2] and [1,2,3,4,5,6,7,8,9,10] sum to 55 and have a product of 10!. - Jessica
As has been proved elsewhere in the thread, any finite number of check-sums can be defeated. - Jessica
@William: In a list of 12 numbers, replacing the 3, 4, and 12 with duplicates of 2, 8, and 9 [i.e. having the list be (1, 2, 2, 8, 5, 6, 7, 8, 9, 10, 11, 9) would yield a sum and product indistinguishable from (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) even though all numbers are 1-12. Jennifer's example changes 3, 6, and 8 to 4, 4, and 9 (total 17, product 144). - supercat
Thank you guys for the counterexamples. It had seemed that the "two-at-a-time" alteration reasoning was extensible to three or more. I didn't give this more than a cursory check, though. It's a bad feeling when one realizes a mistake that could have been prevented with simple thoroughness. Anyway, I appreciate that you guys didn't give me up for brain-dead and kept commenting. - William
25
[-1] [2010-10-31 01:00:15] SHiNKiROU

What about checking duplicates THEN summing all the array elements? The problem is that how to check duplicates with O(n) time complexity? It sounds impossible, but there may be special mathematical properties with array of [1..n] which the interviewers may be wanting you to be a math contest winner. Or, they may be expecting you to mention about quantum computers.

My previous attempt, "If the array has no duplicates, then the sum is 1+2+...+n" failed, since sum [1,2,3,4] = sum [2,2,3,3].

This may help: http://aperiodic.net/phil/archives/Geekery/find-duplicate-elements.html


How about [2, 2, 3, 3] ? - Maciej Hehl
26
[-1] [2010-11-08 18:27:52] Ivan

If array is a permutation of 1,2,3,...,n then:

The sum of the entries in the array will equal a known value
AND
The sum of each entries squared will equal a known value.

Compute both of these values and see if they match the expected output. I bet it's impossible (or at least very hard) to find an array such that both conditions hold and the input array is not a permutation.

For example:
Is {1 4 3 3} a permutation of {1 2 3 4}?

Well: sumOfEntries( {1 2 3 4} ) = 10
sumOfEntriesSquare( {1 2 3 4} ) = 1 + 4 + 9 + 16 = 30

But:

sumOfEntries( {1 4 3 3} ) = 11
sumOfEntriesSquare( {1 4 3 3} ) = 1 + 16 + 9 + 9 = 25

{1 4 3 3} is not a permutation because 10 != 11
{1 4 3 3} is not a permutation because 30 != 25


I'm too lazy to search for it, but somebody posted a counterexample where your test thinks it's a permutation, but it isn't. - CodeInChaos
@Carlos, I think you may be misremembering the "factorial" example. I'm computing the sum of squared entries. I'm not multiplying out all entries and testing to see if it equals the factorial. - Ivan
@Carlos, you could always add the "factorial test" and make it a 3 part test. - Ivan
This doesn't work. - Jessica
"I'll bet" is not a proof - Paul
@Jessica -- Do you have a counter example? The counter example you posted for the "factorial + sum" soluation won't work here multiplication is much "looser" than sums of square combined with sums - Ivan
I found a counter example: 1 2 3 4 5 6 7 :: 140 (sum squares) :: 784 (sum cubes) BUT: 1 1 4 5 5 6 6 has the same sum and sum of squares (140) but not sum of cubes (748). It seems like the solution would be to compute several "known values" and try to match 3-5 of them. - Ivan
No solution like this will work. If you're computing any sort of cumulative checksum, then information by information theoretic arguments, you'll need to keep track of at least O(n) numbers in order to differentiate a permutation from a non-permutation. - Jessica
27
[-1] [2010-10-30 22:05:05] Fred

If the array has no duplicates, the sum of the elements of the array will equal ((n+1)*n)/2.


However if the sum is ((n+1)*n/2 it does not follow that the array has no duplicates. - Don Roby
True, but if there are duplicates, the sum could also be n(n+1)/2. You can't use this alone to distinguish all the cases. - uckelman
(1) For the record, Jessica mentioned long time ago [2,2,2] and [1,2,3] both add up to 6. or it can be [1,2,3,4,5] vs [1,1,3,5,5] or [1,2,3,4,5,6,7,8,9] vs [1,2,2,4,5,6,7,7,9] - 動靜能量
28
[-1] [2010-10-30 19:09:40] chuck

If the numbers are unique they can be sorted in O(n). Assume the numbers are in an array a indexed 1..n. Go to element 1, if it is 1, go to the next element and continue this way until there is an element out of place, say m. Go to a[m]. If a[m] == m there is a duplicate, otherwise set n = a[m], a[m] = m, m = n, and continue. Eventually you will end up where you started. The starting element is now in place and you can continue as before. This will only fail to sort the array if there are duplicate elements and that will be detected in the process.


(1) And by sorting the array inplace you destroy it. And this solution has already been posted. - CodeInChaos
It can't be sorted in O(n). - Jessica
29
[-1] [2010-10-30 19:38:39] Bwmat

here's something that came to mind

let s = 0
go through the array, and for every element a[i],
if a[i] > i, s++
if a[i] < i, s--
else do nothing

if you finish with s = 0, there are no duplicates.

my intuition is that if every number from 1 to n appears once, this sum will always be zero regardless of permutation, but I don't really know if this is true

unfortunately this requires O(logn) space even if it does work.


{2,3,1} gives 1 but has no duplicates. { 3,3,1,1} gives 0 but has duplicates. - Paul
(2,2,2) => s = 0 - CodeInChaos
yeah I had a feeling it wouldn't work. Oh well. - Bwmat
30
[-1] [2010-10-30 17:33:57] someuser

Use a vector as a set to record each integer seen as you iterate through the array. The set will have constant (O(1)) size: one entry for each possible integer so the maximum number of entries is always MAX_INT. Only one iteration through the array is needed, so O(n) provided that the vector operations are constant.


and for n>MAX_INT it fails... - CodeInChaos
Did you not know that the number of integers is infinite? Your solution is fine for 32 bit numbers, but if we say the numbers are 64 bit, you'll need 2^56 bytes for your memory. Good luck with that. - JeremyP
Your solution is NOT O(1) space, it's O(n) space since as n goes up you need more bits in your vector. - Hightechrider
31
[-1] [2010-10-30 11:57:58] 動靜能量

(Update: per Jessica and Fabio's comment, need to check the number being less than or equal to n, and also the numbers add up to (1+n) * n / 2. I was assuming at first that they add up to that number, but just some number(s) are greater, some number(s) smaller, to compensate with each other, and with that condition, can they multiply all together to be exactly n!?)

(Update: the idea is, find a mathematical property, when f(1, 2, 3, ..., n) gives a number that cannot be duplicated with a different set of numbers. If f alone is not enough, then maybe f and g together, or maybe f, and g, and h together, but that's the idea of approaching this problem this way)

I am suspecting we can multiply all the elements, and check it against n! (n factorial).

Assuming the floating point is accurate no matter what order the numbers are multiplied.

Or we can use integer (bignum), but I think strictly speaking, then it is O(n log n).

This requires a mathematical proof, it is like, we expect a perfect occurance of 1, 2, 3, ..., n. If there is any deviation from it, they can't multiply to be exactly n!

Also, if we look at:

1 2 3 4 5 6 7 8 9

if we first make sure there is a medium 5, and the number left of it, 4, and right to it, 6 -- if any of those number doesn't exist, we know there must be a duplicate by pigeon hole principle. Now, we go through the unsorted array and calculate (a[i] - 4) ^ 2 and sum them up, as well as (a[i] - 6) ^ 2 and sum them up, and compare to what it is a perfect 1 2 3 4 5 6 7 8 9, since if we have the case 1 2 3 3 5 7 7 8 9, finding the deviation from 5 by (a[i] - 5) ^ 2 is not useful, if we shift the point where the deviation is comparing to, then they can't be equal. Also, we can use (a[i] - 4 ) ^ 3 instead of squared. It really depends on finding a math property where a perfect f(1, 2, 3, ..., n) cannot be exactly the same as f(n1, n2, n3, ... ), or if f alone is not enough, a combination of f and g that together ensures that when equal, only possible that the numbers must be from 1 to n, sequentially.


(1,1,6) breaks your algorithm. - Jessica
hm... with the check that no number can be greater than n - 動靜能量
(1,2,3,4,5,6,7,8) = (1,2,3,3,5,8,7,8) - Fabio F.
sorry, forgot to add that, we already first check that they add up to (1+n)*n / 2 - 動靜能量
No, this doesn't work. - Jessica
Both [7, 5, 10, 4, 9, 9, 4, 4, 1, 2] and [1,2,3,4,5,6,7,8,9,10] sum to 55 and have a product of 10!. - Jessica
haha Jessica you are good at find cases where it breaks - 動靜能量
The property exists and is that for each different set of number the sum of 2^a[i] is different. - Fabio F.
I'm not good at finding cases where it breaks .... my computer is. - Jessica
do you just "brute-force" all the numbers to see if it breaks? Then if n becomes 20 and it hasn't broken, then your algorithm can take very long time to run!? (20^20 is 1.05E+26) - 動靜能量
@Jessica -- Yo, can you update your program to test my answer (which is related to this one) - Ivan
32
[-1] [2010-10-30 10:24:08] djjeck

You could use a bitmask variable v=0, and set, for each number found i,

v |= 1<<(i-1)

At the end, check if

v == (1<<n)-1

This works for n<64 for a long v, and can be extended easily for a greater n

This isn't really O(1) space, but neither n(n+1)/2 is!!


actually I don't like this answer, but it's better than nothing =) - djjeck
(3) This solution doesn't work because it's O(n) space. - Jessica
That's true, but the assumption that n<64 seems somehow more limiting than that n<2**64 - sepp2k
agreed, not a valid solution =) - djjeck
33
[-2] [2010-10-30 10:37:02] Antoine Pelisse

You can probably try this code: (this is python)

integers = [ 1, 4, 5, 3, 2 ]

sum = 0

for i in integers:
    sum += 2 ** (i - 1)

if sum != 2 ** (len(integers)) - 1:
    print 'Duplicate'
else:
    print 'Perfetto'

Which only runs through len(integers) O(n) and uses sum for space O(1)


The memory used is not O(1) and the time isn't O(n) either as exponentiation is not O(1). - IVlad
(3) This is O(n) space. You're just hiding it by using bignums. - Marcelo Cantos
(3) This is the same as djjeck’s proposal: A bit map. - Gumbo
Yes it is, then it's O(lg n) in space, and it's still O(n) in time as exponentiation IS O(1) as long as you count in binary. Then it's probably not possible to use O(1) in space ... - Antoine Pelisse
-1 for stating exponentiation is O(1). It's only O(1) if the base is 2 for the fundamental data types. If i is 1000, you're not going to be able to calculate 2^1000 in O(1), nor add them all together in O(1). - IVlad
This is probably not the same order of magnitude and can't probably be counted in a loop: Using your reasonment, it would be virtually impossible to have a O(n) algorithm, as addition in a loop would make it bigger than O(n) - Antoine Pelisse
Maybe we need to reconsider the first question: It's probably, how can you make sure there is no duplicate, reading all the integers only once and not saving all read values. It's probably not, is there a O(1) algorithm to sum/multiply/whatever operation you want - Antoine Pelisse
(2) @Antoine Pelisse - you can assume that addition of two 32-bit integers is O(1) as per the OP's clarifications. What you cannot assume is that addition of n 32-bit integers without overflow is O(n), because you might have to implement your own data type to store the result, since it might be too big for the fundamental types. You don't have to worry about saving the values yourself obviously, you need to worry about writing a function that is fed the values and computes the answer in O(n). - IVlad
(1) @IVlad: The addition of 'n' instegers which are of a size large enough to hold 'n' takes time O(n), since the resulting total can be at most N^2 and will thus fit in an integer twice as big as the original numbers. - supercat
@supercat - yes, but he is adding powers of two, which stop fitting in any 32 bit integer when the exponent reaches 32. - IVlad
@IVlad: Your statement was about adding 'N' 32-bit integers. If the integers are size 'N', then that will clearly become O(N^2). - supercat
34
[-2] [2010-10-31 07:36:50] Corey Ogburn

Alright, this is different, but it works!

I ran this test program (C#):

    static void Main(string[] args) {
        //Run 100 tests, each test with an array the size of the test number
        for (int i = 1; i <= 100; i++) {
            int[] temp = new int[i];//array the size of the test number
            //Fill temp array with a permutation
            for (int j = 0; j < i; j++) {
                temp[j] = j + 1;
            }
            //Now that setup is done, test for a permutation
            bool check = IsPermutation(temp);
            Console.WriteLine("Permutation: " + check);
        }
    }

    static bool IsPermutation(int[] array) {
        int x = 0;
        for (int i = 0; i < array.Length; i++) { //O(n)
            //Because it's an XOR, it's commutative,
            //and doesn't depend on the order of the array
            x ^= array[i];
        }
        bool result = false;

        switch (array.Length % 4) {
            case 0:
                result = (x == array.Length);
                break;
            case 1:
                result = (x == 1);
                break;
            case 2:
                result = (x == array.Length + 1);
                break;
            case 3:
                result = (x == 0);
                break;
        }
        return result;
    }

So I put up my previous answer in a bit of a haste. My code wasn't very readable and didn't convey what I was doing. The actual algorithm is the IsPermutation method, everything else in the main method is setup for the IsPermutation method, therefore irrelevant for the O(n) requirement.

As for Jessica's { 4, 3, 2, 1 }:

First, XOR ALL elements, not just the first three. 4 ^ 3 = 7 ^ 2 = 5 ^ 1 = 4. So in my code, x is 4. Because we switch on array.Length % 4 == 0, we check the array.Length == x branch of the switch and it's true, so it's a permutation.


This doesn't work. - Jessica
Please Jessica... details. I know I only skimmed at this a little bit, and it's not perfect. But it could be perfected. "This doesn't work" does not solve the problem. - Corey Ogburn
(1) What you wrote is not an algorithm that takes any input and does computation on in to determine something. In fact, I don't see an array or equivalent data structure anywhere in your code. (2) The code you wrote is O(n^2), not O(n), but I'm not 100% sure since I don't know with certainty what you meant the code to do with the input array. (3) It fails on the case (4,3,2,1). The XOR of the first 3 elements is 5, which is not equal to zero. - Jessica
"The XOR of the first 3 elements is 5" Yes, yes it is, BUT the XOR of the ENTIRE array, all 4 elements, is 4 and the logic in the switch correctly identifies { 4, 3, 2, 1 } as a permutation. Please re-read my revised code to see where the arrays come into play. This updated code does exactly what the previous set of code does, it just actually uses arrays this time and separates the logic of the algorithm away from the setup of the test arrays. P.S. This only tests working permutations, there might be some invalid entries that still work but I haven't found a case yet. - Corey Ogburn
You changed the code that you posted. This code fails on [1,1,1,1,1]. No solution similar to this will work. - Jessica
It might be possible to combine this with other answers people have to form a sort of hash. If an algebraic "hash" technique was combined with my binary technique, perhaps an answer can be discovered. For example, your [1,1,1,1,1] answer would fail many algebraic formulas depicted in other users' answers. - Corey Ogburn
35
[-4] [2010-10-30 11:05:00] rados

This should do the job:

bool has_duplicates(const int * a, int n) {
      int expected = (n * (n + 1)) / 2; 
      int computed = 0;
      for (int i = 0; i < n; ++i) computed += a[i];
      return expected != computed;
    }

Hint: "array with n integers" and "integer is between 1 and n (inclusive)"


There can be multiple duplicates in the array! - nikie
(2) This won't work: e.g. n = 4: no dup 1+2+3+4 = 10, counterexample 1+1+4+4 = 10 also - Peer Stritzinger
no, fails on {2,2,2} expected =6, as is computed. - Paul
36