mathematical algorithm中文

簡單的面試問題變得更加困難:給定數字1..100,找到缺少的數字(s) (20)

A very simple way to do it in roughly O(N) time is to remove each element when seen in both lists. This works for unsorted lists too and can be easily further optimized if the lists are both sorted.

import random

K = 2
missingNums = range(0, 101)
incompleteList = range(0, 101)

#Remove K numbers
for i in range(K):
    valueToRemove = random.choice(incompleteList)

dummyVariable = [missingNums.remove(num) for num in p if num in missingNums]

print missingNums

我有一段很有意思的工作面試經歷。 問題開始很簡單:

Q1 :我們有一個包含數字3 ,..., 100 。 每個數字只出現一次,所以有100個數字。 現在從包裡隨機挑選一個號碼。 找到缺少的號碼。


A1 :那麼,數字1 + 2 + 3 + … + N的總和是(N+1)(N/2) (參見Wikipedia:算術系列之和 )。 對於N = 100 ,總和是5050

因此,如果所有數字都出現在包中,則總和恰好為5050 。 由於缺少一個數字,總和將小於這個數字,不同之處在於該數字。 所以我們可以在O(N)時間和O(1)空間找到缺失的數字。


Q2 :這是正確的,但是現在如果兩個數字不見了,你會如何做到這一點?

我以前從未見過/聽過/考慮過這種變化,所以我驚慌失措,無法回答這個問題。 面試官堅持了解我的思維過程,所以我提到或許我們可以通過與預期產品進行比較來獲得更多的信息,或者在收集了第一遍等的一些信息之後進行第二次通過,但我真的只是拍攝在黑暗中而不是實際上有一個清晰的解決方案路徑。

面試官確實試圖鼓勵我說擁有第二個等式確實是解決問題的一種方法。 在這一點上,我有點不高興(因為事先不知道答案),並詢問這是一個普通的(閱讀:“有用的”)編程技巧,或者它只是一個技巧/疑難答案。

面試官的回答讓我感到吃驚:你可以推廣這項技術來找出3個缺失的數字。 事實上,你可以推廣它找到k個缺失的數字。

Qk :如果包裡沒有K個數字,你會如何有效地找到它?

這是幾個月前,我仍然無法弄清楚這種技術是什麼。 顯然,由於我們必須至少掃描一次所有數字,因此存在Ω(N)時間下限,但採訪者堅持認為求解技術的TIMESPACE複雜度(減去O(N)時間輸入掃描)定義為k不是N。


  • 你將如何解決Q2
  • 你會如何解決Q3
  • 你將如何解決QK


  • 通常從1 .. NN個數字,而不僅僅是1..100。
  • 我沒有尋找明顯的基於集合的解決方案,例如使用一個位集合 ,通過指定位的值對每個數字的存在/不存在進行編碼,因此在額外的空間中使用O(N)位。 我們無法承擔任何與N成比例的額外空間。
  • 我也沒有在尋找明顯的排序優先方法。 這個和基於集合的方法在訪談中值得一提(它們很容易實現,取決於N ,可能非常實用)。 我正在尋找聖杯解決方案(可能或不可行,但仍具有所需的漸近特性)。


For Q2 this is a solution that is a bit more inefficient than the others, but still has O(N) runtime and takes O(k) space.

The idea is to run the original algorithm two times. In the first one you get a total number which is missing, which gives you an upper bound of the missing numbers. Let's call this number N . You know that the missing two numbers are going to sum up to N , so the first number can only be in the interval [1, floor((N-1)/2)] while the second is going to be in [floor(N/2)+1,N-1] .

Thus you loop on all numbers once again, discarding all numbers that are not included in the first interval. The ones that are, you keep track of their sum. Finally, you'll know one of the missing two numbers, and by extension the second.

I have a feeling that this method could be generalized and maybe multiple searches run in "parallel" during a single pass over the input, but I haven't yet figured out how.

I don't know whether this is efficient or not but I would like to suggest this solution.

  1. Compute xor of the 100 elements
  2. Compute xor of the 98 elements (after the 2 elements are removed)
  3. Now (result of 1) XOR (result of 2) gives you the xor of the two missing nos i..ea XOR b if a and b are the missing elements
    4.Get the sum of the missing Nos with your usual approach of the sum formula diff and lets say the diff is d.

Now run a loop to get the possible pairs (p,q) both of which lies in [1 , 100] and sum to d.

When a pair is obtained check whether (result of 3) XOR p = q and if yes we are done.

Please correct me if I am wrong and also comment on time complexity if this is correct

I have an idea, but this is assuming that the actual size of the array is 100 and the missing numbers are replaced with something else (like -1).

Basically, do a sort that's kind of a modified version of a selection sort that swaps the list in-place. I believe this is O(n) time (correct me if I'm wrong though) because we make use of the fact we already know the numbers that should exist. We swap the value with the "correct" position, until the index we are at has the correct number (or has -1).

After we're done with that, we just loop the list again and the index will basically be the missing numbers

#Set up the input
input = (1..100).to_a.shuffle
input[rand(0..99)] = -1
input[rand(0..99)] = -1

def f(a)
  for i in 0..99
    while (a[i] != i+1) && a[i] > 0
      v1 = a[i]
      v2 = a[v1 - 1]
      a[i] = v2
      a[v1 - 1] = v1

  result = []
  a.each_with_index do |value, index|
    if value < 0
      result.push(index + 1)

  return result

#Returns the 2 missing numbers
puts f(input)

I think this can be generalized like this:

Denote S, M as the initial values for the sum of arithmetic series and multiplication.

S = 1 + 2 + 3 + 4 + ... n=(n+1)*n/2
M = 1 * 2 * 3 * 4 * .... * n 

I should think about a formula to calculate this, but that is not the point. Anyway, if one number is missing, you already provided the solution. However, if two numbers are missing then, let's denote the new sum and total multiple by S1 and M1, which will be as follows:

S1 = S - (a + b)....................(1)

Where a and b are the missing numbers.

M1 = M - (a * b)....................(2)

Since you know S1, M1, M and S, the above equation is solvable to find a and b, the missing numbers.

Now for the three numbers missing:

S2 = S - ( a + b + c)....................(1)

Where a and b are the missing numbers.

M2 = M - (a * b * c)....................(2)

Now your unknown is 3 while you just have two equations you can solve from.

I'd take a different approach to that question and probe the interviewer for more details about the larger problem he's trying to solve. Depending on the problem and the requirements surrounding it, the obvious set-based solution might be the right thing and the generate-a-list-and-pick-through-it-afterward approach might not.

For example, it might be that the interviewer is going to dispatch n messages and needs to know the k that didn't result in a reply and needs to know it in as little wall clock time as possible after the nk th reply arrives. Let's also say that the message channel's nature is such that even running at full bore, there's enough time to do some processing between messages without having any impact on how long it takes to produce the end result after the last reply arrives. That time can be put to use inserting some identifying facet of each sent message into a set and deleting it as each corresponding reply arrives. Once the last reply has arrived, the only thing to be done is to remove its identifier from the set, which in typical implementations takes O(log k+1) . After that, the set contains the list of k missing elements and there's no additional processing to be done.

This certainly isn't the fastest approach for batch processing pre-generated bags of numbers because the whole thing runs O((log 1 + log 2 + ... + log n) + (log n + log n-1 + ... + log k)) . But it does work for any value of k (even if it's not known ahead of time) and in the example above it was applied in a way that minimizes the most critical interval.

There is a general way to generalize streaming algorithms like this. The idea is to use a bit of randomization to hopefully 'spread' the k elements into independent sub problems, where our original algorithm solves the problem for us. This technique is used in sparse signal reconstruction, among other things.

  • Make an array, a , of size u = k^2 .
  • Pick any universal hash function , h : {1,...,n} -> {1,...,u} . (Like multiply-shift )
  • For each i in 1, ..., n increase a[h(i)] += i
  • For each number x in the input stream, decrement a[h(x)] -= x .

If all of the missing numbers have been hashed to different buckets, the non-zero elements of the array will now contain the missing numbers.

The probability that a particular pair is sent to the same bucket, is less than 1/u by definition of a universal hash function. Since there are about k^2/2 pairs, we have that the error probability is at most k^2/2/u=1/2 . That is, we succeed with probability at least 50%, and if we increase u we increase our chances.

Notice that this algorithm takes k^2 logn bits of space (We need logn bits per array bucket.) This matches the space required by @Dimitris Andreou's answer (In particular the space requirement of polynomial factorization, which happens to also be randomized.) This algorithm also has constant time per update, rather than time k in the case of power-sums.

In fact, we can be even more efficient than the power sum method by using the trick described in the comments.

This might sound stupid, but, in the first problem presented to you, you would have to see all the remaining numbers in the bag to actually add them up to find the missing number using that equation.

So, since you get to see all the numbers, just look for the number that's missing. The same goes for when two numbers are missing. Pretty simple I think. No point in using an equation when you get to see the numbers remaining in the bag.

Very nice problem. I'd go for using a set difference for Qk. A lot of programming languages even have support for it, like in Ruby:

missing = (1..100).to_a - bag

It's probably not the most efficient solution but it's one I would use in real life if I was faced with such a task in this case (known boundaries, low boundaries). If the set of number would be very large then I would consider a more efficient algorithm, of course, but until then the simple solution would be enough for me.

We can do the Q1 and Q2 in O(log n) most of the time.

Suppose our memory chip consists of array of n number of test tubes . And a number x in the the test tube is represented by x milliliter of chemical-liquid.

Suppose our processor is a laser light . When we light up the laser it traverses all the tubes perpendicularly to it's length. Every-time it passes through the chemical liquid, the luminosity is reduced by 1 . And passing the light at certain milliliter mark is an operation of O(1) .

Now if we light our laser at the middle of the test-tube and get the output of luminosity

  • equals to a pre-calculated value(calculated when no numbers were missing), then the missing numbers are greater than n/2 .
  • If our output is smaller, then there is at least one missing number that is smaller than n/2 . We can also check if the luminosity is reduced by 1 or 2 . if it is reduced by 1 then one missing number is smaller than n/2 and other is bigger than n/2 . If it is reduced by 2 then both numbers are smaller than n/2 .

We can repeat the above process again and again narrowing down our problem domain. In each step, we make the domain smaller by half. And finally we can get to our result.

Parallel algorithms that are worth mentioning(because they are interesting),

  • sorting by some parallel algorithm, for example, parallel merge can be done in O(log^3 n) time. And then the missing number can be found by binary search in O(log n) time.
  • Theoretically, if we have n processors then each process can check one of the inputs and set some flag that identifies the number(conveniently in an array). And in the next step each process can check each flag and finally output the number that is not flagged. The whole process will take O(1) time. It has additional O(n) space/memory requirement.

Note, that the two parallel algorithms provided above may need additional space as mentioned in the comment .

You could try using a Bloom Filter . Insert each number in the bag into the bloom, then iterate over the complete 1-k set until reporting each one not found. This may not find the answer in all scenarios, but might be a good enough solution.

You'd probably need clarification on what O(k) means.

Here's a trivial solution for arbitrary k: for each v in your set of numbers, accumulate the sum of 2^v. At the end, loop i from 1 to N. If sum modulo 2^i is zero, then i is missing.

Easy, right? O(N) time, O(1) storage, and it supports arbitrary k.

Except that you're computing enormous numbers that on a real computer would each require O(N) space. In fact, this solution is identical to a bit vector.

So you could be clever and compute the sum and the sum of squares and the sum of cubes... up to the sum of v^k, and do the fancy math to extract the result. But those are big numbers too, which begs the question: what abstract model of operation are we talking about? How much fits in O(1) space, and how long does it take to sum up numbers of whatever size you need?

以下是Dimitris Andreou的鏈接摘要。

記住第i個冪的和,其中i = 1,2,...,k。 這減少了解決方程組問題

a 1 + a 2 + ... + a k = b 1

a 1 2 + a 2 2 + ... + a k 2 = b 2


a 1 k + a 2 k + ... + a k k = b k

使用牛頓的身份 ,知道允許計算

c 1 = a 1 + a 2 + ... a k

c 2 = a 1 a 2 + a 1 a 3 + ... + a k-1 a k


c k = a 1 a 2 ... a k

如果展開多項式(xa 1 )...(xa k ),則係數恰好為c 1 ,...,c k - 請參閱Viète公式 。 由於每個多項式因子都是唯一的(多項式環是一個歐幾里德域 ),這意味著一個i是唯一確定的,直到排列。

這結束了記憶能力足以恢復數字的證明。 對於常數k,這是一個好方法。

然而,當k變化時,計算c 1 ,...,c k的直接方法是非常昂貴的,因為例如c k是所有缺失數量的大小n!/(nk)!的乘積。 為了克服這個問題,在Z q字段中進行計算,其中q是一個素數,使得n <= q <2n - 它由Bertrand的公式存在。 證明不需要改變,因為公式仍然存在,並且多項式的因式分解仍然是唯一的。 您還需要有限域上的因式分解算法,例如BerlekampCantor-Zassenhaus


  • 計算給定數字的第i個冪
  • 減去獲得未知數的第i次冪。 給打錢。
  • 使用牛頓的身份來計算b i的係數; 稱他們為 。 基本上,c 1 = b 1 ; c 2 =(c 1 b 1 - b 2 )/ 2; 請參閱Wikipedia以了解確切的公式
  • 分解多項式x k -c 1 x k-1 + ... + c k
  • 多項式的根是所需的數字a 1 ,...,a k

對於變化的k,使用例如Miller-Rabin找到一個素數n <= q <2n,然後執行步驟,所有數字都是模q的減數。

正如海因里希Apfelmus評論說,而不是一個素數q,你可以使用q = 2⌈logn⌉在有限 域內執行算術

你可以通過閱讀Muthukrishnan的幾頁- 數據流算法:謎題1:尋找缺失的數字來找到它。 它顯示了你正在尋找的概括 。 可能這就是你的面試官閱讀的內容,以及他為什麼提出這些問題。

現在,如果只有人會開始刪除被Muthukrishnan的處理所包含或取代的答案,並且使這個文本更容易找到。 :)

另請參閱sdcvvc's直接相關答案 ,其中還包括偽代碼(歡呼!不需要閱讀那些棘手的數學公式:))(謝謝,很棒的工作!)。

基於數字總和的解決方案存在的問題是,他們沒有考慮存儲和使用大指數數字的成本......在實踐中,因為它適用於非常大的數字n,所以會使用大數字圖書館。 我們可以分析這些算法的空間利用率。

我們可以分析sdcvvc和Dimitris Andreou算法的時間和空間複雜度。


l_j = ceil (log_2 (sum_{i=1}^n i^j))
l_j > log_2 n^j  (assuming n >= 0, k >= 0)
l_j > j log_2 n \in \Omega(j log n)

l_j < log_2 ((sum_{i=1}^n i)^j) + 1
l_j < j log_2 (n) + j log_2 (n + 1) - j log_2 (2) + 1
l_j < j log_2 n + j + c \in O(j log n)`

所以l_j \in \Theta(j log n)

使用的總存儲量: \sum_{j=1}^k l_j \in \Theta(k^2 log n)

使用的空間:假設計算a^j需要ceil(log_2 j)時間,總時間:

t = k ceil(\sum_i=1^n log_2 (i)) = k ceil(log_2 (\prod_i=1^n (i)))
t > k log_2 (n^n + O(n^(n-1)))
t > k log_2 (n^n) = kn log_2 (n)  \in \Omega(kn log n)
t < k log_2 (\prod_i=1^n i^i) + 1
t < kn log_2 (n) + 1 \in O(kn log n)

總使用時間: \Theta(kn log n)

如果這個時間和空間令人滿意,你可以使用一個簡單的遞歸算法。 讓b!i是包中的第i個條目,n是刪除前的數字個數,k是刪除的個數。 在Haskell語法中...

  -- O(1)
  isInRange low high v = (v >= low) && (v <= high)
  -- O(n - k)
  countInRange low high = sum $ map (fromEnum . isInRange low high . (!)b) [1..(n-k)]
  findMissing l low high krange
    -- O(1) if there is nothing to find.
    | krange=0 = l
    -- O(1) if there is only one possibility.
    | low=high = low:l
    -- Otherwise total of O(knlog(n)) time
    | otherwise =
         mid = (low + high) `div` 2
         klow = countInRange low mid
         khigh = krange - klow
         findMissing (findMissing low mid klow) (mid + 1) high khigh
  findMising 1 (n - k) k

使用的存儲:列表O(k) ,堆棧O(log(n))O(k + log(n))該算法更直觀,具有相同的時間複雜度,並且使用更少的空間。



d = sum(l1) - sum(l2)
m = mul(l1) / mul(l2)

We can optimise this since the sum of an arithmetic series is n times the average of the first and last terms:

n = len(l1)
d = (n/2)*(n+1) - sum(l2)

Now we know that (if a and b are the removed numbers):

a + b = d
a * b = m

So we can rearrange to:

a = s - b
b * (s - b) = m

And multiply out:

-b^2 + s*b = m

And rearrange so the right side is zero:

-b^2 + s*b - m = 0

Then we can solve with the quadratic formula:

b = (-s + sqrt(s^2 - (4*-1*-m)))/-2
a = s - b

Sample Python 3 code:

from functools import reduce
import operator
import math
x = list(range(1,21))
sx = (len(x)/2)*(len(x)+1)
mul = lambda l: reduce(operator.mul,l)
s = sx - sum(x)
m = mul(range(1,21)) / mul(x)
b = (-s + math.sqrt(s**2 - (-4*(-m))))/-2
a = s - b
print(a,b) #15,5

I do not know the complexity of the sqrt, reduce and sum functions so I cannot work out the complexity of this solution (if anyone does know please comment below.)

我問了一個4歲的孩子來解決這個問題。 他整理了數字,然後一起計算。 這有一個O(廚房地板)的空間要求,它的工作原理就像很多球丟失一樣。


等一下。 正如問題所述,包裡有100個數字。 不管k有多大,這個問題都可以在恆定的時間內解決,因為你可以使用一個集合,並且在一個循環中最多100-k次迭代中從集合中刪除數字。 100是恆定的。 剩下的一組數字就是你的答案。

如果我們將解推廣到從1到N的數字,除了N不是一個常數,沒有什麼變化,所以我們處於O(N-k)= O(N)時間。 例如,如果我們使用一個位集,我們在O(N)時間中將位設置為1,遍歷數字,隨著我們去(O(Nk)= O(N))將位設置為0,然後我們有答案。

在我看來,面試官問你如何在O(k)時間而不是O(N)時間打印出最終集合的內容。 顯然,只需稍微設置一下,您就必須遍歷所有N位,以確定您是否應打印該數字。 但是,如果您改變設置的實施方式,則可以以k次迭代打印出數字。 這是通過將數字放入要存儲在哈希集和雙向鍊錶中的對象來完成的。 從哈希集中刪除對象時,也可以從列表中刪除它。 答案將留在現在長度為k的列表中。

要解決2(和3)缺失的數字問題,您可以修改quickselect ,它平均在O(n)運行,並且如果在原地進行分區,則使用常量內存。

  1. 將該組相對於隨機樞軸p分區到包含小於樞軸的數字的分區l和包含大於樞軸的數字的r

  2. 通過將樞軸值與每個分區的大小進行比較來確定2個缺失數字所在的分區( p - 1 - count(l) = count of missing numbers in ln - count(r) - p = count of missing numbers in r p - 1 - count(l) = count of missing numbers in l n - count(r) - p = count of missing numbers in r

  3. a)如果每個分區缺少一個數字,則使用求和方法的差異來查找每個缺失的數字。

    (1 + 2 + ... + (p-1)) - sum(l) = missing #1((p+1) + (p+2) ... + n) - sum(r) = missing #2



只有2個缺失數字,該算法總是捨棄至少一個分區,因此它保留了O(n)快速選擇的平均時間複雜度。 同樣,如果有3個缺失的數字,這個算法每次都會丟棄至少一個分區(因為缺少2個數字,最多只有1個分區會包含多個缺失的數字)。 但是,我不確定在添加更多缺失數字時性能會下降多少。



  $list = range(1,100);


  function findMissing($list, $min, $max) {
    if(empty($list)) {
      print_r(range($min, $max));

    $l = $r = [];
    $pivot = array_pop($list);

    foreach($list as $number) {
      if($number < $pivot) {
        $l[] = $number;
      else {
        $r[] = $number;

    if(count($l) == $pivot - $min - 1) {
      // only 1 missing number use difference of sums
      print array_sum(range($min, $pivot-1)) - array_sum($l) . "\n";
    else if(count($l) < $pivot - $min) {
      // more than 1 missing number, recurse
      findMissing($l, $min, $pivot-1);

    if(count($r) == $max - $pivot - 1) {
      // only 1 missing number use difference of sums
      print array_sum(range($pivot + 1, $max)) - array_sum($r) . "\n";
    } else if(count($r) < $max - $pivot) {
      // mroe than 1 missing number recurse
      findMissing($r, $pivot+1, $max);