algorithm - wege - sort复杂度




在1 MB RAM中对100万个8位数字进行排序 (20)

我有一台带有1 MB RAM的计算机,没有其他本地存储。 我必须使用它通过TCP连接接受100万个8位十进制数字,对它们进行排序,然后通过另一个TCP连接发送排序列表。

数字列表可能包含重复,我不能放弃。 代码将被放置在ROM中,所以我不需要从1 MB中减去我的代码的大小。 我已经有了驱动以太网端口和处理TCP / IP连接的代码,并且它需要2 KB的状态数据,包括1 KB缓冲区,通过该缓冲区,代码将读取和写入数据。 有没有解决这个问题的方法?

问题和答案的来源:
slashdot.org

cleaton.net


I have a computer with 1M of RAM and no other local storage

Another way to cheat: you could use non-local (networked) storage instead (your question does not preclude this) and call a networked service that could use straightforward disk-based mergesort (or just enough RAM to sort in-memory, since you only need to accept 1M numbers), without needing the (admittedly extremely ingenious) solutions already given.

This might be cheating, but it's not clear whether you are looking for a solution to a real-world problem, or a puzzle that invites bending of the rules... if the latter, then a simple cheat may get better results than a complex but "genuine" solution (which as others have pointed out, can only work for compressible inputs).


(My original answer was wrong, sorry for the bad math, see below the break.)

这个怎么样?

The first 27 bits store the lowest number you have seen, then the difference to the next number seen, encoded as follows: 5 bits to store the number of bits used in storing the difference, then the difference. Use 00000 to indicate that you saw that number again.

This works because as more numbers are inserted, the average difference between numbers goes down, so you use less bits to store the difference as you add more numbers. I believe this is called a delta list.

The worst case I can think of is all numbers evenly spaced (by 100), eg Assuming 0 is the first number:

000000000000000000000000000 00111 1100100
                            ^^^^^^^^^^^^^
                            a million times

27 + 1,000,000 * (5+7) bits = ~ 427k

Reddit to the rescue!

If all you had to do was sort them, this problem would be easy. It takes 122k (1 million bits) to store which numbers you have seen (0th bit on if 0 was seen, 2300th bit on if 2300 was seen, etc.

You read the numbers, store them in the bit field, and then shift the bits out while keeping a count.

BUT, you have to remember how many you have seen. I was inspired by the sublist answer above to come up with this scheme:

Instead of using one bit, use either 2 or 27 bits:

  • 00 means you did not see the number.
  • 01 means you saw it once
  • 1 means you saw it, and the next 26 bits are the count of how many times.

I think this works: if there are no duplicates, you have a 244k list. In the worst case you see each number twice (if you see one number three times, it shortens the rest of the list for you), that means you have seen 50,000 more than once, and you have seen 950,000 items 0 or 1 times.

50,000 * 27 + 950,000 * 2 = 396.7k.

You can make further improvements if you use the following encoding:

0 means you did not see the number 10 means you saw it once 11 is how you keep count

Which will, on average, result in 280.7k of storage.

EDIT: my Sunday morning math was wrong.

The worst case is we see 500,000 numbers twice, so the math becomes:

500,000 *27 + 500,000 *2 = 1.77M

The alternate encoding results in an average storage of

500,000 * 27 + 500,000 = 1.70M

: (


Google 's (bad) approach, from HN thread. Store RLE-style counts.

Your initial data structure is '99999999:0' (all zeros, haven't seen any numbers) and then lets say you see the number 3,866,344 so your data structure becomes '3866343:0,1:1,96133654:0' as you can see the numbers will always alternate between number of zero bits and number of '1' bits so you can just assume the odd numbers represent 0 bits and the even numbers 1 bits. This becomes (3866343,1,96133654)

Their problem doesn't seem to cover duplicates, but let's say they use "0:1" for duplicates.

Big problem #1: insertions for 1M integers would take ages .

Big problem #2: like all plain delta encoding solutions, some distributions can't be covered this way. For example, 1m integers with distances 0:99 (eg +99 each one). Now think the same but with random distance in the range of 0:99 . (Note: 99999999/1000000 = 99.99)

Google's approach is both unworthy (slow) and incorrect. But to their defense, their problem might have been slightly different.


请参阅第一个正确答案或后面的算术编码答案 。 下面你可能会发现一些有趣的,但不是100%的防弹解决方案。

这是一个非常有趣的任务,这是另一个解决方案。 我希望有人会发现有用的结果(或者至少有趣)。

阶段1:初始数据结构,粗糙压缩方法,基本结果

让我们做一些简单的数学运算:我们有1M(1048576字节)的RAM最初可用于存储10 ^ 6 8位十进制数字。 [0; 99999999]。 因此,要存储一个数字需要27位(假设将使用无符号数字)。 因此,需要存储一个原始流〜3.5M的RAM。 有人已经说过这似乎不可行,但我认为如果输入“足够好”,任务就可以解决。 基本上,这个想法是压缩输入数据的压缩因子0.29或更高,并以适当的方式进行排序。

首先解决压缩问题。 有一些相关的测试已经可用:

http://www.theeggeadventure.com/wikimedia/index.php/Java_Data_Compression

“我进行了一项测试,使用各种形式的压缩来压缩一百万个连续的整数,结果如下:”

None     4000027
Deflate  2006803
Filtered 1391833
BZip2    427067
Lzma     255040

它看起来像LZMA( Lempel-Ziv-Markov链算法 )是继续的好选择。 我准备了一个简单的PoC,但仍有一些细节需要强调:

  1. 内存是有限的,所以想法是预设数字并使用压缩桶(动态大小)作为临时存储
  2. 使用预分类数据更容易获得更好的压缩系数,因此每个桶都有一个静态缓冲区(缓冲区中的数字将在LZMA之前进行排序)
  3. 每个桶都有一个特定的范围,因此可以分别为每个桶完成最​​后的分类
  4. 桶的大小可以正确设置,因此将有足够的内存来解压缩存储的数据,并分别为每个桶进行最终排序

请注意,附加的代码是POC ,它不能用作最终解决方案,它只是展示了使用几个较小的缓冲区以某种最佳方式(可能是压缩)存储预分类数字的想法。 LZMA不是作为最终解决方案提出的。 它被用作为此PoC引入压缩的最快方式。

查看下面的PoC代码(请注意它只是一个演示,编译它需要LZMA-Java ):

public class MemorySortDemo {

static final int NUM_COUNT = 1000000;
static final int NUM_MAX   = 100000000;

static final int BUCKETS      = 5;
static final int DICT_SIZE    = 16 * 1024; // LZMA dictionary size
static final int BUCKET_SIZE  = 1024;
static final int BUFFER_SIZE  = 10 * 1024;
static final int BUCKET_RANGE = NUM_MAX / BUCKETS;

static class Producer {
    private Random random = new Random();
    public int produce() { return random.nextInt(NUM_MAX); }
}

static class Bucket {
    public int size, pointer;
    public int[] buffer = new int[BUFFER_SIZE];

    public ByteArrayOutputStream tempOut = new ByteArrayOutputStream();
    public DataOutputStream tempDataOut = new DataOutputStream(tempOut);
    public ByteArrayOutputStream compressedOut = new ByteArrayOutputStream();

    public void submitBuffer() throws IOException {
        Arrays.sort(buffer, 0, pointer);

        for (int j = 0; j < pointer; j++) {
            tempDataOut.writeInt(buffer[j]);
            size++;
        }            
        pointer = 0;
    }

    public void write(int value) throws IOException {
        if (isBufferFull()) {
            submitBuffer();
        }
        buffer[pointer++] = value;
    }

    public boolean isBufferFull() {
        return pointer == BUFFER_SIZE;
    }

    public byte[] compressData() throws IOException {
        tempDataOut.close();
        return compress(tempOut.toByteArray());
    }        

    private byte[] compress(byte[] input) throws IOException {
        final BufferedInputStream in = new BufferedInputStream(new ByteArrayInputStream(input));
        final DataOutputStream out = new DataOutputStream(new BufferedOutputStream(compressedOut));

        final Encoder encoder = new Encoder();
        encoder.setEndMarkerMode(true);
        encoder.setNumFastBytes(0x20);
        encoder.setDictionarySize(DICT_SIZE);
        encoder.setMatchFinder(Encoder.EMatchFinderTypeBT4);

        ByteArrayOutputStream encoderPrperties = new ByteArrayOutputStream();
        encoder.writeCoderProperties(encoderPrperties);
        encoderPrperties.flush();
        encoderPrperties.close();

        encoder.code(in, out, -1, -1, null);
        out.flush();
        out.close();
        in.close();

        return encoderPrperties.toByteArray();
    }

    public int[] decompress(byte[] properties) throws IOException {
        InputStream in = new ByteArrayInputStream(compressedOut.toByteArray());
        ByteArrayOutputStream data = new ByteArrayOutputStream(10 * 1024);
        BufferedOutputStream out = new BufferedOutputStream(data);

        Decoder decoder = new Decoder();
        decoder.setDecoderProperties(properties);
        decoder.code(in, out, 4 * size);

        out.flush();
        out.close();
        in.close();

        DataInputStream input = new DataInputStream(new ByteArrayInputStream(data.toByteArray()));
        int[] array = new int[size];
        for (int k = 0; k < size; k++) {
            array[k] = input.readInt();
        }

        return array;
    }
}

static class Sorter {
    private Bucket[] bucket = new Bucket[BUCKETS];

    public void doSort(Producer p, Consumer c) throws IOException {

        for (int i = 0; i < bucket.length; i++) {  // allocate buckets
            bucket[i] = new Bucket();
        }

        for(int i=0; i< NUM_COUNT; i++) {         // produce some data
            int value = p.produce();
            int bucketId = value/BUCKET_RANGE;
            bucket[bucketId].write(value);
            c.register(value);
        }

        for (int i = 0; i < bucket.length; i++) { // submit non-empty buffers
            bucket[i].submitBuffer();
        }

        byte[] compressProperties = null;
        for (int i = 0; i < bucket.length; i++) { // compress the data
            compressProperties = bucket[i].compressData();
        }

        printStatistics();

        for (int i = 0; i < bucket.length; i++) { // decode & sort buckets one by one
            int[] array = bucket[i].decompress(compressProperties);
            Arrays.sort(array);

            for(int v : array) {
                c.consume(v);
            }
        }
        c.finalCheck();
    }

    public void printStatistics() {
        int size = 0;
        int sizeCompressed = 0;

        for (int i = 0; i < BUCKETS; i++) {
            int bucketSize = 4*bucket[i].size;
            size += bucketSize;
            sizeCompressed += bucket[i].compressedOut.size();

            System.out.println("  bucket[" + i
                    + "] contains: " + bucket[i].size
                    + " numbers, compressed size: " + bucket[i].compressedOut.size()
                    + String.format(" compression factor: %.2f", ((double)bucket[i].compressedOut.size())/bucketSize));
        }

        System.out.println(String.format("Data size: %.2fM",(double)size/(1014*1024))
                + String.format(" compressed %.2fM",(double)sizeCompressed/(1014*1024))
                + String.format(" compression factor %.2f",(double)sizeCompressed/size));
    }
}

static class Consumer {
    private Set<Integer> values = new HashSet<>();

    int v = -1;
    public void consume(int value) {
        if(v < 0) v = value;

        if(v > value) {
            throw new IllegalArgumentException("Current value is greater than previous: " + v + " > " + value);
        }else{
            v = value;
            values.remove(value);
        }
    }

    public void register(int value) {
        values.add(value);
    }

    public void finalCheck() {
        System.out.println(values.size() > 0 ? "NOT OK: " + values.size() : "OK!");
    }
}

public static void main(String[] args) throws IOException {
    Producer p = new Producer();
    Consumer c = new Consumer();
    Sorter sorter = new Sorter();

    sorter.doSort(p, c);
}
}

随机数字产生以下内容:

bucket[0] contains: 200357 numbers, compressed size: 353679 compression factor: 0.44
bucket[1] contains: 199465 numbers, compressed size: 352127 compression factor: 0.44
bucket[2] contains: 199682 numbers, compressed size: 352464 compression factor: 0.44
bucket[3] contains: 199949 numbers, compressed size: 352947 compression factor: 0.44
bucket[4] contains: 200547 numbers, compressed size: 353914 compression factor: 0.44
Data size: 3.85M compressed 1.70M compression factor 0.44

对于简单的递增序列(使用一个桶),它会生成:

bucket[0] contains: 1000000 numbers, compressed size: 256700 compression factor: 0.06
Data size: 3.85M compressed 0.25M compression factor 0.06

编辑

结论:

  1. 不要试图欺骗Nature
  2. 使用更简单的压缩和更低的内存占用
  3. 一些额外的线索是非常需要的。 普通的防弹解决方案似乎不可行。

阶段2:增强压缩,最终结论

如前面部分已经提到的那样,可以使用任何合适的压缩技术。 因此,让我们摆脱LZMA,转而采用更简单,更好的方法(如果可能)。 有很多很好的解决方案,包括算术编码基数树等。

无论如何,简单但有用的编码方案将比另一个外部库更具说明性,提供了一些漂亮的算法。 实际的解决方案非常简单:由于存在具有部分排序数据的存储区,因此可以使用增量值而不是数字。

随机输入测试显示稍好的结果:

bucket[0] contains: 10103 numbers, compressed size: 13683 compression factor: 0.34
bucket[1] contains: 9885 numbers, compressed size: 13479 compression factor: 0.34
...
bucket[98] contains: 10026 numbers, compressed size: 13612 compression factor: 0.34
bucket[99] contains: 10058 numbers, compressed size: 13701 compression factor: 0.34
Data size: 3.85M compressed 1.31M compression factor 0.34

示例代码

  public static void encode(int[] buffer, int length, BinaryOut output) {
    short size = (short)(length & 0x7FFF);

    output.write(size);
    output.write(buffer[0]);

    for(int i=1; i< size; i++) {
        int next = buffer[i] - buffer[i-1];
        int bits = getBinarySize(next);

        int len = bits;

        if(bits > 24) {
          output.write(3, 2);
          len = bits - 24;
        }else if(bits > 16) {
          output.write(2, 2);
          len = bits-16;
        }else if(bits > 8) {
          output.write(1, 2);
          len = bits - 8;
        }else{
          output.write(0, 2);
        }

        if (len > 0) {
            if ((len % 2) > 0) {
                len = len / 2;
                output.write(len, 2);
                output.write(false);
            } else {
                len = len / 2 - 1;
                output.write(len, 2);
            }

            output.write(next, bits);
        }
    }
}

public static short decode(BinaryIn input, int[] buffer, int offset) {
    short length = input.readShort();
    int value = input.readInt();
    buffer[offset] = value;

    for (int i = 1; i < length; i++) {
        int flag = input.readInt(2);

        int bits;
        int next = 0;
        switch (flag) {
            case 0:
                bits = 2 * input.readInt(2) + 2;
                next = input.readInt(bits);
                break;
            case 1:
                bits = 8 + 2 * input.readInt(2) +2;
                next = input.readInt(bits);
                break;
            case 2:
                bits = 16 + 2 * input.readInt(2) +2;
                next = input.readInt(bits);
                break;
            case 3:
                bits = 24 + 2 * input.readInt(2) +2;
                next = input.readInt(bits);
                break;
        }

        buffer[offset + i] = buffer[offset + i - 1] + next;
    }

   return length;
}

请注意,这种方法:

  1. 不会消耗大量的内存
  2. 与流一起工作
  3. 提供了不那么糟糕的结果

完整的代码可以在here找到,BinaryInput和BinaryOutput实现可以在here找到

定论

没有最后的结论:)有时候,从一个meta-level角度来看,把一个层级升级并审查这个任务是个好主意。

花一些时间完成这项任务很有趣。 顺便说一句,下面有很多有趣的答案。 感谢您的关注和快乐编纂。


I think one way to think about this is from a combinatorics viewpoint: how many possible combinations of sorted number orderings are there? If we give the combination 0,0,0,....,0 the code 0, and 0,0,0,...,1 the code 1, and 99999999, 99999999, ... 99999999 the code N, what is N? In other words, how big is the result space?

Well, one way to think about this is noticing that this is a bijection of the problem of finding the number of monotonic paths in an N x M grid, where N = 1,000,000 and M = 100,000,000. In other words, if you have a grid that is 1,000,000 wide and 100,000,000 tall, how many shortest paths from the bottom left to the top right are there? Shortest paths of course require you only ever either move right or up (if you were to move down or left you would be undoing previously accomplished progress). To see how this is a bijection of our number sorting problem, observe the following:

You can imagine any horizontal leg in our path as a number in our ordering, where the Y location of the leg represents the value.

So if the path simply moves to the right all the way to the end, then jumps all the way to the top, that is equivalent to the ordering 0,0,0,...,0. if it instead begins by jumping all the way to the top and then moves to the right 1,000,000 times, that is equivalent to 99999999,99999999,..., 99999999. A path where it moves right once, then up once, then right one, then up once, etc to the very end (then necessarily jumps all the way to the top), is equivalent to 0,1,2,3,...,999999.

Luckily for us this problem has already been solved, such a grid has (N + M) Choose (M) paths:

(1,000,000 + 100,000,000) Choose (100,000,000) ~= 2.27 * 10^2436455

N thus equals 2.27 * 10^2436455, and so the code 0 represents 0,0,0,...,0 and the code 2.27 * 10^2436455 and some change represents 99999999,99999999,..., 99999999.

In order to store all the numbers from 0 to 2.27 * 10^2436455 you need lg2 (2.27 * 10^2436455) = 8.0937 * 10^6 bits.

1 megabyte = 8388608 bits > 8093700 bits

So it appears that we at least actually have enough room to store the result! Now of course the interesting bit is doing the sorting as the numbers stream in. Not sure the best approach to this is given we have 294908 bits remaining. I imagine an interesting technique would be to at each point assume that that is is the entire ordering, finding the code for that ordering, and then as you receive a new number going back and updating the previous code. Hand wave hand wave.


I think the solution is to combine techniques from video encoding, namely the discrete cosine transformation. In digital video, rather recording the changing the brightness or colour of video as regular values such as 110 112 115 116, each is subtracted from the last (similar to run length encoding). 110 112 115 116 becomes 110 2 3 1. The values, 2 3 1 require less bits than the originals.

So lets say we create a list of the input values as they arrive on the socket. We are storing in each element, not the value, but the offset of the one before it. We sort as we go, so the offsets are only going to be positive. But the offset could be 8 decimal digits wide which this fits in 3 bytes. Each element can't be 3 bytes, so we need to pack these. We could use the top bit of each byte as a "continue bit", indicating that the next byte is part of the number and the lower 7 bits of each byte need to be combined. zero is valid for duplicates.

As the list fills up, the numbers should be get closer together, meaning on average only 1 byte is used to determine the distance to the next value. 7 bits of value and 1 bit of offset if convenient, but there may be a sweet spot that requires less than 8 bits for a "continue" value.

Anyway, I did some experiment. I use a random number generator and I can fit a million sorted 8 digit decimal numbers into about 1279000 bytes. The average space between each number is consistently 99...

public class Test {
    public static void main(String[] args) throws IOException {
        // 1 million values
        int[] values = new int[1000000];

        // create random values up to 8 digits lrong
        Random random = new Random();
        for (int x=0;x<values.length;x++) {
            values[x] = random.nextInt(100000000);
        }
        Arrays.sort(values);

        ByteArrayOutputStream baos = new ByteArrayOutputStream();

        int av = 0;    
        writeCompact(baos, values[0]);     // first value
        for (int x=1;x<values.length;x++) {
            int v = values[x] - values[x-1];  // difference
            av += v;
            System.out.println(values[x] + " diff " + v);
            writeCompact(baos, v);
        }

        System.out.println("Average offset " + (av/values.length));
        System.out.println("Fits in " + baos.toByteArray().length);
    }

    public static void writeCompact(OutputStream os, long value) throws IOException {
        do {
            int b = (int) value & 0x7f;
            value = (value & 0x7fffffffffffffffl) >> 7;
            os.write(value == 0 ? b : (b | 0x80));
        } while (value != 0);
    }
}

I would try a Radix Tree . If you could store the data in a tree, you could then do an in-order traverse to transmit the data.

I'm not sure you could fit this into 1MB, but I think it's worth a try.


If the input stream could be received few times this would be much easier (no info about that, idea and time-performance problem). Then, we could count the decimal values. With counted values it would be easy to make the output stream. Compress by counting the values. It depends what would be in the input stream.


If we don't know anything about those numbers, we are limited by the following constraints:

  • we need to load all numbers before we can sort them them,
  • the set of numbers is not compressible.

If these assumptions hold, there is no way to carry out your task, as you will need at least 26,575,425 bits of storage (3,321,929 bytes).

What can you tell us about your data ?


My suggestions here owe a lot to Dan's solution

First off I assume the solution must handle all possible input lists. I think the popular answers do not make this assumption (which IMO is a huge mistake).

It is known that no form of lossless compression will reduce the size of all inputs.

All the popular answers assume they will be able to apply compression effective enough to allow them extra space. In fact, a chunk of extra space large enough to hold some portion of their partially completed list in an uncompressed form and allow them to perform their sorting operations. This is just a bad assumption.

For such a solution, anyone with knowledge of how they do their compression will be able to design some input data that does not compress well for this scheme, and the "solution" will most likely then break due to running out of space.

Instead I take a mathematical approach. Our possible outputs are all the lists of length LEN consisting of elements in the range 0..MAX. Here the LEN is 1,000,000 and our MAX is 100,000,000.

For arbitrary LEN and MAX, the amount of bits needed to encode this state is:

Log2(MAX Multichoose LEN)

So for our numbers, once we have completed recieving and sorting, we will need at least Log2(100,000,000 MC 1,000,000) bits to store our result in a way that can uniquely distinguish all possible outputs.

This is ~= 988kb . So we actually have enough space to hold our result. From this point of view, it is possible.

[Deleted pointless rambling now that better examples exist...]

Best answer is here .

Another good answer is here and basically uses insertion sort as the function to expand the list by one element (buffers a few elements and pre-sorts, to allow insertion of more than one at a time, saves a bit of time). uses a nice compact state encoding too, buckets of seven bit deltas


Sorting is a secondary problem here. As other said, just storing the integers is hard, and cannot work on all inputs , since 27 bits per number would be necessary.

My take on this is: store only the differences between the consecutive (sorted) integers, as they will be most likely small. Then use a compression scheme, eg with 2 additional bits per input number, to encode how many bits the number is stored on. 就像是:

00 -> 5 bits
01 -> 11 bits
10 -> 19 bits
11 -> 27 bits

It should be possible to store a fair number of possible input lists within the given memory constraint. The maths of how to pick the compression scheme to have it work on the maximum number of inputs, are beyond me.

I hope you may be able to exploit domain-specific knowledge of your input to find a good enough integer compression scheme based on this.

Oh and then, you do an insertion sort on that sorted list as you receive data.


Suppose this task is possible. Just prior to output, there will be an in-memory representation of the million sorted numbers. How many different such representations are there? Since there may be repeated numbers we can't use nCr (choose), but there is an operation called multichoose that works on multisets .

  • There are 2.2e2436455 ways to choose a million numbers in range 0..99,999,999.
  • That requires 8,093,730 bits to represent every possible combination, or 1,011,717 bytes.

So theoretically it may be possible, if you can come up with a sane (enough) representation of the sorted list of numbers. For example, an insane representation might require a 10MB lookup table or thousands of lines of code.

However, if "1M RAM" means one million bytes, then clearly there is not enough space. The fact that 5% more memory makes it theoretically possible suggests to me that the representation will have to be VERY efficient and probably not sane.


There are 10^6 values in a range of 10^8, so there's one value per hundred code points on average. Store the distance from the Nth point to the (N+1)th. Duplicate values have a skip of 0. This means that the skip needs an average of just under 7 bits to store, so a million of them will happily fit into our 8 million bits of storage.

These skips need to be encoded into a bitstream, say by Huffman encoding. Insertion is by iterating through the bitstream and rewriting after the new value. Output by iterating through and writing out the implied values. For practicality, it probably wants to be done as, say, 10^4 lists covering 10^4 code points (and an average of 100 values) each.

A good Huffman tree for random data can be built a priori by assuming a Poisson distribution (mean=variance=100) on the length of the skips, but real statistics can be kept on the input and used to generate an optimal tree to deal with pathological cases.


There is one solution to this problem across all possible inputs. Cheat.

  1. Read m values over TCP, where m is near the max that can be sorted in memory, maybe n/4.
  2. Sort the 250,000 (or so) numbers and output them.
  3. Repeat for the other 3 quarters.
  4. Let the receiver merge the 4 lists of numbers it has received as it processes them. (It's not much slower than using a single list.)

To represent the sorted array one can just store the first element and the difference between adjacent elements. In this way we are concerned with encoding 10^6 elements that can sum up to at most 10^8. Let's call this D . To encode the elements of D one can use a Huffman code . The dictionary for the Huffman code can be created on the go and the array updated every time a new item is inserted in the sorted array (insertion sort). Note that when the dictionary changes because of a new item the whole array should be updated to match the new encoding.

The average number of bits for encoding each element of D is maximized if we have equal number of each unique element. Say elements d1 , d2 , ..., dN in D each appear F times. In that case (in worst case we have both 0 and 10^8 in input sequence) we have

sum(1<= i <= N ) F . di = 10^8

哪里

sum(1<= i <= N ) F = 10^6, or F =10^6/ N and the normalized frequency will be p = F /10^=1/ N

The average number of bits will be -log2(1/ P ) = log2( N ). Under these circumstances we should find a case that maximizes N . This happens if we have consecutive numbers for di starting from 0, or, di = i -1, therefore

10^8= sum(1<= i <= N ) F . di = sum(1<= i <= N ) (10^6/ N ) (i-1) = (10^6/ N ) N ( N -1)/2

N <= 201. And for this case average number of bits is log2(201)=7.6511 which means we will need around 1 byte per input element for saving the sorted array. Note that this doesn't mean D in general cannot have more than 201 elements. It just sows that if elements of D are uniformly distributed, it cannot have more than 201 unique values.


We could play with the networking stack to send the numbers in sorted order before we have all the numbers. If you send 1M of data, TCP/IP will break it into 1500 byte packets and stream them in order to the target. Each packet will be given a sequence number.

We can do this by hand. Just before we fill our RAM we can sort what we have and send the list to our target but leave holes in our sequence around each number. Then process the 2nd 1/2 of the numbers the same way using those holes in the sequence.

The networking stack on the far end will assemble the resulting data stream in order of sequence before handing it up to the application.

It's using the network to perform a merge sort. This is a total hack, but I was inspired by the other networking hack listed previously.


What kind of computer are you using? It may not have any other "normal" local storage, but does it have video RAM, for example? 1 megapixel x 32 bits per pixel (say) is pretty close to your required data input size.

(I largely ask in memory of the old Acorn RISC PC , which could 'borrow' VRAM to expand the available system RAM, if you chose a low resolution or low colour-depth screen mode!). This was rather useful on a machine with only a few MB of normal RAM.


While receiving the stream do these steps.

1st set some reasonable chunk size

Pseudo Code idea:

  1. The first step would be to find all the duplicates and stick them in a dictionary with its count and remove them.
  2. The third step would be to place number that exist in sequence of their algorithmic steps and place them in counters special dictionaries with the first number and their step like n, n+1 ... ,n+2, 2n, 2n+1, 2n+2...
  3. Begin to compress in chunks some reasonable ranges of number like every 1000 or ever 10000 the remaining numbers that appear less often to repeat.
  4. Uncompress that range if a number is found and add it to the range and leave it uncompressed for a while longer.
  5. Otherwise just add that number to a byte[chunkSize]

Continue the first 4 steps while receiving the stream. The final step would be to either fail if you exceeded memory or start outputting the result once all the data is collected by beginning to sort the ranges and spit out the results in order and uncompressing those in order that need to be uncompressed and sort them when you get to them.


只有1兆字节和100万字节之间的差异才有可能实现解决方案。 8093729.5大约有2种不同的方式可以选择100万个8位数的数字,允许有重复,而且顺序不重要,因此只有100万字节RAM的机器没有足够的状态来表示所有可能性。 但1M(TCP / IP少于2k)为1022 * 1024 * 8 = 8372224位,因此可能有解决方案。

第1部分,初始解决方案

这种方法需要1M以上的时间,我会稍微优化它以适应1M。

我将存储一个紧凑的数字排序列表,范围为0到99999999,作为7位数字的子列表序列。 第一个子列表包含从0到127的数字,第二个子列表包含从128到255的数字等。100000000/128正好是781250,因此将需要781250个这样的子列表。

每个子列表由一个2位子列表标题和一个子列表主体组成。 子列表正文每个子列表条目占用7位。 子列表全部连接在一起,并且格式使得可以确定一个子列表的结束和下一个开始的位置。 完整填充列表所需的总存储空间为2 * 781250 + 7 * 1000000 = 8562500位,大约为1.021 M-字节。

4个可能的子列表标题值是:

00空子列表,没有任何内容。

01 Singleton,在子列表中只有一个条目,而后面的7位保存它。

10子列表至少保存2个不同的数字。 条目以非递减顺序存储,除了最后一个条目小于或等于第一个条目。 这允许识别子列表的末尾。 例如,数字2,4,6将被存储为(4,6,2)。 数字2,2,3,4,4将被存储为(2,3,4,4,2)。

11子列表包含2个或更多个单个数字的重复。 接下来的7位给出数字。 然后出现零个或多个值为1的7位条目,然后是值为0的7位条目。子列表体的长度决定了重复次数。 例如,数字12,12将被存储为(12,0),数字12,12,12将被存储为(12,1,0),12,12,12,12将是(12,1 ,1,0)等等。

我从一个空列表开始,读入一堆数字并将它们存储为32位整数,将新数字排序(可能使用堆排序),然后将它们合并到一个新的紧凑排序列表中。 重复,直到没有更多的数字要读取,然后再一次移动紧凑列表以生成输出。

下面一行代表列表合并操作开始之前的内存。 “O”是保存排序的32位整数的区域。 “X”是保存旧的紧凑列表的区域。 “=”符号是紧凑列表的扩展空间,“O”中的每个整数有7位。 “Z”是其他随机开销。

ZZZOOOOOOOOOOOOOOOOOOOOOOOOOO==========XXXXXXXXXXXXXXXXXXXXXXXXXX

合并程序开始在最左边的“O”处和最左边的“X”处进行读取,并开始在最左边的“=”处进行写入。 写入指针不会捕获紧凑列表读取指针,直到所有新的整数都被合并为止,因为两个指针都为每个子列表提前2位,在旧的紧凑列表中每个条目提前7位,并且有足够的额外空间用于新号码的7位条目。

第2部分,将它填充到1M

要将上面的解决方案挤压到1M,我需要使紧凑列表格式更紧凑。 我将摆脱其中一个子列表类型,以便只有3个不同的子列表标题值。 然后我可以使用“00”,“01”和“1”作为子表头值并保存几位。 子列表类型是:

一个空的子列表,没有任何内容。

B Singleton,子列表中只有一个条目,下一个7位保存它。

C子列表保存至少2个不同的数字。 条目以非递减顺序存储,除了最后一个条目小于或等于第一个条目。 这允许识别子列表的末尾。 例如,数字2,4,6将被存储为(4,6,2)。 数字2,2,3,4,4将被存储为(2,3,4,4,2)。

D子列表包含2个或更多个单个数字的重复。

我的3个子表头值将是“A”,“B”和“C”,所以我需要一种方式来表示D型子列表。

假设我有C类型的子表头,然后是3个条目,例如“C [17] [101] [58]”。 如上所述,这不能成为有效的C型子列表的一部分,因为第三个条目小于第二个但比第一个更多。 我可以使用这种类型的构造来表示一个D型子列表。 按位来说,任何有“C {00 ?????} {1 ??????} {01 ?????}”的地方都是不可能的C型子列表。 我将用它来表示由3个或更多重复的单个数字组成的子列表。 前两个7位字编号(下面的“N”位),后跟零个或多个{0100001}字,后跟一个{0100000}字。

For example, 3 repetitions: "C{00NNNNN}{1NN0000}{0100000}", 4 repetitions: "C{00NNNNN}{1NN0000}{0100001}{0100000}", and so on.

这只是留下了只有2个重复的单个数字的列表。 我将用另一个不可能的C型子列表模式来表示那些:“C {0 ??????} {11 ?????} {10 ?????}”。 前两个字中有7位数字有足够的空间,但这个模式比它所代表的子列表长,这使得事情变得更加复杂。 最后的五个问号可以认为不是模式的一部分,所以我有:“C {0NNNNNN} {11N ????} 10”作为我的模式,数字要重复存储在“N “S。 这太长了2位。

我将不得不借用2比特,并在这种模式下从4个未使用的比特中支付它们。 读取时,在遇到“C {0NNNNNN} {11N00AB} 10”时,在“N”中输出2个数字实例,用位A和B覆盖末尾的“10”,并将读指针倒回2位。 对于这种算法,破坏性读取是可以的,因为每个紧凑列表只能走一次。

当写入2个重复的单个数字的子列表时,写入“C {0NNNNNN} 11N00”,并将借用位计数器设置为2.在借用位计数器非零的每次写入时,写入的每一位都会递减;当计数器达到零时写入“10”。 所以接下来的2个位将写入插槽A和B,然后“10”将丢到最后。

由“00”,“01”和“1”表示的3个子列表标题值,我可以将“1”分配给最流行的子列表类型。 我需要一个小表来将子列表标题值映射到子列表类型,并且我需要每个子列表类型的出现计数器,以便我知道最佳子列表标题映射是什么。

当所有子列表类型同样受欢迎时,就会出现完全填充的紧凑列表的最坏情况最小表示形式。 在这种情况下,我为每3个子表头保存1位,因此列表大小为2 * 781250 + 7 * 1000000 - 781250/3 = 8302083.3位。 舍入到32位字边界,即8302112位或1037764字节。

1M减去2k的TCP / IP状态和缓冲区是1022 * 1024 = 1046528字节,让我玩8764字节。

但是更改子列表标题映射的过程呢? 在下面的内存映射中,“Z”是随机开销,“=”是自由空间,“X”是紧凑列表。

ZZZ=====XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

开始阅读最左边的“X”,然后开始写在最左边的“=”处,并且正确地工作。 当它完成时,紧凑列表将会缩短一点,并且会出现在内存错误的末端:

ZZZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX=======

那么我需要把它分流到右边:

ZZZ=======XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

在标题映射更改过程中,最多有1/3的子标题标题将从1位变为2位。 在最坏的情况下,这些都将在列表的头部,所以在我启动之前至少需要781250/3比特的空闲存储空间,这将我带回到以前版本的精简列表的内存需求: (

为了解决这个问题,我们将781250子列表分成10个子列表组,每个子列表组包含78125个子列表。 每个组都有自己的独立子列表标题映射。 对于组使用字母A到J:

ZZZ=====AAAAAABBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ

在子列表标题映射更改期间,每个子列表组缩小或保持不变:

ZZZ=====AAAAAABBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAA=====BBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABB=====CCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCC======DDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDD======EEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEE======FFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFF======GGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGG=======HHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHH=======IJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHI=======JJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ=======
ZZZ=======AAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ

在映射更改期间,子列表组最坏的情况下临时扩展是78125/3 = 26042位。 如果我允许4k加上1037764字节用于填充完整的压缩列表,那么在存储器映射中“87”将使我8764 - 4096 = 4668字节。

对于10个子列表头标映射表,30个子列表头标出现计数以及其他需要的计数器,指针和小缓冲区,以及我没有注意到的空间,如堆​​栈空间用于函数调用返回地址和局部变量。

第3部分,需要多长时间才能运行?

在一个空的压缩列表中,1位列表标题将用于空子列表,并且列表的开始大小将是781250位。 在最坏的情况下,每增加一个数字,列表就会增加8位,因此32位数字中的每一个都需要32 + 8 = 40位自由空间,放在列表缓冲区的顶部,然后进行排序和合并。 在最坏的情况下,更改子列表标题映射会导致空间使用率为2 * 781250 + 7 *条目 - 781250/3位。

在列表中至少有800000个数字的情况下,如果每五次合并一次后更改子列表头映射的策略,最差的情况下将涉及总计大约3000万的紧凑列表读写活动。

资源:

http://nick.cleaton.net/ramsortsol.html


吉尔马诺夫的回答在其假设中是非常错误的。 它开始基于一百万连续整数的毫无意义的衡量。 这意味着没有差距。 那些随意的差距,尽管很小,但确实使它成为一个糟糕的主意。

亲自尝试一下。 获得1百万随机27​​位整数,对它们进行排序,使用7-Zip ,xz压缩,无论您想要什么LZMA。 结果超过1.5 MB。 上面的前提是序列号的压缩。 即使是三角洲编码 超过1.1 MB 。 无需担心这是使用超过100 MB的RAM进行压缩。 所以即使是压缩的整数也不适合这个问题,不用担心运行时RAM的使用

令人难过的是,人们只是为了提高美观和合理化。

#include <stdint.h>
#include <stdlib.h>
#include <time.h>

int32_t ints[1000000]; // Random 27-bit integers

int cmpi32(const void *a, const void *b) {
    return ( *(int32_t *)a - *(int32_t *)b );
}

int main() {
    int32_t *pi = ints; // Pointer to input ints (REPLACE W/ read from net)

    // Fill pseudo-random integers of 27 bits
    srand(time(NULL));
    for (int i = 0; i < 1000000; i++)
        ints[i] = rand() & ((1<<27) - 1); // Random 32 bits masked to 27 bits

    qsort(ints, 1000000, sizeof (ints[0]), cmpi32); // Sort 1000000 int32s

    // Now delta encode, optional, store differences to previous int
    for (int i = 1, prev = ints[0]; i < 1000000; i++) {
        ints[i] -= prev;
        prev    += ints[i];
    }

    FILE *f = fopen("ints.bin", "w");
    fwrite(ints, 4, 1000000, f);
    fclose(f);
    exit(0);

}

Now compress ints.bin with LZMA...

$ xz -f --keep ints.bin       # 100 MB RAM
$ 7z a ints.bin.7z ints.bin   # 130 MB RAM
$ ls -lh ints.bin*
    3.8M ints.bin
    1.1M ints.bin.7z
    1.2M ints.bin.xz




ram