Bulk Creation


So far we have seen how to create one Document (both BaseDocument and our own defined Documents) and persist it in CouchDB using createDocument and createOrUpdateDocument.

But this is not very efficient when you have to create many Documents (example: you are migrating a Database from Oracle to CouchDB) basically because of multiple communication transactions between your application and CouchDB server.

To solve this question, CouchDB provides a method that allows you to insert a list of documents in a simple and single operation using bulkCreateDocuments.

The following code shows a test function that uses this method for creating a total number of documents in chunks (size of the bulk) of bulkSize documents.

long begin = System.currentTimeMillis();
int i = 0;
while (i < total) {
    int j = 0;
    // Create the list of documents
    List docList = new ArrayList(bulkSize);
    while (j < bulkSize) {
        BaseDocument doc = new BaseDocument();
        doc.setProperty("doc", "Document #" + i);
        docList.add(j, doc);
        i++;
        j++;
    }
    // Invoke bulk creation in CouchDB
    List docInfo = db.bulkCreateDocuments(docList);
}
// Compute how long did it take us
long time = System.currentTimeMillis() - begin;
System.out.printf("%5d/%4d|%7d|%4d\n", 
                  total, bulkSize, time, (time / total));

bulkCreateDocuments returns a List of DocumentInfo containing the identifier and the revisionassigned by CouchDB (remember that you can either assign your own identifiers or let CouchDB generate one for you).

If document creation fail, you get a null value for the revision. Alternatively, you might request to create all or nothing (adding an extra parameter to bulkCreateDocuments with value equals to true) and in that case either you persist all documents in the list or any.

NOTE: If you provide your own identifiers and some of the are duplicated, any of the documents with duplicated identifiers is created into CouchDB.

Some example of performance using bulk

Running previous code with different bulk sizes we get the following results:

#Rec./bulk | total | Avg.
-----------+-------+-----
 1000/     |   5632|   5
 1000/    1|   5370|   5
 1000/   10|    669|   0
 1000/  100|    342|   0
 1000/ 1000|    270|   0

The first line corresponds to a loop of 1000 documents in non bulk creation mode (usingcreateDocument) while the following lines corresponde to 1000 documents and bulk sizes of 1, 10, 100 and 1000. We see that as the bulk size grows, the total time required to create the same number of documents (1000) decreases and so (obviously) average time per document.

But don’t expect that this numbers keep getting better and better. On my machines (see Linux Setup) for 10K documents and bulk sizes of 1K, 2K, 4K, 6K, 8K and 10K I got:

#Rec./bulk | total | Avg.
-----------+-------+-----
10000/ 1000|   1603|   0
10000/ 2000|   1603|   0
10000/ 4000|   1889|   0
10000/ 6000|   2209|   0
10000/ 8000|   2692|   0

Showing that times for 1K and 2K bulk sizes are the same (1603 ms.) and then it starts increassing.

NOTE: These results are likely to change with different document sizes.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s