Moving data from SQL Server to MongoDB/CouchDB/Couchbase

I must have re-written the first sentence for this blog 20 times. The reason is that I realize the audience for this post must be incredibly small. You have to work somewhere that is using SQL Server. You also have to work somewhere that is using MongoDB or a Couch product (CouchDB, Couchbase).

Even further you have to want to move data from SQL Server to a document database that stores its data as JSON. But who knows, maybe someone will find this approach useful. =)

The goal is to take data from SQL Server (relational) to MongoDB (document). I’m assuming since you are running SQL Server that you have some .NET languages around the office.

To start, make a new Visual Studio solution and create a Visual C# SQL CLR Database Project. You can find this under the > Database grouping of project types.

The basic idea is that you want to write code, deploy it to SQL Server and then run the code as a stored procedure that produces JSON.

Example of code:

using System.Collections.Generic;
using System.Data;
using Microsoft.SqlServer.Server;
using POC;
using POC.PageElements;


public partial class StoredProcedures
{
    [Microsoft.SqlServer.Server.SqlProcedure]
    public static void StoredProcedure1()
    {
        Builder builder = new Builder();
        var articles = builder.Build<Article>("exec usp_get_articles");

        foreach (var article in articles)
        {
            article.Contributors = builder.Build<Contributor>("exec usp_get_article_contributors " + article.ArticleId);
            foreach (var contributor in article.Contributors)
            {
                contributor.SocialLinks = builder.Build<SocialLink>("select * from contributor_social_link where contributor_id = " + contributor.Id);
            }

            article.Categories = builder.Build<Category>("exec usp_get_article_categories " + article.ArticleId);

            article.Pages = builder.Build<Page>("exec usp_get_article_pages " + article.ArticleId);
            foreach (var page in article.Pages)
            {
                var paragraphs = builder.Build<Paragraph>("exec usp_get_paragraphs " + page.Id);
                foreach (var paragraph in paragraphs)
                {
                    var images = builder.Build<Image>("exec usp_get_paragraph_image " + paragraph.Id);
                    paragraph.Image = (images.Count > 0) ? images.ToArray()[0] : null;
                    page.PageElements.Add(paragraph);
                }


                var tables = builder.Build<Table>("exec usp_get_tables " + page.Id);
                foreach (var table in tables)
                {
                    page.PageElements.Add(table);
                }


                var imageGroups = builder.Build<ImageGroup>("exec usp_get_image_groups " + page.Id);
                foreach (var imageGroup in imageGroups)
                {
                    var images = builder.Build<Image>("exec usp_get_image_group_images " + imageGroup.Id);
                    imageGroup.Images = (images.Count > 0) ? images : new List<Image>();
                    page.PageElements.Add(imageGroup);
                }


                var inlineImages = builder.Build<Image>("exec usp_get_inline_images " + page.Id);
                foreach (var inlineImage in inlineImages)
                {
                    page.PageElements.Add(inlineImage);
                }


                var videos = builder.Build<Video>("exec usp_get_videos " + page.Id);
                foreach (var video in videos)
                {
                    page.PageElements.Add(video);
                }


                var blockQuotes = builder.Build<BlockQuote>("exec usp_get_block_quotes " + page.Id);
                foreach (var blockQuote in blockQuotes)
                {
                    page.PageElements.Add(blockQuote);
                }
            }
        }



        SqlPipe sp;
        sp = SqlContext.Pipe;

        var dataRecord = new SqlDataRecord(
            new SqlMetaData("id", SqlDbType.Int),
            new SqlMetaData("json", SqlDbType.Text));
        sp.SendResultsStart(dataRecord);

        foreach (var article in articles)
        {
            dataRecord.SetInt32(0, article.Id);
            dataRecord.SetString(1, article.ToString());
            sp.SendResultsRow(dataRecord);
        }

        sp.SendResultsEnd();
    }
};

Some of the code calls functions that I haven’t included, but you can use your imagination to understand what they do. They simply run queries that populate a list of objects. We keep constructing a very rich Article object by making one database call after another.

Once we’re done, we have a list of articles that have been populated from the database. Remember the code above is actually deployed as executable code in SQL Server. Inside of Management Studio you would navigate to your database and then Programmability->Assemblies. If deployment goes well, you’ll see your code there.

The end of the code blob above shows that we are calling article.ToString(); The contents of this call:

        public override string ToString()
        {
            return JsonConvert.SerializeObject(this);
        }

As you can see we’re simply serializing the article to JSON using the Newtonsoft JSON library for C#.

I will say that the deployment of code to SQL Server (especially third party libraries) is not easy. It’s possible, but it will take some Googling and investigation.

The benefits of being able to produce JSON from SQL Server is massive however. You can quickly change the objects in C# that get serialized to JSON, which helps development velocity. My use case has each article running roughly 13 queries, which means that the total number of queries for 1,000 articles would be 13,000!

However in my testing I was able to produce around 100,000 articles in less than 10 minutes. The secret sauce is the fact that the code is running on the database machine and so we don’t have to pay the price of network latency.

Hopefully this approach is helpful in sending non-trivially formed data from SQL Server to MongoDB!

Moving data from SQL Server to MongoDB/CouchDB/Couchbase

Quick CouchDB Javascript engine replacement test (couchjs)

Somewhere online I had read about a new project from a guy at IrisCouch to replace the built in Javascript engine that comes with CouchDB.

You can see his project on GitHub here: https://github.com/iriscouch/couchjs

The instructions for replacing the Javascript engine of CouchDB are pretty simple. I spent most of my time trying to fight to install the right version of node.js and npm on my local Ubuntu VM.

Unfortunately after switching to the new couchjs engine I didn’t notice any performance gains. I’m not exactly sure what the motivation is for writing a new Javascript engine for CouchDB if not for performance. Maybe since I’m running BigCouch the advantages are negated? The GitHub site says the new couchjs Javascript engine is using V8. Do you remember the first time you ran a Javascript heavy feature in Chrome? It was much faster, right? I was very excited to see a performance gain of 10-20% (much like Chrome’s performance gains), but instead saw no visible, or measured, gains.

Bummer.

Something to keep an eye on though in case I need to squeeze every last drop of performance out of CouchDB.

Quick CouchDB Javascript engine replacement test (couchjs)

Writing a CouchDB replication filter in Erlang

I’m working a project that uses BigCouch (a fork of CouchDb) and the performance when replicating data from one machine to another was a little less than I had hoped for.

Replication with CouchDb uses the _changes feed and instead of replicating an entire database, I had a filter setup to limit the number of records that would go across the wire.

The filter I wrote was in Javascript and looks like this:

{
   "_id": "_design/segmenting",
   "filters": {
      "by_year_month": "function(doc, req){var month = req.query.month;var year = req.query.year;if (doc.pubYear == year && doc.pubMonth == month){return true;}else{return false;}}"
   }
}

The goal is to get documents by using a month field and a year field that exist in each document.

Here is a sample test to see if the filter is working:

http://127.0.0.1:5984/database_name/_changes?filter=segmenting/by_year_month&month=4&year=2012

When reading about something totally unrelated I stumbled across the idea that I could write my filter function in Erlang instead of Javascript. This had the advantage of speaking CouchDb’s native language – Erlang.

After much reading/researching and testing I ended up with the equivalent in Erlang:

{
   "_id": "_design/fast_segmenting",
   "language": "erlang",
   "filters": {
      "by_year_month": "fun({Doc}, {Req}) -> {Query} = proplists:get_value(<<\"query\">>, Req), Month = list_to_integer(binary_to_list(proplists:get_value(<<\"month\">>,            Query))), Year = list_to_integer(binary_to_list(proplists:get_value(<<\"year\">>, Query))), case {proplists:get_value(<<\"pubMonth\">>, Doc),    proplists:get_value(<<\"pubYear\">>, Doc)} of {Month, Year} ->; true; _ ->; false end end."
   }
}

This code is gross to read, so let me show the Javascript and Erlang filters with some better formatting. Keep in mind every document has a field called “pubMonth” representing the month and “pubYear” representing the year.

Javascript:

function(doc, req)
{
   var month = req.query.month;
   var year = req.query.year;

   if (doc.pubYear == year && doc.pubMonth == month)
   {
      return true;
   }
   else
   {
      return false;
   }
}

Erlang:

fun({Doc}, {Req}) ->
   {Query} = proplists:get_value(<<\"query\">>, Req),
   Month = list_to_integer(binary_to_list(proplists:get_value(<<\"month\">>, Query))),
   Year = list_to_integer(binary_to_list(proplists:get_value(<<\"year\">>, Query))),

   case {proplists:get_value(<<\"pubMonth\">>, Doc), proplists:get_value(<<\"pubYear\">>, Doc)} of 
      {Month, Year} -> true;
      _ -> false
   end
end.

To prove the Erlang version was faster I used the same parameters and ran a Javascript test and an Erlang test. I ran the test three times on a server that had no traffic hitting it.

The largest database had 5,896 documents and the smallest had 910. The average of the average three runs per database came to 52%. That means the Erlang function across all databases – over three test runs – was 52% faster!

Writing a CouchDB replication filter in Erlang