HDInsight – Log storage attempt #1

I plan on doing a series of posts describing my attempts to get data into HDInsight which is hosted in Azure.

At my current job there is a need to do business intelligence reporting. We are just starting to investigate and plan out a data platform where we can store an arbitrary amount of data and then run reports on it later.

Some of the informal requirements we have for the new data platform are:

1. Store arbitrarily large amount of data. We don’t want to worry about deleting data and we want everything to be historical.

2. Cheap. The storage needs to be cheap. We want to minimize the cost of scaling out nodes as our data grows.

3. Dynamic. This is an important requirement. We don’t want to have to manage a rigid schema. We will constrain the “records” to only appending new data. Ideally we only add fields, we don’t change or remove old ones.

Candidate Data

The candidate data set to get a proof of concept off the ground are 404 logs. If someone comes to our site and the page isn’t there, we write that data to the data platform.

Hadoop felt like a natural fit because it has been around for a long time and has a great eco-system around it. We would store the 404 logs in HDFS and then run map/reduce jobs to report.

In my first proof of concept I got a small Linux VM up and running with Hadoop and sent 404 logs to HDFS via WebHDFS.
It worked great and the 404.log file in HDFS would grow and grow as pages couldn’t be found on our website. The key note here is that I was only appending lines to the end of a file.

The next big step was to make it production ready.

Going to Production

In Azure it is clearer cheaper to spin up an HDInsight cluster instead of running my own cluster of Linux VM’s running Hadoop. That is great news, but how do I get data into HDInsight?

The Azure answer is to put the data into blob storage.

This is where the roadblock went up. The blob storage APIs do not allow for the appending of data to existing files. You have to do things like queue up writes and then make blocks and append blocks (complicated) or download an entire file, append a line and then re-upload.

I scoured the web for how to append small amount of data to blobs and couldn’t find any good answers. I looked through log4net code (that writes to blob storage), I tweeted everyone I could find related to blob storage and HDInsight and … no good replies.

This might be a good time to challenge my assertion that appending data is good. Why do that?

The word on the street is that Hadoop prefers a small amount of large files instead of a large amount of small files for running map/reduce jobs. Because of this I want to be able to cheaply add data to large files.

A backdoor?

Eventually I found two .jars (https://github.com/prateek/wasb-parcel/tree/master/parcel-src/lib/hadoop/lib) that would integrate Azure blob storage with HDFS on the command line. I could communicate over wasb:// and copy files into blob storage, read files from blob storage, etc.

After getting a proof of concept running with the .jars I prematurely tweeted my excitement because I thought I had finally reached my goal. Why premature? Because when I tried to execute the -appendToFile command I got a sad “Operation not supported.” message.

In the end I decided to run one small Linux VM in Azure that would run Hadoop in a Docker container (https://registry.hub.docker.com/u/sequenceiq/hadoop-docker/). This way I can accept true append-only writes and persist them to HDFS. Storage is cheap so I can keep adding disks if I want. Or just rotate out old data and archive it in blob storage. Next I could copy the files into blob storage with the Azure HDFS wrapper, spin my cluster up and report to my hearts content.

This would have worked except that after going to production I started to notice that after a few hours Hadoop would get confused and start complaining about corrupted blocks. Was this because of Docker? Azure? Bad config?

2014-09-16 14:27:44,497 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
java.io.EOFException: Premature EOF: no length prefix available
	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1346)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1194)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:531)
2014-09-16 14:27:44,498 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 22acceea218d:50010:DataXceiver error processing WRITE_BLOCK operation  src: / dst: /
java.io.IOException: Corrupted block: ReplicaBeingWritten, blk_1073741859_7815, RBW
  getNumBytes()     = 910951
  getBytesOnDisk()  = 910951
  getVisibleLength()= 910951
  getVolume()       = /tmp/hadoop-root/dfs/data/current
  getBlockFile()    = /tmp/hadoop-root/dfs/data/current/BP-953099033-
	at org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.createStreams(ReplicaInPipeline.java:218)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:214)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
	at java.lang.Thread.run(Thread.java:744)

I have no idea and I’m done trying to figure it out.

The next blog post will be about the shift to trying the ELK (ElasticSearch, Logstash, Kibana) stack. I hear you can pump data from Elastic to Hadoop without too much trouble.

HDInsight – Log storage attempt #1

Azure+TeamCity, Build Server Fun!

At work we have a strong need for a build server with pseudo continuous integration and the task was passed to me to get something up and running. Many years ago at a different job I cobbled together Cruise Control and some NANT scripts and was able to get something like a build server going. This time around I knew I could do better and on the recommendation of a co-worker I went with TeamCity.

TeamCity is awesome! Poor CruiseControl.Net looks awfully out dated compared to TeamCity.

Some quick things I love about TeamCity:

1. Lots of integrations out of the box. It can talk to TFS, MSTest, MSBuild, Powershell, etc.
2. It just works out the box. Web portal fired up fine, the windows service installed well, etc.
3. There is a ton of support online for it. CruiseControl has some of this, but not as much as TeamCity.

It was a bit harder to get TeamCity to publish projects to Azure, but with some help of a blog or two online, I got it working.

Our process now:

1. Check out latest solution from TFS.
2. Update Nuget packages.
3. Build the entire solution.
4. Run unit tests.
5. Publish to Azure (the staging instance just to be safe).
6. Flush Redis.
7. Run integration tests.
8. Email the team that a build is successful.

It’s been a few days of work getting it all dialed, but it feels good to add some automation to our process. It was getting tiresome to deploy manually and run tests a few times a day.

Azure+TeamCity, Build Server Fun!

Azure Fatigue

The battle rages on with Microsoft Azure. Having come from a solid year of working with Amazon’s EC2, I find working with Azure to be very frustrating at times.

A recent example is how the IP addresses – both public and private – change on instances.

For example, I have a cloud service that is up and running with a production instance and a staging instance. The private IP addresses end with .11 and .12.

I deploy my project to the staging instance and then swap staging and production once I have verified that staging is good.

What is the state of things now?

Now I have private IP addresses that end in .10 and .12. The address that ended in .11 is gone and now I have an IP with the .10 address that is not allowed to communicate with other machines since it is unrecognized.

I remember going through similar pains with EC2 (getting machines to communicate internally), but at least EC2 kept things consistent. The IPs would stick on a box that wasn’t stop and restarted. If I just rebooted a machine, the IP address would stay the same and I didn’t have this tight deployment integration that does “magic” for me.

Azure Fatigue

Windows Azure SDK 2.0 Troubleshooting


Several years ago I got the chance to work with Microsoft’s AppFabric for caching. I had a heck of a time getting it running well because when it failed, it did so silently. No stack traces, no log files, no messages in the console … just null objects and non-working software.

Recently I ran into some trouble with the Windows Azure SDK and it felt all too familiar.

The Start

Like the AppFabric experience, everything started well. I added a new worker role to an existing cloud service project and it ran fine. Data went from point A to point B, I checked in my code and forgot about it.

Later I heard groans coming from another developer at work. Turns out I had upgraded the Azure SDK from version 1.8 to 2.0 and it was forcing everyone else to download the new SDK so the code would build.

After spending a few minutes upgrading we made sure the build was good and moved on to other things.


Several weeks later the decision was made to move my worker role project to a different cloud service. I created the new project, added the worker role, checked in and since everything built fine I figured I was safe.

I didn’t need to test it again because I hadn’t made any code changes. As long as it built, it should run the same as it had before.


A new feature came across my desk regarding the worker role, so it was time to fire it up and do some work.

The first thing I did was to run the project to refresh my memory as to how it worked.

Failure. The Windows Azure Emulator would start and then immediately stop. Huh?

Eventually I was able to scrape out the contents of the emulator’s console and I could see that the worker role was indeed starting, but right after starting, it would die. No descriptive errors, no activity in Visual Studio, nothing. Silence.

Stabs in the Dark

A short list of what I tried:

1. Refreshing the DLLs in the worker role project. Version 1.8? Try 2.0. Version 2.0? Maybe 1.8 again?
2. Removing all code related to dependencies. Maybe our TraceLogger class was failing in some odd way?
3. Tracing. I tried to trace the SDK and my code, but couldn’t get it working.
4. Breakpoints. I would set them all over but none of them were getting hit.
5. Event viewer. The old standby. Empty. No messages, no errors.
6. Lots and lots of Googling. I read a lot of articles about flipping this bit or that flag, but nothing worked.
7. Running another worker project in the same codebase. That worked fine, but I couldn’t see why.
8. Doing a diff between the .csproj file of a working worker role and my failing one.
9. Doing a diff between a working cloud service project and my broken one.


The prior steps all took place in a few hours towards the end of a work day. The next morning I came in and tried this:

1. Slowly and methodically start commenting code out. I removed one method at a time – or stubbed it out – and re-ran the worker role locally.
2. I went through all the code in this manner. The class, the base class it inherited from and all properties/fields that were in both.
3. Changed the custom base class (a wrapper around RoleEntryPoint) to RoleEntryPoint. Ah ha! That worked!

The problem ended up being that we had several projects “Common”, “Utility” that had older version of the Azure SDK. They were all using 1.8 and my worker role was trying to use 2.0.

Just inheriting from a class in one of those other projects caused the failure.

Once again though I wish there was some output somewhere that I could have easily latched on to that would have pointed me in the right direction.

Windows Azure SDK 2.0 Troubleshooting