I plan on doing a series of posts describing my attempts to get data into HDInsight which is hosted in Azure.
At my current job there is a need to do business intelligence reporting. We are just starting to investigate and plan out a data platform where we can store an arbitrary amount of data and then run reports on it later.
Some of the informal requirements we have for the new data platform are:
1. Store arbitrarily large amount of data. We don’t want to worry about deleting data and we want everything to be historical.
2. Cheap. The storage needs to be cheap. We want to minimize the cost of scaling out nodes as our data grows.
3. Dynamic. This is an important requirement. We don’t want to have to manage a rigid schema. We will constrain the “records” to only appending new data. Ideally we only add fields, we don’t change or remove old ones.
The candidate data set to get a proof of concept off the ground are 404 logs. If someone comes to our site and the page isn’t there, we write that data to the data platform.
Hadoop felt like a natural fit because it has been around for a long time and has a great eco-system around it. We would store the 404 logs in HDFS and then run map/reduce jobs to report.
In my first proof of concept I got a small Linux VM up and running with Hadoop and sent 404 logs to HDFS via WebHDFS.
It worked great and the 404.log file in HDFS would grow and grow as pages couldn’t be found on our website. The key note here is that I was only appending lines to the end of a file.
The next big step was to make it production ready.
Going to Production
In Azure it is clearer cheaper to spin up an HDInsight cluster instead of running my own cluster of Linux VM’s running Hadoop. That is great news, but how do I get data into HDInsight?
The Azure answer is to put the data into blob storage.
This is where the roadblock went up. The blob storage APIs do not allow for the appending of data to existing files. You have to do things like queue up writes and then make blocks and append blocks (complicated) or download an entire file, append a line and then re-upload.
I scoured the web for how to append small amount of data to blobs and couldn’t find any good answers. I looked through log4net code (that writes to blob storage), I tweeted everyone I could find related to blob storage and HDInsight and … no good replies.
This might be a good time to challenge my assertion that appending data is good. Why do that?
The word on the street is that Hadoop prefers a small amount of large files instead of a large amount of small files for running map/reduce jobs. Because of this I want to be able to cheaply add data to large files.
Eventually I found two .jars (https://github.com/prateek/wasb-parcel/tree/master/parcel-src/lib/hadoop/lib) that would integrate Azure blob storage with HDFS on the command line. I could communicate over wasb:// and copy files into blob storage, read files from blob storage, etc.
After getting a proof of concept running with the .jars I prematurely tweeted my excitement because I thought I had finally reached my goal. Why premature? Because when I tried to execute the -appendToFile command I got a sad “Operation not supported.” message.
In the end I decided to run one small Linux VM in Azure that would run Hadoop in a Docker container (https://registry.hub.docker.com/u/sequenceiq/hadoop-docker/). This way I can accept true append-only writes and persist them to HDFS. Storage is cheap so I can keep adding disks if I want. Or just rotate out old data and archive it in blob storage. Next I could copy the files into blob storage with the Azure HDFS wrapper, spin my cluster up and report to my hearts content.
This would have worked except that after going to production I started to notice that after a few hours Hadoop would get confused and start complaining about corrupted blocks. Was this because of Docker? Azure? Bad config?
2014-09-16 14:27:44,497 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1346) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1194) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:531) 2014-09-16 14:27:44,498 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 22acceea218d:50010:DataXceiver error processing WRITE_BLOCK operation src: /172.17.0.5:58867 dst: /172.17.0.5:50010 java.io.IOException: Corrupted block: ReplicaBeingWritten, blk_1073741859_7815, RBW getNumBytes() = 910951 getBytesOnDisk() = 910951 getVisibleLength()= 910951 getVolume() = /tmp/hadoop-root/dfs/data/current getBlockFile() = /tmp/hadoop-root/dfs/data/current/BP-953099033-10.0.0.17-1409838183920/current/rbw/blk_1073741859 bytesAcked=910951 bytesOnDisk=910951 at org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.createStreams(ReplicaInPipeline.java:218) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:214) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:744)
I have no idea and I’m done trying to figure it out.
The next blog post will be about the shift to trying the ELK (ElasticSearch, Logstash, Kibana) stack. I hear you can pump data from Elastic to Hadoop without too much trouble.