Project Euler!

Well it appears I am quite late to the party. The other day while trying to solve a CodeChef problem I stumbled across Project Euler. It was right up my alley as I’ve been trying to get into sites that help me practice algorithms and math.

Briefly some notes after my first 37 problems:

1. Project Euler encourages me to write really bad code. Since there are no time or memory constraints (just give us the right answer), I end up writing the worst possible code to brute force answers.
2. I love Python. What else is there to say? There is a reason Python is one of the most popular languages for solving Project Euler problems. String parsing, math functions, the terseness of the language, the ability to easily handle large numbers … it’s just fantastic.
3. I am about done with C++ for problems. The syntax and constant Googling of how to use various data structures became too annoying for me. Plus the extra few steps of having to compile and then run … it’s not that much, but it adds up. Again I love just running Python code without having to compile first.
4. It’s addicting. The site was made for someone with my personality. Goals, levels, competition, etc.

I read a blog the other day of a guy talking about how Project Euler was boring. He had solved about 150 problems (he even contributed a problem!) and then got tired of the site. Something along those lines.

Well I plan to beat that guy. We’ll see how it goes. =)

Project Euler!

Best Practice Exhaustion

I’m sure everyone has days like this – you feel a little worn out physically and exhausted mentally. I always know I’m extra fatigued mentally when I lose all interest in my work. Just like when I was racing mountain bikes, some days your mind refuses to go out and train. It’s too much. The idea of shirking your responsibility and doing anything except what you are supposed to do is irresistable.

While considering how I came to this point of fatigue, a new thought came to me in regards to development.

From time to time I believe I experience best practice exhaustion.

Work is filled with people that know best practices, but any idiot can read and remember rules. Most best practices start with the words “always” and “never”. Some examples:

1. Always check in your code to version control.
2. Always write unit tests.
3. Always do code reviews.
4. Never let an unanswered email fester.
5. Never let someone else drop the ball on a project.
6. Never use XYZ data structure, ABC pattern.

In software, it’s easy to compile a long list of best practices. Things that every good developer will always do.

Here are some implications of best practices:

1. You are responsible for any and all problems with a project.
2. If other people are lazy/non-responsive/not professional, that’s your fault too.
3. If a bug appears, and you could have prevented it, you’re guilty. Why didn’t you follow best practices?
4. If you’re exhausted at your desk and decide to stare aimlessly people wonder, why aren’t you working?
5. Not only are you responsible for coding, but you are responsible for project management, people management, specifications, deployments, etc.
6. Your velocity is always measured by code only and not these other factors.

So I find myself sometimes stuck in a bit of a rut. I take a task at work, I think about it, start developing and this laundry list of do’s and dont’s floods my thought process.

“Have I checked in every piece of code I’ve ever written? Even one off scripts? Even trivial database scripts?”
“Did I release something without a code review?”
“Did I log my time perfectly in XYZ tool?”
“Does everyone always know what I’m doing at all times? Did I send enough emails?”
“Have I thought about every possible boundary/use case no matter how obscure?”

I could go on and on and on.

The bottom line that summarizes this post is the statement: You never have any excuse for not doing something perfectly.

At least that’s how software engineering feels at times. I have found little grace and forgiveness amongst co-workers for simple human error.

We’ll see if there are solutions to the “no excuses” problem going forward.

Best Practice Exhaustion

Code Chef Problem – Distinct Prime Factors (KPrime)

Whew. This problem took me the better part of a week to solve. The first few days I misunderstood the problem and went off on a bit of a rabbit trail. Even though I have consistently learned to not rush head long into solving a problem before properly understanding it, I keep repeating this mistake.

The Problem

The problem is in two parts:

1. Find the number of distinct primes for a number n. (For example, 10=2*5, so that is 2 distinct primes. 20=5*2*2, so again, 2 distinct primes)
2. For a range of numbers, for example, 1 to 100, how many numbers have k number of distinct primes?

An example:

If we consider the numbers 4 to 10, with k=2. The question is … 4,5,6,7,8,9,10 … out of those numbers, how many have 2 distinct prime factors?

The answer?

– 4=2*2, so it has 1 distinct prime factor
– 5=5*1, so it has 1 distinct prime factor
– 6=2*3, so it has 2 distinct prime factors
– 7=7*1, so it has 1 distinct prime factor
– 8=2*2*2, so it has 1 distinct prime factor
– 9=3*3, so it has 1 distinct prime factor
– 10=2*5, so it has 2 distinct prime factors

The answer must be 2 then. 6 and 10 have two distinct prime factors.

The constraints were that the range could go up to 100,000 and k could range from 1 to 5.

First Swings

Initially I thought of actually generating prime number combinations. For example, if k=3, I would generate [2,3,5], [3,5,7], [5,7,11], etc. The idea being that order doesn’t matter and numbers can’t be repeated. The problem was that this would only yield numbers that had uniquely distinct prime factors. In other words, [2,2,3,3,5,5,5] also have three distinct prime factors, so in that sense it is the same as [2,3,5].

I abandoned this approach after trying it out for way too long.

My Good Friend OEIS

After Googling around a bit I ended up on this page: OEIS A001221.

For algorithm competitions, OEIS (Online Encyclopedia of Integer Sequences) is invaluable.

Unfortunately there were no code samples for generating k-distinct primes on the OEIS page. I tried to read the math specifics on Wikipedia and various other math sites and found the function I was looking for was omega(n). Even though I knew the function, I couldn’t find any good implementations of it.

New Approach

I did actually see something on OEIS that gave me an idea. There was a MAPLE implementation that inspired me to do this:

1. Loop over all the primes under 100,000.
2. For each prime, go through the multiples of it within a limit. (The limit ended up being ceil(100000 / prime(i)) )
3. For each number in (2), we add 1 to an index in a vector. For example, if we’re looking at 7 for the prime, we would have:
a. nums[7*1] = nums[7*1] + 1
b. nums[7*2] = nums[7*2] + 1
c. nums[7*3] = nums[7*3] + 1
d. …

In this way I could create a vector that cached the distinct prime count for numbers 1 to 100,000. Then when a range was given to me, I could iterate over the vector and count how many times a given number appeared.

Temptation to Cheat

At one point in my search for an implementation – or at least good written explanation – of omega(n), I came across this page: http://stackoverflow.com/questions/17545888/number-of-distinct-prime-factors-of-a-number.

A guy was doing the same problem and gave up and asked Stack Overflow to finish it for him. Not cool.

The temptation was there to just give in after a week of work. I could just read the answers on SO and submit the problem and be done with it. However I hadn’t spent all that time just to quit and cheat.

First Submission

My first submission failed from the time constraint. A few weeks ago I made a subtle transition over to C++ for algorithm competitions because I had to use it for a problem on TalentBuddy. The language is still fairly new to me, so when my first submission failed I wasn’t surprised. I must have missed some major performance issues with my code.

The good news was that I didn’t get a “Wrong Answer” reply.

Micro Optimizations Work!

Every single time an algorithm has failed on CodeChef, micro optimizations have not worked. I’ll try minimizing variables, making loops shorter, etc. and it never works. There is always a major issue with the algorithm that only a major re-factor can fix.

For the first time, I was able to do micro optimizations and it worked.

I did some performance profiling and was shocked that my prime generation function only took 2ms to run. Moreover, the caching of k-distinct primes only took 23ms! 25ms out of 1000ms allowed was spent caching the data I needed.

That meant looping through the vector of k-distinct primes was too slow! This really surprised me because I figured that code was as good as it could get.

The big change was accessing a vector by index instead of using .at().

nums.at(i)

… became …

nums[i]

Just like that my vector caching of k-distinct primes went from 23ms to 10ms. My submission finally worked!

Conclusion

C++ is very powerful and micro optimizations add up. Over time I figure I’ll learn more and more of them. I’m excited to see how other people solved this problem.

Code: https://github.com/ryan1234/codechef/tree/master/jul-2013/kprime

Code Chef Problem – Distinct Prime Factors (KPrime)

Integration Testing with Entity Framework

At work we have a .NET codebase that is largely uncovered by unit tests. Currently we have about 20% coverage and unfortunately a large chunk of the untested code is integrated tightly with Entity Framework.

After searching the web for a while I realized that it was going to be very tough to unit test anything related to Entity Framework. The best I could hope for was writing integration tests.

I’ve come across the problem of doing integration testing against databases before and each time I’ve failed miserably. Two examples stand out in my mind:

Youthful attempts at MySQL integration testing

I tried to do integration testing with MySQL for a SaaS company many years ago. The steps went something like this for testing:

a. Restore a production backup for MySQL to a new instance.
b. Run a test against the database.
c. Check the diffs to make sure we wrote data correctly, updated data, etc.
d. Kill the instance.
e. Repeat from (a) for the next test.

The problem was that it took too long to restore the database for each test. Also the schema and data changed often enough that it ended up breaking the tests that depended on a static database.

Overall the approach was just too complex to manage. I was in a bit over my head at that time in my career.

More mature integration testing with Mongo and MySQL

Fast forward a few years and I found myself needing to do database integration tests with Mongo and MySQL.

This time instead of trying to create a live instance of MySQL, I came across the idea of creating an embedded database in memory that my tests would point to. For MySQL I ended up using H2 since the data types were fairly similar. I tried other databases, but things like Derby had compatability issues when trying to map data/schemas from MySQL.

For Mongo we used a project called “Embedded Mongo” and it worked pretty well.

Even though I successfully got both integration tests off the ground and working, I would consider the effort a failure. Why?

After several months of heavy development around the databases, I realized I had neglected the integration tests. I simply did not keep up with the maintenance of them and by the time I re-visited them, they were all broken.

The effort was a failure because I made the decision to abandon the integration tests instead of update them. The gap between reality and the tests had grown so large that I didn’t have the mental energy to fix the issues.

Entity Framework Integration Testing

Back to present day, and I find myself in a familiar place. I want to do integration tests against a database. I remember all the lessons I’ve learned from the past:

1. Make sure your tests can be updated easily in the future.
2. Try to use an embedded/in memory database.
3. Speed matters. You don’t want to create an entire database from scratch for each test.
4. Complexity kills. It’s easy to spiral out of control with complexity making the testing work 10x harder than the actual implementation.

After spending a few days waist deep in thought and neck deep in code, I’ve given up for the time being on doing integration testing with Entity Framework.

In a way I’m proud of myself because in the past I would easily spend (or waste I suppose) weeks trying to nail a solution. The older I get, the better I’ve become at cutting my loses. Better to abandon an approach entirely and come back to it later than be stubborn and let other projects suffer.

Some things I noticed from my attempts to test Entity Framework:

1. It’s confusing. The documentation, generated code, the articles online … nothing was intuitive to me.
2. Lots of hacks. The people that want to do integration tests around EF are getting really creative. I always get nervous when I don’t sense a community concensus around an approach.
3. SQL Server Compact Edition (CE) is a poor choice to mimic SQL Server. Lots of people suggested replacing a live SQL Server instance with the embedded version. While I was able to get this working, the schema discrepancies were too dramatic to make it feasible. Plus it was confusing how I could alternate between SQL Server and SQL Server CE in EF.
4. Complexity. It was just too complex. Yes there are tools to generate a schema that you can use to construct a database in SQL Server CE. Yes there are tools to generate data for SQL Server CE. Yesyou can automate all of this. No it cannot be easy.

Conclusion

The real issue at work might be that we’ve hidden a ton of Linq that ties directly to Entity Framework in one database access class. The design from the start didn’t have unit testing in mind and it might be too late with project deadlines, etc. to refactor.

I would agree with a statement I read online that it’s a poor substitute to do testing against Linq to objects because EF generates actual SQL from the Linq you’ve written. It’s best to test the actual SQL generation to find potential bugs.

One of these days I hope to find a solid approach that will span technologies for doing integration testing with databases.

Integration Testing with Entity Framework

Broken Windows

In the book Pragmatic Programmer the author outlines the theory of Broken Windows in regards to coding. Jeff Atwood has a nice article about the theory here.

Broken Window Theory: There was a study done that showed that when a building suffered a broken window, other windows were quickly broken and the building fell into disrepair. The theory is that if you leave something in a broken state – or neglect a flaw – the system as a whole will quickly degrade. The Pragmatic Programmer book argues the same thing happens with code.

Broken windows in regards to code look like the following:

1. Comments that have become wrong/incorrect.
2. Conventions being broken – for example inconsistent names for objects.
3. Lazy implementations that could easily be improved upon.
4. Unorganized code and data structures.

The idea is that as you work with a codebase, if you see some of the things above, you are likely to believe no one cares for the code anymore. There are broken windows all over, so why try hard to write good code? Why try to refactor existing systems and classes? Why do extensive testing?

Lately I’ve found myself falling into this mindset when working with legacy code. At the end of this article I’ve listed a few broken windows I’ve found in the past few months. At a low level in my brain I find myself making excuses for taking shortcuts with my code. There are so many broken windows, so why not skip testing everything I touched?

My resolution is that even though I might be working with legacy code, I need to uphold the same standards I would have for new code.


Some examples:

1. A comment in code:

// UN-COMMON THIS LINE WHEN READY TO ROLL

2. A comment in a stored procedure in the database:

// COMMENTS: What is IndividualDates all about ? 

3. We need to be using source control:

// DISABLE CACHING FOR NOW 9/26/06
// Check cache to see if cache exists

4. Justin needs to test:

//Added true so the menu is always fresh
//You can remove....but justin wants to test the menu

5. Updating!

            if (e.CommandName == "Up")
            {
                if (indx > 0)
                {
                    newCat = lCats[indx - 1];
                    upDateDB = true;
                }
            }
            else
            {
                if (indx < lCats.Count - 1)
                {
                    newCat = lCats[indx + 1];
                    upDateDB = true;
                }
            }

6. Good luck reading that for loop.

            for (int i=0;i<lCats.Count;i++)
                if (lCats[i].MenuCategoryID == mCatID)
                {
                    indx = i;
                    oldCat = lCats[i];
                }

7. Long, long, long. Found a 700 line function to initialize a web forms .aspx page. 850 lines total in the class, so about 82% of the code in this class is dedicated to one method!

Broken Windows

Stubbornness with Tools

For my entire computing life I’ve been slow to adopt new tools. I always end up learning a program or tool and then stick with it no matter what.

Some quick examples:

Purpose Old/Comfortable Huge Advance
Writing Java vim Eclipse (even though I generally hate Eclipse)
Operating System DOS Windows 95
Html/Javascript Notepad Notepad++
Javascript Pure Javascript jQuery
Prototyping Temp .NET Projects LinqPad

The latest entry is using LinqPad to do quick C# prototyping. In the past if I wanted to test an algorithm or some simple function, I would create a new project, call it “Sandbox” and start coding. It would take several minutes to make the console project, make a bunch of static methods and then test. Then I’d have to open a CLI prompt and run it. Bleh.

With LinqPad I can start just banging out C# and run it without all the extra “stuff”. Awesome.

I heard about LinqPad years ago, but refused to look into it. Like all the other entries in my list above, it really comes down to pride.

Deep down in my brain there is this thought, “You’ve wasted a ton of time by not using XYZ tool!” It’s hard to admit that for years I wrote raw Javascript to accomplish tasks that jQuery could have done far better. It’s also hard to admit that I would notice other developers using Notepad++ and thought, “Ha! So weak. Notepad++. I don’t need that.” Now I can’t imagine being without it!

The lesson for me is to be quicker to adjust to new tools that can make me more productive.

Stubbornness with Tools

String Searching Analysis – Rabin Karp versus Naive Search

I recently finished an algorithm problem that involved implementing the Rabin Karp search algorithm.

After reading several articles about it, I thought it would be fun to run some tests to prove how much faster it is than a naive search algorithm.

Unfortunately most of my tests show that it is actually far worse than two naive/brute force approaches that I used. This surprised me because on TalentBuddy my initial naive approaches were too slow to pass their tests. Only after implementing Rabin Karp could I get the tests to pass.

I’ve been anxious to start using R to visualize data, and this was a great opportunity to do so.

Benchmark details:

1. For each graph, the tests were run 10 times back to back and the graphs show the mean of each algorithm.
2. The random data was generated from an English dictionary.
3. 100,000 bytes were taken from the end of the randomly generated file and this was used as the search pattern. Then I concatenated the random file together four times to make the pattern appear four times.
4. It was a bit hard to find long common phrases in the Bible and Shakespeare.
5. The slow behavior of Rabin Karp seemed to take place in the “sliding” part of the algorithm. When it moves up one character to do a comparison … that loop was the slowest part.

Naive 1 = My best attempt at a naive search.
Naive 2 = http://www.geeksforgeeks.org/searching-for-patterns-set-1-naive-pattern-searching/
Rabin Karp = http://www.geeksforgeeks.org/searching-for-patterns-set-3-rabin-karp-algorithm/

results4

results3

results2

results1

I used this article to generate my random word file.

All my code, scripts, data, etc. can be found here: https://github.com/ryan1234/talentbuddy/tree/master/mapreduce/hashstring/analysis/hash-string

String Searching Analysis – Rabin Karp versus Naive Search