And now, for something completely different. Hadoop.

I’ve been playing with everything I can get my hands on (and can find time for) with Hadoop for a while now. Have tried Hortonworks on CentOS (seems like the Hortonworks “reference” platform), Hortonworks HDP on Windows, and HDInisght. I even think I may have a clue about what big data is and why it matters. But that’s a topic for another time. It’s intruiging that Hadoop on Linux has a built-in GUI dev tools (like Hue) but the Windows’ versions (put out only by Hortonworks/Microsoft) have no GUI except for the NameNode/YARN/HBase status pages, and almost mandate you work on PowerShell. I thought the Linux folks were the bigger command-line wonks. I haven’t tried any other distos (e.g. Cloudera) yet (or tried building it myself for that matter).

Over the weekend, I installed Hortonworks new GA distro for Windows. The main reason I installed it was to try out Hive 0.13 (with vectorized query – think SQL Server columnstore batch mode, ORC files are the columnstore) and Hive under Tez. Think of Tez as a more flexible, powerful, faster MapReduce. I did a single-node install for simplicity. Ran into some problems and anomolies that I’ll mention here. But did get most of what there is working. I say “what there is” because the reference diagram at mentions all the things I do see, but the Windows distro doesn’t include Accumulo (another fast data storage offering based on BigTable) and Solr (a search offering). Perhaps the CentOS distro does. And, of course, no Hue. And, although there’s an exe and it’s in the list of Windows services, HWI (Hive Web Interface) doesn’t start up. That’s been reported by others too.

BTW, this version isn’t available on HDInight yet. Maybe soon. Microsoft allocated some engineers to Hive 0.13 (and other phases of the Stinger Initiative), so I’m pretty sure they want it in HDInsight too.

There were the usual potholes in the docs, if you’re trying to follow the docs by rote. First, the docs warn you not to use path names with spaces in the Java or Python installs ( Good idea in any case (don’t have to put quotes around names in command lines) but especially with Unix ports to Windows. But they warn you in the Python directions and Python 2.7 doesn’t install in “Program Files”. Java does. And one of the instructions says you should add (step 5c) e.g C:\Java\jdk1.7.0_45\bin as the value for JAVA_HOME. Don’t use the “bin”. Step 6b is wrong too; the path should include the “bin” directory. Easy to fix. The install reports “JAVA_HOME is not set” if you mess it up. Ask me how I know…I should have been paying attention.

If you choose “Run Hive Under Tez” you must peform Section 6.1 “Setting up Tez for Hive”. But there’s a glitch there too. Step 5: “Copy the Tez home directory on the local machine into the HDFS /apps/tez directory” is wrong too.
%HADOOP_HOME%\bin\hadoop.cmd dfs -put %TEZ_HOME%* /apps/tez
should be:
%HADOOP_HOME%\bin\hadoop.cmd dfs -put %TEZ_HOME%/* /apps/tez
(note the slash after %TEZ_HOME% and before the *)

Until you do that, Hive fails starting up the command line app with /apps/tez/lib not found.

And its likely that step 6 should be:
%HADOOP_HOME%\bin\hadoop.cmd dfs -rmr -skipTrash /apps/tez/conf
because there is no conf3 directory.

With the right copy command the Tez smoke test and Hive under Tez run just fine. I did have trouble with the HCatalog and Hive smoke tests at first (something about hadoop user needing a GRANT, GRANTs are expanded in Hive 0.13), but after enough tries (3 reinstalls of the distro and 7-8 runs of smoke tests) it just started working. Don’t know exactly what I did to make that start happening. So all the smoke tests (including optional components) succeed.

So THANKS folks! Lots more to try. There’s a lovely Hive/Tez/vectorized query tutorial (do it sans-GUI) that shows the new features off. And I’m glad that I can specify precision/scale for decimal types. And all other Stinger features and additional ecosystem components too.

BTW, I ran this on Windows 2012 R2 OS, even though the docs say only Windows 2008 R2 an 2012 OS are officially supported. It lists these under minimal requirements.

Hopefully this information will be helpful to someone else.

Cheers, @bobbeauch