A few months ago, I embarked on a project to learn more about data mining, machine learning and, as a prerequisite, statistics. I was tired of hearing “statistics show that…” without enough “proof” and have taken a long side-trip toward Statistical Inference and Reproducible Research. I’d be much more interested in reading a scientific paper that comes with code and data than one with paper and forgone conclusions (possibly influenced by grant-money) only. And after hearing “this would be clearer with an understanding of calculus”, I brushed up on my calculus and linear algebra as well (it had been over 40 years ago, after all). I’d been working in languages like Octave and R and when Azure Machine Learning (AzureML) arrived, the idea that you could use R scripts in AzureML experiments was intriguing.
AzureML features a drag-and-drop canvas-based project system (the projects are called “experiments”). I tend to like programming with code over drag-drop-cut-paste, but the killer feature of AzureML may be that you can publish an endpoint that enables singleton or batch scoring, using your work. But back to the R scripts.
AzureML currently has about 25 sample experiments to use as templates or exemplars. After looking at a few of them, I thought it would be nice to find the experiments that use R scripts and see how the samples used them. Turns out that you run into problems doing this automatically. You can’t, at this point, persist an experiment using, say, an XML vocabulary or some JSON format. At all. To me, this was quite strange, because you can’t pick them up and move them, like you can with say, an SSIS package. And since the only artifacts in the Azure storage account are the blob containers “experimentoutput” and “uploadedresources”, neither one of which, as far as I could see, contains the experiment “definition” (i.e. the project).
I validated this observation (on the AzureML forum) and put in a request on the AzureML improvement suggestions site. Since there already is a fairly standard XML vocabulary called PMML, I suggested using that. Other reasons for having this available would be for version control and providing an offline format in the event that you inadvertently delete your AzureML storage account. In addition to the uses I’ve already mentioned.
To end this post with something useful, I did do the search for R Scripts in the sample AzureML experiments “by hand”, by copying each sample opening the canvas and typing in each unique component use (but not the parameters or scripts) in notepad. Enclosed is a file that contains this information, I hope you’ll find it useful. Please excuse any inadvertent typos.
Hopefully in future, this would be a simple case of querying the set of “experiment projects” with XPath. Or if you’re XML-query-phobic, with grep.