Work Highlights

This summarizes some main highlights and results from my work since 2000, most recent first. So it is of course biased toward the projects that were the most interesting, and went well.

There are of course many colleagues without whom these highlights would not have been possible, so I am grateful to all of them.

Driving Dotscience MLOps Platform Product Roadmap, Developer Experience & Content Strategy (2019−2020)

As the most senior data scientist in the company (de facto Chief Data Scientist) I owned and drove the product roadmap for Dotscience, reporting directly to the CEO. In this role I integrated feedback from sales, marketing and engineering as well as my own professional opinion, skills and experience into defining and prioritizing the most critical revenue−driving features, such as RBAC, A/B testing and enhanced statistical model monitoring capabilities. I also took ownership of the product documentation to drive improved developer experience on the platform, as well as the content strategy in which I authored key blog posts that staked out the company’s whole value proposition in a way that makes sense to the target audience such as Why do Data Scientists Need DevOps for Machine Learning (MLOps)?

Previous roadmap areas I have covered include auto-ml, auto-featurization, missing values, model performance, data preparation, time series, scalability, as well as many smaller areas and details from the whole end-to-end data science proess.

NYC Taxi Blog (2016)

I wrote various blog entries for Skytree, and now Dotscience. The New York Taxi data one was nice because we showed a 500 million row dataset being used as the training set for a machine learning model, within the Skytree graphical user interface. It also included data preparation and featurization from the raw public data. An improvement to the analysis would be to redo the data preparation in PySpark, as opposed to streaming it through awk, a legacy of some of my astronomy coding from years before.

Training set of 433,437,625 rows loaded into the Skytree graphical user interface
Gradient boosted decision tree trained on this data using smart search model hyperparameter tuning

Trillion Element Training Set (2016)

I didn’t write this, but the data science and engineering teams did. It showed the Skytree command line using a dataset of 10 billion rows by 100 columns (hence 1 trillion elements) as the training set for its distributed gradient boosted decision tree on 10 Hadoop nodes. I didn’t see a comparably−sized dataset at the time.

Skytree Benchmarking (2015)

I led the benchmarking project of Skytree versus various competitor software, showing CPU speedups of 150⨯ vs. R, 100⨯ vs. Scikit−learn, 71⨯ vs. MLlib, and 2⨯ vs. H2O and XGBoost. And these were on one node. XGBoost is still named now 7 years later as being state of the art. It is likely that this and the other software are better now − they have had 7 years to develop, and there are further tools like Dask for Python, but the set of experienced people in both data science and engineering helped ensure the comparisons were fair at the time. Skytree was written from the ground up in C++, which was one reason for the speed, another being the expertise of the engineers from Alex Gray’s FASTlab.

Skytree PoCs (2012−2017)

We did a wide variety of customer projects as PoCs, solving various business problems. A PoC (proof−of−concept) has been used by many companies looking to add AI to their business as a way of showing that it can generate business value for them. It seems a hyperbolic thing to say, but as far as I can tell it is true, and that is that Skytree never lost a PoC. (Lost in the sense of a competitor had better model performance.) Like I didn’t write the software, I can’t really take credit for this record, but to work with a team and software capable of establishing it was definitely a highlight.

CANFAR+Skytree (2010−2013)

This one I can take more credit for, in the sense that I initiated the collaboration that led to it. CANFAR was a project of the Canadian Astronomy Data Centre, where I worked from 2009−13, and Skytree was founded in 2010. CANFAR provided a cloud-like system for astronomers where they could access large compute resources via virtual machines but with the queuing functionality of a supercomputer. Adding Skytree to it thus opened up the machine learning speed we saw above, combined with the possibility to run it in parallel across 500 nodes. This was one of the first systems in the world to open up large scale machine learning to astronomy in this way. Unfortunately it generated little interest at the time, but the fact that we made it possible is a highlight in itself for me. Even today many astronomers use Python for their ML, which as we saw above does not quite have the same performance.

IJMPD Review: Data Mining and Machine Learning in Astronomy (2009−2010)

In 2009 I was invited to write a literature review on Data Mining and Machine Learning in Astronomy for the International Journal of Modern Physics D. I did this, and the 61 page article published in 2010 has 272 citations as of May 2020 according to Google Scholar. Its distinction is similar to CANFAR+Skytree: one of the world’s first examples of its kind, in this case a review of this subject in a refereed journal. Like CANFAR+Skytree, I don’t want to say the first, as I can’t substantiate that, but having reviewed the literature (as you might expect given the type of article it is), there weren’t obvious precursors covering what would now be called data science in astronomy, or astroinformatics (or possibly data-intensive astronomy, which I quite like as a name).

Quasar Photometric Redshift Probability Density Functions (2008)

This was cool because as far as I know it solved a problem which hadn’t been solved before: a sample of quasars with all accurate distances, and not a fraction of catastrophic failures where a significant fraction of the distances are completely wrong. Photometric redshifts refers to measuring the distances to astronomical objects using only their images, and not needing to take a spectrum, which is useful because there are far more objects that have images than have spectra. You can thus get a much better map of the universe and where objects are within it, valuable for many different science questions.

The failures come about for quasars because at certain distances the bright lines in the spectra used for the training set drop between the color filters on a telescope and so quasars at different distances can appear the same color, making their distance ambiguous and hence failure for the photometric redshifts. The ML we did solved this because by using k nearest neighbors, specifically k=1, which you would think would be bad, the failures stopped blending in to the good results and became their own area on the diagram of predicted distance versus true distance. By itself this is not useful but when we perturbed the inputs we could generate a “probability density function” (PDF) of possible distance values for each object, and the ones that were failures had two peaks. Remove the objects with 2 peaks and you were left with a subsample that had accurate distances, something I hadn’t seen done with another method.

Typical quasar redshift predicted-versus-true testing set performance from regular machine learning in 2008, showing large errors
Selection of quasar subset with single-peaked PDFs from k=1 nearest neighbors removed almost all bad objects

One thing I never quite was satisfied on was whether the “PDF”s we made were really probability density functions in a strict statistical sense, but the part about taking the ones with only one peak to get a subset with much better distances clearly worked. It was nice computationally too because the nearest neighbors was fast (thanks to a kd−tree implementation by a colleague at NCSA), and each object and perturbation were independent, so it was easily parallelized on the NCSA computer. The method is thus viable to any sky survey in the parameter space where there are at least some spectra taken (e.g., deep spectra for a narrow region of sky, then you can get photometric redshifts + PDFs for the much wider regions of sky to the same depth).

Morphological Galaxy Classification Using Artificial Neural Networks (2000−2004)

This was my masters thesis in 2000 then part of my PhD work from 2001−4. Technically it included deep learning because some of the networks had more than one hidden layer, but they were the classical backpropagation architecture of the time, not the modern CNN/LSTM/etc. We showed that ANNs could classify galaxies into morphological types, e.g., Hubble classes, with the same accuracy as human experts. The later well−known Galaxy Zoo project later collected a few 100,000s classifications by crowdsourcing over a few years if I remember correctly. They went into more detail and also got other results, worthy of a lot more papers, but the ANNs once trained could assign classifications that took the crowd years to make in under a minute.

Hubble tuning fork, one method of morphological galaxy classification (https://en.wikipedia.org/wiki/Hubble_sequence)