Work Experience

Here I describe my work experience and most significant projects undertaken after leaving full−time academic research in 2012. My previous academic work is described under academic research .

This page is essentially like my LinkedIn profile, except way longer. For a shorter and more fun to read version of both this and the research, see the highlights page.

The main intent here is not so much to say “hey look, I did lots of stuff” − any data scientist can say that after a few years − but having been generalist covering a lot of areas to substantiate what some of those have been.

Dotscience (2019−2020)

From my LinkedIn:

Principal role as the leading data scientist at this startup company, in a generalist capacity. Wearer of many hats, including leading product, working with engineering, sales, marketing, UX, and customers. Responsible for Dotscience product roadmap, documentation, data science content, competitor analysis, and leading the data science side of customer engagements. Contributor to blog content, writing and giving conference talks, writing and giving product demos, product testing, and company strategy.

Oracle: Lead Data Scientist (2018−2019)

My Oracle position did not allow much time for substantive achievement, as I was only there 5 months before the position was eliminated, along with several thousand others, in spring 2019. This is a quick list of day−to−day items covered under my generalist role:

  • Learn & use datascience.com product
  • Learn & use Oracle Data Science Service
  • Learn & use Oracle Cloud Infrastructure
  • Man Oracle OpenWorld booth Oct 22−24 2018
  • Feedback on product definition and roadmap
  • Customer meetings (including travel to onsite)
  • Data science pitfalls talk
  • Fraud detection functional demo
  • Competitor analysis: pricing, cloud products
  • Meetings/discussions with about a dozen other Oracle groups on algorithms, data science process, product integrations, research, etc.
  • Presentation to Stanford undergraduate students via Oracle Recruiting group
  • Feedback and approve/reject submitted technical blog entries for team’s public website

Also within the job scope, but with no opportunity to carry out, were

  • Lead a team of data scientists
  • Present Oracle Data Science Service to customers

Infosys: Principal Data Scientist (2017−2018)

Our group was under−resourced, I don’t think intentionally, but it was one of the reasons I left. But this limited opportunities to mostly inward−facing day−to−day generalist data science work. Examples include:

  • Bank POC: Fraud detection with supervised machine learning, including preparation of raw data, handling highly imbalanced classes
  • Develop product roadmap: several dozen Jira tickets and epics
  • Product testing and feedback from user point of view
  • Demos: credit risk, customer churn, fraud detection, lead scoring, spam filtering
  • Demo videos
  • Pre-sales calls with salespeople and engineers, sometimes leading

Skytree: Data Scientist / Staff Data Scientist (2012−2017)

My most substantive industry data science experience is with the startup company Skytree, “The Machine Learning Company”, between 2012 and 2017. Skytree was founded by Alex Gray and others from the Georgia Institute of Technology in Atlanta, an outgrowth of the FASTLab there. This group held a number of records for the fastest machine learning algorithm implementations. The technology and team were acquired by Infosys in 2017.

Note, “POC” refers to an end−to−end major paid project with a customer with a view to putting Skytree models resulting from it into production. Production deployment was part of the software’s ability, so this was always in scope. On projects I led I was therefore responsible for turning the customer’s payment, up to 6 or 7 digits, into a return on investment for them.

Lead data scientist on customer projects (2013+)
  • Critical role as lead data scientist on interactions with over 60 commercial companies
  • Including deals over $1m and internationally, onsite work (primarily data science, also some customer training), leading data science teams
  • Lead for about 20 projects over time, 5 of Skytree’s 10 paying customers, 10 other POCs, many more of the over 60 pre-POC
Example POCs

Unfortunately the company names cannot be made public. Some were international outside the US, e.g., Europe and Asia.

  • Equipment manufacturer: Turbine failure: primarily time series anomaly detection
  • Data science consulting company: Support customers in Japan, trained them
  • Bank: Find more fraud than is currently detected from customer reports. An upsurge in fraud in the previous year, including suspected specific targeting of the bank by groups, left their existing system unable to cope with the fraud volume. Any approach that can detect more fraud and thus prevent financial loss is welcomed by them.
  • Battery manufacturer: Battery failure prediction so their monitoring system can say replace before a rack fails
  • Wafer manufacturer: What−if prediction on wafer substrates
  • Domain name provider: Customer churn on domain name registry
  • Pharmaceutical company: Fibromyalgia diagnosis, supervised learning, ∼200 columns
  • Job recruiter: Compare Skytree to wise.io and Data Robot for machine learning on a 450 MB dataset of 37,973 resumes to predict who is a good candidate
Research and development
  • Benchmarking: led benchmarking project with team of data scientists and engineers on Skytree vs. competitors (MLlib, R, Scikit−learn, etc.). Results used for remainder of Skytree’s existence on presentation slides for marketing, sales, technical, and investors
  • Auto−featurization: Skytree command line and SDK scripts for feature selection, e.g., backward elimination by variable importances. Skytree snippets for auto−featurization. List desirable transforms, transform snippets
  • Data science roadmap: integral part of data science + engineering team defining and prioritizing our product roadmap. Authored many pages on our Atlassian Confluence system, and other material
  • Demonstrate Skytree UI interface on multi-billion element data (NYC taxi). Next step would have been to extend to the data preparation using the software’s transform snippet mechanism that used PySpark
UI/SDK/CLI testing
  • Using Skytree as a customer would, giving feedback, esp. versions 15.3 onwards
  • Feedback total 100s JIRA tickets (technical, user experience, presentation) in aggregate since 2013 that have significantly improved the product
  • Includes many documentation improvements and changes
Skytree Demos and Videos
  • Videos forming main introductory content on Skytree website, viewed by 100s of people: technical differentiators, how to download and install, speed and automation. Go−to person in the company for videos because it was vital that the content was both correct technically, and well presented
  • 4 15min videos of demo projects (income, fraud, churn, leads) for partner companies, also Hadoop virtual machine
  • Often asked by Skytree CEO to give demos to important prospective customers and VC partners in conference and in person, onsite and at the office
  • Demo material included on Skytree Express free download virtual machine, e.g., PDP plots for income demo. Later release has full project walkthrough
  • Results using Skytree Server command line to be shown in demos, e.g., UFO sighting data, visualized by Leland Wilkinson’s student Tuan Dang
Other marketing, sales, outreach
  • Skytree star award from VP Worldwide Sales for work with sales in the field
  • Conference booths: talking to customers, answering general and technical questions, live demos
  • Booth scope included entire process solo, e.g., at Dallas did “one person conference”: talk, booth, carry booth supplies, because trusted by Skytree to represent the company
  • Meetups: multiple presentations of Skytree at Meetups, Bay Area and further afield, e.g., general Skytree, demos, galaxy distances
  • Reviewed marketing collateral: combine general audience with proper technical presentation in correct English
  • Toastmasters Competent Communicator certification (and commended by colleagues as best speaker among the technical people)
Other technical content
  • Skytree blog entries (external): Astronomy Data, Analyzing Massive Datasets Whitepaper, SFO use case, SDSS Galaxy Distances (x3), New York Taxi Dataset on 500 million row training set
  • Conference presentations (external): Lawrence Berkeley Laboratory MANTISSA day (poster), New York Data Science Environments Initiative (poster)
  • Tree of Knowledge (internal): Added about 50 questions internally, plus many answers
  • “Keeper of the data” (internal): Documentation on available public large datasets
  • Document templates: Based on content from Max Shron’s “Thinking with Data” and the CRISP−DM data mining scheme, these document templates for a Skytree proposal, Skytree training, and a final report, all for customers, formed the basis of a more rigorous framework for Skytree data science to solve business problems. The documents templates were used on many occasions for Skytree final reports to paid customers.
  • Data Scientist Guide: Used by Skytree CTO Alex Gray for data scientist training, this formed the basis for Skytree data scientist best practice
  • Tutorial slide decks: Course on how to use Skytree, utilized by multiple data scientists to teach multiple commercial customers
Other work
  • National Science Foundation Review Panel for Astroinformatics
  • Lunch & learn talk
  • Presentation talk to sales re data science lessons learned
  • Internal research on petascale data
  • Wrote sections of published Skytree whitepapers
  • Review content of 4 patents and provide feedback/changes
  • Evaluate competitor UIs (e.g., H2O)
  • Skytree use case taxonomy

+ Ongoing small everyday contributions to various Skytree areas, like any job of this type, e.g.,

Internal email and chat discussions
Internal meetings
Definition of epics (sets of tasks) for projects
Confluence pages
Sharing interesting/relevant stories from industry to ml−interest, etc.

Commended as having an aptitude for organization, even disposition with difficult customers/situations, great to work with in the field with customers, best speaker among the data scientists.

Incomplete Projects

As with any wide−ranging job, there were also projects that produced some work but did not complete. Several are astronomy related but were carried out after I moved to Skytree. They are listed here for interest.

Significant work done
  • AstroML: Extend existing analyses using scikit-learn to using Skytree. Alex Gray is coauthor on the book Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data (Ivezic et al.) that has AstroML as its primary software
  • Image featurization: Use scagnostics code from Leland Wilkinson and his student Tuan Dang on SETI images. These allow nonlinear outlier detection.
  • Trillion element dataset on real data: “How can we transform a large-scale dark-matter simulation into something we can compare with our observations?” … “Find new and more accurate ways to produce mock observations of a simulated universe.” Run Skytree on 1 trillion rows of astronomical catalog data using the LBL Cori supercomputer in collaboration with NERSC/LBL and Stanford KIPAC astronomy. Colleague quote: “In the usual run of things, we’d use a mock catalog to produce simulated images, which we’d then run through SExtractor or similar. We’re cutting out both those steps − so we go directly from catalogs of mock (simulated) galaxies, to catalogues of the same galaxies with properties as observed by the telescope+SExtractor. This means we avoid the computationally expensive image simulation AND SExtractor image analysis pipelines.”
  • SETI: Featurize data using scagnostics, combining machine learning and image processing; save Allen Telescope Array from throwing away 99% of its data; distinguish signals from no signal; classify signals; unsupervised full analysis of every pixel. Clustering qualitatively separated the image types. Insufficient training data at the time for supervised separation beyond signal/no signal. If better results, would have been, e.g., Ball N.M., Richards J., Harp G, et al., “Detection of SETI Signals Using Machine Learning”.
  • Skytree Server algorithms on big astronomy data: Run Skytree Server’s algorithms on astronomy datasets of 100 million+ objects, including finding outliers (K−means clustering, nearest neighbors, kernel density estimation), classification (e.g., quasars), and data preparation via ETL. Formed the basis for Skytree demos showcasing its ability to scale to big data. Other uses for the algorithms: testing of the product before customers use it; academic publication(s), e.g., scaling, outliers papers; Skytree demos, e.g., outliers; datasets for benchmarking; and machine learning use case stories for marketing, e.g., classifying quasars. If published, would have been, e.g., Ball N.M., et al., Outliers in a Billion Rows of Astronomy Data
  • Skytree scaling (2014): If published, e.g., Ball N.M., Gray A., Ram P., Riegel R and Schade D., 2014 CANFAR+Skytree: The Worlds First Cloud Computing Data Mining System for Astronomy (submit to, e.g, Astronomy & Computing). Coauthors were unable to document for publication the relation of found empirical scaling to complex theoretical expectations.
  • “Data Mining: Astronomical Discoveries through Exploration of Big Data” (Nick Ball, Ashish Mahabal, Kirk Borne): Invited review for New Astronomy Reviews special issue on Next Generation Sky Surveys. Became Ball N.M., Mahabal A., McConnell S., Borne K., 2014, “Data Mining: Astronomical Discoveries through Exploration of Big Data” (Invited review for New Astronomy Reviews special issue on Next Generation Sky Surveys), but insufficient manpower to finish
  • Cancer dataset: Distinguish cancer from non-cancer using supervised learning based on very high dimensional (100,000+) mass spectroscopy data

Initial stages started

That’s a nice way of saying these projects didn’t go far enough to count as incomplete.

  • Widefield OuTlier Finder: Science working group for the Square Kilometre Array radio telescope
  • DRAO/EMU: Apply data mining to Dominion Radio Astrophysical Observatory data and the EMU project. The latter has working groups on both outliers and photometric redshifts
  • Drexel University: Classify quasars, in collaboration with researchers there
  • SETI+Kepler: Potential interest at the SETI Institute in using ML to find exoplanets in the Kepler data
  • University of Texas at Brownsville: Analyze LIGO and GEO600 gravitational wave data
  • Princeton: Demonstrate supervised learning on internet advertising data to use as part of Princeton degree course
  • NED-Z (2012+): Predict whether a NASA Astrophysics Data System paper contains non-redshift-based galaxy distances using supervised learning. Find what the distances are
  • NCSA Dark Energy Survey / Private Sector Program: Analyze DES data on NCSA supercomputers. Extend to Private Sector Program data on those machines
  • Photometric Redshifts for the MegaPipe Reductions of the Canada−France−Hawaii Telescope Legacy Survey (2009+): ApJ/AJ/MNRAS; assign full probabilistic photo−zs to all ∼26m CFHTLS galaxies in 5 bands ∼130m objects; catalogue runs to billions, the scale of next−generation sky surveys