Work Experience – Nick Ball

Here I describe my work experience and most significant projects undertaken after leaving full−time academic research in 2012. My previous academic work is described under academic research .

This page is essentially like my LinkedIn profile, except way longer. For a shorter and more fun to read version of both this and the research, see the highlights page.

The main intent here is not so much to say “hey look, I did lots of stuff” − any data scientist can say that after a few years − but having been generalist covering a lot of areas to substantiate what some of those have been.

Table of Contents

Dotscience (2019−2020)

From my LinkedIn:

Principal role as the leading data scientist at this startup company, in a generalist capacity. Wearer of many hats, including leading product, working with engineering, sales, marketing, UX, and customers. Responsible for Dotscience product roadmap, documentation, data science content, competitor analysis, and leading the data science side of customer engagements. Contributor to blog content, writing and giving conference talks, writing and giving product demos, product testing, and company strategy.

Oracle: Lead Data Scientist (2018−2019)

My Oracle position did not allow much time for substantive achievement, as I was only there 5 months before the position was eliminated, along with several thousand others, in spring 2019. This is a quick list of day−to−day items covered under my generalist role:

Learn & use datascience.com product
Learn & use Oracle Data Science Service
Learn & use Oracle Cloud Infrastructure
Man Oracle OpenWorld booth Oct 22−24 2018
Feedback on product definition and roadmap
Customer meetings (including travel to onsite)
Data science pitfalls talk
Fraud detection functional demo
Competitor analysis: pricing, cloud products
Meetings/discussions with about a dozen other Oracle groups on algorithms, data science process, product integrations, research, etc.
Presentation to Stanford undergraduate students via Oracle Recruiting group
Feedback and approve/reject submitted technical blog entries for team’s public website

Also within the job scope, but with no opportunity to carry out, were

Lead a team of data scientists
Present Oracle Data Science Service to customers

Infosys: Principal Data Scientist (2017−2018)

Our group was under−resourced, I don’t think intentionally, but it was one of the reasons I left. But this limited opportunities to mostly inward−facing day−to−day generalist data science work. Examples include:

Bank POC: Fraud detection with supervised machine learning, including preparation of raw data, handling highly imbalanced classes
Develop product roadmap: several dozen Jira tickets and epics
Product testing and feedback from user point of view
Demos: credit risk, customer churn, fraud detection, lead scoring, spam filtering
Demo videos
Pre-sales calls with salespeople and engineers, sometimes leading

Skytree: Data Scientist / Staff Data Scientist (2012−2017)

My most substantive industry data science experience is with the startup company Skytree, “The Machine Learning Company”, between 2012 and 2017. Skytree was founded by Alex Gray and others from the Georgia Institute of Technology in Atlanta, an outgrowth of the FASTLab there. This group held a number of records for the fastest machine learning algorithm implementations. The technology and team were acquired by Infosys in 2017.

Note, “POC” refers to an end−to−end major paid project with a customer with a view to putting Skytree models resulting from it into production. Production deployment was part of the software’s ability, so this was always in scope. On projects I led I was therefore responsible for turning the customer’s payment, up to 6 or 7 digits, into a return on investment for them.

Lead data scientist on customer projects (2013+)

Critical role as lead data scientist on interactions with over 60 commercial companies
Including deals over $1m and internationally, onsite work (primarily data science, also some customer training), leading data science teams
Lead for about 20 projects over time, 5 of Skytree’s 10 paying customers, 10 other POCs, many more of the over 60 pre-POC

Example POCs

Unfortunately the company names cannot be made public. Some were international outside the US, e.g., Europe and Asia.

Equipment manufacturer: Turbine failure: primarily time series anomaly detection
Data science consulting company: Support customers in Japan, trained them
Bank: Find more fraud than is currently detected from customer reports. An upsurge in fraud in the previous year, including suspected specific targeting of the bank by groups, left their existing system unable to cope with the fraud volume. Any approach that can detect more fraud and thus prevent financial loss is welcomed by them.
Battery manufacturer: Battery failure prediction so their monitoring system can say replace before a rack fails
Wafer manufacturer: What−if prediction on wafer substrates
Domain name provider: Customer churn on domain name registry
Pharmaceutical company: Fibromyalgia diagnosis, supervised learning, ∼200 columns
Job recruiter: Compare Skytree to wise.io and Data Robot for machine learning on a 450 MB dataset of 37,973 resumes to predict who is a good candidate

Research and development

Benchmarking: led benchmarking project with team of data scientists and engineers on Skytree vs. competitors (MLlib, R, Scikit−learn, etc.). Results used for remainder of Skytree’s existence on presentation slides for marketing, sales, technical, and investors
Auto−featurization: Skytree command line and SDK scripts for feature selection, e.g., backward elimination by variable importances. Skytree snippets for auto−featurization. List desirable transforms, transform snippets
Data science roadmap: integral part of data science + engineering team defining and prioritizing our product roadmap. Authored many pages on our Atlassian Confluence system, and other material
Demonstrate Skytree UI interface on multi-billion element data (NYC taxi). Next step would have been to extend to the data preparation using the software’s transform snippet mechanism that used PySpark

UI/SDK/CLI testing

Using Skytree as a customer would, giving feedback, esp. versions 15.3 onwards
Feedback total 100s JIRA tickets (technical, user experience, presentation) in aggregate since 2013 that have significantly improved the product
Includes many documentation improvements and changes

Skytree Demos and Videos

Videos forming main introductory content on Skytree website, viewed by 100s of people: technical differentiators, how to download and install, speed and automation. Go−to person in the company for videos because it was vital that the content was both correct technically, and well presented
4 15min videos of demo projects (income, fraud, churn, leads) for partner companies, also Hadoop virtual machine
Often asked by Skytree CEO to give demos to important prospective customers and VC partners in conference and in person, onsite and at the office
Demo material included on Skytree Express free download virtual machine, e.g., PDP plots for income demo. Later release has full project walkthrough
Results using Skytree Server command line to be shown in demos, e.g., UFO sighting data, visualized by Leland Wilkinson’s student Tuan Dang

Other marketing, sales, outreach

Skytree star award from VP Worldwide Sales for work with sales in the field
Conference booths: talking to customers, answering general and technical questions, live demos
Booth scope included entire process solo, e.g., at Dallas did “one person conference”: talk, booth, carry booth supplies, because trusted by Skytree to represent the company
Meetups: multiple presentations of Skytree at Meetups, Bay Area and further afield, e.g., general Skytree, demos, galaxy distances
Reviewed marketing collateral: combine general audience with proper technical presentation in correct English
Toastmasters Competent Communicator certification (and commended by colleagues as best speaker among the technical people)

Other work

National Science Foundation Review Panel for Astroinformatics
Lunch & learn talk
Presentation talk to sales re data science lessons learned
Internal research on petascale data
Wrote sections of published Skytree whitepapers
Review content of 4 patents and provide feedback/changes
Evaluate competitor UIs (e.g., H2O)
Skytree use case taxonomy

+ Ongoing small everyday contributions to various Skytree areas, like any job of this type, e.g.,

Internal email and chat discussions
Internal meetings
Definition of epics (sets of tasks) for projects
Confluence pages
Sharing interesting/relevant stories from industry to ml−interest, etc.

Commended as having an aptitude for organization, even disposition with difficult customers/situations, great to work with in the field with customers, best speaker among the data scientists.

Incomplete Projects

As with any wide−ranging job, there were also projects that produced some work but did not complete. Several are astronomy related but were carried out after I moved to Skytree. They are listed here for interest.

Significant work done

AstroML: Extend existing analyses using scikit-learn to using Skytree. Alex Gray is coauthor on the book Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data (Ivezic et al.) that has AstroML as its primary software
Image featurization: Use scagnostics code from Leland Wilkinson and his student Tuan Dang on SETI images. These allow nonlinear outlier detection.
Trillion element dataset on real data: “How can we transform a large-scale dark-matter simulation into something we can compare with our observations?” … “Find new and more accurate ways to produce mock observations of a simulated universe.” Run Skytree on 1 trillion rows of astronomical catalog data using the LBL Cori supercomputer in collaboration with NERSC/LBL and Stanford KIPAC astronomy. Colleague quote: “In the usual run of things, we’d use a mock catalog to produce simulated images, which we’d then run through SExtractor or similar. We’re cutting out both those steps − so we go directly from catalogs of mock (simulated) galaxies, to catalogues of the same galaxies with properties as observed by the telescope+SExtractor. This means we avoid the computationally expensive image simulation AND SExtractor image analysis pipelines.”
SETI: Featurize data using scagnostics, combining machine learning and image processing; save Allen Telescope Array from throwing away 99% of its data; distinguish signals from no signal; classify signals; unsupervised full analysis of every pixel. Clustering qualitatively separated the image types. Insufficient training data at the time for supervised separation beyond signal/no signal. If better results, would have been, e.g., Ball N.M., Richards J., Harp G, et al., “Detection of SETI Signals Using Machine Learning”.
Skytree Server algorithms on big astronomy data: Run Skytree Server’s algorithms on astronomy datasets of 100 million+ objects, including finding outliers (K−means clustering, nearest neighbors, kernel density estimation), classification (e.g., quasars), and data preparation via ETL. Formed the basis for Skytree demos showcasing its ability to scale to big data. Other uses for the algorithms: testing of the product before customers use it; academic publication(s), e.g., scaling, outliers papers; Skytree demos, e.g., outliers; datasets for benchmarking; and machine learning use case stories for marketing, e.g., classifying quasars. If published, would have been, e.g., Ball N.M., et al., Outliers in a Billion Rows of Astronomy Data
Skytree scaling (2014): If published, e.g., Ball N.M., Gray A., Ram P., Riegel R and Schade D., 2014 CANFAR+Skytree: The Worlds First Cloud Computing Data Mining System for Astronomy (submit to, e.g, Astronomy & Computing). Coauthors were unable to document for publication the relation of found empirical scaling to complex theoretical expectations.
“Data Mining: Astronomical Discoveries through Exploration of Big Data” (Nick Ball, Ashish Mahabal, Kirk Borne): Invited review for New Astronomy Reviews special issue on Next Generation Sky Surveys. Became Ball N.M., Mahabal A., McConnell S., Borne K., 2014, “Data Mining: Astronomical Discoveries through Exploration of Big Data” (Invited review for New Astronomy Reviews special issue on Next Generation Sky Surveys), but insufficient manpower to finish
Cancer dataset: Distinguish cancer from non-cancer using supervised learning based on very high dimensional (100,000+) mass spectroscopy data

Initial stages started

That’s a nice way of saying these projects didn’t go far enough to count as incomplete.

Widefield OuTlier Finder: Science working group for the Square Kilometre Array radio telescope
DRAO/EMU: Apply data mining to Dominion Radio Astrophysical Observatory data and the EMU project. The latter has working groups on both outliers and photometric redshifts
Drexel University: Classify quasars, in collaboration with researchers there
SETI+Kepler: Potential interest at the SETI Institute in using ML to find exoplanets in the Kepler data
University of Texas at Brownsville: Analyze LIGO and GEO600 gravitational wave data
Princeton: Demonstrate supervised learning on internet advertising data to use as part of Princeton degree course
NED-Z (2012+): Predict whether a NASA Astrophysics Data System paper contains non-redshift-based galaxy distances using supervised learning. Find what the distances are
NCSA Dark Energy Survey / Private Sector Program: Analyze DES data on NCSA supercomputers. Extend to Private Sector Program data on those machines
Photometric Redshifts for the MegaPipe Reductions of the Canada−France−Hawaii Telescope Legacy Survey (2009+): ApJ/AJ/MNRAS; assign full probabilistic photo−zs to all ∼26m CFHTLS galaxies in 5 bands ∼130m objects; catalogue runs to billions, the scale of next−generation sky surveys