Technical Skills – Nick Ball

This expands on my resume to give a fuller picture of my technical skills. Especially for a rather wide−ranging generalist data scientist like myself, it is important to elucidate what I know well, what I know some of, and what is outside my scope.

Likewise one can phrase it as any data scientist after a few years has a giant stack of tools they have used or tried. So which ones do I know, know a bit, or don’t know?

The page is subdivided into skills listed by

Business considerations
The various fields of data science
Software tools

This reflects my assignment of equal importance to these 3 areas for end−to−end data science. It’s also a good way of seeing what I should learn next.

Table of Contents

Technical Skills: Business

As with the conceptual and tools parts of the subject on the two related pages, my data science business skills are the sum total of my experience in the subject since the early 2000s. Here they are divided into broad areas or management & leadership, and business use cases.

In my Infosys and Oracle jobs, part of the job description was to help hire and then lead a data science team. However, both groups were not well resourced or staffed, a situation outside of my control, so the opportunity never materialized.

Some general material about the relationship of the data science process to business problems and processes is on the data science page.

Management & leadership

Example	Details
POC primary data scientist	Formulation and solution of business problems with customers across multiple industries as key part of data science process
Models in production	My company roles included aiding customers whose models were in production. CANFAR+Skytree was in production and available to the astronomical community
Benchmarking	Led benchmarking project with team of data scientists and engineers on Skytree vs. competitors (MLlib, R, scikit−learn, etc.). Results used for remainder of Skytree’s existence on presentation slides for marketing, sales, technical, and investors
Presentations & demos	Frequent business−critical presentations and product demos to customers up to the senior executive level in a variety of contexts
Documentation templates	Conception and production of templates for documenting data science POC work, used on multiple projects totaling millions of dollars of paid customer value
NGVS SWG group lead	Group lead on the Next Generation Virgo Cluster survey for the luminosity function Science Working Group. Also led the photometric redshifts subgroup
GALEX PI	Principal investigator on Galaxy Evolution Explorer grant to improve galaxy distance estimates via photometric redshifts (total funding $40,000). Co-investigator on grants with funding totaling $970,000
Student mentor	Supervision of PhD student projects as postdoctoral researcher
Teaching Assistant (TA)	Student lab supervision and grading homework assignments as graduate student

Functional use cases

This gives example broad industry areas in which I have experience. Some specific example use cases are in work experience. Details such as company names and quantitative business value created are, however, confidential.

In each of these, machine learning was used to improve on existing customer results. The end−to−end process of business problem definition, data understanding & preparation, modeling, and production, as well as documentation, interpretation, and presentation were always in scope. Some functions listed have several example projects.

Area	Details
Customer churn	Predict which customers will stay with a company and which will leave
Fraud detection	Detect fraudulent transactions in large datasets, including highly imbalanced data where non-frauds outnumber frauds 1000:1
Healthcare diagnosis	Disease diagnosis from pharmaceutical company data
Human resources	Which resumes are likely to produce the best job candidates
Predictive maintenance	Time to failure of various industrial equipment such as turbines or hard drives
Recommender systems	Content recommendation for media, including appropriate level of variety
Revenue prediction	Expected revenue or payout from, e.g., insurance

Technical Skills: Data Science

This lists the areas I have worked on from the data science point of view rather than the tools point of view. All of these involve coding, e.g., Python, and/or command line usage. Some on distributed or HPC systems.

I have divided the list into 2 sections:

Experienced: areas in which I have extensive experience. With these, I can “hit the ground running” to do data science.
Some experience: areas I have worked in, either less than the above, or a long time ago. With these I could “hit the ground walking”, and get back up to speed if needed.

At the end, I have named a few further areas which I have not used, but would be nice to pick up given their utility.

Experienced

Area	Example(s)
AutoML	Skytree had a Bayesian−based search through algrorithm hyperparameters. Its AutoML generalized this to multiple algorithms.
Data featurization	Feature filtering, selection, and engineering in several domains
Data preparation	Raw data to final analysis on many customer POCs and use cases. Also extensive in academia.
Ensembles	Mixture of experts, ensemble gradient boosted tree in research and customer POCs
Evaluation	Usual machine learning metrics, also business metrics such as direct dollar value
Gradient-boosted decision trees	Many customer POCs: the well-tuned GBT was often best. As far as I know, Skytree never lost a POC on model performance.
Interpretation	Model−dependent (e.g., variable importances), model−agnostic (e.g., partial dependence plots)
K nearest neighbors	Showed Skytree linear scaling to 400 million objects on CANFAR cloud computing & data mining system
Large datasets	Datasets too large to fit in memory thus not loadable by many analysis tools
Model scoring metrics	Accuracy, recall/sensitivity/true positive rate, precision, F-score, Gini, ROC/Lorenz curve, capture deviation, precision at k, MAE etc., normalized Gini, ranking scoring, random permutation variable importances, yield scoring
Model tuning	Tuning hyperparameters of all ML algorithms listed in this table and the next one
Model validation	Train/tune/test, cross-validation
Models in production	In business, working with customers who had our software in production. In academia, projects backed by grant funding (up to 7 figures) had models that generated published results. Hence they were in production.
Neural networks	Neural networks for galaxy classification and distance prediction (circa 2000−2004, hence pre− deep learning) [1,2]
Random forest decision trees	Suitable for some problems when GBT was not best (more stable, better variable importances)
Research	Led Skytree benchmarking project, results used in almost every company customer pitch

[1] ANN ref, photoz ref
[2] Some of them were multilayer, so technically they were deep learning, but only with the basic fully connected architecture

Some experience

Area	Example
Clustering	Various algorithms, especially K−means as the most basic
Deep learning	Dotscience demos using TensorFlow for image processing and hyperparameter tuning
Density estimation	As part of Skytree software training for customers; software testing & feedback; company demos
Generalized linear model	As baseline to compare to nonlinear ML models; includes logistic regression
Imbalanced data	Customer POCs, e.g., fraud detection; company demos
Model monitoring	Generally customers have deployments and monitored them themselves, so it was not in my scope to deploy and monitor models for our company. We provided product support, and in Dotscience I wrote demo material monitoring deployed models using PromQL
Outlier / anomaly detection	Company demos, e.g., nearest neighbor outliers on millions of objects with 3D interactive visualization in Partiview
Recommender systems	Company demos & customer training
Singular value decomposition	Fast SVD as part of company software training for customers; software testing & feedback (SVD includes PCA)
Sparse data	Especially for large datasets
Support vector machine	Classification of high−dimensional data; linear and nonlinear
Text analysis	Customer POCs, e.g., resume scoring; company demos, e.g., Skytree demo with UFO sighting reports
Time series analysis	Customer POCs, e.g., predictive maintenance; company demos
Two point correlation function	Excess probability versus random that an object (e.g., a galaxy) is within a given distance
What-if / sensitivity analysis	Customer POCs, e.g., effect of changing data inputs; company demos

Would be nice to add

Area	Benefit(s)
A/B testing	As part of production
Edge computing	Handle IoT
Graphical models	Most datatypes can be re−expressed as a graph, greatly aiding data integration
Image/speech/video processing	Flexibility of use cases
Reinforcement learning	Superhuman performance or when little/no training data
Transfer learning	Pre-trained models for GUI/nontechnical users, esp. cloud and deep learning

Some other ideas not currently listed: Active learning, association rules, EM algorithm, Gaussian mixture modeling, hidden Markov models, ICA, information bottleneck, LDA, linear regression, naive Bayes, NNMF, particle swarm, semi−supervised, simulated annealing, SOM, stacking, wavelets

Technical Skills: Tools

I have divided the list into 2 sections: “experienced”, and “some experience”. At the end, I have named a few further tools which I have not used, but am working on picking up given their utility, when I get time.

Many of the tools in the latter 2 categories embody generic data science concepts with which I am familiar, lessening the learning curve of picking them up. For example, I would not need to learn about machine learning from scratch before learning to use TensorFlow.

Experienced

These are tools in which I have extensive experience. With these, I can “hit the ground running” to do data science.

Tool	Example(s)
Skytree	End−to−end data science from 2009−2018 including machine learning, large scale, customer POCs, product development, research, etc. GUI, SDK, and command line interfaces.
H2O [1]	Internal company customer−facing machine learning demos; benchmarking
Python	Extensive use since mid 2000s as a data scientist and technical user, coding to solve problems (but not software engineering)
Bash shell scripting	Many examples from last 20 years, e.g., preparing data too large to load into memory (awk, sed, etc.), managing distributed computation (e.g., NCSA supercomputers), etc.

[1] By far my most extensive practical machine learning experience is in Skytree, which is no longer publicly available. H2O, which remains available, is now quite similar and embodies the same concepts, meaning I can solve the same problems with it. In the future, cloud computing tools (Amazon, Microsoft, Google) may become more similar too.

Some experience

These tools I have used, either less than the above, or quite a lot but a long time ago. With these tools I could “hit the ground walking”, and get up to / back up to speed if needed. Because of the nature of my work moving from one project and area of data science to another, this list is much longer than the above one.

Tool/Language	Example(s)
C	Generalized code for calculating galaxy luminosity functions from univariate to bivariate
Cloud computing	CANFAR, Oracle Cloud Infrastructure (OCI)
Condor	And other job scheduling systems in HPC environments
FITS	Astronomy general spreadsheet data format enabling image processing and plotting of millions of points on one plot (used in later work also)
Genetic algorithms	Feature selection for quasar photometric redshifts using multi-wavelength data, in collaboration with Illinois Genetic Algorithms Laboratory (IlliGAL)
Grafana	Model monitoring. Pairs well with Prometheus.
Hadoop/YARN	Skytree software used Hadoop/YARN when run distributed (many use cases), with the HDFS filesystem
HPC	General supercomputer usage with allocated processor core hours, especially at NCSA in Illinois
IDL	Interactive data language; was commonly used for astronomical data analysis
LaTeX	Document typesetting for submission to refereed journals for publication
Matlab	Neural network classification of galaxies; galaxy properties as a function of their environment
Matplotlib	Sporadic use in Python code in many projects including publications
NumPy, SciPy	As part of general Python usage
Office	365, Excel, Powerpoint, Word, as de facto format for shared presentations with some colleagues and customers. Also Mac equivalents Keynote & Pages. Not really lacking any needed skill in this, but included for completeness.
Pandas	General data preparation where tool’s flexibility made it best choice, e.g., small−scale customer POCs
Partiview	3D visualization of large scale astronomy data that allows the user to fly through it
Prometheus	Time series database suited to model monitoring. Pairs well with Grafana.
PySpark	Skytree generic large scale data preparation in customer POCs utilized PySpark
R	Internal company benchmarking: set up equivalent tool hyperparameters versus Python, H2O, and others
Scikit-learn	Demos and benchmarking (it was unsuitable for most customer work)
SExtractor	Astronomical image extraction
SQL	Querying astronomy databases for scientifically correct datasets used in publications, e.g., Sloan Digital Sky Survey
Virtual Box	And other virtual machines in desktop and HPC environments. Skytree free trial was distributed to customers on Virtual Box
XGBoost	Internal company customer−facing machine learning demos; benchmarking

Several others in the list were used in various papers (see publications page) or public-facing material

Not experienced, but working on

If I get time

Tool	Benefit(s)
Amazon Sagemaker	Cloud computing end−to−end data science. Test Dotscience integrations
Databricks/MLFlow	Data flows (looked at for competitor analysis with Dotscience)
Datashader	Visualize large datasets
Docker	Part of deploying models as microservices. Used indirectly within Dotscience
GitHub	Part of dataflow on Oracle and Dotscience work
Google Cloud	Cloud computing end−to−end data science
Keras	Deep learning in TensorFlow
Koalas	Scaling of Spark combined with flexibility of Pandas
Microsoft Azure	Cloud computing end−to−end data science. Competitor analysis between Dotscience and Azure Databricks
Plotly	Interactive inline notebook plots
Tensorflow	Now commonly used tool for enterprise data science, especially deep learning

Grab-bag of more

It would be nice to try these but it’s better to focus on the above first.

Tool	Benefit(s)
Cython	Speed up Python code
Dask	Combine Python ease of use with scale
Eli5	Visualize machine learning models
ELKI	Advanced data clustering with many algorithms
fast.ai	Deep learning on PyTorch
Gensim	Topic modeling
Geopandas	Geographical data
GeoPy	Geographical data
Google Facets	Visualize image datasets
ggplot2	Plotting in R
Isolation forest	Outlier detection
Julia	Combine ease of language use with speed
Kafka	Streaming data
Kubeflow	Data flows
Kubernetes	Container orchestration for deploying models as microservices
Neo4j	Graph database
NetworkX	Study graphs and networks
NLTK	Natural language toolkit
Numba	Speed up Python code
One-class SVM	Outlier detection
ONNX	Interchangeable neural network models
Parquet	Columnar data is faster than row-based
PyTorch	Deep learning competitor to TensorFlow, widely used in research
R Shiny	Interactive web apps from R
Scagnostics	Outlier detection
Seaborn	Statistical plots
spaCY	Python NLP
StatsModels	Python statistics
Tableau	Tell the data story
Theano	Deep learning
Trifacta	Data preparation

Not Listed

Used but too small/obscure usage to be useful now:

Basic, C++, D2K, FPGA, GPU, Hive, Java, Lisp, MLlib, Perl, Tcl/Tk, Visual Basic

Not used but well-known:

Bokeh, C#, Caffe, Cassandra, Dataiku, Data Robot, Flask, Go, HBase, Java, KNIME, libSVM, Mahout, MongoDB, MXNet, .NET, NLTK, PHP, Pig, PyCharm, PyMC3, RapidMiner, Ruby, Rust, SAS, Scala, SPSS, Talend, Theano, Watson, Weka

etc.