Technical Skills

This expands on my resume to give a fuller picture of my technical skills. Especially for a rather wide−ranging generalist data scientist like myself, it is important to elucidate what I know well, what I know some of, and what is outside my scope.

Likewise one can phrase it as any data scientist after a few years has a giant stack of tools they have used or tried. So which ones do I know, know a bit, or don’t know?

The page is subdivided into skills listed by

  • Business considerations
  • The various fields of data science
  • Software tools

This reflects my assignment of equal importance to these 3 areas for end−to−end data science. It’s also a good way of seeing what I should learn next.

Technical Skills: Business

As with the conceptual and tools parts of the subject on the two related pages, my data science business skills are the sum total of my experience in the subject since the early 2000s. Here they are divided into broad areas or management & leadership, and business use cases.

In my Infosys and Oracle jobs, part of the job description was to help hire and then lead a data science team. However, both groups were not well resourced or staffed, a situation outside of my control, so the opportunity never materialized.

Some general material about the relationship of the data science process to business problems and processes is on the data science page.

Management & leadership

ExampleDetails
POC primary data scientistFormulation and solution of business problems with customers across multiple industries as key part of data science process
Models in productionMy company roles included aiding customers whose models were in production. CANFAR+Skytree was in production and available to the astronomical community
BenchmarkingLed benchmarking project with team of data scientists and engineers on Skytree vs. competitors (MLlib, R, scikit−learn, etc.). Results used for remainder of Skytree’s existence on presentation slides for marketing, sales, technical, and investors
Presentations & demosFrequent business−critical presentations and product demos to customers up to the senior executive level in a variety of contexts
Documentation templatesConception and production of templates for documenting data science POC work, used on multiple projects totaling millions of dollars of paid customer value
NGVS SWG group leadGroup lead on the Next Generation Virgo Cluster survey for the luminosity function Science Working Group. Also led the photometric redshifts subgroup
GALEX PIPrincipal investigator on Galaxy Evolution Explorer grant to improve galaxy distance estimates via photometric redshifts (total funding $40,000). Co-investigator on grants with funding totaling $970,000
Student mentorSupervision of PhD student projects as postdoctoral researcher
Teaching Assistant (TA)Student lab supervision and grading homework assignments as graduate student

Functional use cases

This gives example broad industry areas in which I have experience. Some specific example use cases are in work experience. Details such as company names and quantitative business value created are, however, confidential.

In each of these, machine learning was used to improve on existing customer results. The end−to−end process of business problem definition, data understanding & preparation, modeling, and production, as well as documentation, interpretation, and presentation were always in scope. Some functions listed have several example projects.

AreaDetails
Customer churnPredict which customers will stay with a company and which will leave
Fraud detectionDetect fraudulent transactions in large datasets, including highly imbalanced data where non-frauds outnumber frauds 1000:1
Healthcare diagnosisDisease diagnosis from pharmaceutical company data
Human resourcesWhich resumes are likely to produce the best job candidates
Predictive maintenanceTime to failure of various industrial equipment such as turbines or hard drives
Recommender systemsContent recommendation for media, including appropriate level of variety
Revenue predictionExpected revenue or payout from, e.g., insurance

Technical Skills: Data Science

This lists the areas I have worked on from the data science point of view rather than the tools point of view. All of these involve coding, e.g., Python, and/or command line usage. Some on distributed or HPC systems.

I have divided the list into 2 sections:

  • Experienced: areas in which I have extensive experience. With these, I can “hit the ground running” to do data science.
  • Some experience: areas I have worked in, either less than the above, or a long time ago. With these I could “hit the ground walking”, and get back up to speed if needed.

At the end, I have named a few further areas which I have not used, but would be nice to pick up given their utility.

Experienced

AreaExample(s)
AutoMLSkytree had a Bayesian−based search through algrorithm hyperparameters. Its AutoML generalized this to multiple algorithms.
Data featurizationFeature filtering, selection, and engineering in several domains
Data preparationRaw data to final analysis on many customer POCs and use cases. Also extensive in academia.
EnsemblesMixture of experts, ensemble gradient boosted tree in research and customer POCs
EvaluationUsual machine learning metrics, also business metrics such as direct dollar value
Gradient-boosted decision treesMany customer POCs: the well-tuned GBT was often best. As far as I know, Skytree never lost a POC on model performance.
InterpretationModel−dependent (e.g., variable importances), model−agnostic (e.g., partial dependence plots)
K nearest neighborsShowed Skytree linear scaling to 400 million objects on CANFAR cloud computing & data mining system
Large datasetsDatasets too large to fit in memory thus not loadable by many analysis tools
Model scoring metricsAccuracy, recall/sensitivity/true positive rate, precision, F-score, Gini, ROC/Lorenz curve, capture deviation, precision at k, MAE etc., normalized Gini, ranking scoring, random permutation variable importances, yield scoring
Model tuningTuning hyperparameters of all ML algorithms listed in this table and the next one
Model validationTrain/tune/test, cross-validation
Models in productionIn business, working with customers who had our software in production. In academia, projects backed by grant funding (up to 7 figures) had models that generated published results. Hence they were in production.
Neural networksNeural networks for galaxy classification and distance prediction (circa 2000−2004, hence pre− deep learning) [1,2]
Random forest decision treesSuitable for some problems when GBT was not best (more stable, better variable importances)
ResearchLed Skytree benchmarking project, results used in almost every company customer pitch
[1] ANN ref, photoz ref
[2] Some of them were multilayer, so technically they were deep learning, but only with the basic fully connected architecture

Some experience

AreaExample
ClusteringVarious algorithms, especially K−means as the most basic
Deep learningDotscience demos using TensorFlow for image processing and hyperparameter tuning
Density estimationAs part of Skytree software training for customers; software testing & feedback; company demos
Generalized linear modelAs baseline to compare to nonlinear ML models; includes logistic regression
Imbalanced dataCustomer POCs, e.g., fraud detection; company demos
Model monitoringGenerally customers have deployments and monitored them themselves, so it was not in my scope to deploy and monitor models for our company. We provided product support, and in Dotscience I wrote demo material monitoring deployed models using PromQL
Outlier / anomaly detectionCompany demos, e.g., nearest neighbor outliers on millions of objects with 3D interactive visualization in Partiview
Recommender systemsCompany demos & customer training
Singular value decompositionFast SVD as part of company software training for customers; software testing & feedback (SVD includes PCA)
Sparse dataEspecially for large datasets
Support vector machineClassification of high−dimensional data; linear and nonlinear
Text analysisCustomer POCs, e.g., resume scoring; company demos, e.g., Skytree demo with UFO sighting reports
Time series analysisCustomer POCs, e.g., predictive maintenance; company demos
Two point correlation functionExcess probability versus random that an object (e.g., a galaxy) is within a given distance
What-if / sensitivity analysisCustomer POCs, e.g., effect of changing data inputs; company demos

Would be nice to add

AreaBenefit(s)
A/B testingAs part of production
Edge computingHandle IoT
Graphical modelsMost datatypes can be re−expressed as a graph, greatly aiding data integration
Image/speech/video processingFlexibility of use cases
Reinforcement learningSuperhuman performance or when little/no training data
Transfer learningPre-trained models for GUI/nontechnical users, esp. cloud and deep learning

Some other ideas not currently listed: Active learning, association rules, EM algorithm, Gaussian mixture modeling, hidden Markov models, ICA, information bottleneck, LDA, linear regression, naive Bayes, NNMF, particle swarm, semi−supervised, simulated annealing, SOM, stacking, wavelets

Technical Skills: Tools

I have divided the list into 2 sections: “experienced”, and “some experience”. At the end, I have named a few further tools which I have not used, but am working on picking up given their utility, when I get time.

Many of the tools in the latter 2 categories embody generic data science concepts with which I am familiar, lessening the learning curve of picking them up. For example, I would not need to learn about machine learning from scratch before learning to use TensorFlow.

Experienced

These are tools in which I have extensive experience. With these, I can “hit the ground running” to do data science.

ToolExample(s)
SkytreeEnd−to−end data science from 2009−2018 including machine learning, large scale, customer POCs, product development, research, etc. GUI, SDK, and command line interfaces.
H2O [1]Internal company customer−facing machine learning demos; benchmarking
PythonExtensive use since mid 2000s as a data scientist and technical user, coding to solve problems (but not software engineering)
Bash shell scriptingMany examples from last 20 years, e.g., preparing data too large to load into memory (awk, sed, etc.), managing distributed computation (e.g., NCSA supercomputers), etc.

[1] By far my most extensive practical machine learning experience is in Skytree, which is no longer publicly available. H2O, which remains available, is now quite similar and embodies the same concepts, meaning I can solve the same problems with it. In the future, cloud computing tools (Amazon, Microsoft, Google) may become more similar too.

Some experience

These tools I have used, either less than the above, or quite a lot but a long time ago. With these tools I could “hit the ground walking”, and get up to / back up to speed if needed. Because of the nature of my work moving from one project and area of data science to another, this list is much longer than the above one.

Tool/LanguageExample(s)
CGeneralized code for calculating galaxy luminosity functions from univariate to bivariate
Cloud computingCANFAR, Oracle Cloud Infrastructure (OCI)
CondorAnd other job scheduling systems in HPC environments
FITSAstronomy general spreadsheet data format enabling image processing and plotting of millions of points on one plot (used in later work also)
Genetic algorithmsFeature selection for quasar photometric redshifts using multi-wavelength data, in collaboration with Illinois Genetic Algorithms Laboratory (IlliGAL)
GrafanaModel monitoring. Pairs well with Prometheus.
Hadoop/YARNSkytree software used Hadoop/YARN when run distributed (many use cases), with the HDFS filesystem
HPCGeneral supercomputer usage with allocated processor core hours, especially at NCSA in Illinois
IDLInteractive data language; was commonly used for astronomical data analysis
LaTeXDocument typesetting for submission to refereed journals for publication
MatlabNeural network classification of galaxies; galaxy properties as a function of their environment
MatplotlibSporadic use in Python code in many projects including publications
NumPy, SciPyAs part of general Python usage
Office365, Excel, Powerpoint, Word, as de facto format for shared presentations with some colleagues and customers. Also Mac equivalents Keynote & Pages. Not really lacking any needed skill in this, but included for completeness.
PandasGeneral data preparation where tool’s flexibility made it best choice, e.g., small−scale customer POCs
Partiview3D visualization of large scale astronomy data that allows the user to fly through it
PrometheusTime series database suited to model monitoring. Pairs well with Grafana.
PySparkSkytree generic large scale data preparation in customer POCs utilized PySpark
RInternal company benchmarking: set up equivalent tool hyperparameters versus Python, H2O, and others
Scikit-learnDemos and benchmarking (it was unsuitable for most customer work)
SExtractorAstronomical image extraction
SQLQuerying astronomy databases for scientifically correct datasets used in publications, e.g., Sloan Digital Sky Survey
Virtual BoxAnd other virtual machines in desktop and HPC environments. Skytree free trial was distributed to customers on Virtual Box
XGBoostInternal company customer−facing machine learning demos; benchmarking

Several others in the list were used in various papers (see publications page) or public-facing material

Not experienced, but working on

If I get time

ToolBenefit(s)
Amazon SagemakerCloud computing end−to−end data science. Test Dotscience integrations
Databricks/MLFlowData flows (looked at for competitor analysis with Dotscience)
DatashaderVisualize large datasets
DockerPart of deploying models as microservices. Used indirectly within Dotscience
GitHubPart of dataflow on Oracle and Dotscience work
Google CloudCloud computing end−to−end data science
KerasDeep learning in TensorFlow
KoalasScaling of Spark combined with flexibility of Pandas
Microsoft AzureCloud computing end−to−end data science. Competitor analysis between Dotscience and Azure Databricks
PlotlyInteractive inline notebook plots
TensorflowNow commonly used tool for enterprise data science, especially deep learning

Grab-bag of more

It would be nice to try these but it’s better to focus on the above first.

ToolBenefit(s)
CythonSpeed up Python code
DaskCombine Python ease of use with scale
Eli5Visualize machine learning models
ELKIAdvanced data clustering with many algorithms
fast.aiDeep learning on PyTorch
GensimTopic modeling
GeopandasGeographical data
GeoPyGeographical data
Google FacetsVisualize image datasets
ggplot2Plotting in R
Isolation forestOutlier detection
JuliaCombine ease of language use with speed
KafkaStreaming data
KubeflowData flows
KubernetesContainer orchestration for deploying models as microservices
Neo4jGraph database
NetworkXStudy graphs and networks
NLTKNatural language toolkit
NumbaSpeed up Python code
One-class SVMOutlier detection
ONNXInterchangeable neural network models
ParquetColumnar data is faster than row-based
PyTorchDeep learning competitor to TensorFlow, widely used in research
R ShinyInteractive web apps from R
ScagnosticsOutlier detection
SeabornStatistical plots
spaCYPython NLP
StatsModelsPython statistics
TableauTell the data story
TheanoDeep learning
TrifactaData preparation

Not Listed

Used but too small/obscure usage to be useful now:

Basic, C++, D2K, FPGA, GPU, Hive, Java, Lisp, MLlib, Perl, Tcl/Tk, Visual Basic

Not used but well-known:

Bokeh, C#, Caffe, Cassandra, Dataiku, Data Robot, Flask, Go, HBase, Java, KNIME, libSVM, Mahout, MongoDB, MXNet, .NET, NLTK, PHP, Pig, PyCharm, PyMC3, RapidMiner, Ruby, Rust, SAS, Scala, SPSS, Talend, Theano, Watson, Weka

etc.