This expands on my resume to give a fuller picture of my technical skills. Especially for a rather wide−ranging generalist data scientist like myself, it is important to elucidate what I know well, what I know some of, and what is outside my scope.
Likewise one can phrase it as any data scientist after a few years has a giant stack of tools they have used or tried. So which ones do I know, know a bit, or don’t know?
The page is subdivided into skills listed by
- Business considerations
- The various fields of data science
- Software tools
This reflects my assignment of equal importance to these 3 areas for end−to−end data science. It’s also a good way of seeing what I should learn next.
Table of Contents
Technical Skills: Business
As with the conceptual and tools parts of the subject on the two related pages, my data science business skills are the sum total of my experience in the subject since the early 2000s. Here they are divided into broad areas or management & leadership, and business use cases.
In my Infosys and Oracle jobs, part of the job description was to help hire and then lead a data science team. However, both groups were not well resourced or staffed, a situation outside of my control, so the opportunity never materialized.
Some general material about the relationship of the data science process to business problems and processes is on the data science page.
Management & leadership
Example | Details |
POC primary data scientist | Formulation and solution of business problems with customers across multiple industries as key part of data science process |
Models in production | My company roles included aiding customers whose models were in production. CANFAR+Skytree was in production and available to the astronomical community |
Benchmarking | Led benchmarking project with team of data scientists and engineers on Skytree vs. competitors (MLlib, R, scikit−learn, etc.). Results used for remainder of Skytree’s existence on presentation slides for marketing, sales, technical, and investors |
Presentations & demos | Frequent business−critical presentations and product demos to customers up to the senior executive level in a variety of contexts |
Documentation templates | Conception and production of templates for documenting data science POC work, used on multiple projects totaling millions of dollars of paid customer value |
NGVS SWG group lead | Group lead on the Next Generation Virgo Cluster survey for the luminosity function Science Working Group. Also led the photometric redshifts subgroup |
GALEX PI | Principal investigator on Galaxy Evolution Explorer grant to improve galaxy distance estimates via photometric redshifts (total funding $40,000). Co-investigator on grants with funding totaling $970,000 |
Student mentor | Supervision of PhD student projects as postdoctoral researcher |
Teaching Assistant (TA) | Student lab supervision and grading homework assignments as graduate student |
Functional use cases
This gives example broad industry areas in which I have experience. Some specific example use cases are in work experience. Details such as company names and quantitative business value created are, however, confidential.
In each of these, machine learning was used to improve on existing customer results. The end−to−end process of business problem definition, data understanding & preparation, modeling, and production, as well as documentation, interpretation, and presentation were always in scope. Some functions listed have several example projects.
Area | Details |
Customer churn | Predict which customers will stay with a company and which will leave |
Fraud detection | Detect fraudulent transactions in large datasets, including highly imbalanced data where non-frauds outnumber frauds 1000:1 |
Healthcare diagnosis | Disease diagnosis from pharmaceutical company data |
Human resources | Which resumes are likely to produce the best job candidates |
Predictive maintenance | Time to failure of various industrial equipment such as turbines or hard drives |
Recommender systems | Content recommendation for media, including appropriate level of variety |
Revenue prediction | Expected revenue or payout from, e.g., insurance |
Technical Skills: Data Science
This lists the areas I have worked on from the data science point of view rather than the tools point of view. All of these involve coding, e.g., Python, and/or command line usage. Some on distributed or HPC systems.
I have divided the list into 2 sections:
- Experienced: areas in which I have extensive experience. With these, I can “hit the ground running” to do data science.
- Some experience: areas I have worked in, either less than the above, or a long time ago. With these I could “hit the ground walking”, and get back up to speed if needed.
At the end, I have named a few further areas which I have not used, but would be nice to pick up given their utility.
Experienced
Area | Example(s) |
AutoML | Skytree had a Bayesian−based search through algrorithm hyperparameters. Its AutoML generalized this to multiple algorithms. |
Data featurization | Feature filtering, selection, and engineering in several domains |
Data preparation | Raw data to final analysis on many customer POCs and use cases. Also extensive in academia. |
Ensembles | Mixture of experts, ensemble gradient boosted tree in research and customer POCs |
Evaluation | Usual machine learning metrics, also business metrics such as direct dollar value |
Gradient-boosted decision trees | Many customer POCs: the well-tuned GBT was often best. As far as I know, Skytree never lost a POC on model performance. |
Interpretation | Model−dependent (e.g., variable importances), model−agnostic (e.g., partial dependence plots) |
K nearest neighbors | Showed Skytree linear scaling to 400 million objects on CANFAR cloud computing & data mining system |
Large datasets | Datasets too large to fit in memory thus not loadable by many analysis tools |
Model scoring metrics | Accuracy, recall/sensitivity/true positive rate, precision, F-score, Gini, ROC/Lorenz curve, capture deviation, precision at k, MAE etc., normalized Gini, ranking scoring, random permutation variable importances, yield scoring |
Model tuning | Tuning hyperparameters of all ML algorithms listed in this table and the next one |
Model validation | Train/tune/test, cross-validation |
Models in production | In business, working with customers who had our software in production. In academia, projects backed by grant funding (up to 7 figures) had models that generated published results. Hence they were in production. |
Neural networks | Neural networks for galaxy classification and distance prediction (circa 2000−2004, hence pre− deep learning) [1,2] |
Random forest decision trees | Suitable for some problems when GBT was not best (more stable, better variable importances) |
Research | Led Skytree benchmarking project, results used in almost every company customer pitch |
[2] Some of them were multilayer, so technically they were deep learning, but only with the basic fully connected architecture
Some experience
Area | Example |
Clustering | Various algorithms, especially K−means as the most basic |
Deep learning | Dotscience demos using TensorFlow for image processing and hyperparameter tuning |
Density estimation | As part of Skytree software training for customers; software testing & feedback; company demos |
Generalized linear model | As baseline to compare to nonlinear ML models; includes logistic regression |
Imbalanced data | Customer POCs, e.g., fraud detection; company demos |
Model monitoring | Generally customers have deployments and monitored them themselves, so it was not in my scope to deploy and monitor models for our company. We provided product support, and in Dotscience I wrote demo material monitoring deployed models using PromQL |
Outlier / anomaly detection | Company demos, e.g., nearest neighbor outliers on millions of objects with 3D interactive visualization in Partiview |
Recommender systems | Company demos & customer training |
Singular value decomposition | Fast SVD as part of company software training for customers; software testing & feedback (SVD includes PCA) |
Sparse data | Especially for large datasets |
Support vector machine | Classification of high−dimensional data; linear and nonlinear |
Text analysis | Customer POCs, e.g., resume scoring; company demos, e.g., Skytree demo with UFO sighting reports |
Time series analysis | Customer POCs, e.g., predictive maintenance; company demos |
Two point correlation function | Excess probability versus random that an object (e.g., a galaxy) is within a given distance |
What-if / sensitivity analysis | Customer POCs, e.g., effect of changing data inputs; company demos |
Would be nice to add
Area | Benefit(s) |
A/B testing | As part of production |
Edge computing | Handle IoT |
Graphical models | Most datatypes can be re−expressed as a graph, greatly aiding data integration |
Image/speech/video processing | Flexibility of use cases |
Reinforcement learning | Superhuman performance or when little/no training data |
Transfer learning | Pre-trained models for GUI/nontechnical users, esp. cloud and deep learning |
Some other ideas not currently listed: Active learning, association rules, EM algorithm, Gaussian mixture modeling, hidden Markov models, ICA, information bottleneck, LDA, linear regression, naive Bayes, NNMF, particle swarm, semi−supervised, simulated annealing, SOM, stacking, wavelets
Technical Skills: Tools
I have divided the list into 2 sections: “experienced”, and “some experience”. At the end, I have named a few further tools which I have not used, but am working on picking up given their utility, when I get time.
Many of the tools in the latter 2 categories embody generic data science concepts with which I am familiar, lessening the learning curve of picking them up. For example, I would not need to learn about machine learning from scratch before learning to use TensorFlow.
Experienced
These are tools in which I have extensive experience. With these, I can “hit the ground running” to do data science.
Tool | Example(s) |
Skytree | End−to−end data science from 2009−2018 including machine learning, large scale, customer POCs, product development, research, etc. GUI, SDK, and command line interfaces. |
H2O [1] | Internal company customer−facing machine learning demos; benchmarking |
Python | Extensive use since mid 2000s as a data scientist and technical user, coding to solve problems (but not software engineering) |
Bash shell scripting | Many examples from last 20 years, e.g., preparing data too large to load into memory (awk, sed, etc.), managing distributed computation (e.g., NCSA supercomputers), etc. |
[1] By far my most extensive practical machine learning experience is in Skytree, which is no longer publicly available. H2O, which remains available, is now quite similar and embodies the same concepts, meaning I can solve the same problems with it. In the future, cloud computing tools (Amazon, Microsoft, Google) may become more similar too.
Some experience
These tools I have used, either less than the above, or quite a lot but a long time ago. With these tools I could “hit the ground walking”, and get up to / back up to speed if needed. Because of the nature of my work moving from one project and area of data science to another, this list is much longer than the above one.
Tool/Language | Example(s) |
C | Generalized code for calculating galaxy luminosity functions from univariate to bivariate |
Cloud computing | CANFAR, Oracle Cloud Infrastructure (OCI) |
Condor | And other job scheduling systems in HPC environments |
FITS | Astronomy general spreadsheet data format enabling image processing and plotting of millions of points on one plot (used in later work also) |
Genetic algorithms | Feature selection for quasar photometric redshifts using multi-wavelength data, in collaboration with Illinois Genetic Algorithms Laboratory (IlliGAL) |
Grafana | Model monitoring. Pairs well with Prometheus. |
Hadoop/YARN | Skytree software used Hadoop/YARN when run distributed (many use cases), with the HDFS filesystem |
HPC | General supercomputer usage with allocated processor core hours, especially at NCSA in Illinois |
IDL | Interactive data language; was commonly used for astronomical data analysis |
LaTeX | Document typesetting for submission to refereed journals for publication |
Matlab | Neural network classification of galaxies; galaxy properties as a function of their environment |
Matplotlib | Sporadic use in Python code in many projects including publications |
NumPy, SciPy | As part of general Python usage |
Office | 365, Excel, Powerpoint, Word, as de facto format for shared presentations with some colleagues and customers. Also Mac equivalents Keynote & Pages. Not really lacking any needed skill in this, but included for completeness. |
Pandas | General data preparation where tool’s flexibility made it best choice, e.g., small−scale customer POCs |
Partiview | 3D visualization of large scale astronomy data that allows the user to fly through it |
Prometheus | Time series database suited to model monitoring. Pairs well with Grafana. |
PySpark | Skytree generic large scale data preparation in customer POCs utilized PySpark |
R | Internal company benchmarking: set up equivalent tool hyperparameters versus Python, H2O, and others |
Scikit-learn | Demos and benchmarking (it was unsuitable for most customer work) |
SExtractor | Astronomical image extraction |
SQL | Querying astronomy databases for scientifically correct datasets used in publications, e.g., Sloan Digital Sky Survey |
Virtual Box | And other virtual machines in desktop and HPC environments. Skytree free trial was distributed to customers on Virtual Box |
XGBoost | Internal company customer−facing machine learning demos; benchmarking |
Several others in the list were used in various papers (see publications page) or public-facing material
Not experienced, but working on
If I get time
Tool | Benefit(s) |
Amazon Sagemaker | Cloud computing end−to−end data science. Test Dotscience integrations |
Databricks/MLFlow | Data flows (looked at for competitor analysis with Dotscience) |
Datashader | Visualize large datasets |
Docker | Part of deploying models as microservices. Used indirectly within Dotscience |
GitHub | Part of dataflow on Oracle and Dotscience work |
Google Cloud | Cloud computing end−to−end data science |
Keras | Deep learning in TensorFlow |
Koalas | Scaling of Spark combined with flexibility of Pandas |
Microsoft Azure | Cloud computing end−to−end data science. Competitor analysis between Dotscience and Azure Databricks |
Plotly | Interactive inline notebook plots |
Tensorflow | Now commonly used tool for enterprise data science, especially deep learning |
Grab-bag of more
It would be nice to try these but it’s better to focus on the above first.
Tool | Benefit(s) |
Cython | Speed up Python code |
Dask | Combine Python ease of use with scale |
Eli5 | Visualize machine learning models |
ELKI | Advanced data clustering with many algorithms |
fast.ai | Deep learning on PyTorch |
Gensim | Topic modeling |
Geopandas | Geographical data |
GeoPy | Geographical data |
Google Facets | Visualize image datasets |
ggplot2 | Plotting in R |
Isolation forest | Outlier detection |
Julia | Combine ease of language use with speed |
Kafka | Streaming data |
Kubeflow | Data flows |
Kubernetes | Container orchestration for deploying models as microservices |
Neo4j | Graph database |
NetworkX | Study graphs and networks |
NLTK | Natural language toolkit |
Numba | Speed up Python code |
One-class SVM | Outlier detection |
ONNX | Interchangeable neural network models |
Parquet | Columnar data is faster than row-based |
PyTorch | Deep learning competitor to TensorFlow, widely used in research |
R Shiny | Interactive web apps from R |
Scagnostics | Outlier detection |
Seaborn | Statistical plots |
spaCY | Python NLP |
StatsModels | Python statistics |
Tableau | Tell the data story |
Theano | Deep learning |
Trifacta | Data preparation |
Not Listed
Used but too small/obscure usage to be useful now:
Basic, C++, D2K, FPGA, GPU, Hive, Java, Lisp, MLlib, Perl, Tcl/Tk, Visual Basic
Not used but well-known:
Bokeh, C#, Caffe, Cassandra, Dataiku, Data Robot, Flask, Go, HBase, Java, KNIME, libSVM, Mahout, MongoDB, MXNet, .NET, NLTK, PHP, Pig, PyCharm, PyMC3, RapidMiner, Ruby, Rust, SAS, Scala, SPSS, Talend, Theano, Watson, Weka
etc.