Novice data scientists sometimes have the notion that all they need to do is to find the right model for their data and then fit it. Nothing could be farther from the actual practice of data science. In fact, data wrangling (also called data cleansing and data munging) and exploratory data analysis often consume 80% of a data scientist’s time.
Despite how easy data wrangling and exploratory data analysis are conceptually, it can be hard to get them right. Uncleansed or badly cleansed data is garbage, and the GIGO principle (garbage in, garbage out) applies to modeling and analysis just as much as it does to any other aspect of data processing.
What is data wrangling?
Data rarely comes in usable form. It’s often contaminated with errors and omissions, rarely has the desired structure, and usually lacks context. Data wrangling is the process of discovering the data, cleaning the data, validating it, structuring it for usability, enriching the content (possibly by adding information from public data such as weather and economic conditions), and in some cases aggregating and transforming the data.
Exactly what goes into data wrangling can vary. If the data comes from instruments or IoT devices, data transfer can be a major part of the process. If the data will be used for machine learning, transformations can include normalization or standardization as well as dimensionality reduction. If exploratory data analysis will be performed on personal computers with limited memory and storage, the wrangling process may include extracting subsets of the data. If the data comes from multiple sources, the field names and units of measurement may need consolidation through mapping and transformation.
What is exploratory data analysis?
Exploratory data analysis is closely associated with John Tukey, of Princeton University and Bell Labs. Tukey proposed exploratory data analysis in 1961, and wrote a book about it in 1977. Tukey’s interest in exploratory data analysis influenced the development of the S statistical language at Bell Labs, which later led to S-Plus and R.
Exploratory data analysis was Tukey’s reaction to what he perceived as over-emphasis on statistical hypothesis testing, also called confirmatory data analysis. The difference between the two is that in exploratory data analysis you investigate the data first and use it to suggest hypotheses, rather than jumping right to hypotheses and fitting lines and curves to the data.
In practice, exploratory data analysis combines graphics and descriptive statistics. In a highly cited book chapter, Tukey uses R to explore the 1990s Vietnamese economy with histograms, kernel density estimates, box plots, means and standard deviations, and illustrative graphs.
ETL and ELT for data analysis
In traditional database usage, ETL (extract, transform, and load) is the process for extracting data from a data source, often a transactional database, transforming it into a structure suitable for analysis, and loading it into a data warehouse. ELT (extract, load, and transform) is a more modern process in which the data goes into a data lake or data warehouse in raw form, and then the data warehouse performs any necessary transformations.
Whether you have data lakes, data warehouses, all the above, or none of the above, the ELT process is more appropriate for data analysis and specifically machine learning than the ETL process. The underlying reason for this is that machine learning often requires you to iterate on your data transformations in the service of feature engineering, which is very important to making good predictions.
Screen scraping for data mining
There are times when your data is available in a form your analysis programs can read, either as a file or via an API. But what about when the data is only available as the output of another program, for example on a tabular website?
It’s not that hard to parse and collect web data with a program that mimics a web browser. That process is called screen scraping, web scraping, or data scraping. Screen scraping originally meant reading text data from a computer terminal screen; these days it’s much more common for the data to be displayed in HTML web pages.
Cleaning data and imputing missing values for data analysis
Most raw real-world datasets have missing or obviously wrong data values. The simple steps for cleaning your data include dropping columns and rows that have a high percentage of missing values. You might also want to remove outliers later in the process.
Sometimes if you follow those rules you lose too much of your data. An alternate way of dealing with missing values is to impute values. That essentially means guessing what they should be. This is easy to implement with standard Python libraries.
The Pandas data import functions, such as read_csv()
, can replace a placeholder symbol such as ‘?’ with ‘NaN’. The Scikit_learn class SimpleImputer()
can replace ‘NaN’ values using one of four strategies: column mean, column median, column mode, and constant. For a constant replacement value, the default is ‘0’ for numeric fields and ‘missing_value’ for string or object fields. You can set a fill_value
to override that default.
Which imputation strategy is best? It depends on your data and your model, so the only way to know is to try them all and see which strategy yields the fit model with the best validation accuracy scores.
Feature engineering for predictive modeling
A feature is an individual measurable property or characteristic of a phenomenon being observed. Feature engineering is the construction of a minimum set of independent variables that explain a problem. If two variables are highly correlated, either they need to be combined into a single feature, or one should be dropped. Sometimes people perform principal component analysis (PCA) to convert correlated variables into a set of linearly uncorrelated variables.
Categorical variables, usually in text form, must be encoded into numbers to be useful for machine learning. Assigning an integer for each category (label encoding) seems obvious and easy, but unfortunately some machine learning models mistake the integers for ordinals. A popular alternative is one-hot encoding, in which each category is assigned to a column (or dimension of a vector) that is either coded 1 or 0.
Feature generation is the process of constructing new features from the raw observations. For example, subtract Year_of_Birth from Year_of_Death and you construct Age_at_Death, which is a prime independent variable for lifetime and mortality analysis. The Deep Feature Synthesis algorithm is useful for automating feature generation; you can find it implemented in the open source Featuretools framework.
Feature selection is the process of eliminating unnecessary features from the analysis, to avoid the “curse of dimensionality” and overfitting of the data. Dimensionality reduction algorithms can do this automatically. Techniques include removing variables with many missing values, removing variables with low variance, Decision Tree, Random Forest, removing or combining variables with high correlation, Backward Feature Elimination, Forward Feature Selection, Factor Analysis, and PCA.
Data normalization for machine learning
To use numeric data for machine regression, you usually need to normalize the data. Otherwise, the numbers with larger ranges might tend to dominate the Euclidian distance between feature vectors, their effects could be magnified at the expense of the other fields, and the steepest descent optimization might have difficulty converging. There are several ways to normalize and standardize data for machine learning, including min-max normalization, mean normalization, standardization, and scaling to unit length. This process is often called feature scaling.
Data analysis lifecycle
While there are probably as many variations on the data analysis lifecycle as there are analysts, one reasonable formulation breaks it down into seven or eight steps, depending on how you want to count:
- Identify the questions to be answered for business understanding and the variables that need to be predicted.
- Acquire the data (also called data mining).
- Clean the data and account for missing data, either by discarding rows or imputing values.
- Explore the data.
- Perform feature engineering.
- Predictive modeling, including machine learning, validation, and statistical methods and tests.
- Data visualization.
- Return to step one (business understanding) and continue the cycle.
Steps two and three are often considered data wrangling, but it’s important to establish the context for data wrangling by identifying the business questions to be answered (step one). It’s also important to do your exploratory data analysis (step four) before modeling, to avoid introducing biases in your predictions. It’s common to iterate on steps five through seven to find the best model and set of features.
And yes, the lifecycle almost always restarts when you think you’re done, either because the conditions change, the data drifts, or the business needs to answer additional questions.