crisp

Phase One: Business Understanding
Perhaps the most important phase of any data mining project, the initial business understanding phase focuses on understanding the project objectives from a business perspective, converting this knowledge into a data mining problem definition, and then de veloping a preliminary plan designed to achieve the objectives. In order to understand which data should later be analyzed, and how, it is vital for data mining practitioners to fully understand the business for which they are finding a solution. The business understanding phase involves several key steps, including determining business objectives, assessing the situation, determining the data mining goals, and producing the project plan
Determine the Business Objectives
Understanding a client’s true goal is critical to uncovering the important factors involved in the planned project—and to ensuring that the project does not result in producing the right answers to the wrong questions. To accomplish this, the data analyst must uncover the primary business objective as well as the related questions the business would like to address.
For example, the primary business goal could be to retain current customers by predicting when they are prone to move to a competitor. Examples of related business questions might be, “How does the primary channel (e.g., ATM, branch visit, Internet) of a bank customer affect whether they stay or go?” or “Will lower ATM fees significantly reduce the number of highvalue customers who leave?” A secondary issue might be to determine whether lower fees affect only one particular customer segment.
Finally, a good data analyst always determines the measure of success. Success may be measured by reducing lost customers by 10 percent or simply by achieving a better understanding of the customer base. Data analysts should beware of setting unattainable goals and should make sure that each success criterion relates to at least one of the specified business objectives.
Assess the Situation
In this step, the data analyst outlines the resources, from personnel to software, that are available to accomplish the data mining project. Particularly important is discovering what data is available to meet the primary business goal. At this point, the data analyst also should list the assumptions made in the project— assumptions such as, “To address the business question, a minimum number of customers over age 50 is necessary.” The data analyst also should list the project risks, list potential solutions to those risks, create a glossary of business and data mining terms, and construct a cost-benefit analysis for the project.
Determine the Data Mining Goals
The data mining goal states project objectives in business terms such as, “Predict how many widgets a customer will buy given their purchases in the past three years, demographic information (age, salary, city, etc.), and the item price.” Success also should be defined in these terms—for instance, success could be defined as achieving a certain level of predictive accuracy. If the business goal cannot be effectively translated into a data mining goal, it may be wise to consider redefining the problem at this point.
Produce a Project Plan
The project plan describes the intended plan for achieving the data mining goals, including outlining specific steps and a proposed timeline, an assessment of potential risks, and an initial assessment of the tools and techniques needed to support the project. Generally accepted industry timeline standards are: 50 to 70 percent of the time and effort in a data mining project involves the Data Preparation Phase; 20 to 30 percent involves the Data Understanding Phase; only 10 to 20 percent is spent in each of the Modeling, Evaluation, and Business Understanding Phases; and 5 to 10 percent is spent in the Deployment Planning Phase.

Phase Two: Data Understanding
The data understanding phase starts with an initial data collection. The analyst then proceeds to increase familiarity with the data, to identify data quality problems, to discover initial insights into the data, or to detect interesting subsets to form hypotheses about hidden information. The data understanding phase involves four steps, including the collection of initial data, the description of data, the exploration of data, and the verification of data quality.
Collect the Initial Data
Here a data analyst acquires the necessary data, including loading and integrating this data if necessary. The analyst should make sure to report problems encountered and his or her solutions to aid with future replications of the project. For instance, data may have to be collected from several different sources, and some of these sources may have a long lag time. It is helpful to know this in advance to avoid potential delays.
Describe the Data
During this step, the data analyst examines the “gross” or “surface” properties of the acquired data and reports on the results, examining issues such as the format of the data, the quantity of the data, the number of records and fields in each table, the identities of the fields, and any other surface features of the data. The key question to ask is: Does the data acquired satisfy the relevant requirements? For instance, if age is an important field and the data does not reflect the entire age range, it may be wise to collect a different set of data. This step also provides a basic understanding of the data on which subsequent steps will build
Explore the Data
This task tackles the data mining questions, which can be addressed using querying, visualization, and reporting. For instance, a data analyst may query the data to discover the types of products that purchasers in a particular income group usually buy. Or the analyst may run a visualization analysis to uncover potential fraud patterns. The data analyst should then create a data exploration report that outlines first findings, or an initial hypothesis, and the potential impact on the remainder of the project.
Verify Data Quality
 At this point, the analyst examines the quality of the data, addressing questions such as: Is the data complete? Missing values often occur, particularly if the data was collected across long periods of time. Some common items to check include: missing attributes and blank fields; whether all possible values are represented; the plausibility of values; the spelling of values; and whether attributes with different values have similar meanings (e.g., low fat, diet). The data analyst also should review any attributes that may give answers that conflict with common sense (e.g., teenagers with high income).

Phase Three: Data Preparation
The data preparation phase covers all activities to construct the final data set or the data that will be fed into the modeling tool(s) from the initial raw data. Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modeling tools. The five steps in data preparation are the selection of data, the cleansing of data, the construction of data, the integration of data, and the formatting of data.
Select Data
Deciding on the data that will be used for the analysis is based on several criteria, including its relevance to the data mining goals, as well as quality and technical constraints such as limits on data volume or data types. For instance, while an individual’s address may be used to determine which region that individual is from, the actual street address data can likely be eliminated to reduce the amount of data that must be evaluated. Part of the data selection process should involve explaining why certain data was included or excluded. It is also a good idea to decide if one or more attributes are more important than others are.
Clean Data
Without clean data, the results of a data mining analysis are in question. Thus at this stage, the data analyst must either select clean subsets of data or incorporate more ambitious techniques such as estimating missing data through modeling analyses. At this point, data analysts should make sure they outline how they addressed each quality problem reported in the earlier “Verify Data Quality” step.
Construct Data
After the data is cleaned, the data analyst should undertake data preparation operations such as developing entirely new records or producing derived attributes. An example of a new record would be the creation of an empty purchase record for customers who made no purchases during the past year. Derived attributes, in contrast, are new attributes that are constructed from existing attributes, such as Area = Length x Width. These derived attributes should only be added if they ease the model process or facilitate the modeling algorithm, not just to reduce the number of input attributes. For instance, perhaps “income per head” is a better/easier attribute to use than “income per household.” Another type of derived attribute is single-attribute transformations, usually performed to fit the needs of the modeling tools. These transformations may be necessary to transform ranges to symbolic fields (e.g., ages to age bands), or symbolic fields (“definitely yes,” “yes,” “don’t know,” “no”) to numeric values. Modeling tools or algorithms often require these transformations.
Integrate Data
Integrating data involves combining information from multiple tables or records to create new records or values. With table-based data, an analyst can join two or more tables that have different information about the same objects. For instance, a retail chain has one table with information about each store’s general characteristics (e.g., floor space, type of mall), another table with summarized sales data (e.g., profit, percent change in sales from previous year), and another table with information about the demographics of the surrounding area. Each of these tables contains one record for each store. These tables can be merged together into a new table with one record for each store, combining fields from the source tables.
Data integration also covers aggregations. Aggregations refer to operations where new values are computed by summarizing information from multiple records and/or tables. For example, an aggregation could include converting a table of customer purchases, where there is one record for each purchase, into a new table where there is one record for each customer. The table’s fields could include the number of purchases, the average purchase amount, the percent of orders charged to credit cards, the percent of items under promotion, etc.
Format Data
In some cases, the data analyst will change the format or design of the data. These changes might be simple—for example, removing illegal characters from strings or trimming them to a maximum length—or they may be more complex, such as those involving a reorganization of the information. Sometimes these changes are needed to make the data suitable for a specific modeling tool. In other instances, the changes are needed to pose the necessary data mining questions.
Phase Four: Modeling
In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, several techniques exist for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase may be necessary. Modeling steps include the selection of the modeling technique, the generation of test design, the creation of models, and the assessment of models.
Select the Modeling Technique
This task refers to choosing one or more specific modeling techniques, such as decision tree building with C4.5 or neural network generation with back propagation. If assumptions are attached to the modeling technique, these should be recorded.
Generate Test Design
After building a model, the data analyst must test the model’s quality and validity, running empirical testing to determine the strength of the model. In supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, we typically separate the data set into train and test set, build the model on the train set, and estimate its quality on the separate test set. In other words, the data analyst develops the model based on one set of existing data and tests its validity using a separate set of data. This enables the data analyst to measure how well the model can predict history before using it to predict the future. It is usually appropriate to design the test procedure before building the model; this also has implications for data preparation.
Build the Model
After testing, the data analyst runs the modeling tool on the prepared data set to create one or more models.
Assess the Model
The data mining analyst interprets the models according to his or her domain knowledge, the data mining success criteria, and the desired test design. The data mining analyst judges the success of the application of modeling and discovery techniques technically, but he or she should also work with business analysts and domain experts in order to interpret the data mining results in the business context. The data mining analyst may even choose to have the business analyst involved when creating the models for assistance in discovering potential problems with the data.
For example, a data mining project may test the factors that affect bank account closure. If data is collected at different times of the month, it could cause a significant difference in the account balances of the two data sets collected. (Because individuals tend to get paid at the end of the month, the data collected at that time would reflect higher account balances.) A business analyst familiar with the bank’s operations would note such a discrepancy immediately.
In this phase, the data mining analyst also tries to rank the models. He or she assesses the models according to the evaluation criteria and takes into account business objectives and business success criteria. In most data mining projects, the data mining analyst applies a single technique more than once or generates data mining results with different alternative techniques. In this task, he or she also compares all results according to the evaluation criteria.
Phase Five: Evaluation
Before proceeding to final deployment of the model built by the data analyst, it is important to more thoroughly evaluate the model and review the model’s construction to be certain it properly achieves the business objectives. Here it is critical to determine if some important business issue has not been sufficiently considered. At the end of this phase, the project leader then should decide exactly how to use the data mining results. The key steps here are the evaluation of results, the process review, and the determination of next steps
Evaluate Results
Previous evaluation steps dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and determines if there is some business reason why this model is deficient. Another option here is to test the model(s) on real-world applications—if time and budget constraints permit. Moreover, evaluation also seeks to unveil additional challenges, information, or hints for future directions.
At this stage, the data analyst summarizes the assessment results in terms of business success criteria, including a final statement about whether the project already meets the initial business objectives.
Review Process
It is now appropriate to do a more thorough review of the data mining engagement to determine if there is any important factor or task that has somehow been overlooked. This review also covers quality assurance issues (e.g., did we correctly build the model? Did we only use allowable attributes that are available for future deployment?).
Determine Next Steps
At this stage, the project leader must decide whether to finish this project and move on to deployment or whether to initiate further iterations or set up new data mining projects.
Phase Six: Deployment
Model creation is generally not the end of the project. The knowledge gained must be organized and presented in a way that the customer can use it, which often involves applying “live” models within an organization’s decision-making processes, such as the real-time personalization of Web pages or repeated scoring of marketing databases.
Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. Even though it is often the customer, not the data analyst, who carries out the deployment steps, it is important for the customer to understand up front what actions must be taken in order to actually make use of the created models. The key steps here are plan deployment, plan monitoring and maintenance, the production of the final report, and review of the project.
Plan Deployment
In order to deploy the data mining result(s) into the business, this task takes the evaluation results and develops a strategy for deployment.
Plan Monitoring and Maintenance
Monitoring and maintenance are important issues if the data mining result is to become part of the day-to-day business and its environment. A carefully prepared maintenance strategy avoids incorrect usage of data mining results.
Produce Final Report
At the end of the project, the project leader and his or her team write up a final report. Depending on the deployment plan, this report may be only a summary of the project and its experiences (if they have not already been documented as an ongoing activity) or it may be a final and comprehensive presentation of the data mining result(s). This report includes all of the previous deliverables and summarizes and organizes the results. Also, there often will be a meeting at the conclusion of the project, where the results are verbally presented to the customer.
Review Project
The data analyst should assess failures and successes as well as potential areas of improvement for use in future projects. This step should include a summary of important experiences during the project and can include interviews with the significant project participants. This document could include pitfalls, misleading approaches, or hints for selecting the best-suited data mining techniques in similar situations. In ideal projects, experience documentation also covers any reports written by individual project members during the project phases and tasks.