Supporting the Management of Predictive Analytics Projects in a Decision-Making Center using Process Mining

Article history: Received: 11 January, 2021 Accepted: 12 April, 2021 Online: 28 April, 2021 A Decision-Making Centers (DMCs) Environment facilitates stakeholders' decision-making processes using predictive models and diverse what-if scenarios. An essential element of this environment is the management of Decision Support Components (e.g., models or systems) that need to be created with mature methodologies and good delivery time. However, there has been a gap in the understanding of project management best practices in DMC environments and in the application of methodologies to ease project execution. In the following paper, we address that gap by analyzing six predictive analytics projects executed in a Mexican DMC using Process Mining techniques. We perform process discovery using a detailed activity event log, which has not been possible in previous studies. Additionally, we perform a compliance evaluation versus the de facto methodology to identify the current process alignment gaps, and finally, we analyze the social networks present in the process execution. The research reveals that (1) process mining models are helpful to address management issues of PA/DM projects (2) PA/DM projects require alignment to mature methodologies to improve process performance and avoid execution problems (3) PA/DM project execution should be revised at the activity level to identify issues and to propose specific strategies. This study’s findings can help project managers to perform process analyses and to make informed decisions in PA/DM projects. The following paper is an extension of the article "Applying Process Mining to Support Management of Predictive Analytics/Data Mining Projects in a Decision-Making Center ̈ presented in the 2019 International Conference on Systems and Informatics (ICSAI 2019).


Introduction
Decision-Making Centers (DMCs) are immersive virtual environments used to understand complex problems, simplify decision-making, and visualize the results of predictive and scenario-based models [1]. These environments depend on the creation process of tools like: Predictive Analytics/Data Mining (PA/DM) models to operate [2]. Nevertheless, the authors have demonstrated in previous studies that DMC processes focus on high-level tasks and exclude detailed and standard PA/DM activities [2]. The absence of commonality in PA/DM project execution, generates issues, since (1) models are built using empiric methodologies and (2) managers cannot follow up specific technical activities since they are different in every project.
In this research, we propose three approaches to overcome the mentioned issues and help managers and modelers make informed decisions about PA/DM projects. In the first, we apply process mining techniques to a set of PA/DM processes to discover the timing, flow, frequency, and performance of activities from diverse perspectives (e.g., process, organizational, and case). Second, we compare a real PA/DM project execution with an accepted PA/DM methodology, to identify how aligned are the real processes to the formal methodology (i.e., CRISP-DM) and what gaps need to be closed to achieve compliance. Third, we perform complementary human resources analyses to visualize the relationship between resources and communication channels during process execution.
We expect that managers in DMCs use the models presented in this study to evaluate their processes and to consider the implementation of specific management strategies.

ASTESJ ISSN: 2415-6698
Finally, the organization of the paper is as follows: Section 2 and 3 presents the background and the literature review respectively. Section 4 describes the experimental design. Section 5 provides final results and discussion, and section 6 describes conclusions and future work.

Process Mining Techniques and Project Management applications
The Process Mining (PM) technique is a reverse engineering approach where process models are generated using event logs [2]. In [3], the author classifies the following PM techniques that we use for our analysis: discover, conformance, and enhancement.
The process discovery technique aims to mine process models using discovery algorithms, so the process helps managers answer specific questions [4]. Examples of discovery algorithms include alpha algorithm, heuristic miner, fuzzy miner, genetic miner, region miner, and integer linear programming (ILP). Differently, the process conformance technique aims to measure the process quality through metrics like: fitness, precision, generalization, and simplicity [4]. In this category, conformance checking is used to compare the expected model and the reality obtained from event logs. Likewise, it is possible to identify processes, commonalities, similarities, and deviations [3]. Finally, the process enhancement technique aims to extend the model with relevant information [4]. For instance, statistical metrics based on timestamps (e.g., throughput time, working time, and waiting time) or the use of replay analysis to visualize process execution. In this research, we use Disco and ProM 6 tools to implement the process mining techniques previously explained.
Finally, in [5], the authors explain that the project management field requires the process mining discipline to identify optimal workflows within project life cycles. In this regard, we consider that the following managerial issues identified in PA/DM projects can be analyzed using process mining techniques: establishing realistic goals, the creation of good teams, gaining knowledge of data, lack of infrastructure, poor project communication methodology, lack of risks management and change management [6], [7].

CRISP-DM Framework
The Cross-Industry Standard Process for Data Mining (i.e., CRISP-DM) is the most accepted methodology in the field for executing data mining projects [8]- [10]. In the framework, the project life cycle includes the following key phases [11]: Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation, and Deployment. The methodology provides a universal process with generic tasks that can be executed for all data mining projects. The CRISP-DM reference model can be regarded in Figure 1, and the generic tasks to be performed in each phase are listed in Table 1. As can be noticed, the generic tasks are defined at the activity level, which facilitates its integration to high-level DMC process. In this study, we examine the data understanding, data preparation, modeling, and evaluation phases of CRISP-DM methodology, given information constraints.

PA/DM processes with CRISP-DM
We discussed in [2] the importance of integrating PA/DM processes with DMC processes, and we make an integration effort. However, PA/DM processes at the level of activity have not been studied separately. In this study, we focus just on PA/DM processes, as a part of DMC processes, since there is no work in the literature that performs such analysis using process mining techniques. Finally, we assume that the CRISP-DM methodology matches DMC's PA/DM processes.

Literature review
Limited research has focused on Project Management using Process Mining techniques. In the literature, studies concentrate on the analysis of project management processes using data mining [12]- [14] and predictive analytics [15]- [17] techniques, but not process mining techniques. Likewise, we identify that most analyses are focused on software life cycles [12]- [18], but not on DA/PM processes. For instance, in [18], the authors use conformance checking techniques to reveal aspects of processes and identify deviations in software project execution. Additionally, the author presented an application to optimize the software development life cycle of projects using process mining [5]. However, in both cases, research is not focused on PA/DM projects or DMC environments.
Finally, the authors developed a previous study to analyze PA/DM processes in a DMC [2]. Nevertheless, the paper is oriented to analyze project phases and not project activities. Thus, that limits the possibility to discover low-level issues and applying target strategies. Besides, the study uses only in its majority enhancement techniques and a limited number of discovery algorithms. In the present study, we address those limitations.

Experimental Design
The DMC located at Tecnologico de Monterrey, Mexico City campus also referred as "Decision Laboratory", is a room with seven big format and high-definition screens that offers a space to make consensual decisions and to present solution proposals to a group of decision-makers [19]. This last with the goal of selecting the best possible solution. A picture of Tecnologico de Monterrey's DMC can be regarded in Figure 2. In the Mexican DMC, managers execute PA/DM projects to create models that support decision-making. The team organization is defined according to the knowledge of resources and their affinity to develop specific models. The project manager role is performed by one resource, and one or more product owners define the business requirements. After project execution, the modelers report that no formal methodology is applied to create the models.
Even though project managers try to deliver models with quality and on time, modelers and the supervisor report the following issues during project execution: (1) Wrong selection of modeling technique and (2) lack of standardization of data glossary among models.
With this in mind, we examine the possible causes behind the reported issues and perform a complete analysis of the PA/DM process execution in the Mexican center.

Question to be answered
For this experiment, we aim to answer the following questions about the execution of PA/DM projects in a DMC. 1) RQ1:What do the dependency, frequency, and performance statistics of the process model reveal?
2) RQ2: How compliant is the discovered model vs. the CRISP-DM reference model?
3) RQ3: How is the interaction among resources during process execution? 4) RQ4: What are the possible causes for the reported issues?

Information Gathering
We obtained qualitative and quantitative information from six real PA projects executed at Tecnologico de Monterrey DMC by interviewing modelers and managers. The format utilized for quantitative data gathering is available in Appendix A.
During this phase, five modelers and one manager were interviewed. The requested data include information from four CRISP-DM process stages (i.e., data understanding, data preparation, modeling, and evaluation), since we do not have access to data from the business understanding and deployment phases Finally, the following data was obtained from stakeholders: start and finish dates of activities, the average number of hours per activity per day, and the number of resources involved in each activity. At the end of the interview, we request impressions about execution processes to identify specific issues.

Event Log Generation
We create 4945 records with timestamps based on the provided information. The corresponding records per project can be regarded in Table 2. In this phase, no assumptions were considered since the modelers provide specific times and dates for each activity.

Event Log Analysis
We use Disco and ProM 6 applications to perform process mining. The first is a commercial tool that provides accurate process models [20]. While the second supports other types of functionalities like Petri nets and Social Networks [21]. Table 3 shows the modules used in each application. We use the Map view from the disco application to visualize the flow of activities, dependencies, frequencies, and performance. Likewise, the statistics view is used to identify the process event distribution, the activities, and the frequency of resources. Finally, we use the Filtering functionality to analyze the process model by specific cases. From ProM application, we use Social Network Miner to identify relationships among resources.

Results and Discussion
During process analysis, we document the global statistics shown in Table 4. As can be noted, the number of events represents the total records in the event log, and the cases correspond to the number of processes. The activities represent 16 generic tasks of the following phases of the CRISP-DM methodology: data understanding, data preparation, modeling, and evaluation. Finally, we examine projects that were executed in the next time range (April 25 th , 2016 to June 30 th , 2019). The statistics reveal that, on average, the project duration is 30 weeks. On the other hand, the statistics per activity showed in Table 5 reveal the most, the average, and the least executed activities in the project. In the following subsections, we respond to the defined research questions. RQ1: What do dependency, frequency, and performance statistics of the process model reveal? Figure 3 shows the process map of the event log. As can be noticed, there are four thick arrows in the diagram that represents significant dependence among activities. For instance, the most substantial and unique bidirectional dependency is present between the review process and the determination of Next steps activities. Likewise, a significant reliance is visible between the process review and next steps activities, which means that those tasks execution order is the same in cases majority. Besides, managers should pay attention to the iteration that involves all data manipulation activities (i.e., collect, explore, verify, select, clean, construct, and integrate data) with the modeling selection technique. The evidence support that the team is having trouble with gaining knowledge of the data, which is a common problem in these kinds of models. We can assume that the lack of consistent execution of the description and selection of data could be the cause of the described problem.
Lastly, the diagram shows a dependency between the modeling technique selection and the initial data collection, which should be revised. Strangely, a change in data impacts the modeling technique and also the model construction. We recommend the inclusion of roles with expertise in modeling techniques to break that dependency. On the other hand, the absolute frequency of activities is represented with color. A high-frequency task is painted with intense blue, while one with low frequency is depicted with light blue. For this process model, the activity with the most significant frequency is the model's construction, followed by evaluating results and selecting the modeling technique. As we have mentioned, the model construction is affected by previous executed or non-executed activities. We can assume that previous activities improvement has a positive impact on the construction activity. In this case, we recommend using lean prototypes to facilitate the technique selection and diminish the time devoted to the model construction.
Finally, the performance of the model can be regarded in Figure 4. The model shows the total task duration and delays between activities. The model´s construction is the most significant task with 31.9 wks. Likewise, a delay of 20.6 weeks is present between the select modeling technique and collect initial data activities. In this case, managers should focus on diminishing the time between those two activities, by involving more resources or/and experts to the project.
A second delay is exposed between the process review and the definition of the next steps; however, this case should be analyzed separately since all resources execute these activities simultaneously, and that variable could affect the metric and not represent the real delay. Figure 5 shows the process map by case frequency that is useful to analyze compliance. For instance, we are examining six cases, and theoretically, all activities should be present in all cases; however, in reality, this is not the case. Specifically, for this DMC, we discovered that the activities with a lower presence in the execution are the data formatting and data description. This last represents an issue in subsequent tasks since modelers report that the lack of data description has delayed the model's development and integration processes. Likewise, data cleaning and construction activities have problems with compliance in one of the cases, so DMC managers need to review these deviations. Finally, the flow of activities has some compliance issues. In contrast to CRISP-DM, the actual execution is iterative in the data understanding and preparation phases. With this information, managers can create initiatives to align the model to CRISP-DM by stages and increase its performance.

RQ3: How is the interaction among resources during process execution?
To answer this question, we use the Social Network mining capability of ProM. As can be regarded in Figure 6, the two product owners are critical intermediaries in the project execution. In this case, the PM role is key in the process; however, the manager seems distant from individual modelers. On the other hand, M1 and M2 modelers seem to be more connected to the group. This last can have two explanations: (1) The existence of a functional dependency among parties (e.g., infrastructure, software, etc.) or (2) the modeler has participated in several projects which allow him to collaborate with more people. Finally, managers must pay attention to isolated modelers M8 and M9 and understand why they are separated from the group. We analyze the possible causes of the reported issues using previous process maps.
Wrong selection of modeling technique. It seems that the modeling technique selection is an exploratory process that takes too much time to define. As we have recommended before, inclusion of an expert and the implementation of prototypes can address this problem. Since the modeling technique can be evaluated with a prototype and with the expert support.
Lack of standardization of data glossary among models: This problem is caused by the lack of execution of the data documentation activity. We believe that an alignment to CRISP-DM methodology can solve this problem.
Finally, it is relevant to mention the limitations of the present study, which can be addressed in future research.
First, the presented model represents PA/DM execution processes of DMCs exclusively, so other PA/DM processes outside this environment are not represented in the research. Second, we don't include the business understanding and deployment phases in the modeling given data restrictions. So we represent part of the PA/DM process in this study. Lastly, the absence of previous research limits the possibility to compare and contrast our model with others and evaluate its completeness.

Conclusions and Future Work
In this research, we reveal the value of process mining as a tool to support project management of PA/DM projects in DMCs. Likewise, we expose the need to implement mature PA/DM processes in DMCs that facilitate (1) project management and (2) process improvement.
In this study, we create a process model to identify project execution issues, gaps in compliance, and the interaction of resources. We perform interviews to obtain detailed data from the PA process execution from modelers and managers. An event log at the level of activities was created considering CRISP-DM generic tasks, timestamps, and resources. Disco application was used to apply process discovery and process enhancement techniques. ProM application was used to perform Social Network mining. The results of the study reveal that: (1) Process mining models are helpful to analyze and address common management issues of PA/DM projects (2) PA/DM projects require alignment to mature methodologies to improve process performance and avoid execution problems (3) PA/DM project execution should be revised at the activity level to identify issues and to propose specific strategies (4) PA/DM projects should be analyzed from different perspectives to obtain valuable information for the management team.
Although it has been proved that Process Mining techniques are useful tools to support the management of PA/DM projects, there is work that needs to be addressed in the future. For instance, we need to use ProM tool to obtain compliance metrics of process models. Likewise, we need to use additional social network algorithms to analyze other organizational relationships.