Using Naming Patterns for Identifying Architectural Technical Debt

Hasty software development can produce immediate implementations with source code unnecessarily complex and hardly readable. These small kinds of software decay generate a technical debt that could be big enough to seriously affect future maintenance activities. This work presents an analysis technique for identifying architectural technical debt related to non-uniformity of naming patterns; the technique is based on term frequency over package hierarchies. The proposal has been evaluated on projects of two popular organizations, Apache and Eclipse. The results have shown that most of the projects have frequent occurrences of the proposed naming patterns, and using a graph model and aggregated data could enable the elaboration of simple queries for debt identification. The technique has features that favor its applicability on emergent architectures and agile software development.


Introduction
This paper is an extension of work originally presented in the 8th Euro American Conference on Telematics and Information Systems (EATIS) [38]. Taking an easy solution on short-term in an activity of any phase of software development (i.e., requirements, design, implementation), can generate an accumulated technical debt, which, in a given period of time, can become big enough to affect future deliveries, making hard getting a successful outcome [6,24,37]. The debt comprises any aspect known as inappropiate which has not been addressed in due time (e.g., complex source code that needs refactoring) [24]. This debt is a topic whose interest has been increased over the years [36]. Frequently the technical debt, when is inserted, is less visible for decision makers in the software development [5]. The development of techniques for identifying and monitoring incidences of technical debt, is important for making explicit the debt and it could be resolved in due time [3,11,22,24,35,37].
The technical debt can be inserted by not complying the architectural design, or by not using conventions or standards of programming [35]. Including this as a decision factor inside the software development, requires information about the incidences of technical debt in the software system, where these are located, and their magnitude; such information can be gotten through source code analysis [5].
The objective of this work is to present: 1. An analysis technique for identifying architectural technical debt by non-uniformity of patterns.
2. A set of naming patterns across the package hierarchy of the software system.

Architectural Technical Debt (ATD)
ATD is a kind of technical debt which comprises sub-optimal solutions regarding internal or external quality attributes defined in the intended architecture, mainly compromising the attributes of maintainability and evolvability [2,11].
Changes related to design qualities but not related directly to external behavior of the system, are frequently postponed or neglected to reduce delivery time of the software system [3], increasing the incidences of ATD.
ATD is a debt very related to source code [24], however, in practice, is hard to be identified because this does not provide observable behavior to final users [11,36], and can change with time due to information gotten from implementation details [2]. Therefore, the ATD cannot be completely identified at an initial stage [2].

ASTESJ ISSN: 2415-6698
In [2], a set of ATD is introduced. Among them, ATD by nonuniformity of patterns is related to name conventions applied in part of the system which are not followed in another parts [2]. This instance of ATD is addressed in this work.
Furthermore, several agile approaches consider the architecture as an emergent feature where there is no early design; but the source code is refactored and the architectural elements are refined [21]. The refactoring is a regular practice used in agile approaches, and is often applied on source code [1]; this contributes to the emergence of a successful architecture, improving the internal structure of the application, making the architectural elements more comprehensible, and avoiding the architecture decay, specially in them defined slightly [15,21]. Performing an incomplete refactoring is a cause of ATD that can insert part of ATD and generates new debt [2]. The refactoring can be performed manually, or semi or fully automatic. The fully automatic approach carry out the identification and transformation of code elements, nevertheless a human commits modifications [1,16]. This work enables a fully automatic refactoring, taking into account the identification by the proposed analysis, and applying a transformation through a renaming of classes. The last is a kind of global refactoring (i.e., affects classes in more than one package) [10] with API level (Application Programming Interface) [30], which is often used automatically in programming environments [8,30] with aims of organization and conceptualization [25], standing out over other refactoring forms by supporting the software traceability [1].

Naming Patterns
As a software evolves, its code becomes a source of information that is up to date and contains relevant information about the application domain [14]. Complex code is a major source of technical debt [22]; the correct use of naming conventions defined by the architecture accelerates and makes easy the activities of software comprehension [34]. Nevertheless, these conventions could not be followed throughout the software system. Such phenomenon can be amplified in agile teams [2]; where the teams are empowered in terms of design, different development teams working in parallel accumulates differences in design and architecture, and naming policies are not always defined explicitly and formally, arising divergences and requiring effort [2].
The relevance of class names lies in determining the code legibility, portability, maintainability, and accessibility to new team members, and relating the source code to the problem domain [19]. Also, industry experts highlight the importance of identifier names in software [12,28,31]. Therefore, such importance can reach architectural analysis levels, where identifying component terms is a task less complicated when identifiers are comprised by complete words or meaning acronyms [9,14]. The following subsections present a set of naming patterns inspired on the organization of source code through packages; the patterns are defined taking into account the frequent use of terms in class names inside the subjacent package hierarchy. Examples are taken from several real projects of the organizations Apache and Eclipse.

Pattern: Package
In this pattern the term is often used by classes included in a same package. As an example, figure 1 shows packages of Apache MyFaces. f is defined as a value of minimal frequency; T is the set of terms used in class names; P is the set of packages; C(p) is the set of classes of p ∈ P; and C(p,t) is the set of classes of p which have names with the term t ∈ T. The terms t of this pattern are such that ( | C(p,t) | / | C(p) | ) ≥ f, and | C(p) | > 2.

Pattern: Package Name
The term is often used by classes included in packages with same name. Figure

Pattern: Package Name and Level
The term is often used by classes included in packages with same name at same level of the package hierarchy. As an example, figure 3 shows packages of Apache Hadoop. N is defined as the set of package levels; G(n,m) is the set of packages with name m which are located at level n ∈ N; and G(n,m,t) is the set of packages with name m, at level n, which contain classes having the term t in their names. The terms t of this pattern are such that ( | G(n,m,t) | / | G(n,m) | ) ≥ f, and | G(n,m) | > 2.

Pattern: Package immediately superior
The term is often used by classes included in packages that are located in the same superior package. Figure 5 shows packages of Eclipse BPMN2. H(p) is defined as the set of packages located in package p ∈ P; and H(p,t) is the set of packages located in p which contain classes using the term t in their names. The terms t of this pattern are such that ( | H(p,t) | / | H(p) | ) ≥ f, and | H(p) | > 2.

Analysis Procedure
The analysis procedure performs the following steps: reading of packages and classes; creation of a graph of packages and classes; creation of a graph of terms with aggregated nodes; and querying of frequent terms and their frequency in the graph. The following subsections provide major detail about the relevant features.

Graph based storage
The gotten terms are stored in a graph based database, such model was chosen due to its visualization capabilities, its ease of adding labels to nodes and creating nodes with aggregated data. CQL (i.e.; Code Query Language) has been developed to perform exhaustive analysis on source code [27]. However, querying the source code directly without aggregating data, could affects response time. In this work, the graph query language is used as CQL with aims of taking the most of the database query mechanisms, which are developed to manage considerable amounts of data, and visualizing the results graphically. Moreover, having a graph enables software architects to query and visualize the data for purposes beyond this work.

Analysis of Term Frequency
The procedure of identifying frequent terms uses an analysis based on term frequency with collection range [32], frequency related to the number of times that a term occurs in a collection (e.g.; names of classes organized in packages). The frequency is computed by taking the percentage of term occurrences in same package (pattern Package) or in several packages. For each occurrence, the term position inside the name is considered (e.g., for ClientProtocol, the term Protocol is located in the second position).
The creation of the graph of terms is performed querying the names of classes and storing the occurrence of terms. The new nodes are created aggregating the number of occurrences for each naming pattern: by package, by package name, by package name and level, and by package immediately superior (these nodes will be denominated "aggregated nodes"); such data aggregation enables the simplification of graph queries. Then, aggregated nodes are labeled as frequent terms when they reach a minimal frequency.

Results
This work can be considered a valid proposal for ATD, because it corresponds to ATD by non-uniformity of patterns [2], and it takes into account a debt that affects maintainability and evolvability of software, without been included in not accepted topics as technical debt [37]. The approach of this proposal gives relevance to class names, and these determines the maintainability and legibility of software, between others [4,7,18,28,29,33,34]. Looking at the standard ISO/IEC FDIS 25010, maintainability includes the following quality attributes: modularity, reusability, analyzability, modifiability and testability. Considering that identifying components is less complicated task when the identifiers are comprised by significant terms [9,14], the presented analysis can support the analyzability and modifiability, getting significant terms by their frequent use (been representatives). Furthermore, if the naming patterns are not found in a software implementation, it could evidence poor choices of design and implementation with regard to used terms, affecting the test case artifacts [17]; in this sense, the analysis can also support the testability. Table 1 shows some data about the projects considered henceforth: LOC (lines of code), QF (quantity of files), QP (quantity of packages), and QT (quantity of terms).
For evaluating the proposed analysis technique, an application was implemented to getting the terms used in class name following the CamelCase coding style (predominant style due to its ease of writing and adoption [7,13]), storing terms in a Neo4j database (standard graph database in the industry [26]). The application was executed on twenty projects of the organizations Apache and Eclipse (see table 1). All the source code was gotten from the repositories of Apache and Eclipse in GitHub (https://github.com/). Some project names were simplified to be shown; their names in GitHub are: eclipselink.runtime, hudson.core, scout.rt, servicemix-components.
This evaluation employs a minimal frequency of 0.8 to find frequent terms. Tables 2, 3, 4, and 5 show the following data for patterns 1, 2, 3, 4 (i.e., their order in section Naming Patterns) respectively: N (quantity of frequent terms), Min (minimal frequency found in frequent terms), Max (maximal frequency found), Avg (average frecuency), Stdv (standard deviation of frequency), TN (quantity of terms with a frequency lesser than 1).
Pattern Package is the most used pattern in the set; and Pattern Package Name and Level is the most restrictive and less used. The quantity of projects which does not have occurrences for any pattern is very low. In general, the frequent terms complies some pattern in more than ninety percent of their occurrences (i.e.; average value of 0.9), having cases with one hundred percent. TN values show the quantity of ATD incidences by nonuniformity of patterns. The percentage of TN in N shows the percentage of frequent terms, which were not applied uniformly. The maximal accepted value for this percentage can be defined by the development team, in accordance with the degree of use of naming conventions and how well defined is the architecture.
With aims to show the simplicity of queries, figure 6 shows the following query in Cypher language, which gets frequent terms with their respective packages for the pattern Package. Code conventions can often be expressed as common practices which follows certain consensus before than as imposed rules [19]. The proposed analysis enables identifying a consensus of terms in following the naming patterns. Taking into account that refactoring can insert poor choices of design and implementation, evidencing such emergent consensus in the source code is useful before performing refactoring [2,19]. Table 6 shows some frequent terms which can be highlighted by their matching with concepts used in popular designs and architectures; showing that is possible to getting emergent and significant concepts from names of source code artifacts. The following query gets the TN terms for all naming patterns. MATCH (t:FrequentTerm) WHERE t.percentage < 1 RETURN DISTINCT t.term ("Abstract":technical debt) AND ("Abstract":name OR "Abstract":names OR "Abstract":naming OR "Abstract":identifier OR "Abstract":identifiers) For ScienceDirect:

ABS("technical debt") AND (ABS(name) OR ABS(names) OR ABS(naming) OR ABS(identifier) OR ABS(identifiers))
The quantities of gotten results for ACM, IEEE Xplore and ScienceDirect are 64, 0, and 1, respectively. The result gotten in ScienceDirect is a book chapter about refactoring advices. Many of the results from ACM are studies about the scope, causes, impact, and features of the technical debt; a few results are slightly related to this work, they address static analysis of source code at a low level, inspecting the source code content (i.e., operations and code sentences). Consequently, it can be affirmed that there is not similar proposals to this work, which is focused in naming of source code artifacts.

Conclusions
The naming patterns presented frequent ccurrences in several projects of the organizations Apache and Eclipse, showing that most of the frequent terms complies each pattern by ninety percent of their occurrences.
The proposed analysis identifies architectural technical debt by non-uniformity of naming patterns; which are applied frequently, but not followed in all the system. The used approach, based on naming patterns of source code artifacts, differs from other approaches which uses the source code content (e.g.; operations, sentences) for identifying technical debt.
The use of a graph based database was relevant, to enable using the database query capabilities as CQL, avoiding the limitations that could present a conventional CQL tool [27]; performing data aggregation in new nodes and making easy the elaboration of queries, which could be more complex or hard to be defined with a conventional CQL.
The proposal is applicable under an agile approach, which promotes focusing on product features and taking care about uncertainty in respect of ATD [2]. The analysis performed on source code does not require an architecture specification as input, and could be automatic through the continuous execution of queries during the software development, enabling the tracking of ATD. Additionally, the frequent terms, which were discovered, can be useful for identifying new emergent concepts in the software architecture.