Background: Despite similar education and background, programmers can exhibit vast differences in efficacy. While research has identified some potential factors, such as programming experience and domain knowledge, the effect of these factors on programmers' efficacy is not well understood. Aims: We aim at unraveling the relationship between efficacy (speed and correctness) and measures of programming experience. We further investigate the correlates of programmer efficacy in terms of reading behavior and cognitive load. Method: For this purpose, we conducted a controlled experiment with 37 participants using electroencephalography (EEG) and eye tracking. We asked participants to comprehend up to 32 Java source-code snippets and observed their eye gaze and neural correlates of cognitive load. We analyzed the correlation of participants' efficacy with popular programming experience measures. Results: We found that programmers with high efficacy read source code more targeted and with lower cognitive load. Commonly used experience levels do not predict programmer efficacy well, but self-estimation and indicators of learning eagerness are fairly accurate. Implications: The identified correlates of programmer efficacy can be used for future research and practice (e.g., hiring). Future research should also consider efficacy as a group sampling method, rather than using simple experience measures.
Background: Researchers and practitioners have been using code complexity metrics for decades to predict how developers comprehend a program. While it is plausible and tempting to use them for this purpose, their validity is questionable, since they rely on code properties and rarely consider particularities of human cognition. Aims: We investigate whether and how code complexity metrics reflect difficulty of program comprehension. Method: We conducted a functional magnetic resonance imaging (fMRI) study with 19 participants observing program comprehension of short code snippets at varying complexity levels. We dissected four classes of code complexity metrics and their relationship to neuronal, behavioral, and subjective correlates of program comprehension, overall analyzing more than 41 metrics. Results: While we could corroborate that complexity metrics can—to a limited degree—explain programmers' cognition in program comprehension, fMRI allowed us to gain more insights into why some properties of code can be difficult to process. In particular, the code's textual size drives programmers' attention and vocabulary size burdens programmers' working memory. Conclusion: Our results provide neuro-scientific evidence that supports warnings of prior research questioning the validity of code complexity metrics and reveal factors relevant to program comprehension. uture Work: We outline a number of follow-up experiments investigating fine-grained effects of code complexity and describe possible refinements to complexity metrics.
Eye tracking allows us to shed light on how developers read and
understand source code and how that is linked to cognitive processes. However, studies with eye trackers are usually tied to a
laboratory, requiring to observe participants one at a time, which
is especially challenging in the current pandemic. To allow for safe
and parallel observation, we present our tool REyeker, which allows
researchers to observe developers remotely while they understand
source code from their own computer without having to directly
interact with the experimenter. The original image is blurred to
distort text regions and disable legibility, requiring participants to
click on areas of interest to deblur them to make them readable.
While REyeker naturally can only track eye movements to a limited degree, it allows researchers to get a basic understanding of
developers’ reading behavior.
The human factor is prevalent in empirical software engineering research. However, human studies often do not use the full potential of analysis methods by combining analysis of individual tasks and participants with an analysis that aggregates results over tasks and/or participants. This may hide interesting insights of tasks and participants and may lead to false conclusions by overrating or underrating single-task or participant performance. We show that studying multiple levels of aggregation of individual tasks and participants allows researchers to have both, insights from individual variations as well as generalized, reliable conclusions based on aggregated data. Our literature survey revealed that most human studies perform either a fully aggregated analysis or an analysis of individual tasks. To show that there is important, non-trivial variation when including human participants, we reanalyze 12 published empirical studies, thereby changing the conclusions or making them more nuanced. Moreover, we demonstrate the effects of different aggregation levels by answering a novel research question on published sets of fMRI data. We show that, when more data are aggregated, the results become more accurate. This proposed technique can help researchers to find a sweet spot in the tradeoff between cost of a study and reliability of conclusions.
Background: The way how programmers comprehend source code depends on several factors, including the source code itself and the programmer. Recent studies showed that novice programmers tend to read source code more like natural language text, whereas experts tend to follow the program execution flow. But, it is unknown how the linearity of source code and the comprehension strategy influence programmers' linearity of reading order. Objective: We replicate two previous studies with the aim of additionally providing empirical evidence on the influencing effects of linearity of source code and programmers' comprehension strategy on linearity of reading order. Methods: To understand the effects of linearity of source code on reading order, we conducted a non-exact replication of studies by Busjahn et al. and Peachock et al., which compared the reading order of novice and expert programmers. Like the original studies, we used an eye-tracker to record the eye movements of participants (12 novice and 19 intermediate programmers). Results: In line with Busjahn et al. (but different from Peachock et al.), we found that experience modulates the reading behavior of participants. However, the linearity of source code has an even stronger effect on reading order than experience, whereas the comprehension strategy has a minor effect. Implications: Our results demonstrate that studies on the reading behavior of programmers must carefully select source code snippets to control the influence of confounding factors. Furthermore, we identify a need for further studies on how programmers should structure source code to align it with their natural reading behavior to ease program comprehension.
An early study showed that indentation is not a matter of style, but provides actual support for program comprehension. In this paper, we present a non-exact replication of this study. Our aim is to provide empirical evidence for the suggested level of indentation made by many style guides. Following Miara and others, we also included the perceived difficulty, and we extended the original design to gain additional insights into the influence of indentation on visual effort by employing an eye-tracker. In the course of our study, we asked 22~participants to calculate the output of Java code snippets with different levels of indentation, while we recorded their gaze behavior. We could not find any indication that the indentation levels affect program comprehension or visual effort, so we could not replicate the findings of Miara and others. Nevertheless, our modernization of the original experiment design are a promising starting point for future studies in this field.
Program comprehension is a central cognitive process in programming and has been in the focus of researchers for decades, but is still not thoroughly unraveled. Multi-modal measurement methods are a way to gain a more holistic understanding of program comprehension. However, there is no proper tool support that lets researchers explore synchronized, conjoint multi-modal data, specifically designed for the needs in software engineering. In this paper, we present CodersMUSE, a prototype implementation that aims to satisfy this crucial need.
Abstract: Program comprehension is an important, but hard to measure cognitive process. This makes it difficult to provide suitable
programming languages, tools, or coding conventions to support developers in their everyday work. Here, we explore whether
functional magnetic resonance imaging (fMRI) is feasible for soundly measuring program comprehension. To this end, we observed 17
participants inside an fMRI scanner while they were comprehending source code. The results show a clear, distinct activation of five
brain regions, which are related to working memory, attention, and language processing, which all fit well to our understanding of
program comprehension. Furthermore, we found reduced activity in the default mode network, indicating the cognitive effort necessary
for program comprehension. We also observed that familiarity with Java as underlying programming language reduced cognitive effort
during program comprehension. To gain confidence in the results and the method, we replicated the study with 11 new participants and
largely confirmed our findings. Our results encourage us and, hopefully, others to use fMRI to observe programmers and, in the long
run, answer questions, such as: How should we train programmers? Can we train someone to become an excellent programmer? How
effective are new languages and tools for program comprehension?
Background Researchers have recently started using functional magnetic resonance imaging (fMRI) to validate decades-old programcomprehension models. While fMRI helps us to understand neuronal correlates of cognitive processes during program comprehension, its comparatively low temporal resolution (i.e., seconds)
cannot capture the fast cognitive subprocesses (i.e., milli seconds).
Aims To increase the explanatory power of fMRI measurement of programmers, we are exploring the feasibility of adding simultaneous eye tracking to the fMRI measurement. By observing programmers with two complementary methods, we aim at obtaining
a more holistic understanding of program comprehension.
Method We conducted a controlled fMRI experiment of 22 student participants with simultaneous eye tracking.
Results We could successfully capture fMRI and eye-tracking data, although with some limitations, including spatial imprecision and
a negligible drift. The biggest issue that we experienced is the partial loss of data, such that for only 10 participants, we could
collect a complete set of high-precision eye-tracking data. Since some participants of fMRI studies show excessive head motion, the
proportion of full and high-quality data on fMRI and eye tracking is rather low. Still, the remaining data allowed us to confrm our
prior hypothesis of semantic recall during program comprehension, which was not possible with fMRI alone.
Conclusions Simultaneous measurement of program comprehension with fMRI and eye tracking is feasible and promising. By adding
simultaneous eye tracking to our fMRI study framework, we can conduct more fne-grained fMRI analyses, which in turn helps us
to understand programmer behavior better.
This article extends Hofmeister, Siegmund, & Holt (2017) @ SANER17, see below. We analyze and discuss participants’ visual focus. The data were obtained in the original study using a restricted focus viewer, called the "letterbox", which limited the visible code to 7 lines at once.
Developers spend the majority of their time comprehending code, a process in which identifier names play a key role. Although many identifier naming styles exist, they often lack an empirical basis and it is not quite clear whether short or long identifier names facilitate comprehension. In this paper, we investigate the effect of different identifier naming styles (letters, abbreviations, words) on program comprehension, and whether these effects arise because of their length or their semantics. We conducted an experimental study with 72 professional C# developers, who looked for defects in source-code snippets. We used a within-subjects design, such that each developer saw all three versions of identifier naming styles and we measured the time it took them to find a defect. We found that words lead to, on average, 19% faster comprehension speed compared to letters and abbreviations, but we did not find a significant difference in speed between letters and abbreviations. The results of our study suggest that defects in code are more difficult to detect when code contains only letters and abbreviations. Words as identifier names facilitate program comprehension and can help to save costs and improve software quality.
Most modern software programs cannot be understood in their entirety by a single programmer. Instead, programmers must rely on a set of cognitive processes that aid in seeking, filtering, and shaping relevant information for a given programming task. Several theories have been proposed to explain these processes, such as beacons, for locating relevant code, and plans, for encoding cognitive models. However, these theories are decades old and lack validation with modern cognitive-neuroscience methods. In this paper, we report on a study using functional magnetic resonance imaging (fMRI) with 11 participants who performed program comprehension tasks. We manipulated experimental conditions related to beacons and layout to isolate specific cognitive processes related to bottom-up comprehension and comprehension based on semantic cues. We found evidence of semantic chunking during bottom-up comprehension and lower activation of brain areas during comprehension based on semantic cues, confirming that beacons ease comprehension.