Deep learning applied to the assessment of online student programming exercises
Abstract
Massive online open courses (MOOCs) teaching coding are increasing in number and popularity. They commonly include homework assignments in which the students must write code that is evaluated by
functional tests. Functional testing may to some extent be automated
however provision of more qualitative evaluation and feedback may
be prohibitively labor-intensive. Provision of qualitative evaluation at
scale, automatically, is the subject of much research effort.
In this thesis, deep learning is applied to the task of performing
automatic assessment of source code, with a focus on provision of
qualitative feedback. Four tasks: language modeling, detecting idiomatic code, semantic code search, and predicting variable names are
considered in detail.
First, deep learning models are applied to the task of language modeling source code. A comparison is made between the performance of
different deep learning language models, and it is shown how language
models can be used for source code auto-completion. It is also demonstrated how language models trained on source code can be used for
transfer learning, providing improved performance on other tasks.
Next, an analysis is made on how the language models from the
previous task can be used to detect idiomatic code. It is shown that
these language models are able to locate where a student has deviated
from correct code idioms. These locations can be highlighted to the
student in order to provide qualitative feedback.
Then, results are shown on semantic code search, again comparing
the performance across a variety of deep learning models. It is demonstrated how semantic code search can be used to reduce the time taken
for qualitative evaluation, by automatically pairing a student submission with an instructor’s hand-written feedback.
Finally, it is examined how deep learning can be used to predict
variable names within source code. These models can be used in a
qualitative evaluation setting where the deep learning models can be
used to suggest more appropriate variable names. It is also shown that
these models can even be used to predict the presence of functional
errors.
Novel experimental results show that: fine-tuning a pre-trained
language model is an effective way to improve performance across a
variety of tasks on source code, improving performance by 5% on average; pre-trained language models can be used as zero-shot learners across a variety of tasks, with the zero-shot performance of some architectures outperforming the fine-tuned performance of others; and
that language models can be used to detect both semantic and syntactic errors. Other novel findings include: removing the non-variable
tokens within source code has negligible impact on the performance of
models, and that these remaining tokens can be shuffled with only a
minimal decrease in performance.