Data from: Can we make it better? Assessing and improving quality of GitHub repositories
datasetposted on 07.03.2022, 04:06 by GEDE ARTHA AZRIADI PRANA (SMU)GEDE ARTHA AZRIADI PRANA (SMU)
This is the related dataset for the PhD dissertation by G. A. A. Prana, "Can We Make It Better? Assessing and Improving Quality of GitHub Repositories", available at https://ink.library.smu.edu.sg/etd_coll/373/
The code hosting platform GitHub has gained immense popularity worldwide in recent years, with over 200 million repositories hosted as of June 2021. Due to its popularity, it has great potential to facilitate widespread improvements across many software projects. Naturally, GitHub has attracted much research attention, and the source code in the various repositories it hosts also provide opportunity to apply techniques and tools developed by software engineering researchers over the years. However, much of existing body of research applicable to GitHub focuses on code quality of the software projects and ways to improve them. Fewer work focus on potential ways to improve quality of GitHub repositories through other aspects, although quality of a software project on GitHub is also affected by factors outside a project's source code, such as documentation, the project's dependencies, and pool of contributors.
The three works that form this dissertation focus on investigating aspects of GitHub repositories beyond the code quality, and identify specific potential improvements that can be applied to improve wide range of GitHub repositories. In the first work, we aim to systematically understand the content of README files in GitHub software projects, and develop a tool that can process them automatically. The work begins with a qualitative study involving 4,226 README file sections from 393 randomly-sampled GitHub repositories, which reveals that many README files contain the ``What'' and ``How'' of the software project, but often do not contain the purpose and status of the project. This is followed by a development and evaluation of a multi-label classifier that can predict eight different README content categories with F1 of 0.746. From our subsequent evaluation of the classifier, which involve twenty software professionals, we find that adding labels generated by the classifier to README files ease information discovery.
Our second work focuses on characteristics of vulnerabilities in open-source libraries used by 450 software projects on GitHub that are written in Java, Python, and Ruby. Using an industrial software composition analysis tool, we scanned every version of the projects after each commit made between November 1, 2017 and October 31, 2018. Our subsequent analyses on the discovered library names, versions, and associated vulnerabilities reveal, among others, that ``Denial of Service'' and ``Information Disclosure'' vulnerability types are common. In addition, we also find that most of the vulnerabilities persist throughout the observation period, and that attributes such as project size, project popularity, and experience level of commit authors do not translate to better or worse handling of vulnerabilities in dependent libraries. Based on the findings in the second work, we list a number of implications for library users, library developers, as well as researchers, and provide several concrete recommendations. This includes recommendations to simplify projects' dependency sets, as well as to encourage research into ways to automatically recommend libraries known to be secure to developers.
In our third work, we conduct a multi-region geographical analysis of gender inclusion on GitHub. We use a mixed-methods approach involving a quantitative analysis of commit authors of 21,456 project repositories, followed by a survey that is strategically targeted to developers in various regions worldwide and a qualitative analysis of the survey responses. Among other findings, we discover differences in diversity levels between regions, with Asia and Americas being highest. We also find no strong correlation between gender and geographic diversity of a repository's commit authors. Further, from our survey respondents worldwide, we also identify barriers and motivations to contribute to open-source software. The results of this work provides insights on the current state of gender diversity in open source software and potential ways to improve participation of developers from under-represented regions and gender, and subsequently improve the open-source software community in general. Such potential ways include creation of codes of conduct, proximity-based mentorship schemes, and highlighting of women / regional role models.