Large scale data integration of OSS repositories for automated soft and technical factors assessment
Iqbal, Muhammad Aftab
MetadataShow full item record
This item's downloads: 5437 (view details)
As of today, software development does not revolve around a piece of source code but also around large volume of software project related information that exists in different software repositories hosting a software project. These software repositories produce a variety of software artifacts (i.e., source code, bugs, source control commit logs, emails, documentation etc.) during the whole software development lifecycle. Apart from the software project information that is distributed across different software repositories of a software project, software project related information is also distributed on the Web in heterogeneous open source software repositories. Examples of these open source software repositories are: collaborative infrastructure for software project development (i.e., code forges), social networking infrastructure (e.g., Twitter) to disseminate software project related information to a wider audience and statistical services that provides statistical information about software project development. Hence, we can say that information related to software projects are distributed on the Web. The information contained inside these heterogeneous software repositories is vital to software stakeholders for their day to day development needs. However, this information is not readily accessible due to the distributed nature and lack of integration among software repositories. In this thesis, we propose to integrate software repositories by exploiting a Linked Data approach that allows an easy integration and identification of related information about software artifacts across heterogeneous software repositories. We start by describing our approach to publish and integrate software repositories (based on software artifacts) using Linked Data and show how the interlinked information can be delivered to software stakeholders in their development environments. Further, we present our approach to identify and interlink different and multiple IDs of a software developer, which he/she uses to interact with different software repositories of a software project. Moreover, we present some use case scenarios that can be realized by interlinking multiple IDs of a software developer. With respect to hosting of software projects on publicly available development infrastructures (i.e., code forges), we propose to integrate different code forges based on metadata, similar software projects and software developers. We demonstrate the integration of software project and software developer related information across different code forges as well as relevant information that are available through statistical services. Further, we show that it enables software stakeholders to not only query statistical information about a software project as well as software developer but also allow them to keep track of the involvement of software developers in multiple software projects across different code forges. In regards to the social aspects, we present evidence that software project and software developer related information also exists on social media channels. Based on our case study of the usage of Twitter by software developers, we motivate the integration of social media channels and software repositories. Finally, we exploit our linked datasets to investigate the evolving social dependencies and social relations among software developers over the period of time.