School of Information Studies - Faculty Scholarship

The Perils and Pitfalls of Mining SourceForge

Document Type

Article

Date

2004

Keywords

SourceForge, open source software, data mining

Disciplines

Library and Information Science

Description/Abstract

SourceForge provides abundant accessible data from Open Source Software development projects, making it an attractive data source for software engineering research. However it is not without theoretical peril and practical pitfalls. In this paper, we outline practical lessons gained from our spidering, parsing and analysis of SourceForge data. SourceForge can be practically difficult: projects are defunct, data from earlier systems has been dumped in and crucial data is hosted outside SourceForge, dirtying the retrieved data. These practical issues play directly into analysis: decisions made in screening projects can reduce the range of variables, skewing data and biasing correlations. SourceForge is theoretically perilous: because it provides easily accessible data items for each project, tempting researchers to fit their theories to these limited data. Worse, few are plausible dependent variables. Studies are thus likely to test the same hypotheses even if they start from different theoretical bases. To avoid these problems, analyses of SourceForge projects should go beyond project level variables and carefully consider which variables are used for screening projects and which for testing hypotheses.

Recommended Citation

Howison, J. and Crowston, K. (2004a). The perils and pitfalls of mining Sourceforge. In Proc. of Workshop on Mining Software Repositories at the International Conference on Software Engineering ICSE.

The Perils and Pitfalls of Mining SourceForge_accessible.pdf (261 kB)
Accessible PDF version

Source

local input

Creative Commons License

This work is licensed under a Creative Commons Attribution 3.0 License.

Download

Included in

Library and Information Science Commons

COinS

School of Information Studies - Faculty Scholarship

The Perils and Pitfalls of Mining SourceForge

Document Type

Date

Keywords

Disciplines

Description/Abstract

Recommended Citation

Source

Creative Commons License

Included in

Browse

Search

Author Resources

School of Information Studies - Faculty Scholarship

The Perils and Pitfalls of Mining SourceForge

Author(s)/Creator(s)

Document Type

Date

Keywords

Disciplines

Description/Abstract

Recommended Citation

Source

Creative Commons License

Included in

Share

Browse

Search

Author Resources