ORCID

Jeffrey Stanton: 0000-0001-6120-7273

Document Type

Article

Date

Fall 10-1-2020

Keywords

data analysis, text analysis, word embedding

Language

English

Disciplines

Applied Linguistics | Databases and Information Systems | Library and Information Science | Management Sciences and Quantitative Methods

Description/Abstract

Researchers from many fields have used statistical tools to make sense of large bodies of text. Many tools support quantitative analysis of documents within a corpus, but relatively few studies have examined statistical characteristics of whole corpora. Statistical summaries of whole corpora and comparisons between corpora have potential application in the analysis of topically organized applications such social media platforms. In this study, we created matrix representations of several corpora and examined several statistical tests to make comparisons between pairs of corpora with respect to the topical homogeneity of documents within each corpus. Results of three experiments suggested that a matrix of cosine distances calculated from vector summaries of short phrases contains useful information about how closely the documents within a corpus relate to one another. Both the tested summarization method and a non-parametric test for comparing cosine distance matrices appear to have utility for examining and comparing corpora containing brief texts.

Recommended Citation

Stanton, Jeffrey M. and Sang, Yisi, "Assessing Topical Homogeneity with Word Embedding and Distance Matrices" (2020). School of Information Studies - Faculty Scholarship. 193.
https://surface.syr.edu/istpub/193

Source

submission

Creative Commons License

This work is licensed under a Creative Commons Attribution-No Derivative Works 4.0 International License.

Download

Included in

Applied Linguistics Commons, Databases and Information Systems Commons, Library and Information Science Commons, Management Sciences and Quantitative Methods Commons

COinS

School of Information Studies - Faculty Scholarship

Assessing Topical Homogeneity with Word Embedding and Distance Matrices

ORCID

Document Type

Date

Keywords

Language

Disciplines

Description/Abstract

Recommended Citation

Source

Creative Commons License

Included in

Browse

Search

Author Resources

Links

School of Information Studies - Faculty Scholarship

Assessing Topical Homogeneity with Word Embedding and Distance Matrices

Author(s)/Creator(s)

ORCID

Document Type

Date

Keywords

Language

Disciplines

Description/Abstract

Recommended Citation

Source

Creative Commons License

Included in

Share

Browse

Search

Author Resources

Links