Data preparation


Check the data

Are the texts we collected clear about vaccination requirements? If not, we need to find additional documents as evidence to support our guesses.

In other words, have we collected everything we need?


Clean the data

Before the texts can be analyzed in any software, we need to acquire the texts. In our case, this means we need to download these community messages from their web pages.

We chose to export the web pages as PDF, after we removed the header, footer, styling, etc. Not that this is the best way of acquiring texts, but it is probably easiest for this workshop.

Of course, in real world, this is going to be much more complicated. For instance, for text mining, there are a lot to do in text preprocessing in order for the texts to be machine readable. These may include text normalization, removing stop words, stemming and lemmatization, and so on.


It’s not a linear process!

Data collection, data preparation, and later data analysis is NOT a linear process. These steps may be interrelated and happen simultaneously.

Steps of collecting data, checking data and refining data may be repeated. The criteria of completing these steps may be constantly challenged, and updated. For instance, when analyzing qualitative data, researchers may attempt to saturate the codes/categories they use to aid analysis; they may continue collecting data (e.g. interviewing) until the new information obtained does not further provide insight into the category (Creswell, 2007, p. 160).


Previous submodule: