Data and Methods

The data accessible through the API and AIReD (the dashboard) is being collected using:

Twitter posts were harvested (Twitter unilaterally stopped the harvesting in April 2023) using the Twitter Streaming API, which gives access in real-time to all newly posted tweets. The stream of tweets is filtered to include only the users that were either located in Australia at the time of posting or that mentioned Australia in their Twitter profile (excluding retweets to avoid document repetitions and to minimise, to the extent possible, content that would likely be produced by “bots”);
Mastodon: only public posts from the Mastodon.au node are collected;
Flickr posts are filtered geographically, with a bounding box covering the whole of Australia;
YouTube posts are filtered using multiple geographic queries that cover the whole country;
Reddit posts are collected from subreddits whose names include Australian locations. Currently, the Subreddits used for harvesting are “wollongong”,”geelong”,”centralcoastnsw”,”newcastle”,”hobart”,”GoldCoast”,
“bluemountains”,”canberra”,”ballarat”,”Wodonga”,”Mildura”,”shepparton”,”Morwell”,”LatrobeValley”,
“gippsland”,”Cairns”,”lakemacquarie”,”Melbourne”,”Sydney”,”Perth”,”WaggaNSW”,”Riverina”,
“Cessnock”,”darwin”,”ipswich”,”Launceston”,”rockhampton”,”HuonValley”,”Armidale”,
“redcliffe”,”MargaretRiver”,”albury”,”Toowoomba”,”albanywa”,”Tamworth”,
“Footscray”,”altona”;
BlueSky posts are harvested from users who are thought to be based in Australia (have Australia place names in their bio), or follow several Australian users, or write a significant fraction of posts that mention Australian topics.
GDELT events are harvested when their coordinates are within the country or mention Australian place names as their location.

Some data collections are further processed to group conversations by topic using the BERTopic algorithm.

The AIReD implementation follows a multi-stage preprocessing pipeline:

1. Entity Removal & Tokenization: Raw documents undergo entity stripping (removing URLs, mentions, hashtags), followed by tokenization (using NLTK) that filters tokens shorter than a given number of characters.

2. Document Filtering: Documents with fewer than a given number of tokens are discarded to ensure sufficient content for meaningful topic discovery.

3. Linguistic Processing: POS tagging is performed to retain only nouns, alphabetic filtering, English stopword removal using NLTK, and Porter stemming for normalisation.

4. Word2Vec Feature Generation: Before BERTopic clustering, the pipeline builds Word2Vec embeddings from preprocessed tokens using configurable parameters (vector size, window size, minimum count, training algorithm). These embeddings are crucial for the similarity calculations in BERTopic’s topic coherence metrics.

5. Topic modelling: The process generates topic assignments and probabilities, calculates topic similarity scores using Word2Vec cosine similarity between topic terms, and outputs structured results with topic IDs, sizes, representative terms, and document assignments.

Reproducibility Issues

BERTopic has a reproducibility problem because several of its internal steps are inherently stochastic. UMAP’s dimensionality reduction uses random initialisation and is highly sensitive to small perturbations, so different runs on the same data can produce different embeddings. HDBSCAN then clusters those embeddings, and because the underlying space shifts slightly each run, cluster assignments (and thus topics) can vary. Even when you fix random seeds, full determinism is not guaranteed because UMAP relies on non-deterministic neighbour graph construction and parallelism. The downstream c-TF-IDF representation simply reflects whatever clusters were produced, so topic names, counts, and even the number of topics can differ between runs.

BibTeX citation:

@misc{aio2025,
	Author = {AIReD},
	Title = {AIReD website},
	Howpublished = {https://www.aio.eresearch.unimelb.edu.au/},
	Url = {https://www.aio.eresearch.unimelb.edu.au/},
	Year = {2025}}
}