LDAShiny is an interactive R package built with the Shiny framework and the golem architecture. It provides a complete, graphical workflow for Latent Dirichlet Allocation (LDA) topic modeling applied to bibliometric data exported from Scopus and Web of Science (WoS).
The package was developed by the GEMC Research Group (Grupo de Estadística y Métodos Cuantitativos) at Universidad del Magdalena, Colombia.
If you use LDAShiny in your research, please cite:
De la Hoz-M., J.; Fernandez-Gomez, M. J.; Mendes, S. (2021). LDAShiny: An R Package for Exploratory Review of Scientific Literature Based on a Bayesian Probabilistic Model and Machine Learning Tools. Mathematics, 9(14), 1671. DOI: 10.3390/math9141671
Install the development version from GitHub:
Or, once published on CRAN:
Start the interactive dashboard with a single call:
By default, the application accepts file uploads up to 500 MB. You can adjust this limit:
The dashboard opens in your default web browser and presents five sequential modules in the left-hand sidebar, each building on the output of the previous one.
The full analysis pipeline consists of five modules:
┌──────────────────────┐
│ 1. Data Import │ Upload Scopus CSV + WoS TXT → merged data.frame
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ 2. Text Preprocessing│ Tokenise · Stopwords · Stemming · DTM
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ 3. Inference (K) │ Run LDA for k_min..k_max → coherence curve
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ 4. Final LDA Model │ Train model at optimal K → β, γ, word clouds
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ 5. Trend Analysis │ Linear regression of topic intensity over time
└──────────────────────┘
| Source | Format | Notes |
|---|---|---|
| Scopus | .csv |
Standard Scopus export |
| Web of Science | .txt |
Plain-text tagged format |
| Integrated file | .xlsx |
Previously exported merged dataset |
The module:
doi,
title, year, Journal,
abstract, database).The merged dataset can be downloaded as an .xlsx file
for reuse via the Load Integrated Excel File option in
future sessions.
Two internal functions handle the parsing and standardisation steps:
parse_wos(path) — reads a WoS plain-text export and
returns a data.frame with columns doi,
title, year, Journal,
abstract, database. Missing fields
(e.g. absent DOI) are set to NA.
standardize_scopus(df) — maps Scopus column names to
the shared schema. Any missing column is filled with
NA.
This module converts the free-text field (typically
abstract) into a Document-Term Matrix
(DTM) ready for LDA.
| Option | Default | Description |
|---|---|---|
| Text column | abstract |
Column used as the document text |
| Document ID column | title |
Column used as row identifier |
| Min / Max n-gram | 1 / 2 | Unigrams and bigrams by default |
| Stemming (Porter) | Enabled | Reduces words to their root form |
| Remove numbers | Enabled | |
| Remove punctuation | Enabled | |
| Sparse filter | 0.995 | Removes terms appearing in fewer than 0.5 % of docs |
| Custom stopwords | — | Upload an .xlsx file with one word per row |
| CPUs | max - 1 |
Parallel cores for DTM construction |
After clicking “Run Preprocessing”:
tf_mat) with term, document frequency, and inverse
document frequency columns..rds) and term-frequency table
(.xlsx) can be downloaded for external use.The preprocessing pipeline uses:
textmineR::CreateDtm() for tokenisation, n-gram
construction, and stopword removal.SnowballC::wordStem() (Porter algorithm) for optional
stemming.quanteda + tm::removeSparseTerms() for
sparsity filtering.Matrix::dgCMatrix for
memory-efficient LDA fitting.Choosing the right number of topics K is a critical step. This module fits multiple LDA models over a user-defined range of K values and evaluates each using mean topic coherence — a measure of the semantic interpretability of topics.
| Parameter | Default | Description |
|---|---|---|
| k start | 5 | Minimum number of topics to test |
| k end | 40 | Maximum number of topics to test |
| k step | 1 | Increment between successive K values |
| Iterations | 500 | Gibbs sampler iterations per model |
| Burn-in | 50 | Initial iterations discarded before sampling |
| Alpha | 0.1 | Document-topic concentration (Dirichlet prior) |
| CPUs | max − 1 | Parallel workers (one model per core) |
A higher coherence score means that the top terms of a topic tend to co-occur frequently in the same documents, producing more semantically coherent topics.
The plot is fully customisable (colors, themes, font sizes) and can be exported in PNG, TIFF, JPEG, or PDF formats.
If the coherence curve has multiple local maxima, prefer the smaller K for a more parsimonious model, unless domain knowledge justifies a larger number of topics.
Once an optimal K is identified, this module trains the definitive LDA model. The K field is pre-populated with the value selected in Module 3, though it can be overridden.
| Parameter | Default | Description |
|---|---|---|
| K | (from inference) | Number of topics |
| Iterations | 500 | Gibbs sampler iterations |
| Burn-in | 50 | Warm-up iterations |
| Alpha | 50 / K | Document-topic prior (auto-scaled) |
| Beta | 0.05 | Topic-term prior |
| Optimize Alpha | Yes | Whether to update alpha during sampling |
| Metrics | Likelihood, Coherence | Optional: also compute R² |
After clicking “Train Final LDA Model”:
| Tab | Content |
|---|---|
| Model Evaluation Metrics | K, iterations, α, β, mean coherence, log-likelihood |
| Top Terms Matrix | Top M terms per topic (configurable, default M = 20) |
| Document Topic Assignment | Dominant topic for each document |
| Top Documents per Topic | Top M documents per topic ranked by γ weight |
| Topic-Term Weights (Beta) | Full β matrix in tidy long format |
| Document-Topic Weights (Gamma) | Full γ matrix in tidy long format |
| Topic Word Cloud | Interactive word cloud per topic |
β (phi matrix): probability of each term given a topic. A high β value for a term in a topic means that term is strongly associated with that topic.
γ (theta matrix): probability of each topic given a document. A high γ value means the document strongly belongs to that topic.
Select any topic from the dropdown, set the maximum number of words,
choose a color palette (from RColorBrewer), and click
“Generate Word Cloud”. The word size is proportional to
the term’s β weight. Word clouds can be exported in PNG, JPEG, PDF, or
TIFF format.
All result tables are available as .xlsx files. The full
trained model object can be saved as an .rds file:
# Load a previously saved model
lda_model <- readRDS("lda_final_model.rds")
# Inspect the phi (topic-term) matrix
head(lda_model$phi[, 1:5])
# Inspect the theta (document-topic) matrix
head(lda_model$theta[, 1:5])This module examines how topics have evolved over time by fitting a simple linear regression of mean topic intensity (mean γ) against publication year, separately for each topic.
Each topic is assigned one of three trend categories:
| Category | Criterion | Plot color |
|---|---|---|
| HOT (Increasing) | Slope > 0, p-value < threshold | Red |
| COLD (Decreasing) | Slope < 0, p-value < threshold | Light blue |
| EQUAL (Stable) | p-value ≥ threshold (non-significant slope) | Grey |
The default significance threshold is p = 0.05, adjustable via the P-Value Threshold input.
| Tab | Content |
|---|---|
| Topic-Year Data | Raw yearly mean γ per topic |
| Regression Results | Slope estimate and p-value for each topic |
| Topic Trend Plot | Scatter plot with fitted regression line, color-coded |
The trend plot is fully customisable and exportable in multiple formats.
Data quality
Preprocessing
.xlsx, one term per row)
to remove domain-specific noise (e.g. “study”, “result”, “method”).Choosing K
Reproducibility
.rds) to avoid retraining for
downstream analyses.De la Hoz-M., J.; Fernandez-Gomez, M. J.; Mendes, S. (2021). LDAShiny: An R Package for Exploratory Review of Scientific Literature Based on a Bayesian Probabilistic Model and Machine Learning Tools. Mathematics, 9(14), 1671. https://doi.org/10.3390/math9141671
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.
Chang, W., Cheng, J., Allaire, J., et al. (2023). shiny: Web Application Framework for R. R package version 1.7.5. https://CRAN.R-project.org/package=shiny
Jones, T. (2019). textmineR: Functions for Text Mining and Topic Modeling. R package. https://CRAN.R-project.org/package=textmineR