---
title: "From R to RDF"
output:
  rmarkdown::html_vignette:
    md_extensions: -autolink_bare_uris
vignette: >
  %\VignetteIndexEntry{From R to RDF}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{css, echo=FALSE}
.smaller .table {
  font-size: 11px;
}

.smaller pre,
.smaller code {
  font-size: 11px;
  line-height: 1.2;
}
```

## From tidy data to RDF triples

This vignette demonstrates how to convert tidy R datasets into semantically enriched RDF triple structures, using the `dataset` and `rdflib` packages. These packages help you annotate variables with machine-readable concepts, units, and links to controlled vocabularies.

We’ll start with a small example of a tidy dataset representing countries (`geo`) with unique identifiers (`rowid`) and then show how to transform the dataset into RDF triples using standard vocabularies.

```{r setup}
library(dataset)
library(rdflib)
data("gdp")
```

## Creating a minimal semantically defined dataset

```{r minimaldf}
small_geo <- dataset_df(
  geo = defined(
    gdp$geo[1:3],
    label = "Geopolitical entity",
    concept = "http://purl.org/linked-data/sdmx/2009/dimension#refArea",
    namespace = "https://www.geonames.org/countries/$1/"
  ),
  identifier = c(
    obs = "https://dataset.dataobservatory.eu/examples/dataset.html#"
  )
)
```

The dataset has no creator or author, but the rows have identifiers that can be resolved with <https://dataset.dataobservatory.eu/examples/dataset.html#>. In real publishing
scenarios, you would replace these with persistent URIs
that identify actual datasets and their observations. For example, a DOI-based identifier such as:

`https://doi.org/10.5281/zenodo.14917851#obs:1`

So let's see how this minimal dataset prints in R:

```{r printsmallgeodf}
print(small_geo)
```

A tidy dataset can always be pivotted to a three-column long (tidy) format, which can define every cell value in the tabular dataset with a subject-predicate-object triple. 

```{r triplesdf, eval=FALSE}
triples_df <- dataset_to_triples(small_geo)
knitr::kable(triples_df)
```

::: smaller
```{r triplesdfprintsmall, echo=FALSE}
triples_df <- dataset_to_triples(small_geo)
knitr::kable(triples_df)
```
::: 

This produces triples like:

```{r createntriples}
ntriples <- dataset_to_triples(small_geo, format = "nt")
```
```{r pritriples, eval=FALSE}
cat(ntriples, sep = "\n")
```

::: smaller
```{r printsmaller}
cat(ntriples, sep = "\n")
```
:::

Each row of your dataset becomes a **subject**, each variable a **predicate**, and each value either a **URI** or a typed literal (like a date or number) — depending on how it's defined. The first statement in the example defines the intersection of the first row (observation, identified by the `rowid`) `dataset#eg:1` and the column [reference area](http://purl.org/linked-data/sdmx/2009/dimension#refArea) defined by the URI as [Andorra](https://www.geonames.org/countries/AD/).The advantage of this approach is that the row and column definitions as well as coded cell values have a permanent metadata definition.

### RDF triples enable interoperability

The Resource Description Framework (RDF) represents data as subject–predicate–object triples. This allows your dataset to be machine-readable, linkable to external vocabularies, and to be ready for queries via SPARQL.

### RDF triples enable interoperability

The Resource Description Framework (RDF) represents data as subject–predicate–object triples. This allows your dataset to be machine-readable, linkable to external vocabularies, and queryable via SPARQL.

```{r ntripleexample}
n_triple(
  s = "https://dataset.dataobservatory.eu/examples/dataset.html#obs1",
  p = "http://purl.org/dc/terms/title",
  o = "Small Country Dataset"
)
```


```{r readrdf}
# We write to a temporary file our Ntriples created earlier
temp_file <- tempfile(fileext = ".nt")
writeLines(ntriples, con = temp_file)

rdf_graph <- rdf()
rdf_parse(rdf_graph, doc = temp_file, format = "ntriples")
rdf_graph
```

A simple, serverless scaffolding for publishing `dataset_df` objects on the web (with HTML + RDF exports) is available at    <https://github.com/dataobservatory-eu/dataset-template> with the example
of this vignette tutorial.

## Clean up

It is a good practice to close connections, or clean up larger objects living in the memory:

```{r cleanup}
# Clean up: delete file and clear RDF graph
unlink(temp_file)
rm(rdf_graph)
gc()
```

## Scale up

We build a slightly bigger graph, save it, and reload it. 

```{r scaleup}
small_country_dataset <- dataset_df(
  geo = defined(
    gdp$geo,
    label = "Country name",
    concept = "http://dd.eionet.europa.eu/vocabulary/eurostat/geo/",
    namespace = "https://www.geonames.org/countries/$1/"
  ),
  year = defined(
    gdp$year,
    label = "Reference Period (Year)",
    concept = "http://purl.org/linked-data/sdmx/2009/dimension#refPeriod"
  ),
  gdp = defined(
    gdp$gdp,
    label = "Gross Domestic Product",
    unit = "https://dd.eionet.europa.eu/vocabularyconcept/eurostat/unit/CP_MEUR",
    concept = "http://data.europa.eu/83i/aa/GDP"
  ),
  unit = gdp$unit,
  freq = defined(
    gdp$freq,
    label = "Frequency",
    concept = "http://purl.org/linked-data/sdmx/2009/code"
  ),
  identifier = c(
    obs = "https://dataset.dataobservatory.eu/examples/dataset.html#"
  ),
  dataset_bibentry = dublincore(
    title = "Small Country Dataset",
    creator = person("Jane", "Doe"),
    publisher = "Example Inc.",
    datasource = "https://doi.org/10.2908/NAIDA_10_GDP",
    rights = "CC-BY",
    coverage = "Andorra, Lichtenstein and San Marino"
  )
)
```

```{r smallcountrydfnt}
small_country_df_nt <- dataset_to_triples(
  small_country_dataset,
  format = "nt"
)
```


The following lines read as:

-   [1] `Observation #1` is a geopolitical entity, `Andorra`.
-   [11] `Observation #1` has a reference time period of `2020`.
-   [21] `Observation #1` has a decimal GDP value of `2354.8`
-   [31] `Observation #1` has a unit of `million euros, current prices`.
-   [41] `Observation #1` has a measurement frequency that is `annual`.

:::: smaller
```{r smallcountrydfntsample}
## See rows 1,11,21
small_country_df_nt[c(1, 11, 21, 31, 41)]
```
::::

he statements about `Observation 1`, i.e. Andorra's national economy in 2020, is not serialised consecutively in the text file. This is not necessary, because each cell is precisely connected to the *row* (first part of the triple) and *column* (second part of the triple). We could say that the entire map to the original dataset is embedded into the flat text file, therefore it can be easily imported into a database.

*Note: The `.html#` in these example IRIs does not mean the resource is an HTML file.  
Any absolute IRI is valid in RDF. This form is used here only for illustration;  
in practice, a bare namespace such as `/dataset#` is more conventional.*


```{r readrdf2}
# We write to a temporary file our Ntriples created earlier
temp_file <- tempfile(fileext = ".nt")
writeLines(small_country_df_nt,
  con = temp_file
)

rdf_graph <- rdf()
rdf_parse(rdf_graph, doc = temp_file, format = "ntriples")
```
```{r readrdf2print, eval = FALSE}
rdf_graph
```
:::: smaller
```{r readrdf2printsmaller}
rdf_graph
```
:::: 

Your dataset is now ready to be exported to meet the true FAIR standards, because they are:

- **self-descriptive**: variables carry labels, units, and definitions.
- **machine-readable**: linked vocabularies and standard identifiers.
- **ready to publish and share**: they carry the metadata of each variable, potentially each observation unit, and through metadata standards like Dublin Core and DataCite the information about the whole dataset, too.

```{r readjsonld}
# Create temporary JSON-LD output file
jsonld_file <- tempfile(fileext = ".jsonld")

# Serialize (export) the entire graph to JSON-LD format
rdf_serialize(rdf_graph, doc = jsonld_file, format = "jsonld")
```

Read it back to R for display (only first 30 lines are shown):

:::: smaller
```{r readjsonldprint}
cat(readLines(jsonld_file)[1:30], sep = "\n")
```
::::

```{r clenup2, echo=FALSE, message=FALSE}
unlink(temp_file)
rm(rdf_graph)
gc()
```
