This vignette introduces the smap family of functions. These functions are useful when you want to apply custom igraph-based operations to glycan structure vectors.
This guide assumes you are comfortable with R programming and have some familiarity with graph concepts. If you are just getting started, read the “Getting Started with glyrepr” vignette first.
library(glyrepr)Before using smap, it helps to understand why these functions exist.
Working with glycan structures means working with graphs, and graph operations are computationally expensive. When you are analyzing thousands of glycans from a large-scale study, this becomes a real bottleneck.
glyrepr implements an optimization called unique structure storage. Instead of storing thousands of identical graphs, it stores only the unique ones and keeps track of which original positions they belong to.
Let’s see this in action:
# Our test data: some common glycan structures
iupacs <- c(
"Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-", # N-glycan core
"Gal(b1-3)GalNAc(a1-", # O-glycan core 1
"Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-", # O-glycan core 2
"Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-", # Branched mannose
"GlcNAc6Ac(b1-4)Glc3Me(a1-" # With decorations
)
struc <- as_glycan_structure(iupacs)
# Now create a realistic dataset with lots of repetition.
large_struc <- rep(struc, 1000) # 5,000 total structures
large_struc
#> <glycan_structure[5000]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [2] Gal(b1-3)GalNAc(a1-
#> [3] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [4] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-
#> [5] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> [6] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [7] Gal(b1-3)GalNAc(a1-
#> [8] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [9] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-
#> [10] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> ... (4990 more not shown)
#> # Unique structures: 5Notice that the object reports only 5 unique structures. The vector has 5,000 elements, but only 5 unique graphs are stored internally.
We can verify that directly:
# Only 5 unique graphs are stored internally
length(attr(large_struc, "structures"))
#> [1] 0
# But we have 5,000 total elements
length(large_struc)
#> [1] 5000library(lobstr)
#> Warning: 程序包'lobstr'是用R版本4.5.2 来建造的
obj_sizes(struc, large_struc)
#> * 13.78 kB
#> * 40.69 kBThe memory difference can be substantial. For repeated structures, the optimized representation can be much smaller than storing every graph independently.
smap FamilyThere is one important consequence of this internal representation: regular lapply() or purrr::map() functions do not operate directly on a glycan structure vector as if it were a list of graphs.
# This will not work and will raise an error.
tryCatch(
purrr::map_int(large_struc, ~ igraph::vcount(.x)),
error = function(e) cat("Error:", rlang::cnd_message(e))
)
#> Error: ℹ In index: 1.
#> Caused by error in `ensure_igraph()`:
#> ! Must provide a graph object (provided wrong object type).Why does this fail? Because purrr functions don’t understand the internal structure optimization of glycan_structure objects.
The smap functions are structure-aware alternatives to purrr mapping functions. They understand the unique structure optimization and work directly with the underlying graph objects.
vertex_counts <- smap_int(large_struc, ~ igraph::vcount(.x))
vertex_counts[1:10]
#> [1] 5 2 3 5 2 5 2 3 5 2The “s” stands for “structure”: these functions operate on the underlying igraph objects that represent glycan structures.
smap ToolkitThe smap family provides glycan-aware equivalents for virtually all purrr functions:
| purrr | smap | purrr | smap |
|---|---|---|---|
map() |
smap() |
map2() |
smap2() |
map_lgl() |
smap_lgl() |
map2_lgl() |
smap2_lgl() |
map_int() |
smap_int() |
map2_int() |
smap2_int() |
map_dbl() |
smap_dbl() |
map2_dbl() |
smap2_dbl() |
map_chr() |
smap_chr() |
map2_chr() |
smap2_chr() |
some() |
ssome() |
pmap() |
spmap() |
every() |
severy() |
pmap_*() |
spmap_*() |
none() |
snone() |
imap() |
simap() |
imap_*() |
simap_*() |
As a simple rule, replace map with smap, pmap with spmap, and imap with simap. The function signatures are designed to feel familiar if you already use purrr.
Count vertices in each structure:
vertex_counts <- smap_int(large_struc, igraph::vcount)
summary(vertex_counts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2.0 2.0 3.0 3.4 5.0 5.0Find structures with more than 4 vertices:
has_many_vertices <- smap_lgl(large_struc, ~ igraph::vcount(.x) > 4)
sum(has_many_vertices)
#> [1] 2000Get the degree sequence of each structure:
degree_sequences <- smap(large_struc, ~ igraph::degree(.x))
degree_sequences[1:3]
#> [[1]]
#> 1 2 3 4 5
#> 1 1 3 2 1
#>
#> [[2]]
#> 1 2
#> 1 1
#>
#> [[3]]
#> 1 2 3
#> 1 1 2Check if any structure has isolated vertices:
ssome(large_struc, ~ any(igraph::degree(.x) == 0))
#> [1] FALSEVerify all structures are connected:
severy(large_struc, ~ igraph::is_connected(.x))
#> [1] TRUEsmap()Quick examples of the extended family:
# smap2: Apply function with additional parameters
thresholds <- c(3, 4, 5)
large_enough <- smap2_lgl(struc[1:3], thresholds, function(g, threshold) {
igraph::vcount(g) >= threshold
})
large_enough
#> [1] TRUE FALSE FALSE# simap: Include position information
indexed_report <- simap_chr(large_struc[1:3], function(g, i) {
paste0("#", i, ": ", igraph::vcount(g), " vertices")
})
indexed_report
#> [1] "#1: 5 vertices" "#2: 2 vertices" "#3: 3 vertices"Performance note: simap functions do not benefit from the unique structure optimization. Since each element has a different index, the combination of (structure, index) is always unique, breaking the deduplication that makes other smap functions fast. Use simap only when you truly need position information.
The main performance benefit of smap functions comes from automatic deduplication:
# Create a large dataset with high redundancy
huge_struc <- rep(struc, 5000) # 25,000 structures, only 5 unique
cat("Dataset size:", length(huge_struc), "structures\n")
#> Dataset size: 25000 structures
cat("Unique structures:", length(attr(huge_struc, "structures")), "\n")
#> Unique structures: 0
cat("Redundancy factor:", length(huge_struc) / length(attr(huge_struc, "structures")), "x\n")
#> Redundancy factor: Inf x
library(tictoc)
# Optimized approach: smap only processes 5 unique structures
tic("smap_int (optimized)")
vertex_counts_optimized <- smap_int(huge_struc, igraph::vcount)
toc()
#> smap_int (optimized): 0.001 sec elapsed
# Naive approach: extract all graphs and process each one
tic("Naive approach (all graphs)")
all_graphs <- get_structure_graphs(huge_struc) # Extracts all 25,000 graphs
vertex_counts_naive <- purrr::map_int(all_graphs, igraph::vcount)
toc()
#> Naive approach (all graphs): 0.086 sec elapsed
# Verify results are equivalent (though data types may differ)
all.equal(vertex_counts_optimized, vertex_counts_naive)
#> [1] TRUEThe higher the redundancy, the larger the performance gain. In real glycoproteomics datasets with repeated structures, this optimization can provide about 10x speedups.
The function you pass to smap must accept an igraph object as its first argument. You can use purrr-style lambda notation:
# Calculate clustering coefficient for each structure
clustering_coeffs <- smap_dbl(large_struc, ~ igraph::transitivity(.x, type = "global"))
summary(clustering_coeffs)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0 0 0 0 0 0 2000# Create a compact structure summary.
structure_analysis <- smap(large_struc, function(g) {
list(
vertices = igraph::vcount(g),
edges = igraph::ecount(g),
diameter = ifelse(igraph::is_connected(g), igraph::diameter(g), NA),
clustering = igraph::transitivity(g, type = "global")
)
})
# Convert to a more usable format
analysis_df <- do.call(rbind, lapply(structure_analysis, data.frame))
head(analysis_df)
#> vertices edges diameter clustering
#> 1 5 4 3 0
#> 2 2 1 1 NaN
#> 3 3 2 1 0
#> 4 5 4 2 0
#> 5 2 1 1 NaN
#> 6 5 4 3 0# Find only structures with exactly 5 vertices
has_five_vertices <- smap_lgl(large_struc, ~ igraph::vcount(.x) == 5)
five_vertex_structures <- large_struc[has_five_vertices]
cat("Found", sum(has_five_vertices), "structures with exactly 5 vertices\n")
#> Found 2000 structures with exactly 5 verticessmap FunctionsUse smap functions when:
igraph-based functions to glycan structures.Use regular R functions when:
Special note on simap:
While simap functions are convenient for position-aware operations, they do not provide performance benefits over regular imap functions. The inclusion of index information breaks the unique structure optimization, making each (structure, index) pair unique even when structures are identical.
Here’s how you might build a custom glycan analysis pipeline using smap functions:
# Custom motif detector
detect_branching <- function(g) {
degrees <- igraph::degree(g)
any(degrees >= 3)
}
# Apply to a large dataset using unique structure optimization.
has_branching <- smap_lgl(large_struc, detect_branching)
cat("Structures with branching:", sum(has_branching), "out of", length(large_struc), "\n")
#> Structures with branching: 2000 out of 5000
# Use smap2 to check structures against complexity thresholds
complexity_thresholds <- rep(c(3, 4, 5, 2, 4), 1000) # Thresholds for each structure
meets_threshold <- smap2_lgl(large_struc, complexity_thresholds, function(g, threshold) {
igraph::vcount(g) >= threshold
})
cat("Structures meeting complexity threshold:", sum(meets_threshold), "out of", length(large_struc), "\n")
#> Structures meeting complexity threshold: 2000 out of 5000The smap family provides structure-aware mapping functions for glycan structure vectors. It lets you write custom graph-based analyses while preserving the unique structure optimization used by glyrepr.
Key takeaways:
smap functions are purrr-like tools that understand glycan structure vectors.smap for structures, and use regular R or purrr functions for other data types.sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.3.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
#>
#> time zone: Asia/Shanghai
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] tictoc_1.2.1 lobstr_1.1.3 dplyr_1.2.1 tibble_3.3.1 purrr_1.2.2
#> [6] glyrepr_0.11.0
#>
#> loaded via a namespace (and not attached):
#> [1] jsonlite_2.0.0 compiler_4.5.1 tidyselect_1.2.1 stringr_1.6.0
#> [5] jquerylib_0.1.4 yaml_2.3.12 fastmap_1.2.0 R6_2.6.1
#> [9] generics_0.1.4 igraph_2.2.3 knitr_1.51 backports_1.5.1
#> [13] checkmate_2.3.4 rstackdeque_1.1.1 bslib_0.9.0 pillar_1.11.1
#> [17] rlang_1.2.0 utf8_1.2.6 cachem_1.1.0 stringi_1.8.7
#> [21] xfun_0.55 sass_0.4.10 otel_0.2.0 cli_3.6.6
#> [25] magrittr_2.0.5 digest_0.6.39 lifecycle_1.0.5 prettyunits_1.2.0
#> [29] vctrs_0.7.3 evaluate_1.0.5 glue_1.8.0 rmarkdown_2.30
#> [33] tools_4.5.1 pkgconfig_2.0.3 htmltools_0.5.9