Efficient Glycan Manipulation with smap

Overview

This vignette introduces the smap family of functions. These functions are useful when you want to apply custom igraph-based operations to glycan structure vectors.

This guide assumes you are comfortable with R programming and have some familiarity with graph concepts. If you are just getting started, read the “Getting Started with glyrepr” vignette first.

library(glyrepr)

Unique Structure Optimization

Before using smap, it helps to understand why these functions exist.

The Problem

Working with glycan structures means working with graphs, and graph operations are computationally expensive. When you are analyzing thousands of glycans from a large-scale study, this becomes a real bottleneck.

The Solution

glyrepr implements an optimization called unique structure storage. Instead of storing thousands of identical graphs, it stores only the unique ones and keeps track of which original positions they belong to.

Let’s see this in action:

# Our test data: some common glycan structures
iupacs <- c(
  "Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-",  # N-glycan core
  "Gal(b1-3)GalNAc(a1-",                                    # O-glycan core 1
  "Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-",                     # O-glycan core 2
  "Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-",          # Branched mannose
  "GlcNAc6Ac(b1-4)Glc3Me(a1-"                              # With decorations
)

struc <- as_glycan_structure(iupacs)

# Now create a realistic dataset with lots of repetition.
large_struc <- rep(struc, 1000)  # 5,000 total structures
large_struc
#> <glycan_structure[5000]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [2] Gal(b1-3)GalNAc(a1-
#> [3] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [4] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-
#> [5] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> [6] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> [7] Gal(b1-3)GalNAc(a1-
#> [8] Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-
#> [9] Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-
#> [10] GlcNAc6Ac(b1-4)Glc3Me(a1-
#> ... (4990 more not shown)
#> # Unique structures: 5

Notice that the object reports only 5 unique structures. The vector has 5,000 elements, but only 5 unique graphs are stored internally.

We can verify that directly:

# Only 5 unique graphs are stored internally
length(attr(large_struc, "structures"))
#> [1] 0

# But we have 5,000 total elements
length(large_struc)
#> [1] 5000

Memory Savings

library(lobstr)
#> Warning: 程序包'lobstr'是用R版本4.5.2 来建造的
obj_sizes(struc, large_struc)
#> * 13.78 kB
#> * 40.69 kB

The memory difference can be substantial. For repeated structures, the optimized representation can be much smaller than storing every graph independently.

The smap Family

There is one important consequence of this internal representation: regular lapply() or purrr::map() functions do not operate directly on a glycan structure vector as if it were a list of graphs.

# This will not work and will raise an error.
tryCatch(
  purrr::map_int(large_struc, ~ igraph::vcount(.x)),
  error = function(e) cat("Error:", rlang::cnd_message(e))
)
#> Error: ℹ In index: 1.
#> Caused by error in `ensure_igraph()`:
#> ! Must provide a graph object (provided wrong object type).

Why does this fail? Because purrr functions don’t understand the internal structure optimization of glycan_structure objects.

Structure-Aware Mapping

The smap functions are structure-aware alternatives to purrr mapping functions. They understand the unique structure optimization and work directly with the underlying graph objects.

vertex_counts <- smap_int(large_struc, ~ igraph::vcount(.x))
vertex_counts[1:10]
#>  [1] 5 2 3 5 2 5 2 3 5 2

The “s” stands for “structure”: these functions operate on the underlying igraph objects that represent glycan structures.

The smap Toolkit

The smap family provides glycan-aware equivalents for virtually all purrr functions:

purrr smap purrr smap
map() smap() map2() smap2()
map_lgl() smap_lgl() map2_lgl() smap2_lgl()
map_int() smap_int() map2_int() smap2_int()
map_dbl() smap_dbl() map2_dbl() smap2_dbl()
map_chr() smap_chr() map2_chr() smap2_chr()
some() ssome() pmap() spmap()
every() severy() pmap_*() spmap_*()
none() snone() imap() simap()
imap_*() simap_*()

As a simple rule, replace map with smap, pmap with spmap, and imap with simap. The function signatures are designed to feel familiar if you already use purrr.

Basic Examples

Count vertices in each structure:

vertex_counts <- smap_int(large_struc, igraph::vcount)
summary(vertex_counts)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     2.0     2.0     3.0     3.4     5.0     5.0

Find structures with more than 4 vertices:

has_many_vertices <- smap_lgl(large_struc, ~ igraph::vcount(.x) > 4)
sum(has_many_vertices)
#> [1] 2000

Get the degree sequence of each structure:

degree_sequences <- smap(large_struc, ~ igraph::degree(.x))
degree_sequences[1:3]
#> [[1]]
#> 1 2 3 4 5 
#> 1 1 3 2 1 
#> 
#> [[2]]
#> 1 2 
#> 1 1 
#> 
#> [[3]]
#> 1 2 3 
#> 1 1 2

Check if any structure has isolated vertices:

ssome(large_struc, ~ any(igraph::degree(.x) == 0))
#> [1] FALSE

Verify all structures are connected:

severy(large_struc, ~ igraph::is_connected(.x))
#> [1] TRUE

Beyond Basic smap()

Quick examples of the extended family:

# smap2: Apply function with additional parameters
thresholds <- c(3, 4, 5)
large_enough <- smap2_lgl(struc[1:3], thresholds, function(g, threshold) {
  igraph::vcount(g) >= threshold
})
large_enough
#> [1]  TRUE FALSE FALSE
# simap: Include position information
indexed_report <- simap_chr(large_struc[1:3], function(g, i) {
  paste0("#", i, ": ", igraph::vcount(g), " vertices")
})
indexed_report
#> [1] "#1: 5 vertices" "#2: 2 vertices" "#3: 3 vertices"

Performance note: simap functions do not benefit from the unique structure optimization. Since each element has a different index, the combination of (structure, index) is always unique, breaking the deduplication that makes other smap functions fast. Use simap only when you truly need position information.

Performance

The main performance benefit of smap functions comes from automatic deduplication:

# Create a large dataset with high redundancy
huge_struc <- rep(struc, 5000)  # 25,000 structures, only 5 unique

cat("Dataset size:", length(huge_struc), "structures\n")
#> Dataset size: 25000 structures
cat("Unique structures:", length(attr(huge_struc, "structures")), "\n")
#> Unique structures: 0
cat("Redundancy factor:", length(huge_struc) / length(attr(huge_struc, "structures")), "x\n")
#> Redundancy factor: Inf x

library(tictoc)

# Optimized approach: smap only processes 5 unique structures
tic("smap_int (optimized)")
vertex_counts_optimized <- smap_int(huge_struc, igraph::vcount)
toc()
#> smap_int (optimized): 0.001 sec elapsed

# Naive approach: extract all graphs and process each one
tic("Naive approach (all graphs)")
all_graphs <- get_structure_graphs(huge_struc)  # Extracts all 25,000 graphs
vertex_counts_naive <- purrr::map_int(all_graphs, igraph::vcount)
toc()
#> Naive approach (all graphs): 0.086 sec elapsed

# Verify results are equivalent (though data types may differ)
all.equal(vertex_counts_optimized, vertex_counts_naive)
#> [1] TRUE

The higher the redundancy, the larger the performance gain. In real glycoproteomics datasets with repeated structures, this optimization can provide about 10x speedups.

Additional Patterns

Working with Complex Functions

The function you pass to smap must accept an igraph object as its first argument. You can use purrr-style lambda notation:

# Calculate clustering coefficient for each structure
clustering_coeffs <- smap_dbl(large_struc, ~ igraph::transitivity(.x, type = "global"))
summary(clustering_coeffs)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>       0       0       0       0       0       0    2000

Combining Multiple Metrics

# Create a compact structure summary.
structure_analysis <- smap(large_struc, function(g) {
  list(
    vertices = igraph::vcount(g),
    edges = igraph::ecount(g),
    diameter = ifelse(igraph::is_connected(g), igraph::diameter(g), NA),
    clustering = igraph::transitivity(g, type = "global")
  )
})

# Convert to a more usable format
analysis_df <- do.call(rbind, lapply(structure_analysis, data.frame))
head(analysis_df)
#>   vertices edges diameter clustering
#> 1        5     4        3          0
#> 2        2     1        1        NaN
#> 3        3     2        1          0
#> 4        5     4        2          0
#> 5        2     1        1        NaN
#> 6        5     4        3          0

Memory-Efficient Filtering

# Find only structures with exactly 5 vertices
has_five_vertices <- smap_lgl(large_struc, ~ igraph::vcount(.x) == 5)
five_vertex_structures <- large_struc[has_five_vertices]

cat("Found", sum(has_five_vertices), "structures with exactly 5 vertices\n")
#> Found 2000 structures with exactly 5 vertices

When to Use smap Functions

Use smap functions when:

Use regular R functions when:

Special note on simap:

While simap functions are convenient for position-aware operations, they do not provide performance benefits over regular imap functions. The inclusion of index information breaks the unique structure optimization, making each (structure, index) pair unique even when structures are identical.

Example: Custom Motif Detection

Here’s how you might build a custom glycan analysis pipeline using smap functions:

# Custom motif detector
detect_branching <- function(g) {
  degrees <- igraph::degree(g)
  any(degrees >= 3)
}

# Apply to a large dataset using unique structure optimization.
has_branching <- smap_lgl(large_struc, detect_branching)
cat("Structures with branching:", sum(has_branching), "out of", length(large_struc), "\n")
#> Structures with branching: 2000 out of 5000

# Use smap2 to check structures against complexity thresholds
complexity_thresholds <- rep(c(3, 4, 5, 2, 4), 1000)  # Thresholds for each structure
meets_threshold <- smap2_lgl(large_struc, complexity_thresholds, function(g, threshold) {
  igraph::vcount(g) >= threshold
})
cat("Structures meeting complexity threshold:", sum(meets_threshold), "out of", length(large_struc), "\n")
#> Structures meeting complexity threshold: 2000 out of 5000

Summary

The smap family provides structure-aware mapping functions for glycan structure vectors. It lets you write custom graph-based analyses while preserving the unique structure optimization used by glyrepr.

Key takeaways:

Session Information

sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Tahoe 26.3.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] C/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tictoc_1.2.1   lobstr_1.1.3   dplyr_1.2.1    tibble_3.3.1   purrr_1.2.2   
#> [6] glyrepr_0.11.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] jsonlite_2.0.0    compiler_4.5.1    tidyselect_1.2.1  stringr_1.6.0    
#>  [5] jquerylib_0.1.4   yaml_2.3.12       fastmap_1.2.0     R6_2.6.1         
#>  [9] generics_0.1.4    igraph_2.2.3      knitr_1.51        backports_1.5.1  
#> [13] checkmate_2.3.4   rstackdeque_1.1.1 bslib_0.9.0       pillar_1.11.1    
#> [17] rlang_1.2.0       utf8_1.2.6        cachem_1.1.0      stringi_1.8.7    
#> [21] xfun_0.55         sass_0.4.10       otel_0.2.0        cli_3.6.6        
#> [25] magrittr_2.0.5    digest_0.6.39     lifecycle_1.0.5   prettyunits_1.2.0
#> [29] vctrs_0.7.3       evaluate_1.0.5    glue_1.8.0        rmarkdown_2.30   
#> [33] tools_4.5.1       pkgconfig_2.0.3   htmltools_0.5.9