---
title: "6. tidyped Class Structure and Extension Notes"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{6. tidyped Class Structure and Extension Notes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

This document describes the structural contract of the `tidyped` class in
visPedigree 1.8.0. It is intended for maintenance and extension work.

## 1. Class identity

`tidyped` is an S3 class layered on top of `data.table`.

Expected class vector:

```r
c("tidyped", "data.table", "data.frame")
```

The class is created through `new_tidyped()` (internal constructor) and checked
with `is_tidyped()`.

## 2. Core design goals

`tidyped` is designed to be:

1. **safe for C++**: integer pedigree indices (`IndNum`, `SireNum`, `DamNum`)
  are always aligned with row order, so C++ routines can index directly
  without translation;
2. **fast for large pedigrees**: the fast path skips redundant validation when
  the input is already a `tidyped`;
3. **compatible with `data.table`**: in-place modification via `:=` and `set()`
  preserves class and metadata without copying;
4. **explicit about structural degradation**: row subsets that break pedigree
  completeness are downgraded to plain `data.table` with a warning.

## 3. The head invariant: IndNum == row index

The single most important structural rule in visPedigree:

> **`IndNum[i]` must equal `i` for every row.**

This means `SireNum` and `DamNum` are direct row pointers: the sire of
individual `i` lives at row `SireNum[i]`, and `0L` encodes a missing parent.

Every C++ function in visPedigree — inbreeding coefficients, relationship
matrices, BFS tracing, topological sorting — relies on this invariant. If it
breaks, C++ will read wrong parents.

This invariant is enforced at three levels:

- **`tidyped()`**: builds indices from scratch during construction.
- **`[.tidyped`**: rebuilds indices in-place after valid row subsets.
- **`ensure_tidyped()` / `ensure_complete_tidyped()`**: detect and repair
  stale indices when class was accidentally dropped.

## 4. Column contract

### 4.1 Minimal structural columns

These four columns define a valid `tidyped`:

| Column | Type      | Description                          |
|--------|-----------|--------------------------------------|
| `Ind`  | character | Unique individual ID                 |
| `Sire` | character | Sire ID, `NA` for unknown            |
| `Dam`  | character | Dam ID, `NA` for unknown             |
| `Sex`  | character | `"male"`, `"female"`, or `"unknown"` |

Checked by `validate_tidyped()`.

### 4.2 Integer pedigree columns

| Column    | Type    | Description                         |
|-----------|---------|-------------------------------------|
| `IndNum`  | integer | Row index (== row number, see §3)   |
| `SireNum` | integer | Row index of sire, `0L` for missing |
| `DamNum`  | integer | Row index of dam, `0L` for missing  |

These exist whenever `tidyped()` is called with `addnum = TRUE` (default).
They are the interface between R and C++.

### 4.3 Other common columns

| Column       | Description                                  |
|--------------|----------------------------------------------|
| `Gen`        | Generation number                            |
| `Family`     | Family group code                            |
| `FamilySize` | Number of offspring in the family            |
| `Cand`       | `TRUE` for candidate individuals             |
| `f`          | Inbreeding coefficient (added by `inbreed()`) |

### 4.4 Column naming convention

All data columns use **PascalCase** (`Ind`, `SireNum`, `MeanF`, `ECG`),
matching the core column style.

## 5. Metadata layer

Pedigree-level metadata is stored in a single attribute:

```r
attr(x, "ped_meta")
```

Built by `build_ped_meta()`, accessed by `pedmeta()`.

| Field              | Type      | Description                             |
|--------------------|-----------|-----------------------------------------|
| `selfing`          | logical   | Whether self-fertilization mode was used |
| `bisexual_parents` | character | IDs appearing as both sire and dam       |
| `genmethod`        | character | `"top"` or `"bottom"` generation numbering |

No other pedigree-level attributes should be added outside `ped_meta`.

## 6. Structural invariants

The following invariants must hold for a valid `tidyped`:

1. **IndNum == row index** (see §3).
2. **Ind is unique** — no duplicate individual IDs.
3. **Completeness** — every non-`NA` Sire and Dam appears in `Ind`.
4. **Acyclicity** — no individual is its own ancestor.
5. **SireNum / DamNum consistency** — `0L` for missing parents, valid row
  indices otherwise.
6. **ped_meta is the sole metadata container** — no scattered attributes.

Invariants 1–5 are established by `tidyped()` and guarded by `[.tidyped`.
Invariant 6 is a development convention.

## 7. Constructor pipeline

`tidyped()` currently has two distinct tracing paths:

- **Raw-input path** (`data.frame` / `data.table`) — uses igraph for loop
  detection, candidate tracing, and topological sorting before integer indices
  are finalized.
- **Fast path** (`tidyped` + `cand`) — skips graph rebuilding and uses C++ for
  candidate tracing and topological sorting on existing integer pedigree
  indices.

### 7.1 Full path: `tidyped(raw_input)`

When the input is a raw `data.frame` or `data.table`:

1. `validate_and_prepare_ped()` — normalize IDs, detect duplicates and
  bisexual parents, inject missing founders.
2. Loop detection — igraph builds a directed graph and checks `is_dag()`;
  `which_loop()` and `shortest_paths()` are used only on the error path to
  report informative loop diagnostics.
3. Candidate tracing — if `cand` is supplied, igraph neighborhood search is
  used on the raw-input path.
4. Topological sort — igraph `topo_sort()` on the raw-input path.
5. Generation assignment — C++ (`cpp_assign_generations_top` /
  `cpp_assign_generations_bottom`) using the pedigree implied by the sorted
  rows.
6. Sex inference — resolve unknowns from parental roles.
7. Build integer indices — `IndNum`, `SireNum`, `DamNum`.
8. `new_tidyped()` + attach `ped_meta`.

### 7.2 Fast path: `tidyped(tp, cand = ids)`

When the input is already a `tidyped` **and** `cand` is supplied:

- **Skipped**: ID validation, loop detection, sex inference, founder injection.
- **Executed**: C++ BFS tracing → C++ topo sort → C++ generation assignment →
  rebuild indices → `new_tidyped()` + `ped_meta`.

The fast path is the preferred workflow for repeated local tracing from a
previously validated master pedigree:

```r
tp_master <- tidyped(raw_ped)
tp_local  <- tidyped(tp_master, cand = ids, trace = "up", tracegen = 3)
```

### 7.3 `new_tidyped()` — internal constructor

`new_tidyped()` attaches the `"tidyped"` class via `setattr()` (no copy) and
clears data.table's invisible flag via `x[]`. It does **not** attach
`ped_meta` — that is the caller's responsibility. It should only be called when
the caller has already ensured structural validity.

## 8. Three-tier guard system

Analysis functions must guard their inputs. visPedigree provides three guard
levels, chosen based on what each function needs.

### 8.1 `validate_tidyped()` — visualization guard

- Attempts silent class recovery via `ensure_tidyped()`.
- Checks only that `Ind`, `Sire`, `Dam`, `Sex` exist.
- **Does not require** pedigree completeness.
- Used by: `visped()`, `plot.tidyped()`, `summary.tidyped()`.

### 8.2 `ensure_tidyped()` — structure-light guard

- If already `tidyped`: returns as-is.
- If class was dropped but 8 core columns (`Ind`, `Sire`, `Dam`, `Sex`, `Gen`,
  `IndNum`, `SireNum`, `DamNum`) are present: rebuilds `IndNum` if stale,
  restores class, emits a message.
- **Does not check** pedigree completeness.
- Used by: `pedsubpop()`, `splitped()`, `pedne(method = "demographic")`,
  `pedstats(ecg = FALSE, genint = FALSE)`, `pedfclass()` (when `f` column
  already exists).

### 8.3 `ensure_complete_tidyped()` — complete-pedigree guard

- Everything `ensure_tidyped()` does, **plus**:
- Calls `require_complete_pedigree()` — verifies that every non-`NA` Sire/Dam
  is present in `Ind`. Stops with an error if not.
- Required by any function that recurses through pedigree structure in C++.
- Used by: `inbreed()`, `pedecg()`, `pedgenint()`, `pedrel()`,
  `pedne(method = "inbreeding" | "coancestry")`, `pedcontrib()`,
  `pedancestry()`, `pedfclass()` (when `f` must be computed), `pedpartial()`,
  `pediv()`, `pedmat()`, `pedhalflife()`.

### 8.4 Choosing the right guard

| Guard                       | Recovers class? | Requires completeness? | When to use                   |
|-----------------------------|:---------------:|:----------------------:|-------------------------------|
| `validate_tidyped()`        | yes             | no                     | Visualization                 |
| `ensure_tidyped()`          | yes             | no                     | Summaries on existing columns |
| `ensure_complete_tidyped()` | yes             | **yes**                | Pedigree recursion in C++     |

Some functions are **conditionally guarded**: they use `ensure_tidyped()` by
default but escalate to `ensure_complete_tidyped()` when a parameter triggers
pedigree recursion (for example `pedstats(ecg = TRUE)`,
`pedne(method = "coancestry")`).

## 9. Safe subsetting contract

`[.tidyped` is the key protection layer.

### 9.1 `:=` operations

Modify-by-reference is passed through safely. Class and metadata are preserved
via `setattr()`. No copy occurs.

### 9.2 Column-only selections

If the selection removes core pedigree columns, the result is returned as a
plain `data.table` without warning.

### 9.3 Row subsets

After row subsetting, `[.tidyped` checks pedigree completeness:

- **Complete subset** (all referenced parents still present): `IndNum`,
  `SireNum`, `DamNum` are rebuilt in-place, class and `ped_meta` are preserved.
- **Incomplete subset** (parent records missing): result is downgraded to plain
  `data.table` with a warning guiding the user to
  `tidyped(tp, cand = ids, trace = "up")`.

This downgrade is deliberate. It prevents stale integer indices from reaching
C++ routines.

## 10. Computational boundaries: C++ vs igraph

visPedigree delegates heavy pedigree recursion to C++ and uses igraph where a
graph object is still the simplest representation.

### 10.1 C++ — core computation path

| Task                          | C++ function                                         |
|-------------------------------|------------------------------------------------------|
| Ancestry / descendant tracing | `cpp_trace_ancestors`, `cpp_trace_descendants`       |
| Topological sorting           | `cpp_topo_order`                                     |
| Generation assignment         | `cpp_assign_generations_top`, `cpp_assign_generations_bottom` |
| Inbreeding coefficients       | `cpp_calculate_inbreeding` (Meuwissen & Luo)         |
| Relationship matrices         | `cpp_addmat`, `cpp_dommat`, `cpp_aamat`, `cpp_ainv`  |

All C++ functions consume `SireNum` / `DamNum` integer vectors and assume the
head invariant (§3).

### 10.2 igraph — graph-specific tasks

| Task                   | Where                         | igraph functions                                     |
|------------------------|-------------------------------|------------------------------------------------------|
| Pedigree visualization | `visped()` pipeline           | `graph_from_data_frame`, `layout_with_sugiyama`, `plot.igraph` |
| Connected components   | `splitped()`                  | `graph_from_edgelist`, `components`                  |
| Loop detection         | `tidyped()` raw-input path    | `graph_from_edgelist`, `is_dag`                      |
| Loop diagnosis         | `tidyped()` error path        | `which_loop`, `shortest_paths`, `neighbors`, `components` |
| Candidate tracing      | `tidyped()` raw-input path    | `neighborhood`                                       |
| Topological sorting    | `tidyped()` raw-input path    | `topo_sort`                                          |

igraph is not used in the core numerical pedigree analysis routines such as
`inbreed()`, `pedmat()`, `pedecg()`, or `pedrel()`, but it is still part of
the preprocessing and visualization stack.

## 11. Extension rules

When extending the class, follow these rules.

### 11.1 Do not add new pedigree-level attributes

Prefer adding fields to `ped_meta` instead of scattering new standalone
attributes.

### 11.2 Keep computed state derivable

If a column can be rebuilt from pedigree structure, prefer derivation over
storing opaque cached state.

### 11.3 Preserve `data.table` semantics

Use `:=`, `set()`, and `setattr()` carefully. Avoid patterns that trigger full
copies unless unavoidable.

### 11.4 Respect downgrade semantics

Any future method that subsets rows must preserve the current rule:

valid complete subset -> `tidyped`; incomplete subset -> plain `data.table`.

### 11.5 Document C++ assumptions

Any feature using `IndNum`, `SireNum`, or `DamNum` should document whether it
requires:

- topologically ordered rows,
- dense consecutive indices,
- `0L` encoding for missing parents.

## 12. User-facing inspection helpers

| Function                  | Returns                           |
|---------------------------|-----------------------------------|
| `is_tidyped(x)`           | `TRUE` if class is present        |
| `is_complete_pedigree(x)` | `TRUE` if all Sire/Dam are in Ind |
| `pedmeta(x)`              | The `ped_meta` named list         |
| `has_inbreeding(x)`       | `TRUE` if `f` column exists       |
| `has_candidates(x)`       | `TRUE` if `Cand` column exists    |

Future extensions should prefer helper functions over direct attribute access.

## 13. Maintenance checklist

Before merging a structural change to `tidyped`, check:

1. Does class identity remain `c("tidyped", "data.table", "data.frame")`?
2. Is the head invariant `IndNum == row index` preserved after every code path?
3. Are `ped_meta` fields preserved correctly?
4. Does `[.tidyped` still handle `:=` without copy issues?
5. Do incomplete row subsets still downgrade with warning?
6. Are integer pedigree columns rebuilt whenever a subset remains valid?
7. Does `tidyped(tp_master, cand = ...)` match the full path result?
8. After `setorder()` or `merge()`, are indices rebuilt before reaching C++?
9. Do package tests and vignettes build cleanly?

## 14. Recommended workflow

For large pedigrees, the intended usage pattern is:

```r
# build one validated master pedigree
tp_master <- tidyped(raw_ped)

# reuse it for repeated local tracing (fast path)
tp_local <- tidyped(tp_master, cand = ids, trace = "up", tracegen = 3)

# modify analysis columns in place
tp_master[, phenotype := pheno]

# split only when disconnected components matter
parts <- splitped(tp_master)
```

This keeps workflows explicit, fast, and safe.