```
library(forcats)
<- factor(c("apple", "bear", "banana", "dear"))
x <- c(fruit = "apple", fruit = "banana")
fruit_levels fct_recode(x, !!!fruit_levels)
```

```
[1] fruit bear fruit dear
Levels: fruit bear dear
```

The closest thing to a mystery-thriller that I’ve experienced in the ggplot2 ecosystem

data-viz

how-to

ggplot2

Author

Cynthia Huang

Published

November 27, 2023

Modified

November 27, 2023

I promise this post is about making ggplots, but first some background. I’ve been working on a new approach to validating data pipelines which involve recoding or redistributing values between categories. The core idea of approach is to validate *Mapping Objects* instead of, or in addition, to the data itself. For more details about the appraoch see this interview or the docs for the {xmap} package (pronounced “crossmap”).

Mapping objects can include named vectors or lists, as well as lookup tables and crosswalks – basically anything that encodes instructions of the form “category A connects to category B”. For instance, when using `forcats::fct_recode()`

, you might store the `new = old`

mappings as a named vector specifying which `old`

level “connects” to which `new`

level as below:

```
library(forcats)
x <- factor(c("apple", "bear", "banana", "dear"))
fruit_levels <- c(fruit = "apple", fruit = "banana")
fct_recode(x, !!!fruit_levels)
```

```
[1] fruit bear fruit dear
Levels: fruit bear dear
```

Now given a particular mapping object, you might want to verify that it has certain properties before using it. For example, when renaming columns with `dplyr::rename(...)`

, where the `…`

takes `new = old`

pairs, you probably only want 1-to-1 relations. No 2 `old`

columns should get the same `new`

name, and a single `old`

column being renamed into 2 `new`

columns is just duplicating data.

Of course this is a somewhat trivial example that you could quickly check by looking at the code, but as mappings get more complex and involve more categories, it becomes less obvious how to ensure you’re actually performing the intended transformations. Add a combination of recoding, aggregating and disaggregating numeric counts (e.g. occupation level statistics, or population by administrative area) to your data wrangling pipeline and you’re only one coding mistake away from accidentally (and often silently) dropping or corrupting some of your data (trust me, I’ve done it before).

Now, where do graphs come in? Well, assertive programming is a good preventive measure against funny business in your data wrangling pipelines. However, it’s not always obvious what assertions you should be checking. In the case of recoding or redistributing data, it turns out that thinking of mapping objects as directed bipartite weighted graphs is quite informative for designing assertions.

As a quick reminder, bipartite graphs are graphs where the nodes or vertices can be split into two disjoint sets, and edges or links are only allowed between the two sets, not within. The weighted part refers to the addition of a numeric attribute to each link. When the graph represents recoding or redistributing (i.e. collapsing or splitting) values, weights will be between 0 and 1. I call this graph-based representation a *Crossmap*.

With this representation we can identify assertions like if we only want 1-to-1 relations, we should check that the `old`

and `new`

sets have the same number of unique elements:

Notice that this condition doesn’t hold in the example above, since that is a many-to-one relation.

`[1] "apple" "banana"`

`[1] "fruit"`

You can take these assertions and check them using existing assertive programming tools like `{assertr}`

and `{validate}`

or even just using `{testthat}`

. Alternatively, the `{xmap}`

(crossmap) package wraps these conditions into assertive functions that you can call before using a mapping object:

```
library(xmap)
fruit_color <- c(green = "apple", yellow = "banana") |>
xmap::verify_named_all_1to1()
fct_recode(x, !!!fruit_color)
```

```
[1] green bear yellow dear
Levels: green yellow bear dear
```

Ok, so what else can we do with these graph representations? Well, wouldn’t it be nice if could easily summarise and visualise mapping objects? especially more complex ones?… and thus began my journey down the rabbit hole of graph data structures and ggplot2 extensions.

I’ll spare you the lengthier rabbit hole detours for now (skip ahead for some failed experiments), but I’ve landed on using a combination of `{ggraph}`

, `{tidygraph}`

and `{igraph}`

to power the `autoplot()`

method I want to add to the `{xmap}`

package. I’m still wrapping my head around `ggproto`

and how to implement new Geoms and Stats, but here’s what I’ve learnt so far.

Before we begin, a little context:

`{igraph}`

provides R bindings to the core igraph network analysis library written in C. It has its own class for graphs (`igraph`

) and offers a lot of graph analysis and layout algorithms that are meant much more complex graphs than a humble bipartite representation of data recoding objects.`{tidygraph}`

provides an tidy API for graph/network manipulation, including the`tbl_graph`

class which is a thin wrapper around an`igraph`

object.`{ggraph}`

is a`ggplot2`

extension which lets you turn graphs (`tbl_graph`

) into ggplots using layouts, nodes and edges.

And a note on the types of mappings I’m trying to plot. I think the crossmap format is particularly useful when you are working with combinations of one-to-one, one-to-many, many-to-one and many-to-many relations, rather than just one type of relation. For example, if you’re just doing one-to-one recodings, a two-column look up table is a much more space and time efficient summary method.

With that in mind, let’s see how far I got plotting the following simple crossmap that connects some nodes with `lower`

case names with ones with `upper`

case names with :

To start with we need to convert our table of edges into a graph object. Luckily `{tidygraph}`

handles this for us with ease:

```
# A tbl_graph: 9 nodes and 6 edges
#
# A rooted forest with 3 trees
#
# Node Data: 9 × 1 (active)
name
<chr>
1 a
2 b
3 c
4 d
5 AA
6 BB
# … with 3 more rows
#
# Edge Data: 6 × 3
from to share
<int> <int> <dbl>
1 1 5 1
2 2 6 1
3 3 6 1
# … with 3 more rows
```

From here we just need to generate a layout and add some `geom_edge_*`

s and `geom_node_*`

s.

What’s a layout you ask? I had the same question. According to `vignette("Layouts", package = "ggraph")`

:

In very short terms, a layout is the vertical and horizontal placement of nodes when plotting a particular graph structure. Conversely, a layout algorithm is an algorithm that takes in a graph structure (and potentially some additional parameters) and return the vertical and horizontal position of the nodes.

Ok, so we just need to pick a sensible layout algorithm, and lucky for me `ggraph`

offers the option of using the `igraph::layout_as_bipartite()`

algorithm. But wait a minute, apparently I haven’t supplied a bipartite graph? And what’s this `types`

argument?

`type`

of graph, maybe too special.It turns out that `igraph`

only recognises graphs as bipartite if you add a logical `type`

attribute to each of the vertices. It’s not clear to me if there’s an easy way to add this attribute once you’ve jumbled all your `to`

and `from`

nodes together into the `Node Data`

component of a tidygraph `tbl_graph`

.

A somewhat cumbersome workaround is to coerce the edge list into a matrix first and then use the `as_tbl_graph.matrix()`

method which handles the creation of the `type`

attribute. Again lucky for me, I’ve been experimenting with this in the `xmap`

package (because matrix representations also reveal useful assertions but I digress):

```
library(xmap)
(mtx_abc <- edges_abc |>
as_xmap_df(from = lower, to = upper, weights = share) |>
xmap_to_matrix())
```

```
4 x 5 sparse Matrix of class "dgCMatrix"
upper
lower AA BB CC DD EE
a 1 . . . .
b . 1 . . .
c . 1 . . .
d . . 0.3 0.6 0.1
```

The function currently returns a sparse matrix by default, but we can easily switch that into a base `matrix`

that `as_tbl_graph()`

can handle:

```
# A tbl_graph: 9 nodes and 6 edges
#
# A rooted forest with 3 trees
#
# Node Data: 9 × 2 (active)
type name
<lgl> <chr>
1 FALSE a
2 FALSE b
3 FALSE c
4 FALSE d
5 TRUE AA
6 TRUE BB
# … with 3 more rows
#
# Edge Data: 6 × 3
from to weight
<int> <int> <dbl>
1 1 5 1
2 2 6 1
3 3 6 1
# … with 3 more rows
```

Notice that we now have a logical `type`

attribute indicating the two disjoint `lower`

and `upper`

node sets. Let’s try again with the bipartite layout algorithm:

```
tgm_abc |>
ggraph(layout = "igraph", algorithm = "bipartite") +
geom_node_point() +
geom_edge_link()
```

Success! We have something that loosely resembles my earlier hand-drawn sketch. So now we can move on to customisation using the familiar layered ggplot grammar and the extra features offered by `ggraph`

’s `geom_edge_*`

and `geom_node_*`

families.

But before that, maybe we don’t need to mess around with converting to matrices after all.

As a general rule, I like things to be efficient and elegant, and the edgelist to matrix detour is neither. So, I started to dig around in the `igraph`

and `tidygraph`

documentation and code base to see if I could find a less clunky way of generating a bigraph layout.

Now, you have to remember that I’m not a graph theorist, or network researcher so a lot of the documentation was basically gibberish to me. Then there’s the added layer of traversing between `ggraph`

and `igraph`

layout functions (via `tidygraph`

?), and untangling the relationship between the `tbl_graph`

and `igraph`

classes.

Nevertheless, I eventually stumbled across this little breadcrumb in the `igraph::layout.bipartite()`

manual entry:

The layout is created by first placing the vertices in two rows, according to their types. Then the positions within the rows are optimized to minimize edge crossings, using the Sugiyama algorithm (see

`layout_with_sugiyama()`

).

At first glance, the examples in `igraph::layout_with_sugiyama()`

don’t bear much resemblance to the bipartite graph I sketched earlier. Take this example:

```
library(igraph, quietly = TRUE)
## Another example, from the following paper:
## Markus Eiglsperger, Martin Siebenhaller, Michael Kaufmann:
## An Efficient Implementation of Sugiyama's Algorithm for
## Layered Graph Drawing, Journal of Graph Algorithms and
## Applications 9, 305--325 (2005).
ex <- graph_from_literal(
0 -+ 29:6:5:20:4,
1 -+ 12,
2 -+ 23:8,
3 -+ 4,
4,
5 -+ 2:10:14:26:4:3,
6 -+ 9:29:25:21:13,
7,
8 -+ 20:16,
9 -+ 28:4,
10 -+ 27,
11 -+ 9:16,
12 -+ 9:19,
13 -+ 20,
14 -+ 10,
15 -+ 16:27,
16 -+ 27,
17 -+ 3,
18 -+ 13,
19 -+ 9,
20 -+ 4,
21 -+ 22,
22 -+ 8:9,
23 -+ 9:24,
24 -+ 12:15:28,
25 -+ 11,
26 -+ 18,
27 -+ 13:19,
28 -+ 7,
29 -+ 25
)
layers <- list(
0, c(5, 17), c(2, 14, 26, 3), c(23, 10, 18), c(1, 24),
12, 6, c(29, 21), c(25, 22), c(11, 8, 15), 16, 27, c(13, 19),
c(9, 20), c(4, 28), 7
)
layex <- layout_with_sugiyama(ex, layers = apply(
sapply(
layers,
function(x) V(ex)$name %in% as.character(x)
),
1, which
))
origvert <- c(rep(TRUE, vcount(ex)), rep(FALSE, nrow(layex$layout.dummy)))
realedge <- as_edgelist(layex$extd_graph)[, 2] <= vcount(ex)
plot(layex$extd_graph,
vertex.label.cex = 0.5,
edge.arrow.size = .5,
vertex.size = ifelse(origvert, 5, 0),
vertex.shape = ifelse(origvert, "square", "none"),
vertex.label = ifelse(origvert, V(ex)$name, ""),
edge.arrow.mode = ifelse(realedge, 2, 0)
)
```

But wait, let’s take a closer look at the documentation (emphasis mine):

This layout algorithm is designed for

directed acyclic graphswhere each vertex is assigned to a layer. Layers are indexed from zero, and vertices of the same layer will be placed on the same horizontal line. The X coordinates of vertices within each layer are decided by the heuristic proposed by Sugiyama et al. to minimize edge crossings.

the usage:

```
layout_with_sugiyama(
graph,
layers = NULL,
hgap = 1,
vgap = 1,
maxiter = 100,
weights = NULL,
attributes = c("default", "all", "none")
)
```

and the `layers`

argument:

- layers
A numeric vector or NULL. If not NULL, then it should specify the layer index of the vertices. Layers are numbered from one. If NULL, then igraph calculates the layers automatically.

Are you following? Not quite? Well, what if I told you that *directed bipartite graphs* are just a special kind of directed acyclic graph?

Then, we’ve found a way to bypass the `type`

requirement for using `layout_as_bipartite()`

, which as far as I can tell probably just converts the logical `type`

attribute into a numeric vector (i.e. 0 or 1) assigning a “layer index” to the two types of nodes, and then calls `layout_with_sugiyama()`

.

Translating this back to `ggraph::create_layout()`

/`ggraph()`

syntax we can get around our earlier error without fiddling with matrices by letting `layout_with_sugiyama()`

work out the disjoint `lower`

and `upper`

sets:

Another benefit of using the Sugiyama algorithm directly is that it lends itself nicely to plotting multi-step transformations as directed multipartite graphs. We would just need to calculate a “layer index” for each step and let the `ggraph`

/`tidygraph`

handle the rest. Instead of just two disjoint sets, we could have three or more classification schemes that we transform our data between. For example, we could extend my earlier hand draw example by adding an aggregation step:

Now, for the grand reveal:

```
tg_abc |>
## calculating edge properties
activate(edges) |>
mutate(frac_weight = ifelse(share < 1, TRUE, FALSE)) |>
mutate(edge_linetype = ifelse(frac_weight, "dashed", "solid")) |>
mutate(edge_label_pos = ifelse(frac_weight, 0.8, 0.2)) |>
## calculating node properties
activate(nodes) |>
mutate(n_from = centrality_degree(mode = "in"),
in_from = n_from == 0,
collapse = n_from > 1) |>
## now we plot...
ggraph::ggraph(layout = "sugiyama") +
## unit weight links,
ggraph::geom_edge_diagonal(
aes(edge_linetype = I(edge_linetype),
edge_alpha = share,
filter = !frac_weight),
end_cap = circle(6, 'mm'),
show.legend = FALSE
) +
## frac weight links,
ggraph::geom_edge_diagonal(
aes(edge_linetype = I(edge_linetype),
edge_alpha = share,
filter = frac_weight,
label = share,
label_pos = edge_label_pos,
),
end_cap = circle(6, 'mm'),
show.legend = FALSE,
angle_calc = "along",
label_dodge = grid::unit(2, "mm")
) +
## from nodes,
ggraph::geom_node_label(aes(label = name,
filter=in_from),
) +
## to nodes,
ggraph::geom_node_label(aes(label = name,
fill = collapse,
filter=!in_from
),
show.legend = FALSE,
) +
scale_fill_brewer() +
## and finally modify coordinates, scale and theme
coord_flip() +
scale_y_reverse() +
theme_minimal() +
ggraph::th_no_axes()
```

The first modification I made was to swap the basic `geom_node_point()`

and `geom_edge_link()`

for their fancier cousins `geom_node_label()`

and `geom_edge_diagonal()`

:

```
tg_abc |>
ggraph(layout = "sugiyama") +
geom_node_label(aes(label = name)) + # was: geom_node_point()
geom_edge_diagonal() # was: geom_edge_link()
```

Then I shuffled the layer order and tweaked the coordinates, scales and theme:

```
tg_abc |>
ggraph(layout = "sugiyama") +
geom_edge_diagonal() +
geom_node_label(aes(label = name)) +
coord_flip() +
scale_y_reverse() +
theme_minimal() +
ggraph::th_no_axes()
```

The final set of modifications required a bit of set up and rely heavily on the very useful but lightly documented `filter`

aesthetic offered in all `ggraph`

geoms. If you unfold the code above in Finally, a ggplot, you’ll see that I calculated a bunch of variables to partition the edges (by having or not having `frac_weight`

) and nodes (by being `in_from`

or in the target set). I then mapped these variables to various aesthetics in standard ggplot style.

That’s it for today. If you want more `ggplot2`

stories keep reading, otherwise stay tuned for updates from the next phase of my journey to add `autoplot()`

to the `xmap`

package: *Geoms, Stats and autoplot, Oh My!*

Obviously I’m not the first person to try and plot bipartite graphs, but navigating the wonderfully rich R and ggplot2 ecosystems to find what I needed proved much more difficult than I anticipated. Here are selections of my experiments using `geoms`

from various ggplot2 extension packages:

Attempts at plotting bipartite graphs using `{ggalluvial}`

, `{ggsankey}`

and `{ggbump}`

.

As you can see, my most successful Frankenstein was the `{ggbump}`

graphic. Young, naive, and too intimidated to even think about reading the “Extending ggplot2” vignette, I shoved a creative combination of `geom_text()`

, `geom_label()`

, `ggbump::geom_sigmoid()`

and far too many ad-hoc computations in a wrapper function and hoped for the best. Unfortunately, I soon ran into problems because as Thomas Lin Pedersen notes in the `{ggraph}`

package:

I made some truly unsightly ggplots with my Frankenstein wrapper function including this monstrosity:

After which, I did not stop and rethink my approach, but came up with this workaround:

On the surface, this workaround looks pretty good, but don’t even think about resizing it, and against all principles of reproducibility, the image above is actually a screenshot of a ggplot with two tables underneath it. I couldn’t work out how to get `patchwork`

and `GridExtra`

to hold onto my two tables with top alignment, and so I just placed a ggplot chunk on top of a `knitr::kable()`

chunk and took a screenshot…

It turns out if you squint hard enough, almost anything can be a graph. This can be both useful and frustrating, but definitely lead me to a few realisations:

- a whole lot of ggplot2 extensions and network analysis packages offer slight variations on what I’m trying to do:
`{ggsankey}`

,`{ggalluvial}`

and`{ggparallel}`

all seem to expect a meaningful aesthetic mapping for the edge width (e.g. flow amount)`{ggbump}`

produces bump charts that look a lot like directed graphs, but there’s some mental gymnastics required to work out what the nodes, edges and attributes are.`ggbump::geom_sigmoid()`

can probably be thought of as a more rigid variant of`ggraph::geom_edge_diagonal()`

`{ggdag}`

,`{ggbipart}`

,`{ggnet2}`

also offer plots that look a bit like what I’m after, but I couldn’t quite connect the dots between their network / DAG syntax and my use case.`{multigraph}`

offers functions for plotting two-mode data, but doesn’t seem to use`ggplot2`

making it difficult to customise the appearance of the plots.

- multipartite graphs seem to get the worst of both worlds:
- they are too complicated to be neatly wrapped up into a simple ggplot2 pipeline and retain a sensible amount of flexibility (because they are graphs, and graphs are tricky).
- but they also need much more rigid and specific layout features and aesthetic mappings compared to network, hierarchical and matrix layouts.

- BUT many things are possible once we separate the creation and analysis of graphs (i.e.
`{tidygraph}`

) from the visualisation (i.e`{ggraph}`

). Thanks Thomas Lin Pedersen!

And on that note, thanks for reading! Please feel free to @me on Twitter, Mastodon, or GitHub with suggestions, corrections and/or ideas.

BibTeX citation:

```
@online{huang2023,
author = {Huang, Cynthia},
title = {Visualising Data Recoding as {Bipartite} {Graphs} Using
Igraph, Tidygraph, Ggraph and Ggplot2},
date = {2023-05-04},
url = {https://www.cynthiahqy.com/posts/xmap-ggraph-bipartite},
langid = {en}
}
```

For attribution, please cite this work as:

Huang, Cynthia. 2023. “Visualising Data Recoding as Bipartite
Graphs Using Igraph, Tidygraph, Ggraph and Ggplot2.” May 4, 2023.
https://www.cynthiahqy.com/posts/xmap-ggraph-bipartite.