64 Interactive Visualization Tools

Static charts answer questions you already know to ask. A bar chart of model accuracy by class tells you which classes are hard, but it cannot tell you why a particular image was misclassified, how confidence shifts as you slide a decision threshold, or whether a cluster of errors hides inside a feature subspace you never plotted. Interactivity closes that gap. It turns a single rendered answer into a surface you can probe, where the next view is one hover, brush, or filter away. For machine learning work, where the interesting structure lives in high dimensional spaces and long tails, that capability is not a luxury. This chapter surveys the modern Python ecosystem for interactive visualization, explains when interactivity earns its cost, and shows how to build both exploratory and explanatory interactive views for ML.

64.0.1 What this chapter covers

We proceed from principle to practice. Section 1 frames interactivity through a small amount of theory: Shneiderman’s information-seeking mantra, a set-theoretic account of brushing and linking, and a latency budget grounded in human perception. Section 2 compares the three dominant Python libraries by their underlying philosophy. Sections 3 and 4 treat the two distinct jobs of interactive views, exploration and explanation, with a worked threshold-and-confusion-matrix example. Section 5 covers dashboards, and Section 6 distills practical guidance on performance, reproducibility, and embedding.

64.1 1. When Interactivity Helps

64.1.1 1.1 The two jobs of a visualization

Visualizations do one of two jobs, and confusing them is the most common reason interactive tools get misused. An exploratory visualization helps you, the analyst, discover something you did not already know. An explanatory visualization communicates something you already understand to an audience that does not. The audience, the level of polish, and the acceptable complexity differ sharply between the two.

Exploration rewards breadth and speed. You want to slice, zoom, recolor, and recompute quickly, accepting rough edges because you are the only consumer. Explanation rewards focus and restraint. You have found the insight, and now every interactive control you add is a question you are asking your reader to answer for themselves. Sometimes that is exactly right, because the reader genuinely has different questions. Often it is a sign you have not finished thinking.

64.1.2 1.2 What interactivity actually buys you

Interactivity pays off when the data has more structure than a single static frame can hold, and when the viewer benefits from steering. Three patterns recur in ML work.

The first is high cardinality. A scatter plot of ten thousand embeddings is an ink blob until you can zoom into a region and hover to read the underlying record. The second is conditional structure, where the relationship you care about only appears after you filter or facet, such as a calibration curve that looks fine overall but falls apart for one customer segment. The third is parameter sensitivity, where the question is how an output moves as an input changes, such as precision and recall as a function of a threshold, or a partial dependence curve as you swap the feature being held fixed.

64.1.3 1.3 The information-seeking mantra

A useful organizing principle for interactive design is Shneiderman’s visual information-seeking mantra: overview first, zoom and filter, then details on demand [11]. Each clause names a task that interactivity supports and that a static frame cannot.

Overview gives the gestalt of the whole dataset, the shape, the gross outliers, the major clusters.
Zoom narrows the spatial extent to a region of interest while preserving context.
Filter removes records that fail a predicate, so the remaining structure is no longer occluded.
Details on demand retrieves the full record for a single mark, typically via hover or click, without permanently cluttering the view.

A common extension adds relate (show relationships among items), history (keep an undoable trail of actions), and extract (save a sub-collection for further work). These seven tasks form a practical checklist. When you cannot articulate which of them an interactive control serves, the control is probably decoration. Conversely, the patterns of Section 1.2 map cleanly onto the mantra: high cardinality demands zoom and details, conditional structure demands filter and relate, and parameter sensitivity demands a continuous control bound to a recomputed view.

64.1.4 1.4 Brushing and linking, formally

The single most powerful exploratory interaction, linked selection, has a clean set-theoretic description that clarifies what the tool is actually doing. The technique, often called brushing, dates to early interactive statistical graphics [14]. Let the dataset be a set of records $D = \{r_1, \dots, r_n\}$. A view is a function that maps each record to a visual mark in some coordinate space, and a brush is a region $B$ drawn in one view’s coordinate space. The brush induces a selection

\[ S \;=\; \{\, r_i \in D \;:\; \pi(r_i) \in B \,\}, \]

where $\pi$ is the projection that places record $r_i$ in that view’s space (for a scatter plot, $\pi(r_i) = (x_i, y_i)$). Linking means that every other view $V_k$ renders the same partition of $D$ into the selected set $S$ and its complement $D \setminus S$, typically by emphasizing $S$ and de-emphasizing $D \setminus S$ rather than discarding the latter, so context is retained.

The reason this is powerful is that the brush in one feature space answers questions in every other feature space at once. Brush the high-residual band of a residuals-versus-fitted plot, defining $S$ by a predicate on residual magnitude, and the linked histograms reveal the marginal distributions of class, feature value, and timestamp conditional on membership in $S$. Each linked view is, in effect, displaying the conditional distribution $P(\text{attribute} \mid r \in S)$. Static small multiples can show marginals; only linking shows conditionals chosen interactively at query time.

64.1.5 1.5 When to stay static

Interactivity has real costs. It adds JavaScript payload, slows page loads, complicates reproducibility, and can fail silently when a dependency drifts. If the message is a single comparison, a static figure is faster to make, faster to read, and trivially embeddable in a paper or slide. A good rule is that interactivity should remove ambiguity, not add decoration. If a reader can extract the full message without touching a control, the controls are noise. Reserve interaction for the moments where the reader genuinely needs to ask a question you cannot answer for them in advance.

64.1.6 1.6 The latency budget

Interaction feels like thinking only when the response is fast enough that the loop between question and answer stays unbroken. Three thresholds from the human-computer interaction literature set the budget [12, 13].

Response time	Subjective effect	Design implication
Under ~0.1 s	Feels instantaneous; perceived as direct manipulation	Target for hover, pan, brush, and zoom
Up to ~1 s	Noticeable, but thought flow is uninterrupted	Acceptable for filter-driven recomputation
Beyond ~10 s	Attention wanders; the task is abandoned mid-thought	Requires a progress indicator and usually a redesign

These numbers convert directly into engineering constraints. A brush callback that must stay under 0.1 s cannot ship a million rows to the browser and re-layout them on every mouse move, which is why aggregation, sampling, and precomputed indices (Section 6.2) are not optional polish but the difference between a tool that supports thought and one that interrupts it. The budget also explains the architectural split between client-side and server-side interaction: a control wired to a JavaScript callback updating an in-browser data structure can hit the 0.1 s target, whereas a round trip to a Python server for recomputation lives in the 1 s tier and should be reserved for interactions that genuinely require it.

flowchart LR
    A["Overview of full dataset"] --> B["Zoom and filter to a region"]
    B --> C["Details on demand for one mark"]
    C --> D["Form a hypothesis"]
    D --> A
    B -. "linked selection" .-> E["Other views update conditionally"]
    E --> D

Figure 64.1: The information-seeking loop. Each step maps to an interaction the reader can drive, and each must respect the latency budget.

64.2 2. The Library Landscape

Three plotting libraries dominate interactive work in Python, and they embody three different philosophies. Understanding the philosophy matters more than memorizing the API, because it predicts where each tool will feel natural and where it will fight you.

64.2.1 2.1 Plotly: imperative and batteries included

Plotly builds figures imperatively. You construct traces, attach them to a figure, and tune a layout dictionary. Its plotly.express module offers a high level interface that produces a richly interactive chart from a single call, complete with hover tooltips, zoom, pan, and a legend that toggles series on click.

import plotly.express as px

fig = px.scatter(
    df, x="pc1", y="pc2", color="label",
    hover_data=["sample_id", "confidence"],
    title="Embedding projection by predicted label",
)
fig.update_layout(legend_title_text="Class")

Plotly’s strengths are coverage and polish. It handles 3D surfaces, geographic maps, and animation, and its output works in notebooks, exported HTML, and dashboards without changes. Its cost is that complex customization means navigating a large and sometimes inconsistent configuration surface, and large datasets can produce heavy HTML unless you downsample or switch to its WebGL backends.

64.2.2 2.2 Bokeh: a model for building interactive applications

Bokeh is less a chart library than a framework for browser based interactive graphics. Its core abstraction is the ColumnDataSource, a shared data model that glyphs render from and that widgets and callbacks mutate. Because the data source is explicit and shared, Bokeh excels when multiple views must stay linked, such as a brush on one plot that highlights the same points on three others.

from bokeh.plotting import figure
from bokeh.models import ColumnDataSource

source = ColumnDataSource(df)
p = figure(title="Residuals vs fitted", tools="box_select,lasso_select,reset")
p.scatter("fitted", "residual", source=source, size=6, alpha=0.5)

Bokeh can run with a live Python server, which means callbacks execute real Python rather than precompiled JavaScript. That unlocks interactions backed by arbitrary computation, including rerunning a model on the selected subset. The tradeoff is that a server adds deployment weight, and standalone HTML output limits you to callbacks that can be expressed in Bokeh’s JavaScript layer.

64.2.3 2.3 Altair: declarative grammar of graphics

Altair takes the opposite stance from Plotly. You do not describe how to draw; you declare what the data means by mapping columns to visual channels such as x, y, color, and size. It compiles to Vega-Lite, a JSON specification that a JavaScript runtime renders. The declarative style makes Altair concise and composable, and its selection and binding primitives express linked filtering and cross highlighting with remarkable economy.

import altair as alt

brush = alt.selection_interval()
base = alt.Chart(df).add_params(brush)
points = base.mark_circle().encode(
    x="feature_x", y="feature_y",
    color=alt.condition(brush, "label:N", alt.value("lightgray")),
)
bars = base.mark_bar().encode(x="count()", y="label:N").transform_filter(brush)
points | bars

Altair’s discipline is its gift and its limit. Composed views and linked selections that would take pages of callback code elsewhere fall out of a few operators. But Vega-Lite historically materializes data into the spec, so very large datasets need aggregation, sampling, or an external data URL before they render comfortably.

64.2.4 2.4 Choosing among them

For fast exploratory plotting and presentation ready figures with minimal fuss, reach for Plotly. For applications where linked views, custom widgets, and server side computation are central, reach for Bokeh. For analysis where the visualization is a precise statement about data relationships and you value reproducible, composable specifications, reach for Altair. None is wrong; they optimize for different parts of the workflow, and mature teams often use more than one.

64.3 3. Building Exploratory Interactive Views

64.3.1 3.1 The exploratory mindset

Exploratory interaction is a conversation with your data, and the goal is to lower the cost of each question to near zero. You are looking for surprises: a cluster that should not exist, an outlier that breaks a trend, a subgroup where the model behaves differently. The right tool here is whatever lets you go from hypothesis to picture fastest, which usually means a notebook, a dataframe, and one of the express style APIs.

64.3.2 3.2 Exploring embeddings and high dimensional structure

ML systems generate embeddings everywhere, from word vectors to image features to user representations. A two dimensional projection from UMAP or t-SNE turns those vectors into something you can look at, and interactivity turns looking into investigating. Hovering reveals the source record, color encodes a label or cluster id, and zoom lets you separate a dense region into its constituents.

import plotly.express as px

fig = px.scatter(
    proj_df, x="x", y="y", color="cluster",
    hover_data={"text": True, "x": False, "y": False},
    opacity=0.6,
)
fig.update_traces(marker=dict(size=4))

The decisive feature is the tooltip. Seeing that a tight cluster of “neutral” sentiment points all contain sarcasm, or that an embedding outlier is a corrupted record, is the kind of discovery that static plots cannot surface because the identity of each point is invisible until you ask.

64.3.3 3.3 Linked views and brushing

The single most powerful exploratory technique is the linked selection, where selecting points in one view filters or highlights them in every other view. This lets you pose conditional questions directly. Brush the high error region of a residual plot, and watch which feature values, which classes, and which time periods light up across the other panels. Altair expresses this with a shared selection parameter, and Bokeh with a shared ColumnDataSource. Either way, the analyst is no longer reading separate charts but interrogating one dataset from several angles at once.

64.3.4 3.4 Interactive error analysis

Error analysis is where interactivity earns its keep for model builders. Build a confusion matrix where clicking a cell lists the misclassified examples in that cell, then displays each example with its true label, predicted label, and probability. Add a threshold slider and watch the matrix recompute live. Add filters for metadata such as image source, text length, or acquisition date, and you can isolate the conditions under which the model fails. This workflow converts an aggregate metric into a stack of concrete, inspectable mistakes, which is what actually drives the next modeling decision.

64.3.5 3.5 Worked example: a live threshold explorer

Consider a binary classifier that outputs a score $s_i \in [0, 1]$ for each example, with true label $y_i \in \{0, 1\}$. A decision threshold $t$ converts scores into predictions $\hat{y}_i(t) = \mathbb{1}[s_i \ge t]$. Sliding $t$ is the canonical example of parameter sensitivity, and it is worth making the dependence explicit because it explains exactly what a threshold widget recomputes.

For a fixed $t$, the four confusion-matrix counts are

\[ \begin{aligned} \mathrm{TP}(t) &= \textstyle\sum_i \mathbb{1}[s_i \ge t]\,\mathbb{1}[y_i = 1], & \mathrm{FP}(t) &= \textstyle\sum_i \mathbb{1}[s_i \ge t]\,\mathbb{1}[y_i = 0], \\ \mathrm{FN}(t) &= \textstyle\sum_i \mathbb{1}[s_i < t]\,\mathbb{1}[y_i = 1], & \mathrm{TN}(t) &= \textstyle\sum_i \mathbb{1}[s_i < t]\,\mathbb{1}[y_i = 0], \end{aligned} \]

from which precision and recall follow:

\[ \mathrm{Prec}(t) = \frac{\mathrm{TP}(t)}{\mathrm{TP}(t) + \mathrm{FP}(t)}, \qquad \mathrm{Rec}(t) = \frac{\mathrm{TP}(t)}{\mathrm{TP}(t) + \mathrm{FN}(t)}. \]

Two structural facts make the slider feel coherent and guide the implementation. First, every count is monotone in $t$: as $t$ increases, $\mathrm{TP}$ and $\mathrm{FP}$ can only decrease while $\mathrm{TN}$ and $\mathrm{FN}$ can only increase, so recall is non-increasing in $t$. Precision is not monotone in general, which is precisely why a reader benefits from steering it rather than reading a single number. Second, the counts change only at the observed score values. There are at most $n$ distinct thresholds that produce distinct confusion matrices, so the entire family $\{(\mathrm{Prec}(t), \mathrm{Rec}(t))\}$ can be precomputed once by sorting the scores in $O(n \log n)$ time. The slider then indexes into a precomputed table in $O(\log n)$ per move, which keeps each update inside the 0.1 s direct-manipulation budget no matter how large $n$ is. This is the general lesson in miniature: precompute the response surface offline, and let interaction be a cheap lookup.

A Streamlit sketch of the live view, building on the counts above, appears in Section 5.1.

64.3.6 3.6 Keeping exploration honest

Exploratory freedom invites overfitting your eyes. When you slice a dataset many ways, some apparent pattern will look striking by chance. The risk is quantifiable: if you inspect $m$ independent slices, each null with a per-slice false-positive rate $\alpha$, the probability that at least one looks “significant” by chance is $1 - (1 - \alpha)^m$, which approaches certainty as $m$ grows. An interactive tool is a machine for raising $m$ cheaply, so it raises the false-discovery risk in exact proportion to how fun it is to use. Treat exploratory findings as hypotheses, not conclusions, and confirm anything important on a held-out split or with a statistical test, ideally one chosen before you looked, before you act on it. Interactivity makes it trivial to manufacture a compelling but spurious story, so the discipline of confirmation matters more, not less.

64.4 4. Building Explanatory Interactive Views

64.4.1 4.1 From discovery to communication

Once you know the message, the design problem inverts. You are no longer minimizing the cost of asking questions; you are guiding a reader to a conclusion while letting them verify it. The best explanatory interactives are mostly static. They present a clear default view that carries the message on its own, then offer a small number of controls for the questions you anticipate the reader will have.

64.4.2 4.2 Annotation and guided defaults

A reader landing on your chart has none of your context, so the default state must do the heavy lifting. Title the chart with the takeaway rather than the variables. Annotate the specific points that matter, such as the threshold you chose and why. Set the initial zoom, the default filter, and the highlighted series so that the message is visible before any interaction. Every control you expose should answer a question a thoughtful reader would actually ask, not merely a question the tool makes easy to enable.

64.4.3 4.3 Explaining model behavior interactively

Interactive explanation shines for model behavior that is inherently conditional. A partial dependence explorer that lets a stakeholder pick a feature and see its modeled effect makes a black box legible. A threshold widget that shows precision, recall, and the resulting count of false positives and false negatives lets a product owner feel the tradeoff in business terms rather than reading it off a table. A local explanation view, where clicking a prediction reveals the feature attributions that drove it, turns “the model said no” into “the model said no because these three inputs pushed it over the line.” In each case the interactivity is tightly scoped to the one degree of freedom the audience cares about.

64.4.4 4.4 Performance and accessibility

Explanatory views reach people on varied hardware and networks, so weight matters. Aggregate or sample before rendering, prefer WebGL or canvas backends for large point counts, and lazy load below the fold. Accessibility matters too. Color must not be the only channel carrying meaning, hover only information must have a non hover fallback for keyboard and touch users, and text should remain legible at the sizes your audience will actually use. An interactive chart that excludes part of its audience has failed at the one job, communication, that justified building it.

64.5 5. Dashboards

When several linked views, controls, and live computations belong together as a tool rather than a single figure, you have a dashboard. Two Python frameworks dominate, and they differ in how much control they hand you.

64.5.1 5.1 Streamlit: scripts that become apps

Streamlit turns a plain Python script into a web app by rerunning the whole script top to bottom on every interaction. A widget call returns its current value, and you use that value as an ordinary variable. The mental model is delightfully simple: there are no callbacks, just a script that reads its inputs and draws its outputs.

import streamlit as st

threshold = st.slider("Decision threshold", 0.0, 1.0, 0.5)
preds = (scores >= threshold).astype(int)
st.metric("Precision", f"{precision(y, preds):.3f}")
st.metric("Recall", f"{recall(y, preds):.3f}")
st.plotly_chart(confusion_figure(y, preds))

This model makes Streamlit the fastest way to wrap a model or analysis in a shareable interface, which is why it dominates internal ML demos and prototypes. The cost of rerunning everything is managed with caching decorators that memoize expensive steps such as loading data or running inference. Streamlit’s simplicity becomes a limit when you need fine grained layout control or complex stateful interactions that resist the rerun model.

64.5.2 5.2 Dash: declarative apps with explicit callbacks

Dash, built on Plotly and Flask, takes the callback approach. You declare a layout of components, each with an id, then write callback functions that name their inputs and outputs explicitly. Only the affected components recompute when an input changes, which gives precise control over what updates and when.

@app.callback(
    Output("roc", "figure"),
    Input("model-dropdown", "value"),
)
def update_roc(model_name):
    return roc_figure(results[model_name])

That explicitness is more verbose than Streamlit but scales better to large, multi page applications with intricate dependencies between controls. Dash is the stronger choice when a dashboard becomes a maintained product with many users rather than a quick internal tool, and when you need the layout and update behavior to be exactly as specified.

64.5.3 5.3 Choosing a framework

Pick Streamlit when speed of construction and a simple mental model matter most, which covers the majority of internal ML tooling and rapid prototypes. Pick Dash when you need production grade structure, granular update control, and multi page complexity. Both render the same underlying Plotly figures, so the visualization skills transfer; what differs is the application scaffolding around them.

64.5.4 5.4 Dashboards as ML interfaces

For ML teams, dashboards become the connective tissue between models and humans. A monitoring dashboard tracks prediction distributions, input drift, and live performance against a baseline, alerting when the world shifts away from the training data. An evaluation dashboard compares candidate models across slices so a reviewer can see not just which model wins on average but where each one wins and loses. A human in the loop labeling or review interface surfaces low confidence predictions for a person to correct, feeding the corrections back into the training set. In each case the dashboard is not a report but a workplace, and the same principles apply: a strong default view, scoped interactivity, and ruthless attention to load time.

64.6 6. Practical Guidance

64.6.1 6.1 A decision checklist

Before adding interactivity, ask whether a static figure conveys the message. If it does, stop. If the data has high cardinality, conditional structure, or parameter sensitivity that a single frame cannot hold, choose a tool by the dominant need: express plotting for exploration, linked views for relational analysis, and a dashboard framework when controls and computation must live together. Match polish to audience, keeping exploratory views rough and fast and explanatory views focused and annotated.

64.6.2 6.2 Performance habits that scale

Most interactive performance problems come from sending too much data to the browser. Aggregate or sample before plotting, since a reader cannot perceive a million overlapping points anyway. Use WebGL backends, available in Plotly and Bokeh, when you genuinely need tens of thousands of marks. Cache expensive computations so interaction recomputes only what changed. Precompute projections and summaries offline rather than on every page load. These habits keep an interactive view responsive, and responsiveness is what makes interaction feel like thinking rather than waiting.

64.6.3 6.3 Reproducibility and embedding

Interactive artifacts complicate reproducibility because they bundle data, code, and a JavaScript runtime whose versions can drift. Pin library versions, and for figures destined for a paper or a Quarto book, export a self contained HTML file or fall back to a static image so the artifact survives independent of a running server. A figure that renders today but breaks on a dependency bump next quarter has a short and frustrating life, so treat the export format as part of the design, not an afterthought.

64.6.4 6.4 Common pitfalls

A few failure modes recur often enough to name explicitly.

Interaction as decoration. Controls that no reader needs add load time, fragility, and cognitive cost while removing no ambiguity. If the default view already carries the message, delete the control.
Overplotting masquerading as data. Tens of thousands of overlapping semi-transparent marks read as a uniform smear. Aggregate, bin, or sample so the visible density reflects the real density, and let zoom recover detail.
Color as the sole channel. Roughly one in twelve men has a color-vision deficiency, so any meaning carried only by hue is invisible to part of the audience. Redundantly encode with shape, position, or direct labels.
Hover-only information. Tooltips do not exist for keyboard and touch users. Anything load-bearing must have a non-hover fallback.
Mining the test set with your eyes. Slicing held-out data interactively until a pattern appears is multiple comparisons by another name (Section 3.6). Confirm on data you have not already inspected.
Latency creep. A control wired to a heavy server round trip that drifts past one second breaks the loop between question and answer. Precompute the response surface and keep the per-interaction cost a lookup.

64.6.5 6.5 When to use which view

As a closing heuristic: stay static for a single fixed comparison; reach for an exploratory notebook view (Plotly Express, or Altair and Bokeh for linked selection) when you are the consumer and the goal is discovery; build an explanatory interactive with strong defaults and one or two scoped controls when you have found the message and must communicate it; and stand up a dashboard (Streamlit for speed, Dash for production structure) only when several linked views, controls, and live computation genuinely belong together as a tool rather than a figure.

64.7 References

Plotly Python Open Source Graphing Library. https://plotly.com/python/
Bokeh Documentation. https://docs.bokeh.org/en/latest/
Vega-Altair: Declarative Visualization in Python. https://altair-viz.github.io/
Vega-Lite: A Grammar of Interactive Graphics. https://vega.github.io/vega-lite/
Streamlit Documentation. https://docs.streamlit.io/
Dash Documentation by Plotly. https://dash.plotly.com/
Satyanarayan, A., Moritz, D., Wongsuphasawat, K., Heer, J. Vega-Lite: A Grammar of Interactive Graphics. IEEE Transactions on Visualization and Computer Graphics, 2017. https://ieeexplore.ieee.org/document/7539624
Wilke, C. O. Fundamentals of Data Visualization. https://clauswilke.com/dataviz/
McInnes, L., Healy, J., Melville, J. UMAP: Uniform Manifold Approximation and Projection. https://arxiv.org/abs/1802.03426
Wexler, J., et al. The What-If Tool: Interactive Probing of Machine Learning Models. https://arxiv.org/abs/1907.04135
Shneiderman, B. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. Proceedings of the IEEE Symposium on Visual Languages, 1996. https://doi.org/10.1109/VL.1996.545307
Card, S. K., Robertson, G. G., Mackinlay, J. D. The Information Visualizer, an Information Workspace. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 1991. https://doi.org/10.1145/108844.108874
Nielsen, J. Usability Engineering. Morgan Kaufmann, 1993. https://doi.org/10.1016/B978-0-08-052029-2.50007-3
Becker, R. A., Cleveland, W. S. Brushing Scatterplots. Technometrics, 1987. https://doi.org/10.1080/00401706.1987.10488204

# Interactive Visualization Tools Static charts answer questions you already know to ask. A bar chart of model accuracy by class tells you which classes are hard, but it cannot tell you why a particular image was misclassified, how confidence shifts as you slide a decision threshold, or whether a cluster of errors hides inside a feature subspace you never plotted. Interactivity closes that gap. It turns a single rendered answer into a surface you can probe, where the next view is one hover, brush, or filter away. For machine learning work, where the interesting structure lives in high dimensional spaces and long tails, that capability is not a luxury. This chapter surveys the modern Python ecosystem for interactive visualization, explains when interactivity earns its cost, and shows how to build both exploratory and explanatory interactive views for ML. ### What this chapter covers We proceed from principle to practice. Section 1 frames interactivity through a small amount of theory: Shneiderman's information-seeking mantra, a set-theoretic account of brushing and linking, and a latency budget grounded in human perception. Section 2 compares the three dominant Python libraries by their underlying philosophy. Sections 3 and 4 treat the two distinct jobs of interactive views, exploration and explanation, with a worked threshold-and-confusion-matrix example. Section 5 covers dashboards, and Section 6 distills practical guidance on performance, reproducibility, and embedding. ## 1. When Interactivity Helps ### 1.1 The two jobs of a visualization Visualizations do one of two jobs, and confusing them is the most common reason interactive tools get misused. An exploratory visualization helps you, the analyst, discover something you did not already know. An explanatory visualization communicates something you already understand to an audience that does not. The audience, the level of polish, and the acceptable complexity differ sharply between the two. Exploration rewards breadth and speed. You want to slice, zoom, recolor, and recompute quickly, accepting rough edges because you are the only consumer. Explanation rewards focus and restraint. You have found the insight, and now every interactive control you add is a question you are asking your reader to answer for themselves. Sometimes that is exactly right, because the reader genuinely has different questions. Often it is a sign you have not finished thinking. ### 1.2 What interactivity actually buys you Interactivity pays off when the data has more structure than a single static frame can hold, and when the viewer benefits from steering. Three patterns recur in ML work. The first is high cardinality. A scatter plot of ten thousand embeddings is an ink blob until you can zoom into a region and hover to read the underlying record. The second is conditional structure, where the relationship you care about only appears after you filter or facet, such as a calibration curve that looks fine overall but falls apart for one customer segment. The third is parameter sensitivity, where the question is how an output moves as an input changes, such as precision and recall as a function of a threshold, or a partial dependence curve as you swap the feature being held fixed. ### 1.3 The information-seeking mantra A useful organizing principle for interactive design is Shneiderman's *visual information-seeking mantra*: overview first, zoom and filter, then details on demand [11]. Each clause names a task that interactivity supports and that a static frame cannot. - **Overview** gives the gestalt of the whole dataset, the shape, the gross outliers, the major clusters. - **Zoom** narrows the spatial extent to a region of interest while preserving context. - **Filter** removes records that fail a predicate, so the remaining structure is no longer occluded. - **Details on demand** retrieves the full record for a single mark, typically via hover or click, without permanently cluttering the view. A common extension adds *relate* (show relationships among items), *history* (keep an undoable trail of actions), and *extract* (save a sub-collection for further work). These seven tasks form a practical checklist. When you cannot articulate which of them an interactive control serves, the control is probably decoration. Conversely, the patterns of Section 1.2 map cleanly onto the mantra: high cardinality demands zoom and details, conditional structure demands filter and relate, and parameter sensitivity demands a continuous control bound to a recomputed view. ### 1.4 Brushing and linking, formally The single most powerful exploratory interaction, linked selection, has a clean set-theoretic description that clarifies what the tool is actually doing. The technique, often called brushing, dates to early interactive statistical graphics [14]. Let the dataset be a set of records $D = \{r_1, \dots, r_n\}$. A *view* is a function that maps each record to a visual mark in some coordinate space, and a *brush* is a region $B$ drawn in one view's coordinate space. The brush induces a selection $$ S \;=\; \{\, r_i \in D \;:\; \pi(r_i) \in B \,\}, $$ where $\pi$ is the projection that places record $r_i$ in that view's space (for a scatter plot, $\pi(r_i) = (x_i, y_i)$). *Linking* means that every other view $V_k$ renders the same partition of $D$ into the selected set $S$ and its complement $D \setminus S$, typically by emphasizing $S$ and de-emphasizing $D \setminus S$ rather than discarding the latter, so context is retained. The reason this is powerful is that the brush in one feature space answers questions in *every* other feature space at once. Brush the high-residual band of a residuals-versus-fitted plot, defining $S$ by a predicate on residual magnitude, and the linked histograms reveal the marginal distributions of class, feature value, and timestamp *conditional on membership in $S$*. Each linked view is, in effect, displaying the conditional distribution $P(\text{attribute} \mid r \in S)$. Static small multiples can show marginals; only linking shows conditionals chosen interactively at query time. ### 1.5 When to stay static Interactivity has real costs. It adds JavaScript payload, slows page loads, complicates reproducibility, and can fail silently when a dependency drifts. If the message is a single comparison, a static figure is faster to make, faster to read, and trivially embeddable in a paper or slide. A good rule is that interactivity should remove ambiguity, not add decoration. If a reader can extract the full message without touching a control, the controls are noise. Reserve interaction for the moments where the reader genuinely needs to ask a question you cannot answer for them in advance. ### 1.6 The latency budget Interaction feels like thinking only when the response is fast enough that the loop between question and answer stays unbroken. Three thresholds from the human-computer interaction literature set the budget [12, 13]. | Response time | Subjective effect | Design implication | |---|---|---| | Under ~0.1 s | Feels instantaneous; perceived as direct manipulation | Target for hover, pan, brush, and zoom | | Up to ~1 s | Noticeable, but thought flow is uninterrupted | Acceptable for filter-driven recomputation | | Beyond ~10 s | Attention wanders; the task is abandoned mid-thought | Requires a progress indicator and usually a redesign | These numbers convert directly into engineering constraints. A brush callback that must stay under 0.1 s cannot ship a million rows to the browser and re-layout them on every mouse move, which is why aggregation, sampling, and precomputed indices (Section 6.2) are not optional polish but the difference between a tool that supports thought and one that interrupts it. The budget also explains the architectural split between client-side and server-side interaction: a control wired to a JavaScript callback updating an in-browser data structure can hit the 0.1 s target, whereas a round trip to a Python server for recomputation lives in the 1 s tier and should be reserved for interactions that genuinely require it. ```{mermaid} %%| label: fig-mantra %%| fig-cap: "The information-seeking loop. Each step maps to an interaction the reader can drive, and each must respect the latency budget." flowchart LR A["Overview of full dataset"] --> B["Zoom and filter to a region"] B --> C["Details on demand for one mark"] C --> D["Form a hypothesis"] D --> A B -. "linked selection" .-> E["Other views update conditionally"] E --> D ``` ## 2. The Library Landscape Three plotting libraries dominate interactive work in Python, and they embody three different philosophies. Understanding the philosophy matters more than memorizing the API, because it predicts where each tool will feel natural and where it will fight you. ### 2.1 Plotly: imperative and batteries included Plotly builds figures imperatively. You construct traces, attach them to a figure, and tune a layout dictionary. Its `plotly.express` module offers a high level interface that produces a richly interactive chart from a single call, complete with hover tooltips, zoom, pan, and a legend that toggles series on click. ```python import plotly.express as px fig = px.scatter( df, x="pc1", y="pc2", color="label", hover_data=["sample_id", "confidence"], title="Embedding projection by predicted label", ) fig.update_layout(legend_title_text="Class") ``` Plotly's strengths are coverage and polish. It handles 3D surfaces, geographic maps, and animation, and its output works in notebooks, exported HTML, and dashboards without changes. Its cost is that complex customization means navigating a large and sometimes inconsistent configuration surface, and large datasets can produce heavy HTML unless you downsample or switch to its WebGL backends. ### 2.2 Bokeh: a model for building interactive applications Bokeh is less a chart library than a framework for browser based interactive graphics. Its core abstraction is the `ColumnDataSource`, a shared data model that glyphs render from and that widgets and callbacks mutate. Because the data source is explicit and shared, Bokeh excels when multiple views must stay linked, such as a brush on one plot that highlights the same points on three others. ```python from bokeh.plotting import figure from bokeh.models import ColumnDataSource source = ColumnDataSource(df) p = figure(title="Residuals vs fitted", tools="box_select,lasso_select,reset") p.scatter("fitted", "residual", source=source, size=6, alpha=0.5) ``` Bokeh can run with a live Python server, which means callbacks execute real Python rather than precompiled JavaScript. That unlocks interactions backed by arbitrary computation, including rerunning a model on the selected subset. The tradeoff is that a server adds deployment weight, and standalone HTML output limits you to callbacks that can be expressed in Bokeh's JavaScript layer. ### 2.3 Altair: declarative grammar of graphics Altair takes the opposite stance from Plotly. You do not describe how to draw; you declare what the data means by mapping columns to visual channels such as x, y, color, and size. It compiles to Vega-Lite, a JSON specification that a JavaScript runtime renders. The declarative style makes Altair concise and composable, and its selection and binding primitives express linked filtering and cross highlighting with remarkable economy. ```python import altair as alt brush = alt.selection_interval() base = alt.Chart(df).add_params(brush) points = base.mark_circle().encode( x="feature_x", y="feature_y", color=alt.condition(brush, "label:N", alt.value("lightgray")), ) bars = base.mark_bar().encode(x="count()", y="label:N").transform_filter(brush) points | bars ``` Altair's discipline is its gift and its limit. Composed views and linked selections that would take pages of callback code elsewhere fall out of a few operators. But Vega-Lite historically materializes data into the spec, so very large datasets need aggregation, sampling, or an external data URL before they render comfortably. ### 2.4 Choosing among them For fast exploratory plotting and presentation ready figures with minimal fuss, reach for Plotly. For applications where linked views, custom widgets, and server side computation are central, reach for Bokeh. For analysis where the visualization is a precise statement about data relationships and you value reproducible, composable specifications, reach for Altair. None is wrong; they optimize for different parts of the workflow, and mature teams often use more than one. ## 3. Building Exploratory Interactive Views ### 3.1 The exploratory mindset Exploratory interaction is a conversation with your data, and the goal is to lower the cost of each question to near zero. You are looking for surprises: a cluster that should not exist, an outlier that breaks a trend, a subgroup where the model behaves differently. The right tool here is whatever lets you go from hypothesis to picture fastest, which usually means a notebook, a dataframe, and one of the express style APIs. ### 3.2 Exploring embeddings and high dimensional structure ML systems generate embeddings everywhere, from word vectors to image features to user representations. A two dimensional projection from UMAP or t-SNE turns those vectors into something you can look at, and interactivity turns looking into investigating. Hovering reveals the source record, color encodes a label or cluster id, and zoom lets you separate a dense region into its constituents. ```python import plotly.express as px fig = px.scatter( proj_df, x="x", y="y", color="cluster", hover_data={"text": True, "x": False, "y": False}, opacity=0.6, ) fig.update_traces(marker=dict(size=4)) ``` The decisive feature is the tooltip. Seeing that a tight cluster of "neutral" sentiment points all contain sarcasm, or that an embedding outlier is a corrupted record, is the kind of discovery that static plots cannot surface because the identity of each point is invisible until you ask. ### 3.3 Linked views and brushing The single most powerful exploratory technique is the linked selection, where selecting points in one view filters or highlights them in every other view. This lets you pose conditional questions directly. Brush the high error region of a residual plot, and watch which feature values, which classes, and which time periods light up across the other panels. Altair expresses this with a shared selection parameter, and Bokeh with a shared `ColumnDataSource`. Either way, the analyst is no longer reading separate charts but interrogating one dataset from several angles at once. ### 3.4 Interactive error analysis Error analysis is where interactivity earns its keep for model builders. Build a confusion matrix where clicking a cell lists the misclassified examples in that cell, then displays each example with its true label, predicted label, and probability. Add a threshold slider and watch the matrix recompute live. Add filters for metadata such as image source, text length, or acquisition date, and you can isolate the conditions under which the model fails. This workflow converts an aggregate metric into a stack of concrete, inspectable mistakes, which is what actually drives the next modeling decision. ### 3.5 Worked example: a live threshold explorer Consider a binary classifier that outputs a score $s_i \in [0, 1]$ for each example, with true label $y_i \in \{0, 1\}$. A decision threshold $t$ converts scores into predictions $\hat{y}_i(t) = \mathbb{1}[s_i \ge t]$. Sliding $t$ is the canonical example of parameter sensitivity, and it is worth making the dependence explicit because it explains exactly what a threshold widget recomputes. For a fixed $t$, the four confusion-matrix counts are $$ \begin{aligned} \mathrm{TP}(t) &= \textstyle\sum_i \mathbb{1}[s_i \ge t]\,\mathbb{1}[y_i = 1], & \mathrm{FP}(t) &= \textstyle\sum_i \mathbb{1}[s_i \ge t]\,\mathbb{1}[y_i = 0], \\ \mathrm{FN}(t) &= \textstyle\sum_i \mathbb{1}[s_i < t]\,\mathbb{1}[y_i = 1], & \mathrm{TN}(t) &= \textstyle\sum_i \mathbb{1}[s_i < t]\,\mathbb{1}[y_i = 0], \end{aligned} $$ from which precision and recall follow: $$ \mathrm{Prec}(t) = \frac{\mathrm{TP}(t)}{\mathrm{TP}(t) + \mathrm{FP}(t)}, \qquad \mathrm{Rec}(t) = \frac{\mathrm{TP}(t)}{\mathrm{TP}(t) + \mathrm{FN}(t)}. $$ Two structural facts make the slider feel coherent and guide the implementation. First, every count is *monotone* in $t$: as $t$ increases, $\mathrm{TP}$ and $\mathrm{FP}$ can only decrease while $\mathrm{TN}$ and $\mathrm{FN}$ can only increase, so recall is non-increasing in $t$. Precision is *not* monotone in general, which is precisely why a reader benefits from steering it rather than reading a single number. Second, the counts change only at the observed score values. There are at most $n$ distinct thresholds that produce distinct confusion matrices, so the entire family $\{(\mathrm{Prec}(t), \mathrm{Rec}(t))\}$ can be precomputed once by sorting the scores in $O(n \log n)$ time. The slider then indexes into a precomputed table in $O(\log n)$ per move, which keeps each update inside the 0.1 s direct-manipulation budget no matter how large $n$ is. This is the general lesson in miniature: precompute the response surface offline, and let interaction be a cheap lookup. A Streamlit sketch of the live view, building on the counts above, appears in Section 5.1. ### 3.6 Keeping exploration honest Exploratory freedom invites overfitting your eyes. When you slice a dataset many ways, some apparent pattern will look striking by chance. The risk is quantifiable: if you inspect $m$ independent slices, each null with a per-slice false-positive rate $\alpha$, the probability that at least one looks "significant" by chance is $1 - (1 - \alpha)^m$, which approaches certainty as $m$ grows. An interactive tool is a machine for raising $m$ cheaply, so it raises the false-discovery risk in exact proportion to how fun it is to use. Treat exploratory findings as hypotheses, not conclusions, and confirm anything important on a held-out split or with a statistical test, ideally one chosen before you looked, before you act on it. Interactivity makes it trivial to manufacture a compelling but spurious story, so the discipline of confirmation matters more, not less. ## 4. Building Explanatory Interactive Views ### 4.1 From discovery to communication Once you know the message, the design problem inverts. You are no longer minimizing the cost of asking questions; you are guiding a reader to a conclusion while letting them verify it. The best explanatory interactives are mostly static. They present a clear default view that carries the message on its own, then offer a small number of controls for the questions you anticipate the reader will have. ### 4.2 Annotation and guided defaults A reader landing on your chart has none of your context, so the default state must do the heavy lifting. Title the chart with the takeaway rather than the variables. Annotate the specific points that matter, such as the threshold you chose and why. Set the initial zoom, the default filter, and the highlighted series so that the message is visible before any interaction. Every control you expose should answer a question a thoughtful reader would actually ask, not merely a question the tool makes easy to enable. ### 4.3 Explaining model behavior interactively Interactive explanation shines for model behavior that is inherently conditional. A partial dependence explorer that lets a stakeholder pick a feature and see its modeled effect makes a black box legible. A threshold widget that shows precision, recall, and the resulting count of false positives and false negatives lets a product owner feel the tradeoff in business terms rather than reading it off a table. A local explanation view, where clicking a prediction reveals the feature attributions that drove it, turns "the model said no" into "the model said no because these three inputs pushed it over the line." In each case the interactivity is tightly scoped to the one degree of freedom the audience cares about. ### 4.4 Performance and accessibility Explanatory views reach people on varied hardware and networks, so weight matters. Aggregate or sample before rendering, prefer WebGL or canvas backends for large point counts, and lazy load below the fold. Accessibility matters too. Color must not be the only channel carrying meaning, hover only information must have a non hover fallback for keyboard and touch users, and text should remain legible at the sizes your audience will actually use. An interactive chart that excludes part of its audience has failed at the one job, communication, that justified building it. ## 5. Dashboards When several linked views, controls, and live computations belong together as a tool rather than a single figure, you have a dashboard. Two Python frameworks dominate, and they differ in how much control they hand you. ### 5.1 Streamlit: scripts that become apps Streamlit turns a plain Python script into a web app by rerunning the whole script top to bottom on every interaction. A widget call returns its current value, and you use that value as an ordinary variable. The mental model is delightfully simple: there are no callbacks, just a script that reads its inputs and draws its outputs. ```python import streamlit as st threshold = st.slider("Decision threshold", 0.0, 1.0, 0.5) preds = (scores >= threshold).astype(int) st.metric("Precision", f"{precision(y, preds):.3f}") st.metric("Recall", f"{recall(y, preds):.3f}") st.plotly_chart(confusion_figure(y, preds)) ``` This model makes Streamlit the fastest way to wrap a model or analysis in a shareable interface, which is why it dominates internal ML demos and prototypes. The cost of rerunning everything is managed with caching decorators that memoize expensive steps such as loading data or running inference. Streamlit's simplicity becomes a limit when you need fine grained layout control or complex stateful interactions that resist the rerun model. ### 5.2 Dash: declarative apps with explicit callbacks Dash, built on Plotly and Flask, takes the callback approach. You declare a layout of components, each with an id, then write callback functions that name their inputs and outputs explicitly. Only the affected components recompute when an input changes, which gives precise control over what updates and when. ```python @app.callback( Output("roc", "figure"), Input("model-dropdown", "value"), ) def update_roc(model_name): return roc_figure(results[model_name]) ``` That explicitness is more verbose than Streamlit but scales better to large, multi page applications with intricate dependencies between controls. Dash is the stronger choice when a dashboard becomes a maintained product with many users rather than a quick internal tool, and when you need the layout and update behavior to be exactly as specified. ### 5.3 Choosing a framework Pick Streamlit when speed of construction and a simple mental model matter most, which covers the majority of internal ML tooling and rapid prototypes. Pick Dash when you need production grade structure, granular update control, and multi page complexity. Both render the same underlying Plotly figures, so the visualization skills transfer; what differs is the application scaffolding around them. ### 5.4 Dashboards as ML interfaces For ML teams, dashboards become the connective tissue between models and humans. A monitoring dashboard tracks prediction distributions, input drift, and live performance against a baseline, alerting when the world shifts away from the training data. An evaluation dashboard compares candidate models across slices so a reviewer can see not just which model wins on average but where each one wins and loses. A human in the loop labeling or review interface surfaces low confidence predictions for a person to correct, feeding the corrections back into the training set. In each case the dashboard is not a report but a workplace, and the same principles apply: a strong default view, scoped interactivity, and ruthless attention to load time. ## 6. Practical Guidance ### 6.1 A decision checklist Before adding interactivity, ask whether a static figure conveys the message. If it does, stop. If the data has high cardinality, conditional structure, or parameter sensitivity that a single frame cannot hold, choose a tool by the dominant need: express plotting for exploration, linked views for relational analysis, and a dashboard framework when controls and computation must live together. Match polish to audience, keeping exploratory views rough and fast and explanatory views focused and annotated. ### 6.2 Performance habits that scale Most interactive performance problems come from sending too much data to the browser. Aggregate or sample before plotting, since a reader cannot perceive a million overlapping points anyway. Use WebGL backends, available in Plotly and Bokeh, when you genuinely need tens of thousands of marks. Cache expensive computations so interaction recomputes only what changed. Precompute projections and summaries offline rather than on every page load. These habits keep an interactive view responsive, and responsiveness is what makes interaction feel like thinking rather than waiting. ### 6.3 Reproducibility and embedding Interactive artifacts complicate reproducibility because they bundle data, code, and a JavaScript runtime whose versions can drift. Pin library versions, and for figures destined for a paper or a Quarto book, export a self contained HTML file or fall back to a static image so the artifact survives independent of a running server. A figure that renders today but breaks on a dependency bump next quarter has a short and frustrating life, so treat the export format as part of the design, not an afterthought. ### 6.4 Common pitfalls A few failure modes recur often enough to name explicitly. - **Interaction as decoration.** Controls that no reader needs add load time, fragility, and cognitive cost while removing no ambiguity. If the default view already carries the message, delete the control. - **Overplotting masquerading as data.** Tens of thousands of overlapping semi-transparent marks read as a uniform smear. Aggregate, bin, or sample so the visible density reflects the real density, and let zoom recover detail. - **Color as the sole channel.** Roughly one in twelve men has a color-vision deficiency, so any meaning carried only by hue is invisible to part of the audience. Redundantly encode with shape, position, or direct labels. - **Hover-only information.** Tooltips do not exist for keyboard and touch users. Anything load-bearing must have a non-hover fallback. - **Mining the test set with your eyes.** Slicing held-out data interactively until a pattern appears is multiple comparisons by another name (Section 3.6). Confirm on data you have not already inspected. - **Latency creep.** A control wired to a heavy server round trip that drifts past one second breaks the loop between question and answer. Precompute the response surface and keep the per-interaction cost a lookup. ### 6.5 When to use which view As a closing heuristic: stay static for a single fixed comparison; reach for an exploratory notebook view (Plotly Express, or Altair and Bokeh for linked selection) when you are the consumer and the goal is discovery; build an explanatory interactive with strong defaults and one or two scoped controls when you have found the message and must communicate it; and stand up a dashboard (Streamlit for speed, Dash for production structure) only when several linked views, controls, and live computation genuinely belong together as a tool rather than a figure. ## References 1. Plotly Python Open Source Graphing Library. https://plotly.com/python/ 2. Bokeh Documentation. https://docs.bokeh.org/en/latest/ 3. Vega-Altair: Declarative Visualization in Python. https://altair-viz.github.io/ 4. Vega-Lite: A Grammar of Interactive Graphics. https://vega.github.io/vega-lite/ 5. Streamlit Documentation. https://docs.streamlit.io/ 6. Dash Documentation by Plotly. https://dash.plotly.com/ 7. Satyanarayan, A., Moritz, D., Wongsuphasawat, K., Heer, J. Vega-Lite: A Grammar of Interactive Graphics. IEEE Transactions on Visualization and Computer Graphics, 2017. https://ieeexplore.ieee.org/document/7539624 8. Wilke, C. O. Fundamentals of Data Visualization. https://clauswilke.com/dataviz/ 9. McInnes, L., Healy, J., Melville, J. UMAP: Uniform Manifold Approximation and Projection. https://arxiv.org/abs/1802.03426 10. Wexler, J., et al. The What-If Tool: Interactive Probing of Machine Learning Models. https://arxiv.org/abs/1907.04135 11. Shneiderman, B. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. Proceedings of the IEEE Symposium on Visual Languages, 1996. https://doi.org/10.1109/VL.1996.545307 12. Card, S. K., Robertson, G. G., Mackinlay, J. D. The Information Visualizer, an Information Workspace. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 1991. https://doi.org/10.1145/108844.108874 13. Nielsen, J. Usability Engineering. Morgan Kaufmann, 1993. https://doi.org/10.1016/B978-0-08-052029-2.50007-3 14. Becker, R. A., Cleveland, W. S. Brushing Scatterplots. Technometrics, 1987. https://doi.org/10.1080/00401706.1987.10488204