8 Computational Thinking for AI

8.1 1. Introduction

Computational thinking is the discipline of formulating problems and their solutions so that the solutions can be carried out by an information processing agent, whether that agent is a human, a machine, or a combination of the two. The phrase was popularized by Jeannette Wing, who argued that it represents a universally applicable skill on par with reading, writing, and arithmetic. For practitioners of artificial intelligence and machine learning, computational thinking is not optional polish. It is the connective tissue between a messy human aspiration (“understand our customers better”) and a precise object that a learning system can actually optimize.

This chapter treats computational thinking as the craft of moving from a vague real-world goal to a well-posed computational problem, and from there to a tractable solution. The central claim is that most failures in applied AI are not failures of model architecture or compute. They are failures of formulation. A team that frames a recommendation task as next-click prediction will build something very different from a team that frames it as long-horizon satisfaction, even if both reach for the same transformer library. The work of deciding which problem you are solving happens before any gradient descends.

We organize the discussion around four classical pillars: decomposition, pattern recognition, abstraction, and algorithm design. We then connect these pillars to three concerns that are especially sharp in AI: the role of representation, awareness of computational complexity and tractability, and the practice of problem framing. Throughout, we ground the ideas in worked examples drawn from machine learning practice.

The flow of work in this chapter runs in one direction, from an underspecified human goal to a procedure whose guarantees and costs are understood.

flowchart LR
    A["Vague real world goal"] --> B["Well posed problem: spaces, target, loss"]
    B --> C["Representation: features or learned embeddings"]
    C --> D["Hypothesis class and inductive bias"]
    D --> E["Algorithm: learning and inference"]
    E --> F["Bounded solution with known cost and error"]
    F -. "feedback and reframing" .-> A

The dashed arrow matters as much as the solid ones. Framing is rarely correct on the first pass, and the discipline is iterative: deployment surfaces a leak, a distribution shift, or an objective mismatch that sends the practitioner back to redefine the problem rather than to retune the model.

8.2 2. The Four Pillars in an AI Context

8.2.1 2.1 Decomposition

Decomposition is the act of breaking a large, intimidating problem into smaller subproblems that can be reasoned about and solved independently. In AI systems this happens at several levels at once.

At the pipeline level, a deployed model is rarely a single artifact. Consider a fraud detection system. It decomposes naturally into ingestion (collecting transaction events), feature computation (aggregating spending velocity, device fingerprints, merchant history), a scoring model (estimating the probability of fraud), a decision policy (choosing to allow, challenge, or block given the score and the cost structure), and a feedback loop (collecting confirmed labels to retrain). Each component has its own inputs, outputs, failure modes, and metrics. Treating the whole thing as “the fraud model” obscures the fact that a precision problem in scoring and a threshold problem in the decision policy require entirely different fixes.

At the learning level, decomposition shows up as the division of a hard prediction task into easier ones. A complex translation system may be decomposed into tokenization, encoding, alignment, decoding, and detokenization. Reinforcement learning decomposes long-horizon control into the estimation of value functions and the improvement of policies. Knowing where the natural seams are, and resisting the urge to cut where there is no seam, is a learned skill.

A useful test for a good decomposition is whether the interfaces between parts are narrow and stable. If two subproblems must constantly exchange large amounts of state, they were probably one problem. This is the same principle of low coupling and high cohesion that governs modular software design, and it has a precise payoff in machine learning: components joined by a narrow interface can be evaluated, retrained, and replaced independently, whereas tightly entangled components fail together and resist diagnosis.

There is a subtle hazard unique to learned systems. When a decomposition feeds the output of one learned component into the input of another, the consuming component begins to depend on the idiosyncrasies of its upstream supplier, a coupling that Sculley and colleagues named an entanglement antipattern, summarized as the maxim that changing anything changes everything. A decomposition that looks clean on a whiteboard can still be tightly coupled in practice if the interface carries a learned, drifting signal rather than a stable contract. Good decomposition in AI therefore demands that interfaces be not only narrow but also versioned and monitored.

8.2.2 2.2 Pattern Recognition

Pattern recognition, in the computational thinking sense, means spotting that a new problem is an instance of a problem class you already know how to solve. This is distinct from the statistical pattern recognition that models perform on data, though the two rhymes.

Much of the leverage in applied AI comes from recognizing the abstract shape of a task. “Predict whether this email is spam,” “predict whether this tumor is malignant,” and “predict whether this user will churn” are all binary classification under a fixed feature representation, and they share an entire toolkit: loss functions, calibration methods, threshold tuning, and evaluation under class imbalance. Recognizing that an apparently novel business question is really ranking, or really anomaly detection, or really sequence labeling, immediately imports decades of theory and tooling.

The danger is false pattern matching. A practitioner who reflexively casts every problem as supervised classification may force a labeling regime onto a situation that is genuinely unsupervised, or may ignore that the real objective is a sequential decision under uncertainty rather than a one-shot prediction. Good pattern recognition includes recognizing when the pattern does not fit.

8.2.3 2.3 Abstraction

Abstraction is the deliberate suppression of detail. It is arguably the most important and the most difficult pillar, because the quality of an AI system is bounded by the quality of the abstractions it is built on.

In machine learning, abstraction governs what we choose to represent and what we choose to ignore. A model of housing prices abstracts a physical building into a vector of features: square footage, location encoded somehow, number of bedrooms, age. Every one of those choices discards information (the smell of the kitchen, the quality of morning light) and asserts that what remains is sufficient for the prediction. The art is choosing abstractions that retain the signal relevant to the objective while discarding the rest.

Abstraction also operates over models themselves. The notion of “a classifier” as a function from a feature space to a probability simplex lets us reason about logistic regression and a deep network with the same vocabulary. Layered abstraction, where each level hides the complexity of the level below, is what makes large systems comprehensible. A practitioner using an embedding model does not reason about attention weights; they reason about a function that maps text to a vector in which semantic similarity corresponds to geometric proximity. That abstraction is a contract, and like all contracts it can leak.

8.2.4 2.4 Algorithm Design

Algorithm design is the specification of a precise, finite procedure that transforms inputs into the desired outputs. In AI the relevant algorithms come in two flavors that are easy to conflate.

The first flavor is the learning algorithm: the procedure that consumes data and produces a model, such as stochastic gradient descent, expectation maximization, or a tree-boosting routine. The second is the inference algorithm: the procedure that consumes a trained model and a new input and produces a prediction or a decision, such as beam search over a language model or value iteration in a planner.

Designing these procedures well requires the same considerations as any algorithm design: correctness (does it converge to what we want), resource cost (time and memory as functions of input size), and robustness (how it behaves on degenerate or adversarial inputs). What is distinctive about AI is that correctness is often statistical rather than exact. We do not demand that a classifier be right on every input; we demand that its expected loss over a distribution be small, and we design and analyze algorithms with that probabilistic notion of success in mind.

8.3 3. From Vague Goal to Well-Posed Problem

8.3.1 3.1 The Anatomy of a Well-Posed Learning Problem

A real-world goal such as “reduce customer churn” is not yet a computational problem. To make it well-posed in the machine learning sense, we must specify several elements precisely.

First, the input space, often written as a set of objects $x$ drawn from a domain $\mathcal{X}$. Second, the output space $\mathcal{Y}$, the set of permissible answers. Third, an unknown target relationship, usually framed as a distribution $\mathcal{D}$ over pairs $(x, y)$, that we wish to approximate. Fourth, a hypothesis class $\mathcal{H}$, the family of candidate functions we are willing to consider. Fifth, a loss function $\ell(\hat{y}, y)$ that quantifies the cost of predicting $\hat{y}$ when the truth is $y$.

We can now state the object of learning precisely. For a hypothesis $h$, define its risk (true error) as the expected loss

\[ R(h) \;=\; \mathbb{E}_{(x,y)\sim\mathcal{D}}\big[\ell(h(x), y)\big]. \]

Because $\mathcal{D}$ is unknown, we cannot evaluate $R(h)$ directly. Instead we observe a finite training sample $S = \{(x_i, y_i)\}_{i=1}^{n}$ drawn independently from $\mathcal{D}$ and compute the empirical risk

\[ \hat{R}_S(h) \;=\; \frac{1}{n}\sum_{i=1}^{n} \ell(h(x_i), y_i). \]

The well-posed problem is to find $h \in \mathcal{H}$ with small true risk $R(h)$, given only the ability to measure $\hat{R}_S(h)$. The standard strategy, empirical risk minimization, returns the hypothesis $\hat{h}_S = \arg\min_{h\in\mathcal{H}} \hat{R}_S(h)$. The entire theory of generalization is the study of when minimizing the proxy $\hat{R}_S$ also makes the target $R$ small.

Definition: a well-posed learning problem

A learning problem is well-posed once the tuple $(\mathcal{X}, \mathcal{Y}, \mathcal{D}, \mathcal{H}, \ell)$ is fully specified together with the sampling protocol that produces $S$. Every element is a commitment. Leaving any one of them implicit means the team is, in effect, optimizing an objective no one has written down.

This template, which Tom Mitchell crystallized as the idea that a program learns from experience E with respect to task T and performance measure P if its performance at T, measured by P, improves with E, forces a set of decisions that the original vague goal left implicit. What counts as a customer? Over what time window do we measure churn? Is the loss of a false negative (a churning customer we failed to flag) symmetric with the loss of a false positive? Answering these questions is most of the work, and the answers are not discoverable by any algorithm. They are commitments made by the people framing the problem.

8.3.2 3.2 Surfacing Hidden Assumptions

Turning a goal into a problem is an exercise in making the implicit explicit. A few recurring questions sharpen the formulation.

What is the unit of prediction? Per customer, per session, per transaction? The choice determines how data is grouped and how leakage between training and test can occur. What information is actually available at decision time? A feature that is only known after the outcome is useless for live prediction, even though it correlates beautifully in a historical table. This is the trap of target leakage, and it is most often introduced during framing, not modeling. What is the cost structure of errors? A medical screening problem and a spam filter both produce binary predictions, but their loss functions are wildly different, and that difference should drive the threshold, the metric, and even the choice of model.

8.3.3 3.3 A Framing Checklist

The following non-executable sketch captures a disciplined framing pass that a practitioner can run before touching data.

FRAME(real_world_goal):
    objective        <- what decision or action does this inform?
    unit             <- what is one example? (customer, image, sentence)
    input_space      <- what is observable at decision time?
    output_space     <- what are we predicting? (class, value, ranking, sequence)
    target           <- how is the ground-truth label defined and obtained?
    loss             <- what does a wrong answer cost, asymmetrically?
    horizon          <- one-shot prediction or sequential decision?
    constraints      <- latency, fairness, interpretability, privacy budgets
    baseline         <- what is the simplest thing that could possibly work?
    success_metric   <- offline proxy AND online business measure
    return well_posed_problem(...)

The final two lines deserve emphasis. Insisting on a baseline before any sophisticated model is a guard against complexity for its own sake. Distinguishing the offline proxy metric (such as area under the ROC curve) from the online business measure (such as retained revenue) keeps the team honest about the fact that the thing we can optimize is rarely identical to the thing we care about.

8.4 4. The Role of Representation

8.4.1 4.1 Why Representation Is Half the Battle

Once a problem is well-posed, the next decisive choice is how to represent the objects in the input space. Representation is the bridge between raw reality and the mathematics of learning, and it frequently determines success more than the choice of model. A linearly inseparable dataset becomes trivially separable under the right feature transformation; a poorly represented one defeats even the most expressive network.

Classically, representation meant feature engineering: a domain expert handcrafting the variables that expose the signal. The deep learning era shifted much of this burden onto representation learning, where the model itself discovers useful intermediate representations from raw inputs. Bengio and colleagues frame this as the search for representations that disentangle the underlying factors of variation in the data. Yet the shift did not eliminate the responsibility for representation. It relocated it. Choosing to tokenize text in a particular way, to resize images to a particular resolution, or to encode a categorical variable as an embedding rather than a one-hot vector are all representational commitments that precede and shape what the model can learn.

8.4.2 4.2 Representation as an Inductive Bias

Every representation encodes an inductive bias, an assumption about which inputs should be treated as similar. Representing words as one-hot vectors asserts that every word is equidistant from every other, which is why distributional embeddings, where semantically related words sit near one another, were such a leap. Representing an image as a grid of pixels processed by convolution asserts that translation should not change the meaning of a local pattern. These biases are not neutral. They are exactly what allows a model to generalize from finite data, because they constrain the hypothesis class to functions that respect the assumed structure.

A computational thinker therefore chooses representations the way an algorithm designer chooses data structures: with full awareness that the choice privileges some operations and penalizes others. The right representation can turn an intractable search into an easy one, and the wrong representation can hide the answer in plain sight.

8.5 5. Complexity and Tractability Awareness

8.5.1 5.1 Two Kinds of Complexity

AI practitioners must hold two distinct notions of complexity in mind. The first is computational complexity in the classical sense: how the time and memory required by an algorithm grow with the size of the input, expressed in big-O notation. Exact inference in a general probabilistic graphical model is NP-hard, which is precisely why approximate methods such as variational inference and sampling exist. Optimal planning over long horizons suffers the curse of dimensionality, where the state space grows exponentially in the number of variables. Recognizing that a naively posed problem is intractable is what motivates the search for structure (independence assumptions, factorization, locality) that restores tractability.

The second is sample complexity: how much data is required to learn a hypothesis of a given accuracy with a given confidence. Statistical learning theory, through tools such as the Vapnik-Chervonenkis (VC) dimension, relates the richness of a hypothesis class to the number of examples needed to avoid overfitting. The VC dimension $\mathrm{VC}(\mathcal{H})$ is the size of the largest set of points that $\mathcal{H}$ can label in all possible ways, a combinatorial measure of how expressive the class is. A central result of the theory makes the trade-off quantitative: with probability at least $1 - \delta$ over the draw of a sample of size $n$, every hypothesis $h \in \mathcal{H}$ satisfies a generalization bound of the form

\[ R(h) \;\le\; \hat{R}_S(h) \;+\; O\!\left(\sqrt{\frac{\mathrm{VC}(\mathcal{H})\,\log n \;+\; \log(1/\delta)}{n}}\right). \]

The structure of this inequality is the entire lesson, even setting the constants aside. The true risk we care about is bounded by the empirical risk we can measure, plus a penalty that grows with the capacity $\mathrm{VC}(\mathcal{H})$ of the hypothesis class and shrinks as the sample size $n$ grows. A richer class can drive the first term down but inflates the second, and only more data can pay that inflation back. This is the formal heart of the bias-variance trade-off, and it explains why throwing a billion-parameter model at a thousand examples is usually folly: the empirical risk can be driven to zero while the bound on true risk stays vacuous. The same inequality also explains why useful inductive biases, discussed below, are not a luxury. By shrinking the effective capacity of $\mathcal{H}$ to functions that respect the structure of the problem, they tighten the generalization gap without requiring more labels.

8.5.2 5.2 Reasoning About Tractability Before Building

Complexity awareness is a design-time discipline, not a post-mortem one. Before committing to an approach, a practitioner should estimate the order of growth of both training and inference, and confront it with the operational budget.

estimate cost of a candidate approach:
    n  = number of training examples
    d  = dimensionality of representation
    k  = number of model parameters or model "size"

    training_time   ~ epochs * n * cost_per_example(d, k)
    inference_time  ~ cost_per_query(d, k)        must fit latency budget
    memory          ~ f(k, batch, sequence_length)
    sample_need     ~ grows with capacity of H    do we have enough labels?

    if inference_time > latency_budget:  reconsider model size or representation
    if sample_need   > available labels: shrink H, add bias, or get more data

The point of this reasoning is not to compute exact numbers; it is to detect, early, that a proposed solution is on the wrong side of a feasibility boundary. A model that achieves excellent accuracy but cannot return a prediction within the required latency is a non-solution, and no amount of later tuning will rescue a formulation that ignored the constraint from the start. Tractability is part of correctness.

8.5.3 5.3 The Discipline of Approximation

Much of modern AI is the art of trading exactness for tractability in a controlled way. Mini-batch gradient descent approximates the true gradient with a cheap, noisy estimate. Nearest-neighbor search at scale uses approximate methods that trade a small recall loss for orders-of-magnitude speedups. Sampling approximates intractable integrals. The computational thinker treats approximation not as a compromise but as a first-class design move, and reasons explicitly about the error each approximation introduces and whether that error matters for the objective.

8.6 6. Worked Examples of Computational Framing

8.6.1 6.1 From “Improve Search Quality” to a Ranking Problem

Suppose a product team wants to “make search better.” This is a goal, not a problem. Applying the framing checklist, we ask what decision the system informs: given a query, order a set of candidate documents. The unit of prediction is a query-document pair. The output is not a class but a relative ordering, which immediately signals that this is learning to rank rather than classification, importing pairwise and listwise loss functions designed for ordering.

The target relationship is subtle. We rarely have absolute relevance labels, but we do have implicit signals such as clicks, which are biased toward whatever was shown at the top. Recognizing this bias is a framing insight: optimizing for raw click-through risks a feedback loop that simply reinforces past rankings. The loss must account for position bias, and the offline metric (such as normalized discounted cumulative gain) must be distinguished from the online measure (such as successful sessions). What looked like one problem decomposes into candidate retrieval (a tractability move, since scoring every document is infeasible) followed by precise re-ranking of a small set.

A small numerical illustration makes the choice of metric concrete. Suppose a query returns five documents whose graded relevances, in the order the system presents them, are $(3, 2, 3, 0, 1)$ on a scale where larger is better. Discounted cumulative gain weights each relevance by a logarithmic position discount,

\[ \mathrm{DCG} \;=\; \sum_{i=1}^{5} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i + 1)}. \]

For this ranking the gains are $\tfrac{7}{1.000}, \tfrac{3}{1.585}, \tfrac{7}{2.000}, \tfrac{0}{2.322}, \tfrac{1}{2.585}$, summing to roughly $12.78$. The ideal ranking sorts the same relevances as $(3, 3, 2, 1, 0)$, giving an ideal DCG of about $13.71$, so the normalized score is $\mathrm{nDCG} = 12.78 / 13.71 \approx 0.93$. The exponential numerator $2^{\mathrm{rel}_i} - 1$ rewards placing a highly relevant document near the top far more than a marginally relevant one, and the logarithmic discount encodes the empirical fact that users attend to early positions. A plain accuracy metric, which treats all positions alike, would be blind to exactly the structure that makes a ranking good. Choosing nDCG over accuracy is therefore not a tuning detail; it is part of posing the problem correctly.

8.6.2 6.2 From “Detect Anomalies” to a Well-Posed Detection Problem

A platform team wants to “catch unusual behavior.” The first framing question is whether labels exist. If confirmed anomalies are rare and unlabeled, this is not supervised classification; it is more naturally posed as density estimation or one-class modeling, where we learn a model of normal behavior and flag low-probability events. That recognition (a pattern-recognition act) changes everything downstream.

The representation choice is decisive here. Anomalies that are obvious in one representation are invisible in another, so the framing must specify what “behavior” is measured over: per-event features, or aggregated sequences over time. The loss is profoundly asymmetric and the base rate is extreme, so the evaluation cannot rely on accuracy and must use precision at low recall or the area under the precision-recall curve. Finally, the decision policy is a separate subproblem from the scoring model: where to set the alert threshold depends on the cost of investigating a false alarm versus the cost of a missed incident, a business decision that no learning algorithm can make for us.

8.6.3 6.3 From “Build a Helpful Assistant” to a Sequential Decision Problem

Consider the goal of an assistant that “helps users accomplish tasks.” Framed naively as next-token prediction, the problem is supervised sequence modeling, and that framing produces a fluent but myopic system that optimizes local plausibility rather than task success. Reframing the goal reveals a horizon: helpfulness unfolds over a multi-turn interaction, which is a sequential decision problem under uncertainty about the user’s intent.

This reframing surfaces an objective mismatch that is central to modern practice. The quantity we can cheaply optimize (likelihood of human-written text) is a proxy for the quantity we care about (useful, truthful, harmless assistance). Bridging that gap is exactly the motivation for learning from human feedback, where a separate reward model abstracts human preference judgments into a scalar that a policy can be optimized against. The decomposition here is illuminating: a representation-learning component (the base model), a preference-abstraction component (the reward model), and a policy-improvement algorithm (reinforcement learning or a related optimizer), each a recognizable subproblem with its own theory.

8.7 7. Synthesis and Practice

The through-line of this chapter is that the hardest and highest-leverage work in AI happens before optimization begins. Decomposition tells us where the seams of a system lie. Pattern recognition lets us see a business question as an instance of a solved problem class, and, just as importantly, lets us notice when it is not. Abstraction governs what we choose to represent and what we discard, setting a ceiling on achievable quality. Algorithm design turns the formulation into a precise procedure whose costs we can analyze.

Representation, complexity, and framing are the places where these pillars meet the specific texture of machine learning. Representation determines what is learnable; complexity determines what is feasible; framing determines whether we are solving the right problem at all. A practitioner who internalizes these habits will, faced with a vague mandate, instinctively reach not for a model but for a notebook on which to write down the input space, the output space, the loss, the constraints, and the simplest baseline. The model comes later, and by then most of the important decisions have already been made well.

8.7.1 7.1 When to Lean on These Habits, and How They Fail

Computational thinking pays the largest dividends at the start of a project, when the problem is still soft and a single framing decision will echo through everything built afterward. It is most valuable when the mandate arrives as a business sentence rather than a dataset, when the cost of errors is asymmetric or hard to name, when labels are scarce or themselves suspect, and when the system must run inside a real latency, memory, or fairness budget. In each of these situations the deciding move is made before any model is chosen, and a deliberate framing pass is the highest-leverage hour the team will spend.

The discipline also has characteristic failure modes worth naming. The first is analysis paralysis: framing can become an excuse never to build, when in truth a quick baseline often teaches more about the problem than another round of whiteboard abstraction. The second is premature or false abstraction, forcing a problem into a familiar class (almost always supervised classification) because the toolkit is comfortable, when the true structure is sequential, unsupervised, or causal. The third is optimizing a proxy whose gap to the real objective is never revisited, so the team faithfully improves an offline metric while the online measure stagnates or declines. The fourth is treating the formulation as fixed once written, when the honest practice is to revisit $(\mathcal{X}, \mathcal{Y}, \mathcal{D}, \mathcal{H}, \ell)$ whenever the data, the costs, or the deployment context shift. The remedy for all four is the same: hold framing and building in tension, let each correct the other, and treat the problem statement as a living artifact rather than a one-time gate.

Computational thinking, in the end, is the refusal to let a problem stay vague. It is the conviction that any goal worth pursuing with a machine can be sharpened, decomposed, represented, and bounded until what remains is a problem a procedure can solve, and a clear-eyed account of what that procedure can and cannot promise.

8.8 References

Wing, J. M. (2006). Computational Thinking. Communications of the ACM, 49(3), 33-35. https://www.cs.cmu.edu/~15110-s13/Wing06-ct.pdf
Wing, J. M. (2008). Computational Thinking and Thinking about Computing. Philosophical Transactions of the Royal Society A, 366(1881), 3717-3725. https://royalsocietypublishing.org/doi/10.1098/rsta.2008.0118
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill. https://www.cs.cmu.edu/~tom/mlbook.html
Domingos, P. (2012). A Few Useful Things to Know about Machine Learning. Communications of the ACM, 55(10), 78-87. https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798-1828. https://arxiv.org/abs/1206.5538
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/
Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press. https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models/
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
Liu, T.-Y. (2009). Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval, 3(3), 225-331. https://www.nowpublishers.com/article/Details/INR-016
Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems 35. https://arxiv.org/abs/2203.02155
Sculley, D., et al. (2015). Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems 28. https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

# Computational Thinking for AI ## 1. Introduction Computational thinking is the discipline of formulating problems and their solutions so that the solutions can be carried out by an information processing agent, whether that agent is a human, a machine, or a combination of the two. The phrase was popularized by Jeannette Wing, who argued that it represents a universally applicable skill on par with reading, writing, and arithmetic. For practitioners of artificial intelligence and machine learning, computational thinking is not optional polish. It is the connective tissue between a messy human aspiration ("understand our customers better") and a precise object that a learning system can actually optimize. This chapter treats computational thinking as the craft of moving from a vague real-world goal to a well-posed computational problem, and from there to a tractable solution. The central claim is that most failures in applied AI are not failures of model architecture or compute. They are failures of formulation. A team that frames a recommendation task as next-click prediction will build something very different from a team that frames it as long-horizon satisfaction, even if both reach for the same transformer library. The work of deciding which problem you are solving happens before any gradient descends. We organize the discussion around four classical pillars: decomposition, pattern recognition, abstraction, and algorithm design. We then connect these pillars to three concerns that are especially sharp in AI: the role of representation, awareness of computational complexity and tractability, and the practice of problem framing. Throughout, we ground the ideas in worked examples drawn from machine learning practice. The flow of work in this chapter runs in one direction, from an underspecified human goal to a procedure whose guarantees and costs are understood. ```{mermaid} flowchart LR A["Vague real world goal"] --> B["Well posed problem: spaces, target, loss"] B --> C["Representation: features or learned embeddings"] C --> D["Hypothesis class and inductive bias"] D --> E["Algorithm: learning and inference"] E --> F["Bounded solution with known cost and error"] F -. "feedback and reframing" .-> A ``` The dashed arrow matters as much as the solid ones. Framing is rarely correct on the first pass, and the discipline is iterative: deployment surfaces a leak, a distribution shift, or an objective mismatch that sends the practitioner back to redefine the problem rather than to retune the model. ## 2. The Four Pillars in an AI Context ### 2.1 Decomposition Decomposition is the act of breaking a large, intimidating problem into smaller subproblems that can be reasoned about and solved independently. In AI systems this happens at several levels at once. At the pipeline level, a deployed model is rarely a single artifact. Consider a fraud detection system. It decomposes naturally into ingestion (collecting transaction events), feature computation (aggregating spending velocity, device fingerprints, merchant history), a scoring model (estimating the probability of fraud), a decision policy (choosing to allow, challenge, or block given the score and the cost structure), and a feedback loop (collecting confirmed labels to retrain). Each component has its own inputs, outputs, failure modes, and metrics. Treating the whole thing as "the fraud model" obscures the fact that a precision problem in scoring and a threshold problem in the decision policy require entirely different fixes. At the learning level, decomposition shows up as the division of a hard prediction task into easier ones. A complex translation system may be decomposed into tokenization, encoding, alignment, decoding, and detokenization. Reinforcement learning decomposes long-horizon control into the estimation of value functions and the improvement of policies. Knowing where the natural seams are, and resisting the urge to cut where there is no seam, is a learned skill. A useful test for a good decomposition is whether the interfaces between parts are narrow and stable. If two subproblems must constantly exchange large amounts of state, they were probably one problem. This is the same principle of low coupling and high cohesion that governs modular software design, and it has a precise payoff in machine learning: components joined by a narrow interface can be evaluated, retrained, and replaced independently, whereas tightly entangled components fail together and resist diagnosis. There is a subtle hazard unique to learned systems. When a decomposition feeds the output of one learned component into the input of another, the consuming component begins to depend on the idiosyncrasies of its upstream supplier, a coupling that Sculley and colleagues named an entanglement antipattern, summarized as the maxim that changing anything changes everything. A decomposition that looks clean on a whiteboard can still be tightly coupled in practice if the interface carries a learned, drifting signal rather than a stable contract. Good decomposition in AI therefore demands that interfaces be not only narrow but also versioned and monitored. ### 2.2 Pattern Recognition Pattern recognition, in the computational thinking sense, means spotting that a new problem is an instance of a problem class you already know how to solve. This is distinct from the statistical pattern recognition that models perform on data, though the two rhymes. Much of the leverage in applied AI comes from recognizing the abstract shape of a task. "Predict whether this email is spam," "predict whether this tumor is malignant," and "predict whether this user will churn" are all binary classification under a fixed feature representation, and they share an entire toolkit: loss functions, calibration methods, threshold tuning, and evaluation under class imbalance. Recognizing that an apparently novel business question is really ranking, or really anomaly detection, or really sequence labeling, immediately imports decades of theory and tooling. The danger is false pattern matching. A practitioner who reflexively casts every problem as supervised classification may force a labeling regime onto a situation that is genuinely unsupervised, or may ignore that the real objective is a sequential decision under uncertainty rather than a one-shot prediction. Good pattern recognition includes recognizing when the pattern does not fit. ### 2.3 Abstraction Abstraction is the deliberate suppression of detail. It is arguably the most important and the most difficult pillar, because the quality of an AI system is bounded by the quality of the abstractions it is built on. In machine learning, abstraction governs what we choose to represent and what we choose to ignore. A model of housing prices abstracts a physical building into a vector of features: square footage, location encoded somehow, number of bedrooms, age. Every one of those choices discards information (the smell of the kitchen, the quality of morning light) and asserts that what remains is sufficient for the prediction. The art is choosing abstractions that retain the signal relevant to the objective while discarding the rest. Abstraction also operates over models themselves. The notion of "a classifier" as a function from a feature space to a probability simplex lets us reason about logistic regression and a deep network with the same vocabulary. Layered abstraction, where each level hides the complexity of the level below, is what makes large systems comprehensible. A practitioner using an embedding model does not reason about attention weights; they reason about a function that maps text to a vector in which semantic similarity corresponds to geometric proximity. That abstraction is a contract, and like all contracts it can leak. ### 2.4 Algorithm Design Algorithm design is the specification of a precise, finite procedure that transforms inputs into the desired outputs. In AI the relevant algorithms come in two flavors that are easy to conflate. The first flavor is the learning algorithm: the procedure that consumes data and produces a model, such as stochastic gradient descent, expectation maximization, or a tree-boosting routine. The second is the inference algorithm: the procedure that consumes a trained model and a new input and produces a prediction or a decision, such as beam search over a language model or value iteration in a planner. Designing these procedures well requires the same considerations as any algorithm design: correctness (does it converge to what we want), resource cost (time and memory as functions of input size), and robustness (how it behaves on degenerate or adversarial inputs). What is distinctive about AI is that correctness is often statistical rather than exact. We do not demand that a classifier be right on every input; we demand that its expected loss over a distribution be small, and we design and analyze algorithms with that probabilistic notion of success in mind. ## 3. From Vague Goal to Well-Posed Problem ### 3.1 The Anatomy of a Well-Posed Learning Problem A real-world goal such as "reduce customer churn" is not yet a computational problem. To make it well-posed in the machine learning sense, we must specify several elements precisely. First, the input space, often written as a set of objects $x$ drawn from a domain $\mathcal{X}$. Second, the output space $\mathcal{Y}$, the set of permissible answers. Third, an unknown target relationship, usually framed as a distribution $\mathcal{D}$ over pairs $(x, y)$, that we wish to approximate. Fourth, a hypothesis class $\mathcal{H}$, the family of candidate functions we are willing to consider. Fifth, a loss function $\ell(\hat{y}, y)$ that quantifies the cost of predicting $\hat{y}$ when the truth is $y$. We can now state the object of learning precisely. For a hypothesis $h$, define its **risk** (true error) as the expected loss $$ R(h) \;=\; \mathbb{E}_{(x,y)\sim\mathcal{D}}\big[\ell(h(x), y)\big]. $$ Because $\mathcal{D}$ is unknown, we cannot evaluate $R(h)$ directly. Instead we observe a finite training sample $S = \{(x_i, y_i)\}_{i=1}^{n}$ drawn independently from $\mathcal{D}$ and compute the **empirical risk** $$ \hat{R}_S(h) \;=\; \frac{1}{n}\sum_{i=1}^{n} \ell(h(x_i), y_i). $$ The well-posed problem is to find $h \in \mathcal{H}$ with small true risk $R(h)$, given only the ability to measure $\hat{R}_S(h)$. The standard strategy, **empirical risk minimization**, returns the hypothesis $\hat{h}_S = \arg\min_{h\in\mathcal{H}} \hat{R}_S(h)$. The entire theory of generalization is the study of when minimizing the proxy $\hat{R}_S$ also makes the target $R$ small. ::: {.callout-note} ## Definition: a well-posed learning problem A learning problem is well-posed once the tuple $(\mathcal{X}, \mathcal{Y}, \mathcal{D}, \mathcal{H}, \ell)$ is fully specified together with the sampling protocol that produces $S$. Every element is a commitment. Leaving any one of them implicit means the team is, in effect, optimizing an objective no one has written down. ::: This template, which Tom Mitchell crystallized as the idea that a program learns from experience E with respect to task T and performance measure P if its performance at T, measured by P, improves with E, forces a set of decisions that the original vague goal left implicit. What counts as a customer? Over what time window do we measure churn? Is the loss of a false negative (a churning customer we failed to flag) symmetric with the loss of a false positive? Answering these questions is most of the work, and the answers are not discoverable by any algorithm. They are commitments made by the people framing the problem. ### 3.2 Surfacing Hidden Assumptions Turning a goal into a problem is an exercise in making the implicit explicit. A few recurring questions sharpen the formulation. What is the unit of prediction? Per customer, per session, per transaction? The choice determines how data is grouped and how leakage between training and test can occur. What information is actually available at decision time? A feature that is only known after the outcome is useless for live prediction, even though it correlates beautifully in a historical table. This is the trap of target leakage, and it is most often introduced during framing, not modeling. What is the cost structure of errors? A medical screening problem and a spam filter both produce binary predictions, but their loss functions are wildly different, and that difference should drive the threshold, the metric, and even the choice of model. ### 3.3 A Framing Checklist The following non-executable sketch captures a disciplined framing pass that a practitioner can run before touching data. ```text FRAME(real_world_goal): objective <- what decision or action does this inform? unit <- what is one example? (customer, image, sentence) input_space <- what is observable at decision time? output_space <- what are we predicting? (class, value, ranking, sequence) target <- how is the ground-truth label defined and obtained? loss <- what does a wrong answer cost, asymmetrically? horizon <- one-shot prediction or sequential decision? constraints <- latency, fairness, interpretability, privacy budgets baseline <- what is the simplest thing that could possibly work? success_metric <- offline proxy AND online business measure return well_posed_problem(...) ``` The final two lines deserve emphasis. Insisting on a baseline before any sophisticated model is a guard against complexity for its own sake. Distinguishing the offline proxy metric (such as area under the ROC curve) from the online business measure (such as retained revenue) keeps the team honest about the fact that the thing we can optimize is rarely identical to the thing we care about. ## 4. The Role of Representation ### 4.1 Why Representation Is Half the Battle Once a problem is well-posed, the next decisive choice is how to represent the objects in the input space. Representation is the bridge between raw reality and the mathematics of learning, and it frequently determines success more than the choice of model. A linearly inseparable dataset becomes trivially separable under the right feature transformation; a poorly represented one defeats even the most expressive network. Classically, representation meant feature engineering: a domain expert handcrafting the variables that expose the signal. The deep learning era shifted much of this burden onto representation learning, where the model itself discovers useful intermediate representations from raw inputs. Bengio and colleagues frame this as the search for representations that disentangle the underlying factors of variation in the data. Yet the shift did not eliminate the responsibility for representation. It relocated it. Choosing to tokenize text in a particular way, to resize images to a particular resolution, or to encode a categorical variable as an embedding rather than a one-hot vector are all representational commitments that precede and shape what the model can learn. ### 4.2 Representation as an Inductive Bias Every representation encodes an inductive bias, an assumption about which inputs should be treated as similar. Representing words as one-hot vectors asserts that every word is equidistant from every other, which is why distributional embeddings, where semantically related words sit near one another, were such a leap. Representing an image as a grid of pixels processed by convolution asserts that translation should not change the meaning of a local pattern. These biases are not neutral. They are exactly what allows a model to generalize from finite data, because they constrain the hypothesis class to functions that respect the assumed structure. A computational thinker therefore chooses representations the way an algorithm designer chooses data structures: with full awareness that the choice privileges some operations and penalizes others. The right representation can turn an intractable search into an easy one, and the wrong representation can hide the answer in plain sight. ## 5. Complexity and Tractability Awareness ### 5.1 Two Kinds of Complexity AI practitioners must hold two distinct notions of complexity in mind. The first is computational complexity in the classical sense: how the time and memory required by an algorithm grow with the size of the input, expressed in big-O notation. Exact inference in a general probabilistic graphical model is NP-hard, which is precisely why approximate methods such as variational inference and sampling exist. Optimal planning over long horizons suffers the curse of dimensionality, where the state space grows exponentially in the number of variables. Recognizing that a naively posed problem is intractable is what motivates the search for structure (independence assumptions, factorization, locality) that restores tractability. The second is sample complexity: how much data is required to learn a hypothesis of a given accuracy with a given confidence. Statistical learning theory, through tools such as the Vapnik-Chervonenkis (VC) dimension, relates the richness of a hypothesis class to the number of examples needed to avoid overfitting. The VC dimension $\mathrm{VC}(\mathcal{H})$ is the size of the largest set of points that $\mathcal{H}$ can label in all possible ways, a combinatorial measure of how expressive the class is. A central result of the theory makes the trade-off quantitative: with probability at least $1 - \delta$ over the draw of a sample of size $n$, every hypothesis $h \in \mathcal{H}$ satisfies a generalization bound of the form $$ R(h) \;\le\; \hat{R}_S(h) \;+\; O\!\left(\sqrt{\frac{\mathrm{VC}(\mathcal{H})\,\log n \;+\; \log(1/\delta)}{n}}\right). $$ The structure of this inequality is the entire lesson, even setting the constants aside. The true risk we care about is bounded by the empirical risk we can measure, plus a penalty that grows with the capacity $\mathrm{VC}(\mathcal{H})$ of the hypothesis class and shrinks as the sample size $n$ grows. A richer class can drive the first term down but inflates the second, and only more data can pay that inflation back. This is the formal heart of the bias-variance trade-off, and it explains why throwing a billion-parameter model at a thousand examples is usually folly: the empirical risk can be driven to zero while the bound on true risk stays vacuous. The same inequality also explains why useful inductive biases, discussed below, are not a luxury. By shrinking the effective capacity of $\mathcal{H}$ to functions that respect the structure of the problem, they tighten the generalization gap without requiring more labels. ### 5.2 Reasoning About Tractability Before Building Complexity awareness is a design-time discipline, not a post-mortem one. Before committing to an approach, a practitioner should estimate the order of growth of both training and inference, and confront it with the operational budget. ```text estimate cost of a candidate approach: n = number of training examples d = dimensionality of representation k = number of model parameters or model "size" training_time ~ epochs * n * cost_per_example(d, k) inference_time ~ cost_per_query(d, k) must fit latency budget memory ~ f(k, batch, sequence_length) sample_need ~ grows with capacity of H do we have enough labels? if inference_time > latency_budget: reconsider model size or representation if sample_need > available labels: shrink H, add bias, or get more data ``` The point of this reasoning is not to compute exact numbers; it is to detect, early, that a proposed solution is on the wrong side of a feasibility boundary. A model that achieves excellent accuracy but cannot return a prediction within the required latency is a non-solution, and no amount of later tuning will rescue a formulation that ignored the constraint from the start. Tractability is part of correctness. ### 5.3 The Discipline of Approximation Much of modern AI is the art of trading exactness for tractability in a controlled way. Mini-batch gradient descent approximates the true gradient with a cheap, noisy estimate. Nearest-neighbor search at scale uses approximate methods that trade a small recall loss for orders-of-magnitude speedups. Sampling approximates intractable integrals. The computational thinker treats approximation not as a compromise but as a first-class design move, and reasons explicitly about the error each approximation introduces and whether that error matters for the objective. ## 6. Worked Examples of Computational Framing ### 6.1 From "Improve Search Quality" to a Ranking Problem Suppose a product team wants to "make search better." This is a goal, not a problem. Applying the framing checklist, we ask what decision the system informs: given a query, order a set of candidate documents. The unit of prediction is a query-document pair. The output is not a class but a relative ordering, which immediately signals that this is learning to rank rather than classification, importing pairwise and listwise loss functions designed for ordering. The target relationship is subtle. We rarely have absolute relevance labels, but we do have implicit signals such as clicks, which are biased toward whatever was shown at the top. Recognizing this bias is a framing insight: optimizing for raw click-through risks a feedback loop that simply reinforces past rankings. The loss must account for position bias, and the offline metric (such as normalized discounted cumulative gain) must be distinguished from the online measure (such as successful sessions). What looked like one problem decomposes into candidate retrieval (a tractability move, since scoring every document is infeasible) followed by precise re-ranking of a small set. A small numerical illustration makes the choice of metric concrete. Suppose a query returns five documents whose graded relevances, in the order the system presents them, are $(3, 2, 3, 0, 1)$ on a scale where larger is better. Discounted cumulative gain weights each relevance by a logarithmic position discount, $$ \mathrm{DCG} \;=\; \sum_{i=1}^{5} \frac{2^{\mathrm{rel}_i} - 1}{\log_2(i + 1)}. $$ For this ranking the gains are $\tfrac{7}{1.000}, \tfrac{3}{1.585}, \tfrac{7}{2.000}, \tfrac{0}{2.322}, \tfrac{1}{2.585}$, summing to roughly $12.78$. The ideal ranking sorts the same relevances as $(3, 3, 2, 1, 0)$, giving an ideal DCG of about $13.71$, so the normalized score is $\mathrm{nDCG} = 12.78 / 13.71 \approx 0.93$. The exponential numerator $2^{\mathrm{rel}_i} - 1$ rewards placing a highly relevant document near the top far more than a marginally relevant one, and the logarithmic discount encodes the empirical fact that users attend to early positions. A plain accuracy metric, which treats all positions alike, would be blind to exactly the structure that makes a ranking good. Choosing nDCG over accuracy is therefore not a tuning detail; it is part of posing the problem correctly. ### 6.2 From "Detect Anomalies" to a Well-Posed Detection Problem A platform team wants to "catch unusual behavior." The first framing question is whether labels exist. If confirmed anomalies are rare and unlabeled, this is not supervised classification; it is more naturally posed as density estimation or one-class modeling, where we learn a model of normal behavior and flag low-probability events. That recognition (a pattern-recognition act) changes everything downstream. The representation choice is decisive here. Anomalies that are obvious in one representation are invisible in another, so the framing must specify what "behavior" is measured over: per-event features, or aggregated sequences over time. The loss is profoundly asymmetric and the base rate is extreme, so the evaluation cannot rely on accuracy and must use precision at low recall or the area under the precision-recall curve. Finally, the decision policy is a separate subproblem from the scoring model: where to set the alert threshold depends on the cost of investigating a false alarm versus the cost of a missed incident, a business decision that no learning algorithm can make for us. ### 6.3 From "Build a Helpful Assistant" to a Sequential Decision Problem Consider the goal of an assistant that "helps users accomplish tasks." Framed naively as next-token prediction, the problem is supervised sequence modeling, and that framing produces a fluent but myopic system that optimizes local plausibility rather than task success. Reframing the goal reveals a horizon: helpfulness unfolds over a multi-turn interaction, which is a sequential decision problem under uncertainty about the user's intent. This reframing surfaces an objective mismatch that is central to modern practice. The quantity we can cheaply optimize (likelihood of human-written text) is a proxy for the quantity we care about (useful, truthful, harmless assistance). Bridging that gap is exactly the motivation for learning from human feedback, where a separate reward model abstracts human preference judgments into a scalar that a policy can be optimized against. The decomposition here is illuminating: a representation-learning component (the base model), a preference-abstraction component (the reward model), and a policy-improvement algorithm (reinforcement learning or a related optimizer), each a recognizable subproblem with its own theory. ## 7. Synthesis and Practice The through-line of this chapter is that the hardest and highest-leverage work in AI happens before optimization begins. Decomposition tells us where the seams of a system lie. Pattern recognition lets us see a business question as an instance of a solved problem class, and, just as importantly, lets us notice when it is not. Abstraction governs what we choose to represent and what we discard, setting a ceiling on achievable quality. Algorithm design turns the formulation into a precise procedure whose costs we can analyze. Representation, complexity, and framing are the places where these pillars meet the specific texture of machine learning. Representation determines what is learnable; complexity determines what is feasible; framing determines whether we are solving the right problem at all. A practitioner who internalizes these habits will, faced with a vague mandate, instinctively reach not for a model but for a notebook on which to write down the input space, the output space, the loss, the constraints, and the simplest baseline. The model comes later, and by then most of the important decisions have already been made well. ### 7.1 When to Lean on These Habits, and How They Fail Computational thinking pays the largest dividends at the start of a project, when the problem is still soft and a single framing decision will echo through everything built afterward. It is most valuable when the mandate arrives as a business sentence rather than a dataset, when the cost of errors is asymmetric or hard to name, when labels are scarce or themselves suspect, and when the system must run inside a real latency, memory, or fairness budget. In each of these situations the deciding move is made before any model is chosen, and a deliberate framing pass is the highest-leverage hour the team will spend. The discipline also has characteristic failure modes worth naming. The first is analysis paralysis: framing can become an excuse never to build, when in truth a quick baseline often teaches more about the problem than another round of whiteboard abstraction. The second is premature or false abstraction, forcing a problem into a familiar class (almost always supervised classification) because the toolkit is comfortable, when the true structure is sequential, unsupervised, or causal. The third is optimizing a proxy whose gap to the real objective is never revisited, so the team faithfully improves an offline metric while the online measure stagnates or declines. The fourth is treating the formulation as fixed once written, when the honest practice is to revisit $(\mathcal{X}, \mathcal{Y}, \mathcal{D}, \mathcal{H}, \ell)$ whenever the data, the costs, or the deployment context shift. The remedy for all four is the same: hold framing and building in tension, let each correct the other, and treat the problem statement as a living artifact rather than a one-time gate. Computational thinking, in the end, is the refusal to let a problem stay vague. It is the conviction that any goal worth pursuing with a machine can be sharpened, decomposed, represented, and bounded until what remains is a problem a procedure can solve, and a clear-eyed account of what that procedure can and cannot promise. ## References 1. Wing, J. M. (2006). Computational Thinking. Communications of the ACM, 49(3), 33-35. https://www.cs.cmu.edu/~15110-s13/Wing06-ct.pdf 2. Wing, J. M. (2008). Computational Thinking and Thinking about Computing. Philosophical Transactions of the Royal Society A, 366(1881), 3717-3725. https://royalsocietypublishing.org/doi/10.1098/rsta.2008.0118 3. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill. https://www.cs.cmu.edu/~tom/mlbook.html 4. Domingos, P. (2012). A Few Useful Things to Know about Machine Learning. Communications of the ACM, 55(10), 78-87. https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf 5. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798-1828. https://arxiv.org/abs/1206.5538 6. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/ 7. Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/ 8. Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press. https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models/ 9. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf 10. Liu, T.-Y. (2009). Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval, 3(3), 225-331. https://www.nowpublishers.com/article/Details/INR-016 11. Ouyang, L., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems 35. https://arxiv.org/abs/2203.02155 12. Sculley, D., et al. (2015). Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems 28. https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf