cartomancy labs futurecast
Posts
Fraud Frameworks & Foundations Part II: Defining & Strengthening the Core

Fraud Frameworks & Foundations Part II: Defining & Strengthening the Core

Learning Loops, Decision Points, & Strengthening Our Core Abuse Detection Tech w/CoRDRA

Allison Miller
October 25, 2024

In our last article on fraud frameworks, we looked at an overview of the CARTO framework, a maturity model for Digital Fraud Programs. The CARTO framework is one piece of what I covered in my BSides Knoxville presentation (and that event was recorded, so if you want the benefit of a talk track check it out here) - or contact me to share this material directly with you & your team in a discussion or brainstorming session.

In the CARTO framework, we’re taking a high level view of an anti-fraud/anti-abuse based program - looking across people, process, technology. Let’s now zoom into the heart of an anti-fraud/anti-abuse system - the components that work together to detect bad behaviors - connected together into a learning loop. Every effective detection system incorporates some kind of learning loop, so that the system can adjust and be re-tuned to maintain performance, even as incoming trends and techniques change.

I started talking about this design of detection systems and the importance of feedback loops back in 2012, (2012.09 A Million Mousetraps: Using Big Data and Little Loops to Build Better Defenses) and while the design has been elaborated-out over time (for example, adding in the connections to customer service, or third-party services/data in real time), the core has stayed pretty much the same. This Core Risk Detection Reference Architecture (CoRDRA) has also been a useful reference architecture when designing solution tech for addressing abuse cases in fintech/payments, e-commerce/merchants, social media, video games, adtech…well pretty much everywhere that’s had these abuse problems requiring automated detection and prevention solutions.

Core of a Fraud Detection System - Reference Architecture (aka CoRDRA)

Here’s a high level depiction of the reference architecture (above). The stripes represent system layers (UX, Back-end, Back-office), and generally we read from left to right (and follow the arrows) to understand the flow of events happening in time.

Start with the User: UX/Front-End

The top layer of the CoRDRA represents the User Experience (UX) which we can think of as what the end-user / customer sees. For those of you who work with product managers or designers, this probably looks a little bit like a wireframe to you - and if you have wireframes, those are a great starting point. When a user/customer attempts an action of interest to our detection system, this triggers the risk evaluation.

Event attempt: The end-user attempts an event (this could be a login, a checkout, or making a post).

UX or “Front End” layer of the Core Risk Detection Reference Architecture (CoRDRA)

All Your Base Are Belong to Us (An Opinionated Back-End drives the Front-End)

While the user is in the product flow, our detection system kicks-in. The UX starts interacting with the Back-End (middle layer) of the platform (below), and whether the detection system is running in real-time or near real-time, it is generally “in production” and invisible to the end user.

Zooming in to the "decisioning” component in CoRDRA’s Platform or “Back End” Layer

Rules/Models: The details of the event are processed by a back-end system (often running in real time as part of the flow) that will run rules/heuristics and maybe even scores/models. I kind of imagine this as the point where the detection system says to itself “what do I think about this attempted event?”. Effective models & rules will take context of the event into consideration along with a comparison against patterns associated with bad actor activity, and also incorporate what the system “knows” about the end-user’s behavior over time.
Decision Strategy: There’s a companion system and question to be answer here which is “Now that I know what I think about the attempted event ...what am I going to do about it?”. In cybersecurity, many of our detection systems are passive - they log and alert. But in user-facing platforms, we need to convert our detection into a response - and many cases, the response will affect the customer experience.

Note: In almost all systems the Rules/Models and Decision Strategy are combined into one engine. However there’s still (in most cases) an acknowledgement that not all decisions yield the same result - some will be approved, some approved and queued for review, some might be shown as approved to the customer but also put in a pending state until a second condition has been met.

It’s Not Over When It’s Over: The Importance of Post-Event Data

We flip back-up to the user experience layer, as the decision made by the risk system affects what the user sees next.

Highlighting the additional experience and telemetry post-decision, post-event.

Post-Decision UX: A simple example is “approve/allow” vs “decline/error”, which we experience in both online contexts as well as standing at a point of sale when paying with a card. More sophisticated systems might ask for additional information, or trigger some kind of out-of-band confirmation loop. Whatever the system “decides” affects the post-decision UX.

For the most part, this is where our detection system stops working in “real time”, but the post-event activity is still happening elsewhere in our production environments.

Post-Txn Experience: we get a lot of additional information after a transaction or event has taken place. For example, we might get a credit card chargeback, an incoming contact from the user that conducted the transaction - or some kind of complaint or report from another user or external system.
While some of this Post-Txn experience lives with the customer, some may also be generated through our operational work (customer service interactions, refunds, disputes, reports, complaints, etc). It’s the job of the system to capture these elements of experience as well as the initial transactions and on-platform behaviors.

Watch Out for the Quiet Ones: Harnessing the Silent Workhorses of Offline Analytics

Although a lot of the dynamic action is happening in the top two layers, the foundational layer of the system is actually at the bottom layer. This is where we connect the learning loop, and everything we learn from the events we’ve seen, and everything we learn thereafter gets folded back into our data platforms.

Note: especially with the advent of cloud, the barrier between “back end” and “back office” gets blurry in some organizations. Still, most organizations have a separation between tools and systems that are in production versus those that are used for analytic and other business purposes.

Offline analytics that happen in the “back office” will drive performance of risk detection capabilities and is a foundational element of CoRDRA

Post-Event Data: This corpus of data is going to be a combination of our logs (anything from Layer 7 down, roughly, for you cyber folks), records (sometimes this can be described as Layer 8) - and any useful contextual data available from the ecosystem and third parties (for example - IP geolocation or consortium generated blocklists].
We can talk more about the distinction between logs and records later, but a simple example is - perimeter security tech generates logs, financial transactions and account history are typically in records. Some very, very, very old thoughts on the difference/sameness of those data types here.

As you can imagine, a database or data lake of this type is incredibly valuable to the business and can many purposes for many teams, but the philosophy of filling in the learning loop I want to focus on just two:

The frequent (often automated) (re)processing of the entire corpus to calculate aggregates and refresh data needed to make the rules run, typically pushed into production systems supporting the rules/model scoring and the decision strategy.
The regular analysis and rule writing/model building work done to refresh and improve performance of the rules and models running in the rules engine

By injecting data and analytic learnings from that data from the back office back into the production back-end systems, we close the learning loop - and in a healthy system, this cycle keeps looping. As bad actors change their techniques, detective defenders need to keep refining their data and rulesets. A truism is that the performance of any rule, and any model (yes, even the fancy AI ones) start degrading as soon as they go live. So truly - the data science behind this is forever.

What’s Your Point(s)?

We also start to notice the advantage of decisioning systems focused on a point problem versus a sequence or graph problem.

When decisions are driven to a point, the detection / decisioning capabilities can get really targeted and precise, and optimize for the best outcome.

With payment fraud, traditionally we were able to focus all of our detective capability on a single decision (note: when we talk about detection systems in fraud we often use the term decisioning interchangeably with detection). When you’re focused on a single point and you have a lot of contextual data, and you have a lot of feedback loop data (i.e. you’ve got tagged examples of “bad” behavior) - you can get so precise that you can pursue optimization of decisions. For example - to maximize profitability.
With cybersecurity, one of the advantages of the perimeter model (as leaky as it was) is that it sought to drive all ingress/egress decisions to a single point, so that they could be examined by technology at the border between “inside” and “outside”.
- That said, in cybersecurity we typically have never had a great feedback loop on our own data. Where we get feedback, it tends to be sparse (or incomplete), often through incident management or hunt efforts.
- Also, cybersecurity problems refuse to stick to a single point - they will hop up and down the stack and across applications and systems in unpredictable ways - making it difficult to understand sequencing and context of actions across the threat surface.
- In my slides I show this as a point shifting to a line (sequence) to a graph (surface) to a hot mess up and down the stack (a solid), and it definitely makes it difficult not only to introduce decisions, but to capture useful post-event data sets.

While once we could really invest in honing our decisions at the point of a critical event/transaction, what we’ve been seeing over the past decade or two is that both fraud and anti-abuse have seen a degradation in their ability to keep decisions into a nice and neat single point. There are a couple of reasons for this, but a big driver has been the move to digital, and account takeovers (ATO) as a method for bad actors to launder their reputation/credentials by acting as more trusted users.

Potential abuse events (and decision points) keep popping-up on the lifecycle.

Whose Line is It Anyway: Layers of Defense Meet Graphs of Decisions

This leads to a question as to how different teams might refactor their approaches with a pair of threats (ATO, plus an abusive behavior like spam or payment fraud), they can:

Add to the sequence: Add login as a standalone decision point, and load all of the detective capabilities that had been used for money movement into the authentication
Double down on the primary abuse-vector detection point: Continue to reserve checkout (for example) as the primary decision point, but ensure all of the context around authentication is available at money movement.
Mind the gap: Introduce an independent orchestration capability that tries to bridge the decision points/systems.

Managing one decision point is difficult, managing two is a bit harder, what happens when more abuse cases pop-up and one needs to start building detection capabilities lengthwise across customer flows and account lifecycles? And then also crosswise between products and environments?

The benefit of a mature learning loop is that the approach is extensible - meaning, it can expand to cover a number of decision points, or be repeated so there’s a mini decisioning system/strategy for each point of interest. However, added dimensionality of a decisioning strategy increases complexity and makes optimization harder. Which is what we’re seeing in fraud specifically, and also in anti-abuse systems in general.

Now that we’ve given ourself an overview of the design, care, and feeding of detection/decisioning systems, we can start drilling down again to understand implementation. Specifically, we can start examining to how this design is applied to different abuse cases/threat vectors - whether singly or in a sequence. Come back next time for fun with threat modeling at Layer 8.

The CoRDRA has been a useful reference architecture when designing solution tech for addressing abuse cases (everything from spam and fraud over to more cybersecurity use cases like account takeover and phishing/malware detection) and I find I’ve been using more and more to help teams in both threat modeling and solution design. If you or your team would like to discuss this topic in more depth or collaborate on incorporating this model into your team’s tech and operational roadmaps, contact me directly at [email protected].

We are grateful that you read this far, hope you enjoy this set of lab notes, and appreciate if you spread the word about what we’re doing at the Futurecast. Like all learning systems, Cartomancy Labs loves feedback. Let us know - do you have questions? Got an opinion on topics we should cover? Have an idea for a better set of acronyms or puns? Don’t leave those points hanging, drop us a line.