Research Paper

Standalone web page / LaTeX comparison / Full source

This document is a Nota-fied version of a programming languages research paper, "Modular Information Flow through Ownership" published at PLDI 2022. This example includes:

Academic paper structure (sections, subsections, figures, abstracts)
Academic paper style (@nota-lang/nota-theme-acm)
Math (inline TeX, block TeX, theorems, definitions)
Charts (Vega-Lite)

Modular Information Flow through Ownership

Will Crichton, Stanford University

Marco Patrignani, University of Trento

Maneesh Agrawala, Stanford University

Pat Hanrahan, Stanford University

Statically analyzing information flow, or how data influences other data within a program, is a challenging task in imperative languages. Analyzing pointers and mutations requires access to a program's complete source. However, programs often use pre-compiled dependencies where only type signatures are available. We demonstrate that ownership types can be used to soundly and precisely analyze information flow through function calls given only their type signature. From this insight, we built Flowistry, a system for analyzing information flow in Rust, an ownership-based language. We prove the system's soundness as a form of noninterference using the Oxide formal model of Rust. Then we empirically evaluate the precision of Flowistry, showing that modular flows are identical to whole-program flows in 94% of cases drawn from large Rust codebases. We illustrate the applicability of Flowistry by using it to implement prototypes of a program slicer and an information flow control system.

1 Introduction

Information flow describes how data influences other data within a program. Information flow has applications to security, such as information flow control [ $Sabelfeld and Myers 2003$ ], and to developer tools, such as program slicing [ $Weiser 1981$ ]. Our goal is to build a practical system for analyzing information flow, meaning:

Applicable to common language features: the language being analyzed should support widely used features like pointers and in-place mutation.
Zero configuration to run on existing code: the analyzer must integrate with an existing language and existing unannotated programs. It must not require users to adopt a new language designed for information flow.
No dynamic analysis: to reduce integration challenges and costs, the analyzer must be purely static — no modifications to runtimes or binaries are needed.
Modular over dependencies: programs may not have source available for dependencies. The analyzer must have reasonable precision without whole-program analysis.

As a case study on the challenges imposed by these requirements, consider analyzing the information that flows to the return value in this C++ function:

// Copy elements 0 to max into a new vector

vector<int> copy_to(vector<int>& v, size_t max) {

vector<int> v2; size_t i = 0;

for (auto x(v.begin()); x != v.end(); ++x) {

if (i == max) { break; }

v2.push_back(*x); ++i;

}

return v2;

}

Here, a key flow is that v2 is influenced by v: (1) push_back mutates v2 with *x as input, and (2) x points to data within v. But how could an analyzer statically deduce these facts? For C++, the answer is by looking at function implementations. The implementation of push_back mutates v2, and the implementation of begin returns a pointer to data in v.

However, analyzing such implementations violates our fourth requirement, since these functions may only have their type signature available. In C++, given only a function's type signature, not much can be inferred about its behavior, since the type system does not contain information relevant to pointer analysis.

Our key insight is that ownership types can be leveraged to modularly analyze pointers and mutation using only a function's type signature. Ownership has emerged from several intersecting lines of research on linear logic [ $Jean-Yves Girard 1987$ ], class-based alias management [ $Clarke et al. 1998$ ], and region-based memory management [ $Grossman et al. 2002$ ]. The fundamental law of ownership is that data cannot be simultaneously aliased and mutated. Ownership-based type systems enforce this law by tracking which entities own which data, allowing ownership to be transferred between entities, and flagging ownership violations like mutating immutably-borrowed data.

Today, the most popular ownership-based language is Rust. Consider the information flows in this Rust implementation of copy_to:

fn copy_to(v: &Vec<i32>, max: usize) -> Vec<i32> {

let mut v2 = Vec::new();

for (i, x) in v.iter().enumerate() {

if i == max { break; }

v2.push(*x);

}

return v2;

}

Focus on the two methods push and iter. For a Vec<i32>, these methods have the following type signatures:

fn push(&mut self, value: i32);

fn iter<'a>(&'a self) -> Iter<'a, i32>;

To determine that push mutates v2, we leverage mutability modifiers. All references in Rust are either immutable (i.e. the type is &T) or mutable (the type is &mut T). Therefore iter does not mutate v because it takes &self as input (excepting interior mutability, discussed in Section 4.3), while push may mutate v2 because it takes &mut self as input.

To determine that x points to v, we leverage lifetimes. All references in Rust are annotated with a lifetime, either explicitly (such as 'a) or implicitly. Shared lifetimes indicate aliasing: because &self in iter has lifetime 'a, and because the returned Iter structure shares that lifetime, then we can determine that Iter may contain pointers to self.

Inspired by this insight, we built Flowistry, a system for analyzing information flow in the safe subset of Rust programs. Flowistry satisfies our four design criteria: (1) Rust supports pointers and mutation, (2) Flowistry does not require any change to the Rust language or to Rust programs, (3) Flowistry is a purely static analysis, and (4) Flowistry uses ownership types to analyze function calls without needing their definition. This paper presents a theoretical and empirical investigation into Flowistry in five parts:

We provide a precise description of how Flowistry computes information flow by embedding its definition within Oxide [ $Weiss et al. 2019$ ], a formal model of Rust (Section 2).
We prove the soundness of our information flow analysis as a form of noninterference (Section 3).
We describe the implementation of Flowistry that bridges the theory of Oxide to the practicalities of Rust (Section 4).
We evaluate the precision of the modular analysis on a dataset of large Rust codebases, finding that modular flows are identical to whole-program flows in 94% of cases, and are on average 7% larger in the remaining cases (Section 5).
We demonstrate the utility of Flowistry by using it to prototype a program slicer and an IFC checker (Section 6).

We conclude by presenting related work (Section 7) and discussing future directions for Flowistry (Section 8). Due to space constraints, we omit many formal details, all auxiliary lemmas, and all proofs. The interested reader can find them in the appendix. Flowistry and our applications of it are publicly available, open-source, MIT-licensed projects at https://github.com/willcrichton/flowistry.

2 Analysis

Inspired by the dependency calculus of $Abadi et al. [1999]$ , our analysis represents information flow as a set of dependencies for each variable in a given function. The analysis is flow-sensitive, computing a different dependency set at each program location, and field-sensitive, distinguishing between dependencies for fields of a data structure.

While the analysis is implemented in and for Rust, our goal here is to provide a description of it that is both concise (for clarity of communication) and precise (for amenability to proof). We therefore base our description on Oxide [ $Weiss et al. 2019$ ], a formal model of Rust. At a high level, Oxide provides three ingredients:

A syntax of Rust-like programs with expressions $e$ and types $τ$ .
A type-checker, expressed with the judgment $Σ; Δ; Γ ⊢ e : τ \Rightarrow Γ^{'}$ using the contexts $Γ$ for types and lifetimes, $Γ$ for type variables, and $Γ$ for global functions.
An interpreter, expressed by a small-step operational semantics with the judgment $Σ ⊢ (σ; e) \to (σ^{'}; e^{'})$ using $σ$ for a runtime stack.

We extend this model by assuming that each expression in a program is automatically labeled with a unique location $ℓ$ . Then for a given expression $e$ , our analysis computes the set of dependencies $κ ::= {\overline{ℓ}}$ . Because expressions have effects on persistent memory, we further compute a dependency context $Θ ::= {\overline{p \mapsto κ}}$ from memory locations $p$ to dependencies $κ$ . The computation of information flow is intertwined with type-checking, represented as a modified type-checking judgment (additions highlighted in red):

Σ; Δ; Γ; Θ ⊢ e_{ℓ} : τ ∙ κ \Rightarrow Γ^{'}; Θ^{'}

This judgment is read as, "with type contexts $Σ, Δ, Γ$ and dependency context $Θ$ , $e$ at location $ℓ$ has type $τ$ and dependencies $κ$ , producing a new dependency context $Θ^{'}$ ."

Oxide is a large language — describing every feature, judgment, and inference rule would exceed our space constraints. Instead, in this section we focus on a few key rules that demonstrate the novel aspects of our system. We first lay the foundations for dealing with variables and mutation (Section 2.1), and then describe how we modularly analyze references (Section 2.2) and function calls (Section 2.3). The remaining rules can be found in the appendix.

2.1 Variables and Mutation

The core of Oxide is an imperative calculus with constants and variables. The abstract syntax for these features is below:

Variable Number Path Place Constant Base Type Sized Type Expression x n q π c τ^{B} τ^{SI} e ::= ::= ::= ::= ::= ::= ∣ ε ∣ x . q ∣ x . q ∣ () ∣ n ∣ true ∣ false ∣ unit ∣ u32 ∣ \dots ∣ τ^{B} ∣ (τ^{SI}_{1}, \dots, τ^{SI}_{n}) ∣ \dots ∣ c ∣ π ∣ let x : τ^{SI} = e_{1}; e_{2} ∣ π := e ∣ e_{1}; e_{2}

Constants are Oxide's atomic values and also the base-case for information flow. A constant's dependency is simply itself, expressed through the T-u32 rule:


	T-u32
$Σ; Δ; Γ; Θ ⊢ n_{ℓ} : u32 ∙ κ \Rightarrow Γ; Θ$

Variables and mutation are introduced through let-bindings and assignment expressions, respectively. For example, this (location-annotated) program mutates a field of a tuple:

let t : (u32, u32) = (1_{ℓ_{1}}, 2_{ℓ_{2}}); t .1 := 3_{ℓ_{3}}

Here, $t$ is a variable and $t .1$ is a place, or a description of a specific region in memory. For information flow, the key idea is that let-bindings introduce a set of places into $Θ$ , and then assignment expressions change a place's dependencies within $Θ$ . In the above example, after binding $t$ , then $Θ$ is:

Θ = {t, t .0, t .1 \mapsto {ℓ_{1}, ℓ_{2}}}

After checking " $t .1 := 3$ ", then $ℓ_{3}$ is added to $Θ (t)$ and $Θ (t .1)$ , but not $Θ (t .0)$ . This is because the values of $t$ and $t .1$ have changed, but the value of $t .0$ has not. Formally, the let-binding rule is:

$Σ; Δ; Γ; Θ ⊢ e_{1} : τ^{SI}_{1} ∙ κ_{1} \Rightarrow Γ; Θ_{1}$ $Δ; Γ_{1} ⊢ τ^{SI}_{1} ≲ τ^{SI}_{a} \Rightarrow Γ_{1}^{'}$ $Θ_{1}^{'} = Θ_{1} [\forall π^{□} [x] . π \mapsto κ_{1}]$ $Σ; Δ; gc - loans (Γ_{1}^{'}, x : τ^{SI}_{a}); Θ_{1}^{'} ⊢ e_{2} : τ^{SI}_{2} ∙ κ_{2} \Rightarrow Γ_{2}, x : τ^{SD}; Θ_{2}$
	T-let
$Σ; Δ; Γ; Θ ⊢ let x : τ^{SI}_{a} = e_{1}; e_{2} : τ^{SI}_{2} ∙ κ_{2} \Rightarrow Γ_{2}; Θ_{2}$

Again, this rule (and many others) contain aspects of Oxide that are not essential for understanding information flow such as the subtyping judgment $τ^{SI}_{1} ≲ τ^{SI}_{a}$ or the metafunction $gc - loans$ . For brevity we will not cover these aspects here, and instead refer the interested reader to $Weiss et al. [2019]$ . We have deemphasized (in grey) the judgments which are not important to understanding our information flow additions.

The key concept is the formula $Θ_{1} [\forall π^{□} [x] . π \mapsto κ_{1}]$ . This introduces two shorthands: first, $π^{□} [x]$ means "a place $π$ with root variable $x$ in a context $π^{□}$ ", used to decompose a place. In T-let, the update to $Θ_{1}$ happens for all places with a root variable $x$ . Second, $Θ_{1} [π \mapsto κ_{1}]$ means "set $π$ to $κ_{1}$ in $Θ_{1}$ ". So this rule specifies that when checking $e_{2}$ , all places within $x$ are initialized to the dependencies $Θ_{1}$ of $e_{1}$ .

Next, the assignment expression rule is defined as updating all the conflicts of a place $π$ :

$Σ; Δ; Γ; Θ ⊢ e : τ^{SI} ∙ κ \Rightarrow Γ_{1}; Θ_{1}$ $Γ_{1} (π) = τ^{SX}$ $(τ^{SX} = τ^{SD} \lor Δ; Γ_{1} ⊢_{uniq} π \Rightarrow {^{uniq} π})$ $Δ; Γ_{1} ⊢ τ^{SI} ≲ τ^{SX} \Rightarrow Γ^{'}$ $Θ_{2} = Θ_{1} [update - conflicts (Θ_{1}, π, κ)]$
	T-assign
$Σ; Δ; Γ; Θ ⊢ π := e : unit ∙ \emptyset \Rightarrow Γ^{'} [π \mapsto τ^{SI}] ⊳ π; Θ_{2}$

If you conceptualize a type as a tree and a path as a node in that tree, then a node's conflicts are its ancestors and descendants (but not siblings). Semantically, conflicts are the set of places whose value change if a given place is mutated. Recall from the previous example that $t .1$ conflicts with $t$ and $t .1$ , but not $t .0$ . Formally, we say two places are disjoint ( $#$ ) or conflict ( $⊓$ ) when:

x_{1} . q_{1} # x_{2} . q_{2} π_{1} ⊓ π_{2} = def x_{1} \neq = x_{2} \lor (= def \neg (π_{1} # π_{2}) (q_{1} is not a prefix of q_{2}) \land (q_{2} is not a prefix of q_{1}))

Then to update a place's conflicts in $Θ$ , we define the metafunction $update - conflicts$ to add $κ$ to all conflicting places $p^{'}$ . (Note that this rule is actually defined over place expressions $p$ , which are explained in the next subsection.)

update - conflicts (Θ, p, κ) = def \forall p^{'} \mapsto κ_{p^{'}} \in Θ_{cfl} . p^{'} \mapsto κ_{p^{'}} \cup κ where Θ_{cfl} = {p^{'} \mapsto κ_{p^{'}} \in Θ ∣ p ⊓ p^{'}}

Finally, the rule for reading places is simply to look up the place's dependencies in $Θ$ :

$Δ; Γ ⊢_{uniq} π \Rightarrow {^{uniq} π}$ $Γ (π) = τ^{SI}$ $noncopyable_{Σ} τ^{SI}$
	T-move
$Σ; Δ; Γ; Θ ⊢ π : τ^{SI} ∙ Θ (π) \Rightarrow Γ [π \mapsto τ^{SI}^{†}]; Θ$

2.2 References

Beyond concrete places in memory, Oxide also contains references that point to places. As in Rust, these references have both a lifetime (called a "provenance") and a mutability qualifier (called an "ownership qualifier"). Their syntax is:

Concrete Provenance Abstract Provenance Place Expression Provenance Ownership Qualifier Sized Type Expression r ϱ p ρ ω τ^{SI} e ::= ::= ::= ::= ::= ∣ x ∣ p . n ∣ * p ∣ r ∣ ϱ ∣ shrd ∣ uniq ∣ \dots ∣ & ρ ω τ^{XI} ∣ \dots ∣ & r ω e ∣ * e ∣ p := e ∣ letprov ⟨ r ⟩ e

Provenances are created via a $letprov$ expression, and references are created via a borrow expression $& r ω p$ that has an initial concrete provenance $r$ (abstract provenances are just used for types of function parameters). References are used in conjunction with place expressions $p$ that are places whose paths contain dereferences. For example, this program creates, reborrows, and mutates a reference:

letprov ⟨ r_{1}, r_{2}, r_{3}, r_{4} ⟩ let x : (u32, u32) = (0, 0); let y : & r_{2} uniq (u32, u32) = & r_{1} uniq x; let z : & r_{4} uniq u32 = & r_{3} uniq (* y) .1; * z := 1_{ℓ}

Consider the information flow induced by $* z := 1_{ℓ}$ . We need to compute all places that $z$ could point-to, in this case $x .1$ , so $ℓ$ can be added to the conflicts of $x .1$ . Essentially, we must perform a pointer analysis [ $Smaragdakis and Balatsouras 2015$ ].

The key idea is that Oxide already does a pointer analysis! Performing one is an essential task in ensuring ownership-safety. All we have to do is extract the relevant information with Oxide's existing judgments. This is represented by the information flow extension to the reference-mutation rule:

$Σ; Δ; Γ; Θ ⊢ e : τ^{SI}_{n} ∙ κ \Rightarrow Γ_{1}; Θ_{1}$ $Δ; Γ_{1} ⊢_{uniq} p : τ^{SI}_{o}$ $Δ; Γ_{1} ⊢_{uniq} p \Rightarrow {\overline{l}}$ $Δ; Γ_{1} ⊢ τ^{SI}_{n} ≲ τ^{SI}_{o} \Rightarrow Γ^{'}$ $Θ_{2} = Θ_{1} [\forall^{ω} p^{'} \in {\overline{l}} . update - conflicts (Θ_{1}, p^{'}, κ)]$
	T-assignderef
$Σ; Δ; Γ; Θ ⊢ p := e : unit ∙ \emptyset \Rightarrow Γ^{'} ⊳ p; Θ_{2}$

Here, the important concept is Oxide's ownership safety judgment: $Δ; Γ ⊢_{ω} p \Rightarrow {\overline{l}}$ , read as "in the contexts $Δ$ and $Γ$ , $p$ can be used $ω$ -ly and points to a loan in ${\overline{l}}$ ." A loan $l ::=^{ω} p$ is a place expression with an ownership-qualifier. In Oxide, this judgment is used to ensure that a place is used safely at a given level of mutability. For instance, in the example at the top of this column, if $* z := 1$ was replaced with $x .1 := 1$ , then this would violate ownership-safety because $x$ is already borrowed by $y$ and $z$ .

In the example as written, the ownership-safety judgment for $* z$ would compute the loan set:

{\overline{l}} = {^{uniq} (* z),^{uniq} (* y) .1,^{uniq} x .1}

Note that $x .1$ is in the loan set of $* z$ . That suggests the loan set can be used as a pointer analysis. The complete details of computing the loan set can be found in $Weiss et al. [2019, p. 12]$ , but the summary for this example is:

Checking the borrow expression " $& r_{1} uniq x$ " gets the loan set for $x$ , which is just ${^{uniq} x}$ , and so sets $Γ (r_{1}) = {^{uniq} x}$ .
Checking the assignment " $y = & r_{1} uniq x$ " requires that $& r_{1} uniq (u32, u32)$ is a subtype of $& r_{2} uniq (u32, u32)$ , which requires that $r_{1}$ "outlives" $r_{2}$ , denoted $r_{1} :> r_{2}$ .
The constraint $r_{1} :> r_{2}$ adds $Γ (r_{1})$ to $Γ (r_{2})$ , so $Γ (r_{2}) = {^{uniq} x}$ .
Checking " $& r_{3} uniq (* y) .1$ " gets the loan set for $(* y) .1$ , which is:
${^{uniq} p .1 ∣^{uniq} p \in Γ (r_{2})} \cup {^{uniq} (* y) .1} = {^{uniq} x .1,^{uniq} (* y) .1}$
That is, the loans for $r_{2}$ are looked up in $Γ$ (to get ${x}$ ), and then the additional projection $_.1$ is added on-top of each loan (to get ${x .1}$ ).
Then $Γ (r_{4}) = Γ (r_{3})$ because $r_{3} :> r_{4}$ .
Finally, the loan set for $* z$ is:
$Γ (r_{4}) \cup {^{uniq} (* z)} = {^{uniq} x .1,^{uniq} (* y) .1,^{uniq} (* z)}$

Applying this concept to the T-assignderef rule, we compute information flow for reference-mutation as: when mutating $p$ with loans ${\overline{l}}$ , add $κ_{e}$ to all the conflicts for every loan $^{uniq} p^{'} \in {\overline{l}}$ .

2.3 Function Calls

Finally, we examine how to modularly compute information flow through function calls, starting with syntax:

Frame Var Function Type Var Expression Global Entries Global Environment φ f α e ε Σ ::= ::= ::= ∣ \dots ∣ f ⟨ \overline{Φ}, \overline{ρ}, \overline{τ^{SI}} ⟩ (π) ∣ fn f ⟨ \overline{ψ}, \overline{ϱ}, \overline{α} ⟩ (x : τ^{SI}) \to τ^{SI} where \overline{ϱ_{1} : ϱ_{2}} {e} ∣ ∙ ∣ Σ, ε

Oxide functions are parameterized by frame variables $φ$ (for closures), abstract provenances $ϱ$ (for provenance polymorphism), and type variables $α$ (for type polymorphism). Unlike Oxide, we restrict to functions with one argument for simplicity in the formalism. Calling a function $f$ requires an argument $π$ and any type-level parameters $Φ, ρ$ and $τ$ .

The key question is: without inspecting its definition, what is the most precise assumption we can make about a function's information flow while still being sound? By "precise" we mean "if the analysis says there is a flow, then the flow actually exists", and by "sound" we mean "if a flow actually exists, then the analysis says that flow exists." For example consider this program:

fn f ⟨ ϱ_{1}, ϱ_{2} ⟩ (x : (& ϱ_{1} uniq u32, & ϱ_{2} shrd u32)) {???} let x : u32 = 1_{ℓ_{1}}; let y : u32 = 2_{ℓ_{2}}; letprov ⟨ r_{1}, r_{2} ⟩ let t : (& r_{1} uniq u32, & r_{2} shrd u32) = (& r_{1} uniq x, & r_{2} shrd y); f ⟨ r_{1}, r_{2} ⟩ (t)

First, what can $f (t)$ mutate? Any data behind a shared reference is immutable, so only $* t .0$ could possibly be mutated, not $* t .1$ . More generally, the argument's transitive mutable references must be assumed to be mutated.

Second, what are the inputs to the mutation of $* t .0$ ? This could theoretically be any possible value in the input, so both $* t .0$ and $* t .1$ . More generally, every transitively readable place from the argument must be assumed to be to be an input to the mutation. So in this example, a modular analysis of the information flow from calling $cp$ would add ${ℓ_{1}, ℓ_{2}}$ to $Θ (x)$ but not $Θ (y)$ .

To formalize these concepts, we first need to describe the transitive references of a place. The $ω - refs (p, τ)$ metafunction computes a place expression for every reference accessible from $p$ . If $ω = uniq$ then this just includes unique references, otherwise it includes unique and shared ones.

ω - refs (p, τ^{B}) ω - refs (p, (τ^{SI}_{1}, \dots, τ^{SI}_{n})) ω - refs (p, & ρ ω^{'} τ^{XI}) = \emptyset = i ⋃ ω - refs (p . i, τ^{SI}_{i}) = {{* p} \cup ω - refs (* p, τ^{XI}) \emptyset if ω ≲ ω^{'} otherwise

Here, $ω ≲ ω^{'}$ means "a loan at $ω$ can be used as a loan at $ω^{'}$ ", defined as $uniq \neq ≲ shrd$ and $ω ≲ ω^{'}$ otherwise. Then $ω - loans (p, τ, Δ, Γ)$ can be defined as the set of concrete places accessible from those transitive references:

ω - loans (p, τ, Δ, Γ) = def p_{1} \in ω - refs (p, τ) ⋃ {p_{2} ∣^{ω} p_{2} \in {\overline{l}}} where Δ; Γ ⊢_{ω} p_{1} \Rightarrow {\overline{l}}

Finally, the function application rule can be revised to include information flow as follows:

$\overline{Σ; Δ; Γ ⊢ Φ}$ $\overline{Δ; Γ ⊢ ρ}$ $\overline{Σ; Δ; Γ ⊢ τ^{SI}}$ $Σ (f) = fn f ⟨ \overline{φ}, \overline{ϱ}, \overline{α} ⟩ (π : τ^{SI}_{a}) \to τ^{SI}_{r} where \overline{ϱ_{1} :> ϱ_{2}} {e}$ $Σ; Δ; Γ; Θ ⊢ π : τ^{SI}_{a} \overline{[Φ / φ]} \overline{[ρ / ϱ]} \overline{[τ^{SI} / α]} ∙ κ \Rightarrow Γ_{1}; Θ$ $Δ; Γ_{1} ⊢ \overline{ϱ_{2} \overline{[ρ / ϱ]} :> ϱ_{1} \overline{[ρ / ϱ]}}$ $κ_{arg} = κ \cup ⋃_{p \in shrd - loans (π, τ^{SI}, Δ, Γ_{2})} Θ (p)$ $Θ^{'} = Θ [\forall p \in uniq - loans (π, τ^{SI}, Δ, Γ_{2}) . update - conflicts (Θ, p, κ_{arg})]$
	T-app
$Σ; Δ; Γ; Θ ⊢ f ⟨ \overline{Φ}, \overline{ρ}, \overline{τ^{SI}} ⟩ (π) : τ^{SI}_{r} \overline{[Φ / φ]} \overline{[ρ / ϱ]} \overline{[τ^{SI} / α]} ∙ κ_{arg} \Rightarrow Γ_{2}; Θ^{'}$

The collective dependencies of the input $π$ are collected into $κ_{arg}$ , and then every unique reference is updated with $κ_{arg}$ . Additionally, the function's return value is assumed to be influenced by any input, and so has dependencies $κ_{arg}$ .

Note that this rule does not depend on the body $e$ of the function $f$ , only its type signature in $Σ$ . This is the key to the modular approximation. Additionally, it means that this analysis can trivially handle higher-order functions. If $f$ were a parameter to the function being analyzed, then no control-flow analysis is needed to guess its definition.

Remaining grammar rules ⬇

Maybe Dead Type Dead Type Maybe Unsized Type Type Frame Expression Frame Typing τ^{SX} τ^{SD} τ^{XI} τ Φ F ::= ::= ::= ::= ::= ::= ∣ τ^{SI} ∣ τ^{SD} ∣ (τ^{SX}_{1}, \dots, τ^{SX}_{n}) ∣ τ^{SI}^{†} ∣ (τ^{SD}_{1}, \dots, τ^{SD}_{n}) ∣ τ^{SI} ∣ [τ^{SI}] ∣ τ^{XI} ∣ τ^{SX} ∣ φ ∣ F ∣ ∙ ∣ F, x : τ^{SX} ∣ F, r \mapsto {\overline{l}}

3 Soundness

To characterize the correctness of our analysis, we seek to prove its soundness: if a true information flow exists in a program, then the analysis computes that flow. The standard soundness theorem for information flow systems is noninterference [ $Goguen and Meseguer 1982$ ]. At a high level, noninterference states that for a given program and its dependencies, and for any two execution contexts, if the dependencies are equal between contexts, then the program will execute with the same observable behavior in both cases. For our analysis, we focus just on values produced by the program, instead of other behaviors like termination or timing.

To formally define noninterference within Oxide, we first need to explore Oxide's operational semantics. Oxide programs are executed in the context of a stack of frames that map variables to values:

Value Stack Stack Frame v σ ς ::= ::= ∣ ∙ ∣ σ ♮ ς ∣ ∙ ∣ ς, x \mapsto v

For example, in the empty stack $∙$ , the expression " $let x : u32 = 1; x := 2$ " would first add $x \mapsto 1$ to the stack. Then executing $x := 2$ would update $σ (x) = 2$ . More generally, we use the shorthand $σ (p)$ to mean ``reduce $p$ to a concrete location $π$ , then look up the value of $π$ in $σ$ .''

The key ingredient for noninterference is the equivalence of dependencies between stacks. That is, for two stacks $σ_{1}$ and $σ_{2}$ and a set of dependencies $κ$ in a context $Θ$ , we say those stacks are the same up to $κ$ if all $p$ with $Θ (p) \subseteq κ$ are the same between stacks. Formally, the dependencies of $κ$ and equivalence of heaps are defined as:

deps (Θ, κ) σ_{1} \sim_{P} σ_{2} σ_{1} \sim_{κ}^{Θ} σ_{2} = def {p ∣ p \mapsto κ_{p} \in Θ \land κ_{p} \subseteq κ} = def \forall p \in P . σ_{1} (p) = σ_{2} (p) = def σ_{1} \sim_{deps (Θ, κ)} σ_{2}

Then we define noninterference as follows:

Theorem 1: Noninterference

Let $e$ such that:

Σ; ∙; Γ; Θ ⊢ e : τ ∙ κ \Rightarrow Γ^{'}; Θ^{'}

For $i \in {1, 2}$ , let $σ_{i}$ such that:

Σ ⊢ σ_{i} : Γ and Σ ⊢ (σ_{i}; e) * (σ_{i}^{'}; v_{i})

Then:

$σ_{1} \sim_{κ}^{Θ} σ_{2} ⟹ v_{1} = v_{2}$
$\forall p \mapsto κ_{p} \in Θ^{'} . σ_{1} \sim_{κ_{p}}^{Θ} σ_{2} ⟹ σ_{1}^{'} (p) = σ_{2}^{'} (p)$

This theorem states that given a well-typed expression $e$ and corresponding stacks $σ_{i}$ , then its output $v_{i}$ should be equal if the expression's dependencies $κ$ are initially equal. Moreover, for any place expression $p$ , if its dependencies in the output context $Θ^{'}$ are initially equal then the stack value will be the same after execution.

Note that the context $Δ$ is required to be empty because an expression $e$ can only evaluate if it does not contain abstract type or provenance variables. The judgment $Σ ⊢ σ_{i} : Γ$ means "the stack $σ_{i}$ is well-typed under $σ$ and $Γ$ ". That is, for all places $π$ in $Γ$ , then $π \in σ$ and $σ (π)$ has type $Γ (π)$ .

The proof of Theorem 1, found in the appendix, guarantees that we can soundly compute information flow for Oxide programs.

4 Implementation

Our formal model provides a sound theoretical basis for analyzing information flow in Oxide. However, Rust is a more complex language than Oxide, and the Rust compiler uses many intermediate representations beyond its surface syntax. Therefore in this section, we describe the key details of how our system, Flowistry, bridges theory to practice. Specifically:

Rust computes lifetime-related information on a control-flow graph (CFG) program representation, not the high-level AST. So we translate our analysis to work for CFGs (Section 4.1).
Rust does not compute the loan set for lifetimes directly like in Oxide. So we must reconstruct the loan sets given the information exported by Rust (Section 4.2).
Rust contains escape hatches for ownership-unsafe code that cannot be analyzed using our analysis. So we describe the situations in which our analysis is unsound for Rust programs (Section 4.3).

4.1 Analyzing Control-Flow Graphs

fn get_count(

h: &mut HashMap<String, u32>,

k: String

) -> u32 {

if !h.contains_key(&k) {

h.insert(k, 0); 0

} else {

*h.get(&k).unwrap()

}

Figure 1:

Example of how Flowistry computes information flow. On the left is a Rust function get_count that finds a value in a hash map for a given key, and inserts 0 if none exists. On the right get_count is lowered into Rust's MIR control-flow graph, annotated with information flow. Each rectangle is a basic block, named at the top. Arrows indicate control flow (panics omitted).

Beside each instruction is the result of the information flow analysis, which maps place expressions to locations in the CFG (akin to $Θ$ in Section 2). For example, the insert function call adds dependencies to *h because it is assumed to be mutated, since it is a mutable reference. Additionally, the switch instructions and _4 variable are added as dependencies to h because the call to insert is control-dependent on the switch.

The Rust compiler lowers programs into a "mid-level representation", or MIR, that represents programs as a control-flow graph. Essentially, expressions are flattened into sequences of instructions (basic blocks) which terminate in instructions that can jump to other blocks, like a branch or function call. Figure 1 shows an example CFG and its information flow.

To implement the modular information flow analysis for MIR, we reused standard static analysis techniques for CFGs, i.e., a flow-sensitive, forward dataflow analysis pass where:

At each instruction, we maintain a mapping from place expressions to a set of locations in the CFG on which the place is dependent, comparable to $Θ$ in Section 2.
A transfer function updates $Θ$ for each instruction, e.g. $p := e$ follows the same rules as in T-assignderef by adding the dependencies of $e$ to all conflicts of aliases of $p$ .
The input $Θ^{in}$ to a basic block is the join of each of the output $Θ_{i}^{out}$ for each incoming edge, i.e. $Θ^{in} = ⋁_{i} Θ_{i}^{out}$ . The join operation is key-wise set union, or more precisely:
$Θ_{1} \lor Θ_{2} = def {x \mapsto Θ_{1} (x) \cup Θ_{2} (x) ∣ x \in Θ_{1} \lor x \in Θ_{2}}$
We iterate this analysis to a fixpoint, which we are guaranteed to reach because $⟨ Θ, \lor ⟩$ forms a join-semilattice.

To handle indirect information flows via control flow, such as the dependence of h on contains_key in Figure 1, we compute the control-dependence between instructions. We define control-dependence following $Ferrante et al. [1987]$ : an instruction $Y$ is control-dependent on $X$ if there exists a directed path $P$ from $X$ to $Y$ such that any $Z$ in $P$ is post-dominated by $Y$ , and $X$ is not post-dominated by $Y$ . An instruction $X$ is post-dominated by $Y$ if $Y$ is on every path from $X$ to a return node. We compute control-dependencies by generating the post-dominator tree and frontier of the CFG using the algorithms of $Cooper et al. [2001]$ and $Cytron et al. [1989]$ , respectively.

Besides a return, the only other control-flow path out of a function in Rust is a panic. For example, each function call in Figure 1 actually has an implicit edge to a panic node (not depicted). Unlike exceptions in other languages, panics are designed to indicate unrecoverable failure. Therefore we exclude panics from our control-dependence analysis.

4.2 Computing Loan Sets from Lifetimes

To verify ownership-safety (perform "borrow-checking"), the Rust compiler does not explicitly build the loan sets of lifetimes (or provenances in Oxide terminology). The borrow checking algorithm performs a sort of flow-sensitive dataflow analysis that determines the range of code during which a lifetime is valid, and then checks for conflicts e.g. in overlapping lifetimes (see the non-lexical lifetimes RFC [ $Niko Matsakis 2017$ ].

However, Rust's borrow checker relies on the same fundamental language feature as Oxide to verify ownership-safety: outlives-constraints. For a given Rust function, Rust can output the set of outlives-constraints between all lifetimes in the function. These lifetimes are generated in the same manner as in Oxide, such as from inferred subtyping requirements or user-provided outlives-constraints. Then given these constraints, we compute loan sets via a process similar to the ownership-safety judgment described in Section 2.2. In short, for all instances of borrow expressions $& r ω p$ in the MIR program, we initialize $Γ (r) = {p}$ . Then we propagate loans via $Γ (r) = ⋃_{r^{'} :> r} Γ (r^{'})$ until $Γ$ reaches a fixpoint.

4.3 Handling Ownership-Unsafe Code

Rust has a concept of raw pointers whose behavior is comparable to pointers in C. For a type T, an immutable reference has type &T, while an immutable raw pointer has type *const T. Raw pointers are not subject to ownership restrictions, and they can only be used in blocks of code demarcated as unsafe. They are primarily used to interoperate with other languages like C, and to implement primitives that cannot be proved as ownership-safe via Rust's rules.

Our pointer and mutation analysis fundamentally relies on ownership-safety for soundness. We do not try to analyze information flowing directly through unsafe code, as it would be subject to the same difficulties of C++ in Section 1. While this limits the applicability of our analysis, empirical studies have shown that most Rust code does not (directly) use unsafe blocks [ $Astrauskas et al. 2020$ ][ $Evans et al. 2020$ ]. We further discuss the impact and potential mitigations of this limitation in Section 8.

5 Evaluation

Section 3 established that our analysis is sound. The next question is whether it is precise: how many spurious flows are included by our analysis? We evaluate two directions:

What if the analysis had more information? If we could analyze the definitions of called functions, how much more precise are whole-program flows vs. modular flows?
What if the analysis had less information? If Rust's type system was more like C++, i.e. lacking ownership, then how much less precise do the modular flows become?

To answer these questions, we created three modifications to Flowistry:

The analysis recursively analyzes information flow within the definitions of called functions. For example, if calling a function f(&mut x, y) where f does not actually modify x, then the Whole-Program analysis will not register a flow from y to x.
The analysis does not distinguish between mutable and immutable references. For example, if calling a function f(&x), then the analysis assumes that x can be modified.
The analysis does not use lifetimes to reason about references, and rather assumes all references of the same type can alias. For example, if a function takes as input f(x: &mut i32, y: &mut i32) then x and y are assumed to be aliases.

The Whole-Program modification represents the most precise information flow analysis we can feasibly implement. The Mut-Blind and Pointer-Blind modifications represent an ablation of the precision provided by ownership types. Each modification can be combined with the others, representing $2^{3} = 8$ possible conditions for evaluation.

To better understand Whole-Program, say we are analyzing the information flow for an expression f(&mut x, y) where f is defined as f(a, b) { (*a).1 = b; }. After analyzing the implementation of f, we translate flows to parameters of f into flows on arguments of the call to f. So the flow b -> (*a).1 is translated intoy -> x.1.

Additionally, if the definition of f is not available, then we fall back to the modular analysis. Importantly, due to the architecture of the Rust compiler, the only available definitions are those within the package being analyzed. Therefore even with Whole-Program, we cannot recurse into e.g. the standard library.

With these three modifications, we compare the number of flows computed from a dataset of Rust projects (Section 5.1) to quantitatively (Section 5.2) and qualitatively (Section 5.3) evaluate the precision of our analysis.

5.1 Dataset

To empirically compare these modifications, we curated a dataset of Rust packages (or "crates") to analyze. We had two selection criteria:

To mitigate the single-crate limitation of Whole-Program, we preferred large crates so as to see a greater impact from the Whole-Program modification. We only considered crates with over 10,000 lines of code as measured by the cloc utility [ $Al Danial 2021$ ].
To control for code styles specific to individual applications, we wanted crates from a wide range of domains.

After a manual review of large crates in the Rust ecosystem, we selected 10 crates, shown in Figure 2. We built each crate with as many feature flags enabled as would work on our Ubuntu 16.04 machine. Details like the specific flags and commit hashes can be found in the appendix.

Project	Crate	Purpose	LOC	# Vars	# Funcs	Avg. Instrs/Func
rayon		Data parallelism library	15,524	10,607	1,079	16.6
rustls	rustls	TLS implementation	16,866	23,407	868	42.4
sccache		Distributed build cache	23,202	23,987	643	62.1
nalgebra		Numerics library	31,951	35,886	1,785	26.7
image		Image processing library	20,722	39,077	1,096	56.8
hyper		HTTP server	15,082	44,900	790	82.9
rg3d		3D game engine	54,426	59,590	3,448	25.7
rav1e		Video encoder	50,294	76,749	931	115.4
RustPython	vm	Python interpreter	47,927	97,637	3,315	51.0
		Total:	286,682	435,979	14,696

Figure 2:

Dataset of crates used to evaluate information flow precision, ordered in increasing number of variables analyzed. Each project often contains many crates, so a sub-crate is specified where applicable, and the root crate is analyzed otherwise. Metrics displayed are LOC (lines of code), number of variables, number of functions, and the average number of MIR instructions per function (size of CFG).

For each crate, we ran the information flow analysis on every function in the crate, repeated under each of the 8 conditions. Within a function, for each local variable $x$ , we compute the size of $Θ (x)$ at the exit of the CFG — in terms of program slicing, we compute the size of the variable's backward slice at the function's return instructions. The resulting dataset then has four independent variables (crate, function, condition, variable name) and one dependent variable (size of dependency set) for a total of 3,487,832 data points.

Our main goal in this evaluation is to analyze precision, not performance. Our baseline implementation is reasonably optimized — the median per-function execution time was 370.24 $μ s$ . But Whole-Program is designed to be as precise as possible, so its naive recursion is sometimes extremely slow. For example, when analyzing the GameEngine::render function of the rg3d crate (with thousands of functions in its call graph), the modular analysis takes 0.13s while the recursive analysis takes 23.18s, a 178 $\times$ slowdown. Future work could compare our modular analysis to whole-program analyses across the precision/performance spectrum, such as in the extensive literature on context-sensitivity [ $Smaragdakis and Balatsouras 2015$ ].

5.2 Quantitative Results

We observed no meaningful patterns from the interaction of modifications — for example, in a linear regression of the interaction of Mut-Blind and Pointer-Blind against the size of the dependency set, each condition is individually statistically significant ( $p < 0.001$ ) while their interaction is not ( $p = 0.337$ ). So to simplify our presentation, we focus only on four conditions: three for each modification individually active with the rest disabled, and one for all modifications disabled, referred to as Baseline.

5.2.1 Whole-Program

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Figure 3:

Distribution in differences of dependency set size between Whole-Program and Baseline analyses. The x-axis is a log-scale with 0 added for comparison. Most sets are the same, so 0 dominates (left). A log-scale (right) shows the tail more clearly.

For Whole-Program, we compare against Baseline to answer our first evaluation question: how much more precise is a whole-program analysis than a modular one? To quantify precision, we compare the percentage increase in size of dependency sets for a given variable between two conditions. For instance, if Whole-Program computes $∣ Θ (x) ∣ = 2$ and Baseline computes $∣ Θ (x) ∣ = 5$ for some $x$ , then the difference is $(5 - 2) /2 = 1.5 = 150%$ .

Figure 3-Left shows a histogram of the differences between Whole-Program and Baseline for all variables. In 94% of all cases, the Whole-Program and Baseline conditions produce the same result and hence have a difference of 0. In the remaining 6% of cases with a non-zero difference, visually enhanced with a log-scale in Figure 3-Right, the metric follows a right-tailed log-normal distribution. We can summarize the log-normal by computing its median, which is 7%. This means that within the 6% of non-zero cases, the median difference is an increase in size by 7%. Thus, the modular approximation does not significantly increase the size of dependency sets in the vast majority of cases.

5.2.2 Mut-Blind and Pointer-Blind

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Figure 4:

Distribution in differences between Baseline and each alternative condition, with zeros excluded to highlight the shape of each distribution. Mut-Blind and Pointer-Blind both reduce the precision more often and more severely than Baseline does vs. Whole-Program.

Next, we address our second evaluation question: how much less precise is an analysis with weaker assumptions about the program than the Baseline analysis? For this question, we compare the size of dependency sets between the Mut-Blind and Pointer-Blind conditions versus Baseline. Figure 4 shows the corresponding histograms of differences, with the Whole-Program vs. Baseline histogram included for comparison.

First, the Mut-Blind and Pointer-Blind modifications reduce the precision of the analysis more often and with a greater magnitude than Baseline does vs. Whole-Program. 39% of Mut-Blind cases and 17% of Pointer-Blind cases have a non-zero difference. Of those cases, the median difference in size is 50% for Mut-Blind and 56% for Pointer-Blind.

Therefore, the information from ownership types is valuable in increasing the precision of our information flow analysis. Dependency sets are often larger without access to information about mutability or lifetimes.

5.3 Qualitative Results

The statistics convey a sense of how often each condition influences precision. But it is equally valuable to understand the kind of code that leads to such differences. For each condition, we manually inspected a sample of cases with non-zero differences vs. Baseline.

5.3.1 Modularity

One common source of imprecision in modular flows is when functions take a mutable reference as input for the purposes of passing the mutable permission off to an element of the input.

fn crop<I: GenericImageView>(

image: &mut I, x: u32, y: u32,

width: u32, height: u32

) -> SubImage<&mut I> {

let (x, y, width, height) =

crop_dimms(image, x, y, width, height);

SubImage::new(image, x, y, width, height)

}

For example, the function crop from the image crate returns a mutable view on an image. No data is mutated, only the mutable permission is passed from whole image to sub-image. However, a modular analysis on the image input would assume that image is mutated by crop.

Another common case is when a value only depends on a subset of a function's inputs. The modular approximation assumes all inputs are relevant to all possible mutations, but this is naturally not always the case.

fn solve_lower_triangular_with_diag_mut<R2,C2,S2>(

&self, b: &mut Matrix<N, R2, C2, S2>, diag: N,

) -> bool {

if diag.is_zero() { return false; }

// logic mutating b...

true

}

For example, this function from nalgebra returns a boolean whose value solely depends on the argument diag. However, a modular analysis of a call to this function would assume that self and b is relevant to the return value as well.

5.3.2 Mutability

The reason Mut-Blind is less precise than Baseline is quite simple — many functions take immutable references as inputs, and so many more mutations have to be

fn read_until<R, F>(io: &mut R, func: F)

-> io::Result<Vec<u8>>

where R: Read, F: Fn(&[u8]) -> bool

{

let mut buf = vec![0; 8192]; let mut pos = 0;

loop {

let n = io.read(&mut buf[pos..])?; pos += n;

if func(&buf[..pos]) { break; }

// ...

}

For instance, this function from hyper repeatedly calls an input function func with segments of an input buffer. Without a control-flow analysis, it is impossible to know what functions read_until will be called with. And so Mut-Blind must always assume that func could mutate buf. However, Baseline can rely on the immutability of shared references and deduce that func could not mutate buf.

5.3.3 Lifetimes

Without lifetimes, our analysis has to make more conservative assumptions about objects that could possibly alias. We observed many cases in the Pointer-Blind condition where two references shared different lifetimes but the same type, and so had to be classified as aliases.

fn link_child_with_parent_component(

parent: &mut FbxComponent,

child: &mut FbxComponent,

child_handle: Handle<FbxComponent>,

) { match parent {

FbxComponent::Model(model) => {

model.geoms.push(child_handle),

// ..

}}

For example, the link_child_with_parent_component function in rg3d takes mutable references to a parent and child. These references are guaranteed not to alias by the rules of ownership, but a naive pointer analysis must assume they could, so modifying parent could modify child.

5.4 Threats to Validity

Finally, we address the issue: how meaningful are the results above? How likely would they generalize to arbitrary code rather than just our selected dataset? We discuss a few threats to validity below.

Are the results due to only a few crates?
If differences between techniques only arose in a small number of situations that happen to be in our dataset, then our technique would not be as generally applicable. To determine the variation between crates, we generated a histogram of non-zero differences for the Baseline vs. Mut-Blind comparison, broken down by crate in Figure 5.

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor
Figure 5:
Distribution of non-zero differences between Baseline and Mut-Blind, broken down by crate.
As expected, the larger code bases (e.g. rav1e and RustPython) have more non-zero differences than smaller codebases — in general the correlation between non-zero differences and total number of variables analyzed is strong, $R^{2} = 0.79$ . However variation also exists for crates with roughly the same number of variables like image and hyper. Mut-Blind reduces precision for variables in hyper more often than image. A qualitative inspection of the respective codebases suggests this may be because hyper simply makes greater use of immutable references in its API.
These findings suggest that the impact of ownership types and the modular approximation likely do vary with code style, but a broader trend is still observable across all code.
Would Whole-Program be more precise with access to dependencies?
A limitation of our whole-program analysis is our inability to access function definitions outside the current crate. Without this limitation, it may be that the Baseline analysis would be significantly worse than Whole-Program. So for each variable analyzed by Whole-Program, we additionally computed whether the information flow for that variable involved a function call across a crate boundary.
Overall 96% of cases reached at least one crate boundary, suggesting that this limitation does occur quite often in practice. However, the impact of the limitation is less clear. Of the 96% of cases that hit a crate boundary, 6.6% had a non-zero difference between Baseline and Whole-Program. Of the 4% that did not hit a crate boundary, 0.6% had a non-zero difference. One would expect that Whole-Program would be the most precise when the whole program is available (no boundary), but instead it was much closer to Baseline.
Ultimately it is not clear how much more precise Whole-Program would be given access to all a crate's dependencies, but it would not necessarily be a significant improvement over the benchmark presented.
Is ownership actually important for precision?
The finding that Pointer-Blind only makes a difference in 17% of cases may seem surprisingly small. For instance, $Shapiro and Horwitz [1997]$ found in a empirical study of slices on C programs that "using a pointer analysis with an average points-to set size twice as large as a more precise pointer analysis led to an increase of about 70% in the size of [slices]."
A limitation of our ablation is that the analyzed programs were written to satisfy Rust's ownership safety rules. Disabling lifetimes does not change the structure of the programs to become more C-like — Rust generally encourages a code style with fewer aliases to avoid dealing with lifetimes. A fairer comparison would be to implement an application idiomatically in both Rust and Rust-but-without-feature-X, but such an evaluation is not practical. It is therefore likely that our results understate the true impact of ownership types on precision given this limitation.

6 Applications

Figure 6a:

A program slicer integrated into VSCode. Above, the user selects a slicing criterion like the variable f. Then the slicer highlights the criterion in green, and fades out lines that are not part of the backward slice on f. For example, write_all mutates the file so it is in the slice, while metadata reads the file so it is not in the slice.

Below, the user can manipulate aspects of a program such as commenting out code related to timing. The user computes a forward slice on start, adds this slice to their selection (in blue), then tells the IDE to comments out all lines in the selection.

Figure 6b:

An IFC checker. Above, the ifc_traits library exports a Secure for users to mark sensitive data, like Password, and insecure operations like insecure_print. Below, a compiler plugin invoked on the program checks for information flows from data with a type implementing Secure to insecure operations. Here insecure_print is conditionally executed based on a read from PASSWORD, so this flow is flagged.

Figure 6:

Two applications of information flow built using Flowistry.

We have demonstrated that ownership can be leveraged to build an information flow analysis that is static, modular, sound, and precise. Our hope is that this analysis can serve as a building block for future static analyses. To bootstrap this future, we have used Flowistry to implement prototypes of a program slicer and an IFC checker, shown in Figure 6.

The program slicer in Figure 6a is a VSCode extension that fades out all lines of code that are not relevant to the user's selection, i.e. not part of the modular slice. Rather than present a slice of the entire program like in prior slicing tools, we can use Flowistry's modular analysis to present lightweight slices of just within a given function. Users can apply the slicer for comprehension tasks such as reducing the scope of a bug, or for refactoring tasks such as removing an aspect of a program like timing or logging.

The IFC checker in Figure 6b is a Rust library and compiler plugin. It provides the user a library with the traits Secure and Insecure to indicate the relative security of data types and operations. Then the compiler plugin uses Flowistry to determine whether information flows from Insecure variables to Secure variables. Users can apply the IFC checker to catch sensitive data leaks in an application. This prototype is purely intraprocedural, but future work could build an interprocedural analysis by using Flowistry's output as procedure summaries in a larger information flow graph.

7 Related Work

Our work draws on three core concepts: information flow, modular static analysis, and ownership types.

7.1 Information Flow

Information flow has been historically studied in the context of security, such as ensuring low-security users of a program cannot infer anything about its high-security internals. Security-focused information flow analyses have been developed for Java [ $Myers 1999$ ], Javascript [ $Austin and Flanagan 2009$ ], OCaml [ $Pottier and Simonet 2003$ ], Haskell [ $Stefan et al. 2011$ ], and many other languages.

Each analysis satisfies some, but not all, of our requirements from Section 1. For instance, the JFlow [ $Myers 1999$ ] and Flow-Caml [ $Pottier and Simonet 2003$ ] languages required adding features to the base language, violating our second requirement. Some methods like that of $Austin and Flanagan [2009]$ for Javascript rely on dynamic analysis, violating our third requirement. And Haskell only supports effects like mutation through monads, violating our first requirement.

Nonetheless, we draw significant inspiration from mechanisms in prior work. Our analysis resembles the slicing calculus of $Abadi et al. [1999]$ . The use of lifetimes for modular analysis of functions is comparable to security annotations in Flow-Caml [ $Pottier and Simonet 2003$ ]. The CFG analysis draws on techniques used in program slicers, such as the LLVM dg slicer [ $Marek Chalupa 2016$ ].

7.2 Modular Static Analysis

The key technique to making static analysis modular (or "compositional" or "separate") is symbolically summarizing a function's behavior, so that the summary can be used without the function's implementation. Starting from $Rountev et al. [1999]$ and $Cousot and Cousot [2002]$ , one approach has been to design a system of "procedure summaries" understood by the static analyzer and distinct from the language being analyzed. This approach has been widely applied for static analysis of null pointer dereferences [ $Yorsh et al. 2008$ ], pointer aliases [ $Dillig et al. 2011$ ], data dependencies [ $Tang et al. 2015$ ], and other properties.

Another approach, like ours, is to leverage the language's type system to summarize behavior. $Yan Mei Tang and Pierre Jouvelot [1994]$ showed that an effect system could be used for a modular control-flow analysis. Later work in Haskell used its powerful type system and monadic effects to embed many forms of information flow control into the language [ $Peng Li and Zdancewic 2006$ ][ $Russo et al. 2008$ ][ $Stefan et al. 2011$ ][ $Buiras et al. 2015$ ].

7.3 Ownership Types

Rust and Oxide's conceptions of ownership derive from $Clarke et al. [1998]$ and $Grossman et al. [2002]$ . For instance, the Cyclone language of Grossman et al. uses regions to restrict where a pointer can point-to, and uses region variables to express relationships between regions in a function's input and output types. A lifetime is similar in that it annotates the types of pointers, but differs in how it is analyzed.

Recent works have demonstrated innovative applications of Rust's type system for modular program analysis. $Astrauskas et al. [2019]$ embed Rust programs into a separation logic to verify pre/post conditions about functions. $Jung et al. [2019]$ use Rust's ownership-based guarantees to implement more aggressive program optimizations.

Closer to our domain, $Balasubramanian et al. [2017]$ implemented a prototype IFC system for Rust by lowering programs to LLVM and verifying them with SMACK [ $Zvonimir Rakamaric and Michael Emmi 2014$ ], although their system is hard to contrast with ours given the high-level description in their paper. $Njor and G \overset{u}{ˊ} stafsson [2021]$ implemented a static taint analysis for Rust, although it is not field-sensitive, alias-sensitive, or modular.

8 Discussion

Looking forward, two interesting avenues for future work on Flowistry are improving its precision and addressing soundness in unsafe code. For instance, the lifetime-based pointer analysis is sound but imprecise in some respects. Lifetimes often lose information about part-whole relationships. Consider the function that returns a mutable pointer to a specific index in a vector:

fn get_mut<'a>(&'a mut self, i: usize)

-> Option<&'a mut T>;

These lifetimes indicate only that the return value points to something in the input vector. The expressions v.get_mut(i) and v.get_mut(i + 1) are considered aliases even though they are not. Future work could integrate Flowistry with verification tools like Prusti [ $Astrauskas et al. 2019$ ] to use abstract interpretation for a more precise pointer analysis in such cases.

Additionally, Rust has many libraries built on unsafe code that can lose annotations essential to information flow, such as interior mutability. For example, shared-memory concurrency in Rust looks like this:

let n: Arc<Mutex<i32>> = Arc::new(Mutex::new(0));

let n2: Arc<Mutex<i32>> = Arc::clone(&n);

*n2.lock().unwrap() = 1;

Arc::clone does not share a lifetime between its input and output, so a lifetime-based pointer analysis therefore cannot deduce that n2 is an alias of n, and Flowistry would not recognize that the mutation on line 3 affects n. Future work can explore how unsafe libraries could be annotated with the necessary metadata needed to analyze information flow, similar to how RustBelt [ $Jung et al. 2017$ ] identifies the pre/post-conditions needed to ensure type safety within unsafe code.

Overall, we are excited by the possibilities created by having a practical information flow analysis that can run today on any Rust program. Many exciting systems for tasks like debugging [ $Ko and Myers 2004$ ], example generation [ $Head et al. 2018$ ], and program repair [ $Wen et al. 2018$ ] rely on information flow in some form, and we hope that Flowistry can support the development of these tools.

Acknowledgments

This work was partially supported by the Italian Ministry of Education through funding for the Rita Levi Montalcini grant (call of 2019).

References

Jeanne Ferrante

Karl J. Ottenstein

, and

Joe D. Warren

. 1987. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst. (3).

Keith D Cooper

Timothy J Harvey

, and

Ken Kennedy

. 2001. A simple, fast dominance algorithm. Software Practice \& Experience (1-10).

Mark Weiser

. 1981. Program Slicing. Proceedings of the 5th International Conference on Software Engineering.

David G. Clarke

John M. Potter

, and

James Noble

. 1998. Ownership Types for Flexible Alias Protection. Proceedings of the 13th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications.

Dan Grossman

Greg Morrisett

Trevor Jim

Michael Hicks

Yanling Wang

, and

James Cheney

. 2002. Region-Based Memory Management in Cyclone. Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation.

Jean-Yves Girard

. 1987. Linear logic. Theoretical Computer Science (1).

Aaron Weiss

Olek Gierczak

Daniel Patterson

, and

Amal Ahmed

. 2019. Oxide: The Essence of Rust.

Atanas Rountev

Barbara G. Ryder

, and

William Landi

. 1999. Data-Flow Analysis of Program Fragments. Proceedings of the 7th European Software Engineering Conference Held Jointly with the 7th ACM SIGSOFT International Symposium on Foundations of Software Engineering.

Patrick Cousot

, and

Radhia Cousot

. 2002. Modular Static Program Analysis. Proceedings of the 11th International Conference on Compiler Construction.

Greta Yorsh

Eran Yahav

, and

Satish Chandra

. 2008. Generating Precise and Concise Procedure Summaries. Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages.

Hao Tang

Xiaoyin Wang

Lingming Zhang

Bing Xie

Lu Zhang

, and

Hong Mei

. 2015. Summary-Based Context-Sensitive Data-Dependence Analysis in Presence of Callbacks. Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages.

Niko Matsakis

. 2017. Non-lexical lifetimes.

Ralf Jung

Jacques-Henri Jourdan

Robbert Krebbers

, and

Derek Dreyer

. 2017. RustBelt: Securing the Foundations of the Rust Programming Language. Proc. ACM Program. Lang. (POPL).

R. Cytron

J. Ferrante

B. K. Rosen

M. N. Wegman

, and

F. K. Zadeck

. 1989. An Efficient Method of Computing Static Single Assignment Form. Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages.

Al Danial

. 2021. cloc: Count Lines of Code.

Marek Chalupa

. 2016. Slicing of LLVM bitcode.

Mart \overset{ı}{ˊ} n Abadi

Anindya Banerjee

Nevin Heintze

, and

Jon G. Riecke

. 1999. A Core Calculus of Dependency. Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages.

Fran \overset{c}{¸} ois Pottier

, and

Vincent Simonet

. 2003. Information Flow Inference for ML. ACM Trans. Program. Lang. Syst. (1).

Yannis Smaragdakis

, and

George Balatsouras

. 2015. Pointer analysis. Foundations and Trends in Programming Languages (1).

Vytautas Astrauskas

Peter M \overset{u}{¨} ller

Federico Poli

, and

Alexander J. Summers

. 2019. Leveraging Rust Types for Modular Specification and Verification. Proc. ACM Program. Lang. (OOPSLA).

Ralf Jung

Hoang-Hai Dang

Jeehoon Kang

, and

Derek Dreyer

. 2019. Stacked Borrows: An Aliasing Model for Rust. Proc. ACM Program. Lang. (POPL).

Vytautas Astrauskas

Christoph Matheja

Federico Poli

Peter M \overset{u}{¨} ller

, and

Alexander J. Summers

. 2020. How Do Programmers Use Unsafe Rust?. Proc. ACM Program. Lang. (OOPSLA).

Isil Dillig

Thomas Dillig

Alex Aiken

, and

Mooly Sagiv

. 2011. Precise and Compact Modular Procedure Summaries for Heap Manipulating Programs. Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation.

A. Sabelfeld

, and

A.C. Myers

. 2003. Language-based information-flow security. IEEE Journal on Selected Areas in Communications (1).

Andrew C. Myers

. 1999. JFlow: Practical Mostly-Static Information Flow Control. Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages.

J.A. Goguen

, and

J. Meseguer

. 1982. Security Policies and Security Models. 1982 {IEEE} Symposium on Security and Privacy.

Deian Stefan

Alejandro Russo

John C. Mitchell

, and

David Mazi \overset{e}{ˋ} res

. 2011. Flexible Dynamic Information Flow Control in Haskell. Proceedings of the 4th ACM Symposium on Haskell.

Thomas H. Austin

, and

Cormac Flanagan

. 2009. Efficient Purely-Dynamic Information Flow Analysis. Proceedings of the ACM SIGPLAN Fourth Workshop on Programming Languages and Analysis for Security.

Marc Shapiro

, and

Susan Horwitz

. 1997. The Effects of the Precision of Pointer Analysis. Proceedings of the 4th International Symposium on Static Analysis.

Yan Mei Tang

, and

Pierre Jouvelot

. 1994. Separate abstract interpretation for control-flow analysis. Lecture Notes in Computer Science.

Peng Li

, and

S. Zdancewic

. 2006. Encoding information flow in Haskell. 19th IEEE Computer Security Foundations Workshop (CSFW'06).

Alejandro Russo

Koen Claessen

, and

John Hughes

. 2008. A Library for Light-Weight Information-Flow Security in Haskell. Proceedings of the First ACM SIGPLAN Symposium on Haskell.

Pablo Buiras

Dimitrios Vytiniotis

, and

Alejandro Russo

. 2015. HLIO: Mixing Static and Dynamic Typing for Information-Flow Control in Haskell. Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming.

Abhiram Balasubramanian

Marek S. Baranowski

Anton Burtsev

Aurojit Panda

Zvonimir Rakamari \overset{c}{ˊ}

, and

Leonid Ryzhyk

. 2017. System Programming in Rust: Beyond Safety. Proceedings of the 16th Workshop on Hot Topics in Operating Systems.

Zvonimir Rakamaric

, and

Michael Emmi

. 2014. SMACK: Decoupling Source Language Details from Verifier Implementations. Proceedings of the 26th International Conference on Computer Aided Verification (CAV).

Amy J. Ko

, and

Brad A. Myers

. 2004. Designing the Whyline: A Debugging Interface for Asking Questions about Program Behavior. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.

Andrew Head

Elena L. Glassman

Bj \overset{o}{¨} rn Hartmann

, and

Marti A. Hearst

. 2018. Interactive Extraction of Examples from Existing Code. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems.

Ming Wen

Junjie Chen

Rongxin Wu

Dan Hao

, and

Shing-Chi Cheung

. 2018. Context-Aware Patch Generation for Better Automated Program Repair. Proceedings of the 40th International Conference on Software Engineering.

Ana Nora Evans

Bradford Campbell

, and

Mary Lou Soffa

. 2020. Is Rust Used Safely by Software Developers?. Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering.

Emil J ø rgensen Njor

, and

Hilmar G \overset{u}{ˊ} stafsson

. 2021. Static Taint Analysis in Rust.

%%%

import "@nota-lang/nota-theme-acm/dist/index.css"

import bibtex from "./bibliography.bib"

import {rust} from "@codemirror/lang-rust"

import {cpp} from "@codemirror/lang-cpp"

import _ from "lodash";

let sysname = @Smallcaps{Flowistry};

%%%

@ListingConfigure[language: rust()]

@Title: Modular Information Flow through Ownership

%let stanford = @Affiliation{@Institution{Stanford University}}

@Authors:

@Author: @Name{Will Crichton} #stanford

@Author:

@Name: Marco Patrignani

@Affiliation:

@Institution: University of Trento

@Author: @Name{Maneesh Agrawala} #stanford

@Author: @Name{Pat Hanrahan} #stanford

@Abstract:

@Section[name: "sec_intro"]: Introduction

Information flow describes how data influences other data within a program. Information flow has applications to security, such as information flow control &sabelfeld2003language, and to developer tools, such as program slicing &weiser1984program. Our goal is to build a practical system for analyzing information flow, meaning:

* **Applicable to common language features:** the language being analyzed should support widely used features like pointers and in-place mutation.

* **Zero configuration to run on existing code:** the analyzer must integrate with an existing language and existing unannotated programs. It must not require users to adopt a new language designed for information flow.

* **No dynamic analysis:** to reduce integration challenges and costs, the analyzer must be purely static --- no modifications to runtimes or binaries are needed.

* **Modular over dependencies:** programs may not have source available for dependencies. The analyzer must have reasonable precision without whole-program analysis.