Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Symbolic Execution of Core-LLZK Programs

The goal of symbolic execution of core-LLZK is to generate an SMT2 formula that faithfully represents the big-step semantics of the witness program (also referred to simply as the witness or the program). This formula can then be used, for example, to prove that the witness and the corresponding circuit are equivalent — that is, that they compute the same function.

To illustrate the idea, consider a program that consists of two instructions: y = felt.mul y x followed by z = felt.add y z. Assuming that x, y and z are represented by constraint variables , , and , the first instruction is encoded as , where is a fresh variable representing the updated value of y. The second instruction is then encoded as . The formula that represents the program is then

Note that, for clarity, we write constraints using standard (mathematical) format logical formulas, where the arithmetic operations are interpreted over a finite field. Translating them into SMT2 format is straightforward.

In what follows we will describe the encoding of the different instructions of the language, how they are composed, and how a function can be encoded in a modular way as SMT2 macros. Then, in the last section, we explain where the implementation of the symbolic execution can be found, and how it can be used.

[!NOTE]

All concrete values and operations used in the rest of this document are over a finite-field with respect to a given prime p that fits in exactly k bits, unless stated otherwise. We will thus sometimes drop the term finite-field.

Assigning constraint variables to program variables

The first thing we need to model is how to relate program variables to their corresponding constraint variables. For this we will use a mapping (referred to as a symbolic environment) such that , where is a program variable, returns one of the following values:

  • A concrete finite-field value.
  • A constraint variable.
  • A symbolic (array) environment such that represents the symbolic value of the -th position of the array (the value can be either a concrete finite-field value or a constraint variable; it cannot be an array again).

Why we need the concrete finite-field value in the symbolic environment ?

This is needed in order to achieve one of the powerful features of the translation, which executes the instructions when all variables have concrete values and thus avoid generating new constraint variables and corresponding formulas. It also handles aliasing, i.e., if we have an instruction like x = y, then we will not generate a new variable for x and add a formula stating the equality, but rather will assign to x the current value of y and avoid generating a new formula.

We will use the syntax for a symbolic environment that is obtained from by setting to , replacing its current value if any.

For simplicity, when we have a simple expression sexp, which can be a variable or a value, then abusing notation we assume that returns sexp itself when it is a finite-field value.

Encoding of expressions

Encoding of an expression, in the context of a symbolic environment , generates a formula that encodes the result of evaluating the expression.

Encoding of expressions does not modify the symbolic environment , it simply uses the constraint variables of and binds the result to a given output variable . Note that all variables that appear in an expression do not correspond to arrays, otherwise the program is ill-typed (array accesses are handled using dedicated instructions).

Next we describe the encodings of the different expressions as they are defined in the core-LLZK language.

Arithmetic

sexp (Identity)

The encoding is the formula

felt.neg sexp (Negation)

The encoding is the formula

felt.add sexp1 sexp2 (Addition)

The encoding is the formula

felt.sub sexp1 sexp2 (Subtraction)

The encoding is the formula

felt.mul sexp1 sexp2 (Multiplication)

The encoding is the formula

felt.div sexp1 sexp2 (Multiplication by modular inverse)

The encoding is the formula

Bitwise

Encoding of bitwise operations heavily relies on the binary expansion of a given constraint variable (it also works when is a finite-field value). This operation is denoted by , i.e., the binary expansion of using bits. We assume that it generates a formula that is a conjunction of the following constraints:

  1. , where are fresh finite-field variables representing the bits of .
  2. to state that the bits can bey either or . The constraint can also be replaced by when range constraints are allowed.

Recall the the finite-field is with respect to a prime that fits in bits.

bit.and sexp1 sexp2 (Bitwise AND)

Let , , and be the encodings corresponding to , , and . Let denote the -th bit of and the -th bit of , produced by the respective calls.

The encoding is

As an optimization, when sexp1 is a constant that fits in bits, the encoding of and can be done with respect to bits instead of bits, and then use the following encoding

This is valid because all bits with are . Here we save variables, which can be important for scalability during the verification phase. We can apply a similar optimization for the case when sexp2 is a constant.

bit.or sexp1 sexp2 (Bitwise OR)

Let , , and be the formulas corresponding to , , and . Let denote the -th bit of and the -th bit of , produced by the respective calls.

The encodingis

bit.xor sexp1 sexp2 (Bitwise XOR)

Let , , and be the formulas corresponding to , , and . Let denote the -th bit of and the -th bit of , produced by the respective calls.

The encoding is

bit.not sexp (Bitwise NOT)

Let and be the formulas corresponding to and . Let denote the -th bit of , produced by the respective call.

The encoding is

bit.shl sexp1 sexp2 (Left shift)

The encoding of left-shift considers two cases: the first handles the case when sexp2 is a value, and the other when it is not. In practice, we rarely find the second case. Next we explain the two cases.

The case when sexp2 is a constant

Let and be the formulas corresponding to and . Let denote the -th bit of , produced by the respective call. Let the value of sexp2 be . The encoding is

The general case

The second case is more elaborated, and is based computing the binary expansion of using bits, i.e., , and then iteratively left-shift by position when the corresponding bit of is 1.

bit.shr sexp1 sexp2 (Right shift)

The encoding of right-shift considers two cases: the first handles the case when sexp2 is a value, and the other when it is not. In practice, we rarely find the second case. Next we explain the two cases.

The case when sexp2 is a constant

Let and be the formulas corresponding to and . Let denote the -th bit of , produced by the respective call. Let the value of sexp2 be . The encoding is

The general case

It is based on the same idea as the general case of bit.shl.

Boolean

We will rely on formulas of the form , interpreted as: if holds then otherwise . Note that when is a bit variable this can be expressed arithmetically as , however, keeping the form may provide important explicit information during the verification process (the one that uses the encoding of the witness program).

Note that Boolean values are simulated using finite-field values, where represents false and any other value is true.

bool.eq sexp1 sexp2 (Equality)

The encoding is

bool.neq sexp1 sexp2 (Inequality)

The encoding is

bool.and sexp1 sexp2 (Logical AND)

The encoding is

bool.or sexp1 sexp2 (Logical OR)

The encoding is

bool.not sexp (Logical NOT)

The encoding is

bool.lt sexp1 sexp2 (Signed less than)

First recall that we deal with signed values. Thus, comparisons interpret field elements as signed integers. The order of the field elements is defined as

where . The idea is that represent negative numbers.

There are two special cases that we consider, which improve the overall performance of the verification process. They arise when sexp1 or sexp2 are constant values. We describe these cases first, followed by the general case when both are variables. Note that the case when both sexp1 and sexp2 are constant is handled when executing the corresponding command (since both are constant, the comparison is simply evaluated).

The case when sexp1 is a constant

Let be the value of sexp1, so we want to encode the signed comparison . The encoding is divided into several cases; in all of them the encoding is

where is:

  • if , then is false, because represents the largest non-negative value.

  • if , then is , because can be any positive value larger than .

  • if , then is , because can be any non-negative number.

  • if , then is , because is negative, so can be any positive or negative value larger than .

The case when sexp2 is a constant

Let be the value of sexp2, so we want to encode the signed comparison . The encoding is divided into several cases; in all of them the encoding is

where is defined separately for each case:

  • if , then is false, because represents the smallest negative value.

  • if , then is , because is negative and thus is a negative number smaller than .

  • if , then is , because must be negative.

  • if , then is , because is positive so can be negative or non-negative smaller than .

The general case

We assume that both sexp1 and sexp2 are not constant values, so we want to encode the signed comparison . Let and be the formulas corresponding to and . Let denote the -th bit of and the -th bit of , produced by the respective calls.

The idea is to compare the bits from the most to the least significant, until we find such that , in which case the comparison is true, otherwise it is false. This can be done using the encoding

where is recursively defined as:

bool.gt sexp1 sexp2 (Signed greater than)

This is done by using the encoding of bool.lt sexp2 sexp1.

bool.le sexp1 sexp2 (Signed less or equal)

It is computed as the negation of bool.lt sexp2 sexp1. Suppose is the encoding of bool.lt sexp2 sexp1 using an auxiliary output variable , then the encoding is

bool.ge sexp1 sexp2 (Signed greater or equal)

This is done by using the encoding of bool.le sexp2 sexp1.

Encoding of commands

Next we describe how commands and lists of commands are encoded. Any encoding of a command receives as input a command and a symbolic environment , and produces where is the corresponding formula, and is a new symbolic environment that results from by modifying the values of some variables (due to the symbolic execution of ).

Executing a list of commands is done recursively as follows:

  • The symbolic execution of an empty list generates .
  • The symbolic execution of is done in two steps. We first execute using and obtain , then recursively execute using and obtain ; the overall encoding is then .

The symbolic execution of a function is supposed to generate a macro that we denote as

where and are sequences of constraint variables obtained from the input and output parameters of function , and is a sequence of local variables used in the formula (i.e., all variables used in that do not appear in or ). We will explain how this encoding is generated later, but for now a brief description suffices since we will rely on it when encoding function calls.

Next we describe the encodings of the different commands as they are defined in the core-LLZK language.

Assignment

The encoding of an assignment id = exp starts by trying to concretely evaluate exp, and if all variables used in exp have constant values in , the evaluation succeeds and results in a value . We then generate , and the encoding is .

If exp cannot be concretely evaluated, we symbolically evaluate exp using and a fresh output variable and obtain the encoding . Then we generate , and the encoding is .

Arrays

Creating a new array

Creating an array is done using the command array.new sexp id.

To symbolically execute this command, we first evaluate sexp to a concrete value that represents the size of the array (the size of an array must be known during symbolic execution). Then we generate a new symbolic array environment such that for all , and set . The encoding is then .

Accessing an array element

Accessing an array element is done using the command array.read id1[sexp] id2, which retrieves the value at position sexp from array id1, and stores it in variable id2.

To symbolically execute this command, we first let , which is the symbolic environment of the array id1. Then we handle two cases separately: the first when the index is constant, and the other when it is not.

The case of a constant index

If evaluates to a constant index , we generate , and the encoding is then .

The case of a non-constant index

If the index evaluates to a variable , we have to consider all possible values for the index. We let be the size of the array, which is supposed to be known during symbolic execution (it is part of the environment ).

Considering all possible values for the index can be done using where:

where is a fresh variable. Note that this simulates an if-then-else to identify which index was accessed.

Next we generate the output symbolic environment , and let the encoding be .

Updating an array element

Updating an array element is done using the command array.write sexp1 id[sexp2], which updates the value at position sexp2 to the value of sexp1.

To symbolically execute this command, we first let , which is the symbolic environment of the array id. Then we handle two cases separately: the first when the index is constant, and the other when it is not.

The case of a constant index

If evaluates to a constant index , we generate , then , and finally the encoding is .

The case of a non-constant index

If evaluates to a variable , we have to consider all values for the index. We let be the size of the array, which is supposed to be known during symbolic execution.

We first generate new fresh variables for all positions of the array, to represent the values after the update. Let us name them n1. Let be a new array environment such that for all .

We denote by a formula that simulates an update to the -th position of the array, i.e., assigns to , and the rest of position keep their old values. This can be modeled as follows:

Then, to consider all possible cases, we can use an if-then-else structure as in the following recursive definition:

The encoding is then .

Copying an array

Copying an array from one variable to another is done using the command array.copy id1 id2. The encoding simply updates the value of id2 (in ) to that of id1. Let , then the encoding is .

Conditionals

A conditional statement is of the form if sexp1==sexp2 { tb } else { te }, where tb and te are sequences of commands. The encoding is done by combining the encodings of tb and te.

Let and be the encodings of tb and te respectively. The encoding starts by creating a new environment that merges and for the variables that are live immediately after the if-statement (we infer live variables using liveness analysis). For each such live variable : if and agree, then ; otherwise we introduce a fresh variable , add to and to , and set . Assuming that at the end of this process we obtain , , and , the encoding is .

As an important optimization, if the condition sexp1==sexp2 can be concretely evaluated to , i.e., all used variables have concrete values, then we can use the encoding of tb or eb depending on .

Bounded Loops

A bounded loop is of the form repeat sexp { body }, and executes body for sexp iterations. Note that the value of sexp must be known statically.

Assume that , i.e., the loop is executed times. The encoding of the loop is computed using the following recursive definition of , which represents the execution of the loop for iterations:

  • is simply since nothing is executed.
  • For , we first compute the encoding of body starting from , which results in ; then we compute with respect to , which results in ; and the value of is .

The encoding of the loop is then defined as the result of .

Function Calls

A function call is of the form call foo(sexp1, ..., sexpn) to id1,...,idm, where sexp1, ..., sexpn are the input parameters and id1,...,idm are the output parameters. Recall that we have assumed that a function is encoded as a macro of the form

where is a sequence of constraint variables corresponding to the formal input parameters of , is a sequence of constraint variables corresponding to the formal output parameters of , and is a sequence of auxiliary variables (those used in that are not in or ).

The function call is encoded as call to the above macro according to the following steps:

  • We generate the actual input variables by concatenating the values of . If any is an array, then all its elements are inserted into .

  • We generate from by inserting a fresh variable for each output variable idi. For an output variable that is of array type, it is assigned an array of fresh variables.

  • We generate the actual output variables by concatenating the values of . If any is an array, then all its elements are inserted into .

  • We generate a sequence of fresh variables of the same length as (these are, in principle, existential variables).

The encoding of the call is then . Note that we keep it as a call to a macro, which is important when translating the formulas into SMT2 format to allow modular verification.

Encoding of functions

A function is of the form

def foo(x1:t, ...,xn:t) -> y1:t, ..., ym:t {
  body
}

and as explained earlier, its encoding generates a corresponding macro according to the following steps:

  • Generate an initial symbolic environment , where each xi is mapped to a fresh variable or an array of fresh variables, depending on its type. Let be the sequence of constraint variables corresponding to x1,...,xn.

  • Symbolically execute body starting from , which results in .

  • Generate from by inserting a fresh variable for each output variable yi. For an output variable of array type, it is assigned an array of fresh variables. Let be a conjunction of equalities of the form (or when yi is an array of size ) for . Let be the sequence of constraint variables corresponding to y1,...,ym taken from .

  • Let be the sequence of all variables used in that are not in or .

The encoding is then:

Encoding of a program

The encoding of a program is done by encoding all its functions, as macros, and adding a top-level formula that simulates a call to the main function.

Implementation

A symbolic execution engine has been implemented in Lean following the ideas described above, and can be found under translator/lean/llzk/Llzk/SymExec.

To compile it, first move to the directory translator/lean/llzk and run lake build. It can then be executed using the following command:

.lake/build/bin/llzk_cli -zk g64 -se -o output.smt2 input.core

This generates the SMT2 encoding of input.core into output.smt2. The g64 parameter selects the prime 18446744069414584321 with 64 bits. For debugging purposes, it can be replaced by f11 to use the prime 11 with 4 bits. Omitting the -o option prints the result to standard output.

The following command pretty-prints the input program:

.lake/build/bin/llzk_cli -zk g64 -pp -o output.smt2 input.core

and is useful for debugging. Full usage information can be obtained with:

.lake/build/bin/llzk_cli --help