FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving

Introduction

Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this project, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELer. The FVELer dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELer in the FVEL environment by first fine-tuning LLMs with FVELer and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELer fine-tuned Llama3-8B solves 17.39% (69→81) more problems, and Mistral-7B 12% (75→84) more problems in SV-COMP. And the proportion of proof errors is reduced..

Workflow

FVEL workflow.

The main idea of the Logo FVEL environment is to provide an interactive environment with large language models (LLMs) that leverage rigorous theorem-proving processes. The input of Logo FVEL environment is a code to be verified. Specifically, we currently verify C code and conduct a pilot study on our new framework. Moreover, the input format is flexible as one can choose to input an ensemble of C code and its corresponding SIMPL and/or Isabelle content as supplements. The output of Logo FVEL is the code verification result.

Logo FVEL interacts with a large language model to achieve the verification. At the initial step of interaction (S₀), Logo FVEL transforms the input C code into facts, and then provides the facts to the LLM. The LLM then generates a lemma in Isabelle as a formal description of the code specification. In this step, a code verification problem is transformed into an ATP problem. As a result, Logo FVEL can leverage the LLMs theorem-proving techniques and rigorous ATP validation. At the follow-up interaction steps (S_i, i ≥ 1), the LLM is prompted to generate proof steps, while Logo FVEL incorporates an Isabell prover to provide feedback such as error messages to the LLM. The process terminates until a whole proof is generated. If the proof success in proving the lemma, Logo FVEL outputs "success", otherwise outputs "failure".

Overview

FVELer construction.

Logo FVELer contains transformed Isabelle theories and lemmas from C codes that support the FVEL environment for C code verification. It has two main components:

Theories dependencies. A resource for dependencies among theories, lemmas, and c code specified by SeL4 verification. These data provide the ground-truth seL4 premises for proving the current lemma and enable a model to retrieve related statements or proof context at both the training and testing stages.
Lemmas from theories with their Isabelle proof states. The step-wise lemmas with multiple proof states that support the Isabelle proving process in FVEL. These data on the one hand enhance LLMs with search-based/step-wise ATP while interacting with FVEL, and on the other hand, provide a benchmark for interactive formal verification.

You can download the dataset on GitHub.

Dataset statistics

The statistics of Logo FVELer dataset.

Logo FVELer contains contains 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We randomly split FVELer according to lemmas, resulting in a training set, a validation set, a test set, and an especially selected test-hard set. The test-hard set is selected from those lemmas in the three sessions SysInit, SysInitExamples, LibTest. Such lemmas are in higher depths in the dependency relationship, therefore they have less import relationships by other theories.

Figures below shows the distribution of extracted theories and lemmas by it's depth. where depth is the degree of the theory dependency graph by the import dependency relationship among the theory files.

Distribution of dependency by theory

Distribution of dependency by lemma.

Figure below demonstrates the distribution of intermediate proof steps of the 29,125 lemmas. It is indicated that the number of proof steps are dramatically different amount the lemmas. 12,089 out of the 29,125 lemmas can be proved via one proof step. Proof steps between 2 and 10 there are 12,957 lemmas. Therefore, over 85% of the lemmas in FVELER can be proved within 10 steps. Moreover, 28,954 out of the 29,125 lemmas can be proved within 100 steps

The FVELer lemma distribution over step intervals.

Dataset Examples

sel4_extraction/ is a folder that has the same structure as the sel4 verification project (l4v). Each file is the extracted step-wise proof state of the corresponding l4v theory files. For example, “sel4_extraction/proof/invariant-abstract/AInvs.json” is the proof state of the file l4v/proof/invariant-abstract/AInvs.thy.
dataset_lemma_split.json contains all lemmas proof steps and states, and splits them into the train, val, test, and test-hard set.
sel4_thy_info.json contains information of all theory files, including their names, dependency relations, and lemmas.
sel4_session_info.json contains all session information, including dependent sessions, theories, and directories.

sel4_extraction/ folder example.

dataset_lemma_split.json example.

sel4_thy_info.json example.

sel4_session_info.json example.

Given a formal verificaiton case that is converted to an Isabelle lemma as the input, The upper row shows a case in which FVEL-Mistral-7B correctly applies the lemma learned from fine-tuning, thus correcting and simplifying the proof. Contrastively, Mistral-7B generates common not_eq_complement without considering a reasonable proof strategy, resulting in a failed proof. In the second case, Mistral-7B rewrites the lemma statement into “assumes” and “shows” statements, according to which gives an incorrect proof. FVEL-Mistral-7B, on the other hand, expands the brackets in the equation and then is able to derive contradiction according to “(bAND NOT b)”, and completes the proof via the contradiction of the right-hand side of the equation.

Generated proofs by Mistral-7B and FVEL-Mistral-7B.

BibTeX


      @inproceedings{lin2024fvel,
        title={{FVEL}: Interactive Formal Verification Environment with Large Language Models via Theorem Proving},
        author={Xiaohan Lin* and Qingxing Cao* and Yinya Huang* and Haiming Wang* and Jianqiao Lu and Zhengying Liu and Linqi Song and Xiaodan Liang},
        booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
        year={2024},
        url={https://openreview.net/forum?id=d0gMFgrYFB}
    }