🌲 LLVM Pass Development I: Branch-Pointer Trace

Gloria Su

2023-12-12

Coding

Frame 2 _3_.png
In the world of making computer programs run faster, it’s crucial to really understand how a program behaves while it’s running. Traditional methods mostly look at the code itself, trying to estimate how it will perform. However, this approach often misses the important influence of different inputs on how a program acts when it’s actually running.

The Key Points Detection project steps in to fill this gap by introducing a profiling tool. This tool is like a detective during the program’s execution, carefully keeping track of important events, especially focusing on decision points and function choices. This project is a big step forward in truly grasping how programs behave, which is crucial for making smart improvements to the code.

Key Points Detection

Let’s dive into a C program (let’s call it “fileX”) to see how Key Points Detection works in capturing branch and function pointer traces. The program has a function pointer and a loop, which can lead to different behaviors.

Line 1: int main() {
Line 2:     void (*fun_ptr)(int) = &fun;
Line 3:     int c = (*fun_ptr)(10);
Line 4:     
Line 5:     for (int i = 0; i < 3; i++) {
Line 6:          c = c + 1;
Line 7:     }
Line 8:     return c;
Line 9: }

The expected trace shows the occurrences of branches and function pointer values during the program’s execution. It looks like this:

*func_0x32576867
br_2
br_2
br_2
br_3

In this trace, “*func_0x32576867” represents a function pointer value, and “br_x” represents branch IDs. These serve as a detailed record of what’s happening in the program.
For example, the tool’s dictionary file could explain branch IDs like this:

1 2	br_2: fileX, 5, 6 br_3: fileX, 5, 8

Breaking it down, “br_2” indicates a branch in “fileX” starting at line 5 and going to line 6. Similarly, “br_3” is another branch in “fileX,” starting from line 5 and reaching line 8. The line numbers tell you where the branch starts, and the target line number shows where it ends.

In the provided code example, within the loop structure (lines 5-7), two branches are encountered due to the iterative nature of the loop. The trace appropriately captures these branches (br_2) multiple times, reflecting the dynamic execution of the program during each iteration of the loop.

Before Implementation

Learn LLVM here: LLVM

LLVM (Low-Level Virtual Machine) is a widely used compiler infrastructure project that provides a collection of modular and reusable compiler and toolchain technologies. LLVM IR (Intermediate Representation) serves as an essential intermediate step during the compilation process. Profiling program behavior, particularly counting branch executions, is crucial for understanding control flow decisions.

ID Generation:
- A unique branch ID is crafted by combining the opcode name and branch number.
- This ID serves as a distinct label for each encountered branch.
Dictionary Update and Counter Increment:
- The branch dictionary is updated with branch details, including file name, start line, and target line.
- The branch counter is incremented for generating a unique identifier for subsequent branches.
IR Instrumentation for Logging:
- Instrumentation code is inserted into the LLVM IR to call a logging function with the branch ID.
- The logging function captures and records information about branch executions.

Challenges & Solutions

Challenge:
LLVM, as a static analysis tool, does not inherently execute code during analysis. Consequently, it cannot directly count how many times a branch is taken during program execution.

Solution:
To address this challenge, the profiling tool must insert instrumentation into the LLVM IR to increment counters each time a branch is executed.

1
2
3

IRBuilder<> builder (&*branch->getFirstInsertionPt ());
Value* args [] = {builder.CreateGlobalStringPtr (branchID)};
builder.CreateCall (logFunc, args);

Challenge:
Printing branch IDs during Intermediate Representation (IR) analysis poses a significant challenge as LLVM primarily focuses on static analysis and does not inherently support direct printing functionalities during this phase.

Solution:
To address the challenge of printing branch IDs within the LLVM IR, a well-defined solution involves the incorporation of a custom logging function. The solution code defines a logPrint function that takes a string as input and prints it. This function is then declared and instantiated within the LLVM context, creating a bridge between the static analysis process and the ability to generate runtime-like outputs.

// ...
LLVMContext &Ctx = F.getContext();
std::vector<Type *> paramTypes = {Type::getInt8PtrTy(Ctx)};
Type *retType = Type::getVoidTy(Ctx);
FunctionType *logFuncType = FunctionType::get(retType, paramTypes, false);
FunctionCallee logFunc = F.getParent()->getOrInsertFunction("logPrint", logFuncType);
// ...
builder.CreateCall(logFunc, args);
// ...

Code Walkthrough

Check out my code in github.

Let’s break down the key points in the LLVM Pass code for branch execution counting, emphasizing the understanding of the critical code blocks:

Dictionary and Counter Initialization

1 2	std::map<std::string, std::tuple<std::string, int, int>> branchDictionary; int branchNumber = 0;

branchDictionary: A data structure to store information about each branch, identified by a unique branch ID. It holds details like the source file name, starting line, and target line.
branchNumber: An integer counter initialized to zero, responsible for assigning unique IDs to each encountered branch.

Function and Logging Setup

for (Function &F : M.functions ()) {
  // ...
  FunctionType *logFuncType = FunctionType::get(retType, paramTypes, false);
  FunctionCallee logFunc = F.getParent()->getOrInsertFunction("logPrint", logFuncType);
  // ...
}

The LLVM Pass iterates over each function in the module.
It sets up a logging function (logPrint) with the appropriate function type, preparing to log information about executed branches.

Branch Instrumentation

for (BasicBlock &B : F) {
  // ...
  for (Instruction &I : B) {
    if (auto *BI = dyn_cast<BranchInst>(&I)) {
      if (BI->isConditional()) {
        const DebugLoc &Loc = BI->getDebugLoc();
        // ...
      }
    }
  }
  // ...
}

Nested loops traverse basic blocks and instructions within functions.
Conditional branch instructions (BranchInst) are identified for instrumentation.
Debug location information is extracted to identify the source code location of the branch.

ID Generation and Logging

std::string branchID = opcodeName + "_" + std::to_string(branchNumber);
// ...
IRBuilder<> builder (&*branch->getFirstInsertionPt ());
Value* args[] = {builder.CreateGlobalStringPtr(branchID)};
builder.CreateCall(logFunc, args);
branchNumber += 1;

A unique branch ID is generated by combining the opcode name and the branch number.
The branchDictionary is then updated with information related to the current branch. This includes the source file name, starting line, and target line. Simultaneously, the branchNumber is incremented, ensuring the generation of a distinct identifier for the next encountered branch.
The LLVM IR is dynamically modified to insert instrumentation code. Using the IRBuilder, a call to the logging function (logFunc) is added at the beginning of the basic block containing the branch instruction. This call includes the branch ID as a parameter, facilitating the logging of branch-related information.

Input & Output

Input:

#include <stdio.h>

int c;

void fun(int a) {
  printf("Value of a is %d\n", a);
}

int main() {
  void (*fun_ptr)(int) = &fun;
  (*fun_ptr)(10);

  int b;
  for (c = 0; c < 3; c++) {
    b = c + 1;
  }

  return c;
}

Output:

main: func_0x147e4af98
Branch Dictionary:
br_0: fileX.c, 14, 15
br_1: fileX.c, 14, 18

Value of a is 10
br_0
br_0
br_0
br_1

Compile & Run

# Step 1: Create and navigate to the build directory
mkdir build
cd build

# Step 2: Run CMake and make with Debug flags
cmake -DCMAKE_BUILD_TYPE=Debug ..
make

# Step 3: Navigate back to the parent directory
cd ..

# Step 4: Compile logPrint.c with debug information and no optimization
cc -g -O0 -c logPrint.c

# Step 5: Compile fileX.c with the clang plugin, including debug information and no optimization
clang -fno-discard-value-names -fpass-plugin='build/skeleton/SkeletonPass.dylib' -c test2.c -g -O0

# Step 6: Link the object files and create the executable with debug information
cc -g test2.o logPrint.o -o a.out

# Step 7: Run the executable
./a.out

Conclusion

By successfully capturing and recording critical runtime factors including conditional branching points and function pointer calls, our profiling tool provides valuable insights into program executions. The generated Branch-Pointer Trace and associated dictionary offer a comprehensive overview of control flow dynamics, laying the groundwork for in-depth program analysis.

Looking ahead, the next phase of this project involves the implementation of a static analysis tool. This tool will build upon the existing codebase, aiming to automatically identify the inputs to a C program that specifically determine its behavior at key points. Focusing on conditional branching and function pointer calls, our goal is to streamline the identification process, facilitating a more automated and efficient understanding of program behaviors.

Frame 1 _3_.png

References:

LLVM for Grad Students

🔍 Check out my code in github.
📮 If find any errors, please feel free to discuss and correct them: bsu5@ncsu.edu.