Loop-Closed SSA (LCSSA)

You’ll use LCSSA to make loop exits “well-formed” so any value defined inside a loop and used outside flows through a PHI in an exit block. This isolates loop internals and unlocks/strengthens LICM, loop-rotate, vectorization, SCEV, etc It’s also very handy around OpenMP parallel for and lastprivate/reduction lowering.

Below is a practical, LLVM-20-style tutorial that goes from concept ⇒ IR ⇒ algorithms ⇒ opt pipelines ⇒ OpenMP specifics ⇒ integration into your own pass.

0) What LCSSA is

LCSSA property: For every instruction I defined in loop L, if I has a use outside L, then each such use must be dominated by a PHI in some exit block of L, and that PHI is the only cross-loop use of I.

Intuition: Replace “random uses after the loop” by “exactly one PHI on the loop exit path,” so later passes reason locally at exits.

Key aspects of the `phi` instruction:

SSA Requirement: SSA form dictates that every variable is assigned a value exactly once. When control flow merges from multiple paths (e.g., after an if-else statement or a loop), a single variable might have different values depending on which path was taken. The phi instruction resolves this by creating a new, unique variable that “selects” the correct value based on the predecessor basic block.
Selection of Value: A phi instruction takes a list of pairs, where each pair consists of a value and the basic block from which that value originates. When the phi instruction is executed, it chooses the value associated with the basic block that was the immediate predecessor in the control flow.
Example:
```
    %x = phi i32 [ %inc, %then ], [ %dec, %else ]
```
In this example, if the control flow came from the %then basic block, %x will be assigned the value of %inc. If the control flow came from the %else basic block, %x will be assigned the value of %dec.

Placement: phi instructions must always be placed at the very beginning of a basic block, before any other instructions in that block.

Purpose: phi instructions enable the compiler to perform various optimizations by ensuring a clear and unambiguous definition for every variable, even across control flow merges. They are essential for accurate dataflow analysis and transformations.

1) Before/After example (by hand)

int f(int *a, int n) {
  int s = 0;
  for (int i = 0; i < n; ++i)
    s += a[i];         // value defined inside loop
  return s;            // used outside loop
}

Pre-LCSSA

for.body:
  %s.cur = phi i32 [0, %entry], [%s.next, %latch]
  %val   = load i32, ptr %aptr
  %s.next = add i32 %s.cur, %val
  %cond  = icmp slt i32 %i, %n
  br i1 %cond, label %latch, label %exit

exit:
  ret i32 %s.next          ; cross-loop use of %s.next (bad for LCSSA)

Post-LCSSA

exit:
  %s.next.lcssa = phi i32 [ %s.next, %for.body ] ; single cross-loop use is the PHI
  ret i32 %s.next.lcssa

2) The algorithm (the minimal version)

Given a loop L, DT, LI:

Collect all instructions I is a member of L that have at least one use outside L.
For each exit block E of L and each predecessor P is a member of L of E, if I is available at P → E:
- Create %I.lcssa = phi [I, P] in E (one PHI per exiting predecessor set)
Redirect every non-LCSSA use of I outside of L to %I.lcssa.

3) Verifying LCSSA

A loop L is in LCSSA form if: “no instruction from L has an out-of-loop use except LCSSA PHIs directly in exit blocks.” LLVM has isLCSSAForm(DT) on a loop and a verifier pass.

4) Using `opt` (new PassManager style)

Common pipelines used in the wild:

# Minimal: canonicalize loops, then LCSSA
opt -passes='loop-simplify,lcssa' -S in.ll -o out.ll

# With verification (handy while debugging)
opt -passes='loop-simplify,lcssa,verify' -S in.ll -o out.ll

# A loop-friendly starter pack
opt -passes='mem2reg,instcombine,simplifycfg,loop-simplify,lcssa,indvars,licm' -S in.ll -o out.ll

Notes:

loop-simplify gives preheaders + dedicated exits; it makes LCSSA formation straightforward.
Many loop passes preserve LCSSA or materialize it on demand, but doing it explicitly avoids surprises.

5) Writing a pass that assumes LCSSA (LLVM 20, new PassManager)

If you want to rely on LCSSA inside your loop/function pass, either:

Run lcssa before your pass in the pipeline, or.
Call formLCSSA yourself per loop.⸻

#include "llvm/IR/Dominators.h"
#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Transforms/Utils/LoopUtils.h"   // formLCSSA
using namespace llvm;

struct MyLoopXform {
  PreservedAnalyses run(Function &F, FunctionAnalysisManager &FAM) {
    auto &LI = FAM.getResult<LoopAnalysis>(F);
    auto &DT = FAM.getResult<DominatorTreeAnalysis>(F);

    // Ensure LCSSA where needed
    for (Loop *L : LI) {
      if (!L->isLCSSAForm(DT)) {
        formLCSSA(*L, DT, LI);
      }
      // ... your loop optimization that relies on LCSSA ...
    }
    return PreservedAnalyses::none();
  }
};

If you write a LoopPass, you can also request/assume simplified form and call formLCSSA(*L, DT, LI) inside run.

6) OpenMP constructs: what changes and what to do

Clang lowers #pragma omp parallel for by outlining the loop body into a helper function and calling into the OpenMP runtime (e.g., __kmpc_fork_call, OpenMPIRBuilder on the LLVM side). Consequences:

The for-loop you care about usually lives inside the outlined worker function.
If you run your pipeline only the parent function, you will miss the loop.
lastprivate and reduction create exactly the kind of “value defined in loop, used after region” scenarios LCSSA excels at (per-thread partials flow to a merge point).

OpenMP example: `lastprivate`

int g(int *a, int n) {
  int last = -1;
  #pragma omp parallel for lastprivate(last)
  for (int i = 0; i < n; ++i)
    last = a[i] * 2;

  return last; // needs the "last iteration’s" value after region
}

Lowering will:

Outline the loop body
Inside the worker, the loop value last is loop-local; at region exit, a runtime-controller merge assigns the last iteration result to the shared last.
Running LoopSimplify + LCSSA in the outlined function ensures the loop’s exiting edges produce clean PHIs; this makes downstream opts (e.g., LICM, indvars, vectorization for omp simd) reliable.

Practical steps to apply LCSSA to OpenMP code:

1. Produce IR with OpenMP (host example):

clang -O0 -fopenmp -emit-llvm -S omp_lastprivate.c -o omp_lastprivate.ll

2. Run on all functions (incluing outlined workers):

opt -passes='loop-simplify,lcssa,verify' -S omp_lastprivate.ll -o omp_lastprivate.lcssa.ll

3. Optional: inspect loops to confirm:

opt -passes='print<loops>' -disable-output omp_lastprivate.lcssa.ll

4. For device/offload builds (`-fopenmp-targets=...`)

You will get additional device modules; ensure your pipeline runs on those as well (same passes, same reasoning).

opt -passes='mem2reg,sroa,instcombine,simplifycfg,
             loop-simplify,lcssa,indvars,licm,
             loop-rotate,gvn,simplifycfg' \
    -S omp.ll -o omp.opt.ll

5. In an O2-ish pipeline that keeps OpenMP transforms friendly:

opt -passes='mem2reg,sroa,instcombine,simplifycfg,
             loop-simplify,lcssa,indvars,licm,
             loop-rotate,gvn,simplifycfg' \
    -S omp.ll -o omp.opt.ll

6. Common pitfalls (esp. with OpenMP)

Not simplified loops: Without preheaders/dedicated exits, LCSSA insertion is messy. Always run loop-simplify first.
Exceptional exits: If your loop can unwind (e.g., C++ exceptions), there may be EH exits; LCSSA PHIs appear in those exit blocks as well.
Debug uses: Debug intrinsics can look like out-of-loop uses. LLVM’s utilities typically ignore them; if you’re rolling your own, filter dbg uses.
Reduction vs lastprivate:
- Reductions are typically handled via per-thread partials that the runtime merges; the hot inner loop inside the worker is still just a normal loop—LCSSA helps classic opts there.
- lastprivate models an actual data flow from the loop’s “last iteration” to the region’s continuation. LCSSA at loop exits keeps that clean.
Outlined function blind spot: Ensure your pass manager/pipeline runs over all functions, not just main/call sites.

7. Minimal Test You Can Run Today

Source (omp_sum.c)

#include <omp.h>
int sum_last(int *a, int n) {
  int s = 0, last = -1;
  #pragma omp parallel for reduction(+:s) lastprivate(last)
  for (int i = 0; i < n; ++i) {
    s    += a[i];
    last  = a[i] * 2;
  }
  return s + last;
}

IR + LCSSA

clang -O0 -fopenmp -emit-llvm -S omp_sum.c -o omp_sum.ll
opt -passes='loop-simplify,lcssa,verify' -S omp_sum.ll -o omp_sum.lcssa.ll

Open omp_sum.lcssa.ll; locate the outlined worker (often a mangled helper). You should see .lcssa suffixed PHIs in loop exit blocks feeding the region merge.

8. Dropping LCSSA on the floor (and fixing it)

Some transforms break LCSSA (e.g., aggressive CFG changes). If your pass needs LCSSA after such transforms, simply call:

formLCSSA(*L, DT, LI);        // for one loop
// or
formLCSSA(LI, DT);            // for all loops in a function (utility overloads exist)

Then continue ……

9. Quick checklist

Ensure SSA (mem2reg / sroa).
loop-simplify (preheaders, single backedge, dedicated exits).
lcssa (or formLCSSA in your pass).
verify while developing.
Run the pipeline on outlined OpenMP worker functions (host and device).
For passes that may disrupt exits, re-formLCSSA before analyses/opts that assume it.

⸻