TVM

Expression

TE Tensor Expression, DSL (domain-specific language)

Relay TVM’s high-level model language

IR intermediate representation

TIR Tensor Intermediate Representation, TVM’s low-level intermediate representation

Schedule how to execute the computation

Stage schedule for one operation

TOPI TVM Operator Inventory, numpy-style generic operations and schedules

Demo

from tvm.driver import tvmc

#Step 1: Load
model = tvmc.load('resnet50-v2-7.onnx', shape_dict={'data':[1, 3, 224, 224]}) 

#Step 1.5: Optional Tune
tuning_records = tvmc.tune(model, target="llvm") 

#Step 2: Compile
package = tvmc.compile(model, target="llvm", tuning_records = tuning_records) 

#Step 3: Run
result = tvmc.run(package, device="cpu") 

print(result)

IRModule: relay.Function + tir.PrimFunc
tvmc.compile: relay::Function --> tir::PrimFunc

Why TVM ?

tvmc.tune make model run more fast.

How ?

ALL, especially,

AutoTVM (template-based) or AutoScheduler (Ansor, template-free auto-tuning)

TVM optimizaing compiler workflow

TF/PyTroch/ONNX
Relay (High-level IR)
TE (Computation definition)
AutoTVM/AutoScheduler (Auto-tuining module)
TE + Schedule (Optimization specification)
TIR (Low-level IR)
Machine Code

流程解析

TVM 数据输入格式，prefer ONNX
TVM 高级 API 操作计算图
Relay 通过 fuseops pass 生成子图，同时有 schedule primitives 对 low-level loop 进行优化，Tensor Operator Inventory (TOPI) 处理常规 op, 生成 TE
通过 AutoTVM (template-based) 或 AutoScheduler (template-free auto-tuning) 寻找最佳 schedule
生成 json 格式 tuning records, 包含最佳 schedule
生成 TIR，支持主流 LLVM/NVCC
生成机器码

low-level loop optimizations: tiling, vectorization, parallelization, unrolling, and fusion

TVM Auto-scheduler (a.k.a. Ansor)

package tvm.auto_scheduler

Schedule Primitives

How to get good performance kernel ?

TE

import tvm
from tvm import te

m = te.var('m')
n = te.var('n')

a = te.placeholder((m, n), name='A')
b = te.placeholder((m, n), name='B')

c = te.compute((m, n), lambda i, j: a[i, j]*b[i, j], name='C')

s = te.create_schedule([c.op])

tgt = tvm.target.Target(target="llvm", host="llvm")
mult = tvm.build(s, [a, b, c], target=tgt, name="mult")

print(mult.get_source())

print(tvm.lower(s, [a, b, c], simple_mode=True))

tvm.build

tvm.te.schedule.Schedule, tvm.tir.PrimFunc, IRModule, Mapping[str, IRModule] --> tvm.runtime.Module

A module that combines both host and device code

tvm.lower

tvm.te.schedule.Schedule, tvm.tir.PrimFunc, IRModule --> IRModule

Demo for IRModule transform

import tvm
from tvm.ir.module import IRModule
from tvm.script import tir as T
import numpy as np

from tvm import te

A = te.placeholder((8,), dtype="float32", name="A")
B = te.compute((8,), lambda *i: A(*i) + 1.0, name="B")
func = te.create_prim_func([A, B])
ir_module = IRModule({"main": func})
print(ir_module.script())
"""
# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
        # function attr dict
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        # body
        # with T.block("root")
        for i0 in T.serial(8):
            with T.block("B"):
                i0_1 = T.axis.spatial(8, i0)
                T.reads(A[i0_1])
                T.writes(B[i0_1])
                B[i0_1] = A[i0_1] + T.float32(1)
"""

# <class 'tvm.driver.build_module.OperatorModule'>
mod = tvm.build(ir_module, target="llvm")  # The module for CPU backends.

# <class 'tvm.tir.schedule.schedule.Schedule'>
sch = tvm.tir.Schedule(ir_module)
block_b = sch.get_block("B")
(i,) = sch.get_loops(block_b)

i_0, i_1, i_2 = sch.split(i, factors=[2, 2, 2])
print(sch.mod.script())
"""
# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
        # function attr dict
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        # body
        # with T.block("root")
        for i0_0, i0_1, i0_2 in T.grid(2, 2, 2):
            with T.block("B"):
                i0 = T.axis.spatial(8, i0_0 * 4 + i0_1 * 2 + i0_2)
                T.reads(A[i0])
                T.writes(B[i0])
                B[i0] = A[i0] + T.float32(1)
"""

sch.reorder(i_0, i_2, i_1)
print(sch.mod.script())
"""
# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
        # function attr dict
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        # body
        # with T.block("root")
        for i0_0, i0_2, i0_1 in T.grid(2, 2, 2):
            with T.block("B"):
                i0 = T.axis.spatial(8, i0_0 * 4 + i0_1 * 2 + i0_2)
                T.reads(A[i0])
                T.writes(B[i0])
                B[i0] = A[i0] + T.float32(1)
"""

sch.bind(i_0, "blockIdx.x")
sch.bind(i_2, "threadIdx.x")
print(sch.mod.script())
"""
# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
        # function attr dict
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        # body
        # with T.block("root")
        for i0_0 in T.thread_binding(2, thread="blockIdx.x"):
            for i0_2 in T.thread_binding(2, thread="threadIdx.x"):
                for i0_1 in T.serial(2):
                    with T.block("B"):
                        i0 = T.axis.spatial(8, i0_0 * 4 + i0_1 * 2 + i0_2)
                        T.reads(A[i0])
                        T.writes(B[i0])
                        B[i0] = A[i0] + T.float32(1)
"""

Relay

relay.build()

return

execution graph in json format
TVM module library of compiled functions specifically for this graph on the target hardware
parameter blobs of the model

During the compilation,

Relay does the graph-level optimization,
TVM does the tensor-level optimization,

resulting in an optimized runtime module for model serving.

Relay v.s. TE

tvm.tir.sqrt(x: tvm.ir.PrimExpr) -> tvm.ir.PrimExpr

# Alias of tvm.tir.sqrt()
tvm.te.sqrt(x: tvm.ir.PrimExpr) -> tvm.ir.PrimExpr

tvm.relay.sqrt(data: tvm.ir.RelayExpr) -> tvm.ir.RelayExpr)

For tvm.ir.BaseExpr,

PrimExpr is class of all primitive expressions, used in the low-level code optimizations and integer analysis.
RelayExpr is class of all non-primitive expressions.

Build

tvm.build v.s. tvm.relay.build

tvm.relay.build(ir_mod: IRModule, target, target_host, executor=graph{}, runtime=cpp) 
    -> tvm.relay.backend.executor_factory.ExecutorFactoryModule

tvm.build(inputs: Union[tvm.te.schedule.Schedule, tvm.tir.function.PrimFunc, tvm.ir.module.IRModule, Mapping[str, tvm.ir.module.IRModule]], args, target, target_host, runtime, binds) 
    -> tvm.driver.build_module.OperatorModule

Optimization

Blocking, Cache

通过分 Block, 让留在 cache 中的中间计算结果能够发挥作用，从而提升性能。

Vectorization, Array Packing

连续的内存访问会比较高效，通过 vectorize 达到这样的目的。对矩阵的 Array Packing 也是在做这样的优化。

Parallelization

非依赖的情况下可以使用并行策略提高整体性能。

TOPI

n = te.var("n")
m = te.var("m")
A = te.placeholder((n, m), name="A")

origin

k = te.reduce_axis((0, m), "k")
B = te.compute((n,), lambda i: te.sum(A[i, k], axis=k), name="B")
s = te.create_schedule(B.op)
# print(tvm.lower(s, [A], simple_mode=True))

TOPI

C = topi.sum(A, axis=1)
ts = te.create_schedule(C.op)
# print(tvm.lower(ts, [A], simple_mode=True))

Develop

write a pass with IR
glue to lowering

每一个 phase 做的 transformation 如下，

Phase 0 generates the raw IR and loop levels.
Phase 1 flattens the array storage.
Phase 2 transforms loops, like unroll, vectorization and thread-binding.
Phase 3 does some cleanup work.

所以比如自定义的 vectorize 适合放在 phase 1 之后。

with tvm.transform.PassContext(config={"tir.add_lower_pass": [(1, vectorize)]}):
    print(tvm.lower(sch, [a, b, c]))

Aller au boulot

TVM

Expression

Demo

TVM optimizaing compiler workflow

TVM Auto-scheduler (a.k.a. Ansor)

Schedule Primitives

TE

Relay

Relay v.s. TE

Build

Optimization

TOPI

Develop

Reference