TVM
Expression
TE Tensor Expression, DSL (domain-specific language)
Relay TVM’s high-level model language
IR intermediate representation
TIR Tensor Intermediate Representation, TVM’s low-level intermediate representation
Schedule how to execute the computation
Stage schedule for one operation
TOPI TVM Operator Inventory, numpy-style generic operations and schedules
Demo
from tvm.driver import tvmc
#Step 1: Load
model = tvmc.load('resnet50-v2-7.onnx', shape_dict={'data':[1, 3, 224, 224]})
#Step 1.5: Optional Tune
tuning_records = tvmc.tune(model, target="llvm")
#Step 2: Compile
package = tvmc.compile(model, target="llvm", tuning_records = tuning_records)
#Step 3: Run
result = tvmc.run(package, device="cpu")
print(result)
- IRModule: relay.Function + tir.PrimFunc
- tvmc.compile: relay::Function --> tir::PrimFunc
Why TVM ?
tvmc.tune
make model run more fast.
How ?
ALL, especially,
AutoTVM (template-based) or AutoScheduler (Ansor, template-free auto-tuning)
TVM optimizaing compiler workflow
- TF/PyTroch/ONNX
- Relay (High-level IR)
- TE (Computation definition)
- AutoTVM/AutoScheduler (Auto-tuining module)
- TE + Schedule (Optimization specification)
- TIR (Low-level IR)
- Machine Code
流程解析
- TVM 数据输入格式,prefer ONNX
- TVM 高级 API 操作计算图
- Relay 通过 fuseops pass 生成子图,同时有 schedule primitives 对 low-level loop 进行优化,Tensor Operator Inventory (TOPI) 处理常规 op, 生成 TE
- 通过 AutoTVM (template-based) 或 AutoScheduler (template-free auto-tuning) 寻找最佳 schedule
- 生成 json 格式 tuning records, 包含最佳 schedule
- 生成 TIR,支持主流 LLVM/NVCC
- 生成机器码
low-level loop optimizations: tiling, vectorization, parallelization, unrolling, and fusion
TVM Auto-scheduler (a.k.a. Ansor)
package tvm.auto_scheduler
Schedule Primitives
How to get good performance kernel ?
TE
import tvm
from tvm import te
m = te.var('m')
n = te.var('n')
a = te.placeholder((m, n), name='A')
b = te.placeholder((m, n), name='B')
c = te.compute((m, n), lambda i, j: a[i, j]*b[i, j], name='C')
s = te.create_schedule([c.op])
tgt = tvm.target.Target(target="llvm", host="llvm")
mult = tvm.build(s, [a, b, c], target=tgt, name="mult")
print(mult.get_source())
print(tvm.lower(s, [a, b, c], simple_mode=True))
tvm.build
tvm.te.schedule.Schedule, tvm.tir.PrimFunc, IRModule, Mapping[str, IRModule] --> tvm.runtime.Module
A module that combines both host and device code
tvm.lower
tvm.te.schedule.Schedule, tvm.tir.PrimFunc, IRModule --> IRModule
Demo for IRModule transform
import tvm
from tvm.ir.module import IRModule
from tvm.script import tir as T
import numpy as np
from tvm import te
A = te.placeholder((8,), dtype="float32", name="A")
B = te.compute((8,), lambda *i: A(*i) + 1.0, name="B")
func = te.create_prim_func([A, B])
ir_module = IRModule({"main": func})
print(ir_module.script())
"""
# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
@T.prim_func
def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
# function attr dict
T.func_attr({"global_symbol": "main", "tir.noalias": True})
# body
# with T.block("root")
for i0 in T.serial(8):
with T.block("B"):
i0_1 = T.axis.spatial(8, i0)
T.reads(A[i0_1])
T.writes(B[i0_1])
B[i0_1] = A[i0_1] + T.float32(1)
"""
# <class 'tvm.driver.build_module.OperatorModule'>
mod = tvm.build(ir_module, target="llvm") # The module for CPU backends.
# <class 'tvm.tir.schedule.schedule.Schedule'>
sch = tvm.tir.Schedule(ir_module)
block_b = sch.get_block("B")
(i,) = sch.get_loops(block_b)
i_0, i_1, i_2 = sch.split(i, factors=[2, 2, 2])
print(sch.mod.script())
"""
# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
@T.prim_func
def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
# function attr dict
T.func_attr({"global_symbol": "main", "tir.noalias": True})
# body
# with T.block("root")
for i0_0, i0_1, i0_2 in T.grid(2, 2, 2):
with T.block("B"):
i0 = T.axis.spatial(8, i0_0 * 4 + i0_1 * 2 + i0_2)
T.reads(A[i0])
T.writes(B[i0])
B[i0] = A[i0] + T.float32(1)
"""
sch.reorder(i_0, i_2, i_1)
print(sch.mod.script())
"""
# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
@T.prim_func
def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
# function attr dict
T.func_attr({"global_symbol": "main", "tir.noalias": True})
# body
# with T.block("root")
for i0_0, i0_2, i0_1 in T.grid(2, 2, 2):
with T.block("B"):
i0 = T.axis.spatial(8, i0_0 * 4 + i0_1 * 2 + i0_2)
T.reads(A[i0])
T.writes(B[i0])
B[i0] = A[i0] + T.float32(1)
"""
sch.bind(i_0, "blockIdx.x")
sch.bind(i_2, "threadIdx.x")
print(sch.mod.script())
"""
# from tvm.script import tir as T
@tvm.script.ir_module
class Module:
@T.prim_func
def main(A: T.Buffer[8, "float32"], B: T.Buffer[8, "float32"]) -> None:
# function attr dict
T.func_attr({"global_symbol": "main", "tir.noalias": True})
# body
# with T.block("root")
for i0_0 in T.thread_binding(2, thread="blockIdx.x"):
for i0_2 in T.thread_binding(2, thread="threadIdx.x"):
for i0_1 in T.serial(2):
with T.block("B"):
i0 = T.axis.spatial(8, i0_0 * 4 + i0_1 * 2 + i0_2)
T.reads(A[i0])
T.writes(B[i0])
B[i0] = A[i0] + T.float32(1)
"""
Relay
relay.build()
return
- execution graph in json format
- TVM module library of compiled functions specifically for this graph on the target hardware
- parameter blobs of the model
During the compilation,
- Relay does the graph-level optimization,
- TVM does the tensor-level optimization,
resulting in an optimized runtime module for model serving.
Relay v.s. TE
tvm.tir.sqrt(x: tvm.ir.PrimExpr) -> tvm.ir.PrimExpr
# Alias of tvm.tir.sqrt()
tvm.te.sqrt(x: tvm.ir.PrimExpr) -> tvm.ir.PrimExpr
tvm.relay.sqrt(data: tvm.ir.RelayExpr) -> tvm.ir.RelayExpr)
For tvm.ir.BaseExpr,
- PrimExpr is class of all primitive expressions, used in the low-level code optimizations and integer analysis.
- RelayExpr is class of all non-primitive expressions.
Build
tvm.build
v.s. tvm.relay.build
tvm.relay.build(ir_mod: IRModule, target, target_host, executor=graph{}, runtime=cpp)
-> tvm.relay.backend.executor_factory.ExecutorFactoryModule
tvm.build(inputs: Union[tvm.te.schedule.Schedule, tvm.tir.function.PrimFunc, tvm.ir.module.IRModule, Mapping[str, tvm.ir.module.IRModule]], args, target, target_host, runtime, binds)
-> tvm.driver.build_module.OperatorModule
Optimization
Blocking, Cache
通过分 Block, 让留在 cache 中的中间计算结果能够发挥作用,从而提升性能。
Vectorization, Array Packing
连续的内存访问会比较高效,通过 vectorize
达到这样的目的。对矩阵的 Array Packing 也是在做这样的优化。
Parallelization
非依赖的情况下可以使用并行策略提高整体性能。
TOPI
n = te.var("n")
m = te.var("m")
A = te.placeholder((n, m), name="A")
origin
k = te.reduce_axis((0, m), "k")
B = te.compute((n,), lambda i: te.sum(A[i, k], axis=k), name="B")
s = te.create_schedule(B.op)
# print(tvm.lower(s, [A], simple_mode=True))
TOPI
C = topi.sum(A, axis=1)
ts = te.create_schedule(C.op)
# print(tvm.lower(ts, [A], simple_mode=True))
Develop
- write a pass with IR
- glue to lowering
每一个 phase 做的 transformation 如下,
- Phase 0 generates the raw IR and loop levels.
- Phase 1 flattens the array storage.
- Phase 2 transforms loops, like unroll, vectorization and thread-binding.
- Phase 3 does some cleanup work.
所以比如自定义的 vectorize 适合放在 phase 1 之后。
with tvm.transform.PassContext(config={"tir.add_lower_pass": [(1, vectorize)]}):
print(tvm.lower(sch, [a, b, c]))