世界德州扑克第一人,棋牌室开业挂画,月亮国际网址(中国)·官方网站

經(jīng)過前面幾章關(guān)于triton在nv gpu上調(diào)優(yōu)的講解，我們這章開始來看看triton的一個(gè)third_party庫，該庫是為了讓triton去支持更多其他的backend。該項(xiàng)目的地址如下所示，并且已經(jīng)在triton的main分支中，作為third_party進(jìn)行了官方支持，在clone triton的時(shí)候，只需要帶上recursive的flag就可以完成對triton-shared的使用。

什么是triton-shared？

關(guān)于triton-shared的官方具體實(shí)現(xiàn)，如下github repo所示：

GitHub - microsoft/triton-shared: Shared Middle-Layer for Triton Compilationgithub.com/microsoft/triton-shared

如下所示為官方對triton-shared的解釋:

Asharedmiddle-layerfortheTritonCompiler.

Currentlythemiddlelayerisnotcompletebuthasenoughfunctionalitytodemonstratehowitcanwork.ThegeneralideaisthatTritonIRisloweredintoanMLIRcoredialecttoallowittobebothsharedacrossTritontargetsaswellasallowback-endstobesharedwithotherlanguages.

Thebasicintendedarchitecturelookslikethis:

[TritonIR]->[MiddleLayer]->[HWspecificIR]

Themiddle-layerusesMLIR'sLinalgandTenorDialectsforoperationsonTritonblockvalues.OperationsonTritonpointersusetheMemrefDialect.

triton-shared其實(shí)就是為了提供一個(gè)膠水一樣的中間層，通過對middle-layer的設(shè)計(jì)來方便我們的編程語言或者編譯器對接到下游不同的硬件生態(tài)，因?yàn)閠riton自身已經(jīng)把nv和amd這兩個(gè)比較常見的GPU后端實(shí)現(xiàn)了，如果第三方的廠商想通過復(fù)用triton的前端來對自己的芯片搞一套編譯flow，那么triton-shared就起到了決定性的作用。下面這個(gè)圖是triton的codebase所希望支持的一個(gè)愿景，可以看出來，中間這條垂直下來的分支就是triton所支持的nv gpu的優(yōu)化路線，當(dāng)用戶寫完的triton dsl會被翻譯成python的AST，然后再從AST到對應(yīng)的triton dialect，從這一步開始，也就正式將用戶手寫的成分轉(zhuǎn)到了MLIR這套生態(tài)，然后再從triton dialect進(jìn)一步優(yōu)化到triton gpu dialect,從trition gpu dialect開始，就走了比較標(biāo)準(zhǔn)的LLVM代碼生成，從LLVM IR一路lower到PTX，再到SASS，最終可以成功運(yùn)行在NV的GPU上，這套codegen的路線相比TVM等其他編譯框架來說更加的激進(jìn)，直接越過了nvcc compiler，從而使得整個(gè)過程都變成了透明的，對于性能優(yōu)化來說帶來了更多的可能。

img

添加圖片注釋，不超過 140 字（可選）

triton-shared其實(shí)主要是用來cover最右邊的分支，因?yàn)槭煜LIR的朋友都知道，在右邊的分支中，Linalg dialect是一個(gè)非常重要dialect，該dialect可以去承接很多不同的backend，在主流一些backend的編譯優(yōu)化環(huán)節(jié)，都會將Linalg作為主要的dialect來進(jìn)行上下游不同dialect之間的轉(zhuǎn)換與對接。

Triton-shared的安裝

Triton-shared的安裝其實(shí)也很簡單，只需要一開始通過recursive來clone整個(gè)triton的主分支，然后使用

exportTRITON_CODEGEN_TRITON_SHARED=1

來指明，我們在build triton整個(gè)項(xiàng)目的過程中需要使用到triton-shared這個(gè)第三方的庫。接下來的流程按照triton官方repo的readme一步一步進(jìn)行即可，有關(guān)LLVM我是使用的具體commit id下手動編譯得到的llvm

LLVMcommitid:b1115f8ccefb380824a9d997622cc84fc0d84a89
Tritoncommitid:1c2d2405bf04dca2de140bccd65480c3d02d995e

為什么要選擇如上兩個(gè)固定的commit id，其實(shí)理由很簡單，因?yàn)槲仪懊孀鲞^一些關(guān)于triton和llvm的開發(fā)都是基于上面兩個(gè)id做的，所以后面我的所有教程以及案例展示都是以這兩個(gè)commit id為主進(jìn)行。如果不知道怎么從0開始編譯triton，可以參考我之前的教程：

科研敗犬丶：OpenAI/Triton MLIR 第零章: 源碼編譯70 贊同 · 7 評論文章

Triton-shared的使用

講解完了什么是triton-shared，以及triton-shared怎么安裝，接下來，我們就來談?wù)勅绾问褂靡呀?jīng)被編譯好的triton-shared。當(dāng)你按照我的上述流程編譯好triton后，會在該路徑下：

/triton/build/tools/triton-shared-opt

看到一個(gè)triton-shared-opt的可執(zhí)行文件，熟悉MLIR的同學(xué)可能很快發(fā)現(xiàn)該方法其實(shí)就是MLIR中最基本的opt，該二進(jìn)制文件可以完成從一個(gè)dialect向另外一個(gè)dialect的lowering，那么我們使用--help來看看triton-shared-opt的所有功能。如果能在終端中輸出如下所示的信息，說明你的triton-shared已經(jīng)全部安裝完畢了。

OVERVIEW:Triton-Sharedtestdriver

AvailableDialects:arith,builtin,cf,gpu,math,scf,triton_gpu,tt
USAGE:triton-shared-opt[options]

OPTIONS:

ColorOptions:

--color-Usecolorsinoutput(default=autodetect)

Generaloptions:

--abort-on-max-devirt-iterations-reached-AbortwhenthemaxiterationsfordevirtualizationCGSCCrepeatpassisreached
--allow-unregistered-dialect-Allowoperationwithnoregistereddialects
Compilerpassestorun
Passes:
--affine-data-copy-generate-Generateexplicitcopyingforaffinememoryoperations
--fast-mem-capacity=-SetfastmemoryspacecapacityinKiB(default:unlimited)
--fast-mem-space=-Fastmemoryspaceidentifierforcopygeneration(default:1)
--generate-dma-GenerateDMAinsteadofpoint-wisecopy
--min-dma-transfer=-MinimumDMAtransfersizesupportedbythetargetinbytes
--skip-non-unit-stride-loops-Testingpurposes:avoidnon-unitstrideloopchoicedepthsforcopyplacement
--slow-mem-space=-Slowmemoryspaceidentifierforcopygeneration(default:0)
--tag-mem-space=-Tagmemoryspaceidentifierforcopygeneration(default:0)
--affine-expand-index-ops-Loweraffineoperationsoperatingonindicesintomorefundamentaloperations
--affine-loop-coalescing-Coalescenestedloopswithindependentboundsintoasingleloop
--affine-loop-fusion-Fuseaffineloopnests
...

這里先來展示

--triton-to-linalg-ConvertTritontoLinalgdialect

這個(gè)pass的使用，因?yàn)閠riton-shared主要就是用來做該優(yōu)化的。他表示的就是將triton dialect作為輸入，然后經(jīng)過triton-to-linalg這個(gè)pass，將其lowering到具有相同語義的linalg dialect上，那triton dialect從哪里來得到呢？不要慌，triton-shared的repo為我們提供了很多MLIR格式的文件來方便我們使用該功能，具體路徑如下：

/triton/third_party/triton_shared/test/Conversion/TritonToLinalg/*

在該教程中，我們使用dot.mlir作為案例進(jìn)行分析，具體代碼如下所示：

//RUN:triton-shared-opt--triton-to-linalg%s|FileCheck%s
module{
tt.func@kernel(
%arg0:!tt.ptr,
%arg1:!tt.ptr,
%arg2:!tt.ptr
)
{
%0=tt.make_range{end=128:i32,start=0:i32}:tensor<128xi32>
%c64=arith.constant128:i32
%1=tt.splat%c64:(i32)->tensor<128xi32>
%2=arith.muli%0,%1:tensor<128xi32>
%3=tt.expand_dims%2{axis=1:i32}:(tensor<128xi32>)->tensor<128x1xi32>
%4=tt.broadcast%3:(tensor<128x1xi32>)->tensor<128x64xi32>
%5=tt.make_range{end=64:i32,start=0:i32}:tensor<64xi32>
%6=tt.expand_dims%5{axis=0:i32}:(tensor<64xi32>)->tensor<1x64xi32>
%7=tt.broadcast%6:(tensor<1x64xi32>)->tensor<128x64xi32>
%8=arith.addi%4,%7:tensor<128x64xi32>
%10=tt.make_range{end=256:i32,start=0:i32}:tensor<256xi32>
%11=tt.expand_dims%10{axis=1:i32}:(tensor<256xi32>)->tensor<256x1xi32>
%12=tt.broadcast%11:(tensor<256x1xi32>)->tensor<256x64xi32>
%13=tt.make_range{end=64:i32,start=0:i32}:tensor<64xi32>
%c256=arith.constant256:i32
%14=tt.splat%c256:(i32)->tensor<64xi32>
%15=arith.muli%13,%14:tensor<64xi32>
%16=tt.expand_dims%15{axis=0:i32}:(tensor<64xi32>)->tensor<1x64xi32>
%17=tt.broadcast%16:(tensor<1x64xi32>)->tensor<256x64xi32>
%18=arith.addi%12,%17:tensor<256x64xi32>
%20=tt.splat%c256:(i32)->tensor<128xi32>
%21=arith.muli%0,%20:tensor<128xi32>
%22=tt.expand_dims%21{axis=1:i32}:(tensor<128xi32>)->tensor<128x1xi32>
%23=tt.broadcast%22:(tensor<128x1xi32>)->tensor<128x256xi32>
%24=tt.expand_dims%10{axis=0:i32}:(tensor<256xi32>)->tensor<1x256xi32>
%25=tt.broadcast%24{axis=0:i32}:(tensor<1x256xi32>)->tensor<128x256xi32>
%26=arith.addi%23,%25:tensor<128x256xi32>
%30=tt.splat%arg0:(!tt.ptr)->tensor<128x64x!tt.ptr>
%31=tt.addptr%30,%8:tensor<128x64x!tt.ptr>,tensor<128x64xi32>
%32=tt.load%31{cache=1:i32,evict=1:i32,isVolatile=false}:tensor<128x64xbf16>
%40=tt.splat%arg1:(!tt.ptr)->tensor<256x64x!tt.ptr>
%41=tt.addptr%40,%18:tensor<256x64x!tt.ptr>,tensor<256x64xi32>
%42=tt.load%41{cache=1:i32,evict=1:i32,isVolatile=false}:tensor<256x64xbf16>
%43=tt.trans%42:(tensor<256x64xbf16>)->tensor<64x256xbf16>
%50=tt.splat%arg2:(!tt.ptr)->tensor<128x256x!tt.ptr>
%51=tt.addptr%50,%26:tensor<128x256x!tt.ptr>,tensor<128x256xi32>
%52=tt.load%51{cache=1:i32,evict=1:i32,isVolatile=false}:tensor<128x256xbf16>
%60=tt.dot%32,%43,%52{allowTF32=false,maxNumImpreciseAcc=0:i32}:tensor<128x64xbf16>*tensor<64x256xbf16>->tensor<128x256xbf16>
tt.store%51,%60:tensor<128x256xbf16>
tt.return
}
}

上述MLIR其實(shí)很容易看懂，在%0->%10其實(shí)都是triton dialect的內(nèi)容，該內(nèi)容表示的就是從上層的triton dsl通過lower轉(zhuǎn)換到對應(yīng)的triton dialect的過程。其中tt就是表示的該MLIR所處的dialect是triton dialect，然后tt.xxx則表示了該dialect所支持的所有operation，有關(guān)如何定義一個(gè)MLIR dialect，我準(zhǔn)備拿一個(gè)單獨(dú)的教程來講。

接下來，只需要在終端中輸入

./triton-shared-opt--triton-to-linalg/triton/third_party/triton_shared/test/Conversion/TritonToLinalg/dot.mlir

就可以得到從triton dialect轉(zhuǎn)到linag dialect部分對應(yīng)的內(nèi)容

#map=affine_map<(d0,?d1)?->(d0,d1)>
module{
func.func@kernel(%arg0:memref<*xbf16>,%arg1:memref<*xbf16>,%arg2:memref<*xbf16>,%arg3:i32,%arg4:i32,%arg5:i32,%arg6:i32,%arg7:i32,%arg8:i32){
%c256=arith.constant256:index
%c128=arith.constant128:index
%reinterpret_cast=memref.reinterpret_cast%arg0tooffset:[0],sizes:[128,64],strides:[%c128,1]:memref<*xbf16>tomemref<128x64xbf16,?strided<[?,?1]>>
%alloc=memref.alloc():memref<128x64xbf16>
memref.copy%reinterpret_cast,%alloc:memref<128x64xbf16,?strided<[?,?1]>>tomemref<128x64xbf16>
%0=bufferization.to_tensor%allocrestrictwritable:memref<128x64xbf16>
%reinterpret_cast_0=memref.reinterpret_cast%arg1tooffset:[0],sizes:[256,64],strides:[1,%c256]:memref<*xbf16>tomemref<256x64xbf16,?strided<[1,??]>>
%alloc_1=memref.alloc():memref<256x64xbf16>
memref.copy%reinterpret_cast_0,%alloc_1:memref<256x64xbf16,?strided<[1,??]>>tomemref<256x64xbf16>
%1=bufferization.to_tensor%alloc_1restrictwritable:memref<256x64xbf16>
%2=tensor.empty():tensor<64x256xbf16>
%transposed=linalg.transposeins(%1:tensor<256x64xbf16>)outs(%2:tensor<64x256xbf16>)permutation=[1,0]
%reinterpret_cast_2=memref.reinterpret_cast%arg2tooffset:[0],sizes:[128,256],strides:[%c256,1]:memref<*xbf16>tomemref<128x256xbf16,?strided<[?,?1]>>
%alloc_3=memref.alloc():memref<128x256xbf16>
memref.copy%reinterpret_cast_2,%alloc_3:memref<128x256xbf16,?strided<[?,?1]>>tomemref<128x256xbf16>
%3=bufferization.to_tensor%alloc_3restrictwritable:memref<128x256xbf16>
%4=tensor.empty():tensor<128x256xbf16>
%5=linalg.matmulins(%0,%transposed:tensor<128x64xbf16>,tensor<64x256xbf16>)outs(%4:tensor<128x256xbf16>)->tensor<128x256xbf16>
%6=linalg.generic{indexing_maps=[#map,#map,#map],iterator_types=["parallel","parallel"]}ins(%5,%3:tensor<128x256xbf16>,tensor<128x256xbf16>)outs(%5:tensor<128x256xbf16>){
^bb0(%in:bf16,%in_4:bf16,%out:bf16):
%7=arith.addf%in,%in_4:bf16
linalg.yield%7:bf16
}->tensor<128x256xbf16>
memref.tensor_store%6,%reinterpret_cast_2:memref<128x256xbf16,?strided<[?,?1]>>
return
}
}

關(guān)于其他更加具體的operator，我們可以都按照上述流程來進(jìn)行操作，一旦你的編譯框架是基于MLIR來開發(fā)的，那么如果能很好的轉(zhuǎn)到Linalg，那么就說明了后續(xù)在接入自己的backend以及適配一些ISA的過程就會方便不少，這也從另外一個(gè)角度彰顯了為什么現(xiàn)在的趨勢都是將自己的compiler通過MLIR進(jìn)行重構(gòu)。最重要的原因，其實(shí)就是以最小的開發(fā)成本方便的接入各種軟件或者硬件的生態(tài)。

后記

對triton的研究已經(jīng)有一段時(shí)間了，由于當(dāng)時(shí)學(xué)triton也是基于源碼一步一步硬吃過來的，并且triton也沒有比較好的中文教程，所以后面會利用空閑時(shí)間將我目前對于使用triton來做codegen的各種優(yōu)化方法(不同backend以及不同IR層面的pass)和細(xì)節(jié)(底層layout的設(shè)計(jì))進(jìn)行一個(gè)詳細(xì)的梳理，來幫助更多想要使用triton來做codegen的同學(xué)。

審核編輯：湯梓紅

聲明：本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場。文章及其配圖僅供工程師學(xué)習(xí)之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問題，請聯(lián)系本站處理。舉報(bào)投訴

gpu

gpu

+關(guān)注

關(guān)注
28

文章
4774

瀏覽量
129350
Triton

Triton

+關(guān)注

關(guān)注
0

文章
28

瀏覽量
7060
代碼

代碼

+關(guān)注

關(guān)注
30

文章
4825

瀏覽量
69041
編譯器

編譯器

+關(guān)注

關(guān)注
1

文章
1642

瀏覽量
49283

原文標(biāo)題：OpenAI/Triton MLIR 第三章: Triton-shared開箱

文章出處：【微信號：GiantPandaCV，微信公眾號：GiantPandaCV】歡迎添加關(guān)注！文章轉(zhuǎn)載請注明出處。

Triton編譯器的原理和性能

Triton是一種用于編寫高效自定義深度學(xué)習(xí)原語的語言和編譯器。Triton的目的是提供一個(gè)開源環(huán)境，以比CUDA更高的生產(chǎn)力編寫快速代碼，但也比其他現(xiàn)有DSL具有更大的靈活性。Triton已被采用

發(fā)表于 12-16 11:22 ?3116次閱讀

在AMD GPU上如何安裝和配置triton？

最近在整理python-based的benchmark代碼，反過來在NV的GPU上又把Triton裝了一遍，發(fā)現(xiàn)Triton的github repo已經(jīng)給出了對應(yīng)的llvm的commit id以及對應(yīng)的編譯細(xì)節(jié)，然后跟著走了一遍，也順利的

發(fā)表于 02-22 17:04 ?2587次閱讀

在AMD GPU上如何<b class='flag-5'>安裝</b>和配置<b class='flag-5'>triton</b>？

NVIDIA Triton推理服務(wù)器簡化人工智能推理

GKE 的 Triton 推理服務(wù)器應(yīng)用程序是一個(gè) helm chart 部署程序，可自動安裝和配置 Triton ，以便在具有 NVIDIA GPU 節(jié)點(diǎn)池的 GKE 集群上使用，包括

發(fā)表于 04-08 16:43 ?2273次閱讀

NVIDIA <b class='flag-5'>Triton</b>推理服務(wù)器簡化人工智能推理

Triton DataCenter云管理平臺

triton.zip

發(fā)表于 04-25 10:06 ?1次下載

<b class='flag-5'>Triton</b> DataCenter云管理平臺

NVIDIA Triton系列文章：開發(fā)資源說明

這里最重要的是 “server documents on GitHub” 鏈接，點(diǎn)進(jìn)去后會進(jìn)入整個(gè) Triton 項(xiàng)目中最完整的技術(shù)文件中心（如下圖），除 Installation

發(fā)表于 11-09 16:17 ?784次閱讀

NVIDIA Triton 系列文章（6）：安裝用戶端軟件

在前面的文章中，已經(jīng)帶著讀者創(chuàng)建好 Triton 的模型倉、安裝并執(zhí)行 Triton 推理服務(wù)器軟件，接下來就是要安裝 Triton 用戶

發(fā)表于 11-29 19:20 ?1249次閱讀

NVIDIA Triton 系列文章（10）：模型并發(fā)執(zhí)行

前面已經(jīng)做好了每個(gè)推理模型的基礎(chǔ)配置，基本上就能正常讓 Triton 服務(wù)器使用這些獨(dú)立模型進(jìn)行推理。接下來的重點(diǎn)，就是要讓設(shè)備的計(jì)算資源盡可能地充分使用，首先第一件事情就是模型并發(fā)執(zhí)行

發(fā)表于 01-05 11:55 ?1193次閱讀

Triton的具體優(yōu)化有哪些

上一章的反響還不錯(cuò)，很多人都私信催更想看Triton的具體優(yōu)化有哪些，為什么它能夠得到比cuBLAS更好的性能。

發(fā)表于 05-16 09:40 ?1858次閱讀

<b class='flag-5'>Triton</b>的具體優(yōu)化有哪些

如何使用triton的language api來實(shí)現(xiàn)gemm的算子

前言通過前兩章對于triton的簡單介紹，相信大家已經(jīng)能夠通過從源碼來安裝triton，同時(shí)通過triton提供的language前端寫出自己想要的一些計(jì)算密集型算子。這章開始，我們

發(fā)表于 05-29 14:34 ?2228次閱讀

如何使用<b class='flag-5'>triton</b>的language api來實(shí)現(xiàn)gemm的算子

Triton編譯器功能介紹 Triton編譯器使用教程

Triton 是一個(gè)開源的編譯器前端，它支持多種編程語言，包括 C、C++、Fortran 和 Ada。Triton 旨在提供一個(gè)可擴(kuò)展和可定制的編譯器框架，允許開發(fā)者添加新的編程語言特性和優(yōu)化技術(shù)

發(fā)表于 12-24 17:23 ?630次閱讀

Triton編譯器支持的編程語言

Triton編譯器支持的編程語言主要包括以下幾種：一、主要編程語言 Python ：Triton編譯器通過Python接口提供了對Triton語言和編譯器的訪問，使得用戶可以在Python環(huán)境中

發(fā)表于 12-24 17:33 ?457次閱讀

Triton編譯器安裝步驟詳解

：用于構(gòu)建項(xiàng)目。 Python ：用于運(yùn)行 Triton 的 Python 綁定。其他依賴：根據(jù)您選擇的架構(gòu)，可能需要額外的依賴。 2. 安裝依賴對于 Linux：打開終端并運(yùn)行以下命令來安裝

發(fā)表于 12-24 17:35 ?644次閱讀

Triton編譯器的常見問題解決方案

Triton編譯器作為一款專注于深度學(xué)習(xí)的高性能GPU編程工具，在使用過程中可能會遇到一些常見問題。以下是一些常見問題的解決方案：一、安裝與依賴問題檢查Python版本 Triton編譯器通常

發(fā)表于 12-24 18:04 ?752次閱讀

Triton編譯器在機(jī)器學(xué)習(xí)中的應(yīng)用

1. Triton編譯器概述 Triton編譯器是NVIDIA Triton推理服務(wù)平臺的一部分，它負(fù)責(zé)將深度學(xué)習(xí)模型轉(zhuǎn)換為優(yōu)化的格式，以便在NVIDIA GPU上高效運(yùn)行。Triton

發(fā)表于 12-24 18:13 ?518次閱讀

Triton編譯器的優(yōu)化技巧

在現(xiàn)代計(jì)算環(huán)境中，編譯器的性能對于軟件的運(yùn)行效率至關(guān)重要。Triton 編譯器作為一個(gè)先進(jìn)的編譯器框架，提供了一系列的優(yōu)化技術(shù)，以確保生成的代碼既高效又適應(yīng)不同的硬件架構(gòu)。 1. 指令選擇

發(fā)表于 12-25 09:09 ?340次閱讀

衡阳派盒市场营销有限公司

搜索歷史

什么是Triton-shared？Triton-shared的安裝和使用

評論