Technical GUIDE

PagedAttention uye vLLM

PagedAttention inzira yekurangarira-yekutarisira iyo inochengeta kutarisisa kwemuenzaniso wemutauro mumabhururu madiki anogona kushandiswa zvakare pachinzvimbo chechunk hombe.

Overview

PagedAttention inzira yekurangarira-yekutarisira iyo inochengeta kutarisisa kwemuenzaniso wemutauro mumabhururu madiki anogona kushandiswa zvakare pachinzvimbo chechunk hombe. Inopa vLLM simba, yakavhurika-sosi inoshumira injini iyo inowedzera zvinoshamisa kuti zvingani zvikumbiro zvinogona kubata GPU imwe chete.

PagedAttention uye vLLM inyanzvi yekuvaka inobata mhando yemhando, mutengo wezvivakwa, latency, uye kuvimbika pachiyero.

Deep Dive

Kana modhi yemutauro yaburitsa mavara, inochengeta 'KV cache' (kiyi uye kukosha mavheta) pachiratidzo chega chega chayakaona kuitira kuti chiratidzo chinotevera chigone kuenderana nemamiriro akazara. Sechivanhu chikumbiro chega chega chaichengeta imwe hombe inobatika yeGPU ndangariro yakaenzana nekureba kwayo, kutambisa huwandu hukuru apo kutevedzana kwaive kupfupi kana kusiana pakureba. PagedAttention, yakaunzwa mu2023 vLLM bepa kubva kuUC Berkeley, inokwereta iyo pfungwa yekurangarira ndangariro paging kubva kune anoshanda masisitimu: inotsemura iyo KV cache kuita yakagadziriswa-saizi mabhuroko anogona kugara chero kupi mundangariro uye kugovaniswa painodiwa. Tafura yekutarisa inoburitsa logic token positions kumabhuroko emuviri. Izvi zvinenge zvapedza kupatsanurwa kwendangariro uye zvinoita kuti zvivharo zvigoverwe, semuenzaniso pane zvakawanda zvinobuda kubva pakukurumidza kumwechete.

Technical Insight

Iyo KV cache yakakamurwa kuita mapeji akasarudzika-saizi, imwe neimwe inobata makiyi uye kukosha kune yakatarwa nhamba yezviratidzo. A per-sequence block tafura inomepu zvinzvimbo kune echokwadi mapeji nzvimbo, saka inotevedzana cache haifanire kuenderana. Nekuti zvakafanana prefixes (yakagovaniswa sisitimu yekumhanyisa, kana matavi ekutsvagisa danda) anogona kunongedza kune mamwe mapeji emuviri kuburikidza nekopi-pa-kunyora, ndangariro inoshandiswa patsva pachinzvimbo chekudzokororwa, kupwanya marara kubva pamusoro pe60% kusvika mashoma muzana.

Mastering PagedAttention uye vLLM

PagedAttention inzira yekurangarira-yekutarisira iyo inochengeta kutarisisa kwemuenzaniso wemutauro mumabhururu madiki anogona kushandiswa zvakare pachinzvimbo chechunk hombe. Inopa vLLM simba, yakavhurika-sosi inoshumira injini iyo inowedzera zvinoshamisa kuti zvingani zvikumbiro zvinogona kubata GPU imwe chete. PagedAttention uye vLLM inyanzvi yekuvaka inobata mhando yemhando, mutengo wezvivakwa, latency, uye kuvimbika pachiyero. Kuvaka kunzwisisa kwakadzama, tora PagedAttention uye vLLM semuenzaniso wekushandisa, kwete chinhu chimwe chete: tsanangura zvinodiwa, kujekesa fungidziro, uye patsanura izvo zvinogona kuitwa nehurongwa hwakavimbika kubva kune izvo zvichiri kuda kutonga kwenyanzvi.

Mukuita, zvikwata zvakasimba zvinoshandisa PagedAttention uye vLLM inogadzirisa zvivakwa, data, uye sarudzo dzezvivakwa zvinopesana nekuvimbika uye mutengo. Ivo vanonyora zvakajeka maitiro ebudiriro, bvunzo vachipokana ne data rechokwadi uye mafambiro ebasa, uye iterate zvichibva pane zvakacherechedzwa maitiro ekutadza kwete kuhwina-nguva imwe chete yebhenji. Apa ndipo apo kunzwisisa kwe theoretical kunoshanduka kuve kugona kwakasimba pane chigadzirwa, mutemo, uye mashandiro.

Zvisarudzo zvezvivakwa zvinotyaira kuita uye mutengo wekushandisa kwemakore. Panguva imwecheteyo, Kukwirisa imwe bhenji kunogona kuvanza yakafara system kushaya simba. Nzira yakatsiga ndeyekubatanidza kukurumidza kuyedza nekutonga: mhanyisa vatyairi vendege, tora humbowo, buritsa matanda esarudzo, uye urambe uchivandudza chengetedzo semaitiro emuenzaniso, zvinotarisirwa nemushandisi, uye zvinodikanwa zvekutonga.

Strategic Impact

Zvisarudzo zvezvivakwa zvinotyaira kuita uye mutengo wekushandisa kwemakore.

Zvisarudzo zvezvivakwa zvinotyaira kuita uye mutengo wekushandisa kwemakore. Mukutumirwa kwemhando yepamusoro, izvi zvinoshandurirwa kuita mitemo inoyerwa yekushanda, miganhu yevaridzi, uye tsika dzekudzokorora dzinodzokororwa kuitira kuti zvikwata zvikwire kuvimba pane kukwidza kusajeka.

Dzidzo yehunyanzvi inobatsira zvikwata kusarudza murwi wakakodzera, kwete iwo mutsva chete.

Dzidzo yehunyanzvi inobatsira zvikwata kusarudza murwi wakakodzera, kwete iwo mutsva chete. Mukutumirwa kwemhando yepamusoro, izvi zvinoshandurirwa kuita mitemo inoyerwa yekushanda, miganhu yevaridzi, uye tsika dzekudzokorora dzinodzokororwa kuitira kuti zvikwata zvikwire kuvimba pane kukwidza kusajeka.

Sarudzo dzeinjiniya dziri nani dzinoderedza zviitiko zvekuvimbika mukugadzira.

Sarudzo dzeinjiniya dziri nani dzinoderedza zviitiko zvekuvimbika mukugadzira. Mukutumirwa kwemhando yepamusoro, izvi zvinoshandurirwa kuita mitemo inoyerwa yekushanda, miganhu yevaridzi, uye tsika dzekudzokorora dzinodzokororwa kuitira kuti zvikwata zvikwire kuvimba pane kukwidza kusajeka.

Ramangwana rePagedAttention uye vLLM

vLLM yave yakasarudzika yakavhurika-sosi inference musana, uye pfungwa dzePagedAttention zvino dzava kuoneka pane akawanda anoshumira stacks. Tarisira zvakadzika prefix caching (kushandisazve cached system zvinokurudzira vashandisi vese), prefill yakapatsanurwa uye decode pamichina yakaparadzana, mitemo yekudzinga yakangwara, uye kubatana kwakasimba nehuwandu uye kufungidzira decoding. Sezvo mahwindo emukati achikura kuita mamirioni ezviratidzo, inoshanda peji yeKV manejimendi inova yakanyanya pakati pekuchengetedza kushumira kuchikwanisika.

Real-World Implementation

Kubata yakavhurika-sosi LLM API uko vLLM inoshandira vazhinji vanobatana vashandisi vekutaura kubva kune imwe GPU pakakwirira kuburikidza.

Kugovera hurongwa hurefu hwekukurumidza muzviuru zvezvikumbiro kuburikidza ne prefix caching saka inogadziriswa kamwe chete, kwete kudzokororwa.

Kumhanya kutsvaga danda kana akawanda masampuli ekupedzisira anogovera KV zvidhinha zveyakajairwa kukurumidza kuburikidza nekukopa-pa-kunyora.

Kucheka marara echiyeuchidzo cheGPU kubva mukutsemuka kuitira kuti mupi agone kurongedza mamwe masesheni panguva imwe chete pane imwechete hardware.

Maitiro Ekuita

PagedAttention uye vLLM mukuita

Kutambira yakavhurika-sosi LLM API uko vLLM inoshandira vazhinji vanobatana vashandisi vekutaura kubva kune imwe GPU pakakwirira kuburikidza.

Kutambira yakavhurika-sosi LLM API uko vLLM inoshandira vazhinji vanobatana vashandisi vekutaura kubva kune imwe GPU pakakwirira throughput Matimu anowanzo kuwana mibairo iri nani kana achinge atsanangura emhando yepamusoro kumberi, chengetedza nzira yekukwira kwevanhu yemakesi ekumucheto, uye kuteedzera zvese zvakawanikwa zvechigadzirwa uye mutengo wekukanganisa nekufamba kwenguva.

PagedAttention uye vLLM mukuita

Kugovera hurongwa hurefu hwekukurumidza muzviuru zvezvikumbiro kuburikidza ne prefix caching saka inogadziriswa kamwe chete, kwete kudzokororwa.

Kugovera sisitimu refu yekuchimbidza muzviuru zvezvikumbiro kuburikidza ne prefix caching kuti igadziriswe kamwe chete, kwete kudzokororwa Matimu anowanzo kuwana mhedzisiro iri nani kana achinge atsanangura emhando yepamusoro kumberi, chengetedza nzira yekukwira kwevanhu yemakesi ekumucheto, uye kuteedzera zvese zvakawanikwa zvechigadzirwa uye mutengo wekukanganisa nekufamba kwenguva.

PagedAttention uye vLLM mukuita

Kumhanyisa kutsvaga kwebeam kana akati wandei masampled kupedzisa ayo anogovera KV mabhuroki ezvakajairika kukurumidza kuburikidza nekukopa-pa-kunyora.

Kumhanya kutsvaga danda kana akawanda sampled mapedziso anogovera maKV zvidhinha zvechimbichimbi kuburikidza nekukopa-pa-kunyora Matimu anowanzo kuwana mhedzisiro iri nani kana achinge atsanangura emhando yepamusoro kumberi, chengetedza nzira yekukwira kwevanhu yemakesi ekumucheto, uye kuteedzera zvese zvakawanikwa zvechigadzirwa uye kukanganisa mutengo nekufamba kwenguva.

PagedAttention uye vLLM mukuita

Kucheka marara echiyeuchidzo cheGPU kubva mukutsemuka kuitira kuti mupi agone kurongedza mamwe masesheni panguva imwe chete pane imwechete hardware.

Kucheka marara echiyeuchidzo cheGPU kubva mukutsemuka kuitira kuti mupi agone kurongedza zvimwe panguva imwe chete pa Hardware Matimu anowanzo kuwana mhedzisiro iri nani kana achinge atsanangura emhando yepamusoro kumberi, chengetedza nzira yekukwira kwevanhu yemakesi ekumucheto, uye kuteedzera zvese zvakawanikwa zvechigadzirwa uye mutengo wekukanganisa nekufamba kwenguva.

Njodzi & Guardrails

!

Kugadzirisa imwe bhenji kunogona kuvanza yakafara system kushaya simba.

!

Infrastructure uye mari yekugadzirisa inowanzotarisirwa pasi.

!

Chengetedzo uye kucherechedzwa mapundu anogona kukura sezvo masisitimu anowedzera kuoma.

Implementation Roadmap

1

Tsanangura latency, mhando, uye mutengo zvinangwa usati waitwa.

Tsanangura latency, mhando, uye mutengo zvinangwa usati waitwa. Bata nhanho yega yega segedhi rehumbowo: kana maitiro asina kusangana, imbomira kuburitsa, vhara gaka, uye wobva wawedzera kushandiswa.

2

Benchmark pasi pechokwadi mutoro uye data mamiriro.

Benchmark pasi pechokwadi mutoro uye data mamiriro. Bata nhanho yega yega segedhi rehumbowo: kana maitiro asina kusangana, imbomira kuburitsa, vhara gaka, uye wobva wawedzera kushandiswa.

3

Chishandiso chekutarisa zvikanganiso, kudonha, uye mushandisi maitiro.

Chishandiso chekutarisa zvikanganiso, kudonha, uye mushandisi maitiro. Bata nhanho yega yega segedhi rehumbowo: kana maitiro asina kusangana, imbomira kuburitsa, vhara gaka, uye wobva wawedzera kushandiswa.

4

Gadzirira nzira dzekudzosera kumashure uye dzezviitiko usati wawedzera.

Gadzirira nzira dzekudzosera kumashure uye dzezviitiko usati wawedzera. Bata nhanho yega yega segedhi rehumbowo: kana maitiro asina kusangana, imbomira kuburitsa, vhara gaka, uye wobva wawedzera kushandiswa.

Ramba Uchiongorora