Mutauro AI GUIDE

Multi-Query Attention

Multi-Query Attention (MQA) inzira inochengetedza ndangariro pakutarisisa kwetransformer iyo inogovera seti imwe yemakiyi uye kukosha pamisoro yese yekutarisisa.

Overview

Multi-Query Attention (MQA) inzira inochengetedza ndangariro pakutarisisa kwetransformer iyo inogovera seti imwe yemakiyi uye kukosha pamisoro yese yekutarisisa. Iyo inomhanyisa zvinoshamisa kugadzira zvinyorwa nekudzikamisa ndangariro iyo modhi inofanirwa kutenderera ichitenderedza.

Multi-Query Attention chikamu chemutauro-AI stack inoshandiswa kuverenga, kugadzira, kuronga, uye kushandura zvinyorwa uye kutaura pamwero.

Deep Dive

Yakajairwa yakawanda-yemusoro kutarisa inopa musoro wega wega mubvunzo, kiyi, uye kukosha kwekufungidzira. Munguva yechizvarwa, makiyi uye kukosha kwezvese tokens zvakapfuura zvinofanirwa kuvharirwa uye kurodha padanho rega rega - iyi KV cache inova iyo huru yebhodhoro, sezvo kuiverenga kubva mundangariro kunononoka kupfuura iyo masvomhu pachayo. Multi-Query Attention, yakakurudzirwa naNoam Shazeer muna 2019, inochengeta fungidziro yemubvunzo wakasiyana pamusoro asi inopunzika makiyi uye kukosha kune imwechete yakagovaniswa musoro. Izvi zvinoderedza cache yeKV nechinhu chakaenzana nenhamba yemisoro, dzimwe nguva 8x kusvika 64x diki. Mhedzisiro yacho inokurumidza kukurumidza autoregressive decoding uye yakareruka ndangariro tsoka, ine chete ine mwero mhando dip. Nzvimbo yepakati, Grouped-Query Attention, inoenzanisa kutengeserana.

Technical Insight

MuMQA, uremu hwemubvunzo huchiri kuburitsa H akapatsanurwa query vectors, asi imwechete kiyi fungidziro uye imwechete kukosha fungidziro inogovaniswa pamisoro yese. Musoro wega wega unoverengera kutarisa uchishandisa muvhunzo wawo uchipesana nemakiyi nemakoshero akafanana. Nekuti iyo cached K neV tensor haichayereki nenhamba yemisoro, ndangariro bandwidth panguva yedecoding inodonha zvakanyanya - uye bandwidth, kwete compute, ndiyo iyo magedhi echizvarwa kumhanya pane ano accelerators.

Mastering Multi-Query Attention

Multi-Query Attention (MQA) inzira inochengetedza ndangariro pakutarisisa kwetransformer iyo inogovera seti imwe yemakiyi uye kukosha pamisoro yese yekutarisisa. Iyo inomhanyisa zvinoshamisa kugadzira zvinyorwa nekudzikamisa ndangariro iyo modhi inofanirwa kutenderera ichitenderedza. Multi-Query Attention chikamu chemutauro-AI stack inoshandiswa kuverenga, kugadzira, kuronga, uye kushandura zvinyorwa uye kutaura pamwero. Kuvaka kunzwisisa kwakadzama, bata Multi-Query Attention semuenzaniso wekushandisa, kwete chinhu chimwe chete: tsanangura zvinodikanwa, kujekesa fungidziro, uye patsanura izvo zvinogona kuitwa nehurongwa hwakavimbika kubva kune zvichiri kuda kutonga kwenyanzvi.

Mukuita, zvikwata zvakasimba zvinoshandisa Multi-Query Attention dhizaini zvinokurudzira, kudzoreredza, uye kuongorora zvishwe seimwe yakabatanidzwa yekutaurirana system. Ivo vanonyora zvakajeka maitiro ebudiriro, bvunzo vachipokana ne data rechokwadi uye mafambiro ebasa, uye iterate zvichibva pane zvakacherechedzwa maitiro ekutadza kwete kuhwina-nguva imwe chete yebhenji. Apa ndipo apo kunzwisisa kwe theoretical kunoshanduka kuve kugona kwakasimba pane chigadzirwa, mutemo, uye mashandiro.

Mutauro workflows inogona kufamba nekukurumidza pasina kupira kuenderana. Panguva imwecheteyo, chokwadi cheHallucified chinogona kupinda chinyararire mishumo, kuyerera kwetsigiro, kana kutsvagisa zvinobuda. Nzira yakatsiga ndeyekubatanidza kukurumidza kuyedza nekutonga: mhanyisa vatyairi vendege, tora humbowo, buritsa matanda esarudzo, uye urambe uchivandudza chengetedzo semaitiro emuenzaniso, zvinotarisirwa nemushandisi, uye zvinodikanwa zvekutonga.

Strategic Impact

Mutauro workflows inogona kufamba nekukurumidza pasina kupira kuenderana.

Mutauro workflows inogona kufamba nekukurumidza pasina kupira kuenderana. Mukutumirwa kwemhando yepamusoro, izvi zvinoshandurirwa kuita mitemo inoyerwa yekushanda, miganhu yevaridzi, uye tsika dzekudzokorora dzinodzokororwa kuitira kuti zvikwata zvikwire kuvimba pane kukwidza kusajeka.

Inopamhidzira kupinda mumitauro yese nemataera ekutaurirana.

Inopamhidzira kupinda mumitauro yese nemataera ekutaurirana. Mukutumirwa kwemhando yepamusoro, izvi zvinoshandurirwa kuita mitemo inoyerwa yekushanda, miganhu yevaridzi, uye tsika dzekudzokorora dzinodzokororwa kuitira kuti zvikwata zvikwire kuvimba pane kukwidza kusajeka.

Zvikwata zvinogona kupedza nguva yakawanda pakutonga uku otomatiki ichibata kudzokorora.

Zvikwata zvinogona kupedza nguva yakawanda pakutonga uku otomatiki ichibata kudzokorora. Mukutumirwa kwemhando yepamusoro, izvi zvinoshandurirwa kuita mitemo inoyerwa yekushanda, miganhu yevaridzi, uye tsika dzekudzokorora dzinodzokororwa kuitira kuti zvikwata zvikwire kuvimba pane kukwidza kusajeka.

Ramangwana reMulti-Query Attention

MQA yakasimbisa kuti iwe unogona kuchekerera makiyi akawandisa / kukosha misoro nekukuvadza kushoma, uye nzwisiso iko zvino inoumba dzinenge dzese dzekukurumidza-inference LLM. Munda wakanyanya kuungana paGrouped-Query Attention (GQA), inoshandiswa muLlama 2/3 nevamwe vazhinji, iyo inoshandisa mashoma emapoka eKV pane imwe kudzoreredza mhando uchichengeta yakawanda yekumhanyisa. Basa remangwana rinosanganisa aya mazano neKV-cache compression, quantization, uye yakawanda-yanonoka kutarisisa kusundira kureba mamiriro uye zvakachipa kushumira.

Real-World Implementation

Kumhanyisa chizvarwa chechiratidzo-ne-chiratidzo muvabatsiri vekutaura uko KV cache, kwete mbishi komputa, inodzikamisa kubuda.

Google's PaLM, iyo yakashandisa Multi-Query Attention kuti igone kugonesa fungidziro huru.

Kushandira vashandisi vazhinji panguva imwe chete pane imwe GPU nekudzikisa iyo-yega-chikumbiro KV cache memory.

Grouped-Query Attention muLlama 2 70B uye Llama 3, dzinza rakananga rinoenzanisa nekumhanya kweMQA nekunyatsoteerera mhando.

Maitiro Ekuita

Multi-Query Attention mukuita

Kumhanyisa chizvarwa chechiratidzo-ne-chiratidzo muvabatsiri vekutaura uko KV cache, kwete mbishi komputa, inodzikamisa kubuda.

Kumhanyisa chizvarwa chechiratidzo-ne-chiratidzo muvabatsiri vekutaura uko iyo KV cache, isiri yakaomeswa komputa, inoganhura kupfuudza Matimu anowanzo kuwana mhedzisiro iri nani kana achinge atsanangura emhando yepamusoro kumberi, chengetedza nzira yekukwira kwevanhu yemakesi emupendero, uye kuteedzera zvese zvakawanikwa zvechigadzirwa uye kukanganisa mutengo nekufamba kwenguva.

Multi-Query Attention mukuita

Google's PaLM, iyo yakashandisa Multi-Query Attention kuti igone kugonesa fungidziro huru.

Google's PaLM, iyo yakashandisa Multi-Query Attention kuti igone kugonesa mahombe ekufungidzira Matimu anowanzo kuwana mibairo iri nani kana achinge atsanangura emhando yepamusoro kumberi, chengetedza nzira yekukwira kwevanhu yemakesi ekumucheto, uye kuteedzera zvese zvakawanikwa zvechigadzirwa nemitengo yekukanganisa nekufamba kwenguva.

Multi-Query Attention mukuita

Kushandira vashandisi vazhinji panguva imwe chete pane imwe GPU nekudzikisa iyo-yega-chikumbiro KV cache memory.

Kushandira vashandisi vazhinji panguva imwe chete paGPU imwe nekudzikisira iyo-yega-chikumbiro KV cache memory Matimu anowanzo kuwana zvirinani zvibodzwa kana achinge atsanangura emhando yepamusoro kumberi, chengetedza nzira yekukwira kwevanhu yemakesi ekumucheto, uye kuteedzera zvese zvakawanikwa zvechigadzirwa uye mutengo wekukanganisa nekufamba kwenguva.

Multi-Query Attention mukuita

Grouped-Query Attention muLlama 2 70B uye Llama 3, dzinza rakananga rinoenzanisa nekumhanya kweMQA nekunyatsoteerera mhando.

Yakabatanidzwa-Query Chengetedzo muLlama 2 70B uye Llama 3, dzinza rakananga rinoenzanisa kumhanya kweMQA nekuzara-kutarisisa mhando Matimu anowanzo kuwana mibairo iri nani kana achinge atsanangura emhando yepamusoro kumberi, chengetedza nzira yekukwira kwevanhu yemakesi emupendero, uye kuteedzera zvese zvakawanikwa zvechigadzirwa nemitengo yekukanganisa nekufamba kwenguva.

Njodzi & Guardrails

!

Chokwadi chehuroyi chinogona kupinda chinyararire mishumo, kuyerera kwetsigiro, kana tsvakiridzo.

!

Kunzwa nekukasira kunogona kugadzira mhedzisiro isingaenderane pane zvikumbiro zvakafanana.

!

Sensitive text data inogona kuburitswa kana zvidhiraivho zvisina kusimba.

Implementation Roadmap

1

Tsanangura chimiro chekubuda, toni, uye mhando zviyero usati waburitsa.

Tsanangura chimiro chekubuda, toni, uye mhando zviyero usati waburitsa. Bata nhanho yega yega segedhi rehumbowo: kana maitiro asina kusangana, imbomira kuburitsa, vhara gaka, uye wobva wawedzera kushandiswa.

2

Mhinduro dzepasi neakavimbika masosi pese pazvine basa.

Mhinduro dzepasi neakavimbika masosi pese pazvine basa. Bata nhanho yega yega segedhi rehumbowo: kana maitiro asina kusangana, imbomira kuburitsa, vhara gaka, uye wobva wawedzera kushandiswa.

3

Chengetedza ongororo yekuongorora yemunhu kune yakakwira-stake zvinobuda.

Chengetedza ongororo yekuongorora yemunhu kune yakakwira-stake zvinobuda. Bata nhanho yega yega segedhi rehumbowo: kana maitiro asina kusangana, imbomira kuburitsa, vhara gaka, uye wobva wawedzera kushandiswa.

4

Tevera maitiro ekutadza uye dzidzisazve kukurudzira kana mafambiro ebasa nguva nenguva.

Tevera maitiro ekutadza uye dzidzisazve kukurudzira kana mafambiro ebasa nguva nenguva. Bata nhanho yega yega segedhi rehumbowo: kana maitiro asina kusangana, imbomira kuburitsa, vhara gaka, uye wobva wawedzera kushandiswa.

Ramba Uchiongorora