Làkk AI GUIDE

Laajte yu bari

Multi-Query Attention (MQA) dafay sakkanal mémoire ci transformateur biy séddoo benn xeetu caabi ak valeur ci bépp boppu attention.

Résumé

Multi-Query Attention (MQA) dafay sakkanal mémoire ci transformateur biy séddoo benn xeetu caabi ak valeur ci bépp boppu attention. Dafay gaawlu bu baax defar mbind ndax dafay wàññi memory bi model bi wara jaxasoo.

Multi-Query Attention bokk na ci lakk-IA stack bi ñuy jëfandikoo ngir jàng, defar, xaaj, ak soppi mbind ak wax ci eskaal.

Plongeur bu xóot

Foofu ñuy bàyyi xel ci bopp yu bari dafay jox bopp bu nekk laaj boppam, caabi ak projection valeur. Ci jamonoy generation, caabi yi ak valeur yi ci token yi weesu yépp dañu leen wara cache ak chargewaat ci jéego bu nekk - cache KV bi mooy nekk bottleneck bi gëna mag, ndax jàng ko ci mémoire moo gëna yeex math bi ci boppam. Multi-Query Attention, bi Noam Shazeer xalaat ci 2019, dafay tëye ay projection laaj yu wuute ci bopp bu nekk waaye dafay dindi caabi yi ak valeur yi ci benn bopp buñ bokk. Loolu dafay wàññi cache KV ci lu tollu ci limu bopp yi, yenn saa mu gëna ndaw 8x ba 64x. Lépp soo ko boolee mooy decodage autorégressive bu gëna gaaw ak emprent mémoire bu gëna woyof, ak tuuti ci kalite bi. Diggu suuf, Grouped-Query Attention, mooy ekilibre kompromis bi.

Gis-gis xarala

Ci MQA, diisaayu laajte yi ba leegi dañuy defar H vecteur laajte yu wuute, waaye benn projection key ak benn projection valeur lañu bokk ci bopp yépp. Bopp bu nekk dafay xayma attention bi ci laaj boppam ci benn caabi ak valeur yi. Ndax tensor K ak V yiñ cache amatu ñu benn scale ak limu bopp yi, bandwidth memory bi ci decodage bi dafay wàññeeku bu baax - te bandwidth, nekkul calcul, mooy gaawaayu defar buntu yi ci accelerateur yu bees yi.

Mastering ci laaj yu bari

Multi-Query Attention (MQA) is a memory-saving twist on transformer attention that shares one set of keys and values across all attention heads. Dafay gaawlu bu baax defar mbind ndax dafay wàññi memory bi model bi wara jaxasoo. Multi-Query Attention bokk na ci lakk-IA stack bi ñuy jëfandikoo ngir jàng, defar, xaaj, ak soppi mbind ak wax ci eskaal. Ngir tabax xam-xam bu xóot, jàppal Multi-Query Attention ni xeetu liggéey, du benn man-man: leeral njariñ yi nga bëgg, leeral xalaat yi, ba noppi tàqale li sistem bi mëna def ci anam wu wóor ak li ba leegi soxla àtteb kàngam.

In practice, strong teams using Multi-Query Attention design prompts, retrieval, and review loops as one integrated communication system. Dañuy bind kritër yu leer ngir am ndam, natt leen ci done yu dëggu ak def liggéey, ba noppi ñu baamtu ci anamu ñàkka mëna seetlu, du ci benn yoon benchmark wins. Mooy barab bi xam-xam theorie bi di soppiku nekk kàttan buy yàgg ci produit yi, ci politik yi ak ci liggéey yi.

Liggéeyukaay yi ci làkk yi mën nañu gëna gaaw te duñu yàq deggoo gi. Ci jamano jooju, mbiri Hallucinated mën nañu dugg ci rapoor yi, jàppale flow yi, wala gëstu yi génne. Xeetu jëf bi gëna dëgër mooy boole gaawaayu jàngat ak disipline nguur: doxal pilote, jàpp firnde, siiwal dogal yi, ak wéy di yeesal kaaraange gi ci anam wi ñuy doxalee, li jëfandikukat bi di xaar, ak sàrti sàrt yi di jëm kanam.

njeextalu pexe

Liggéeyukaay yi ci làkk yi mën nañu gëna gaaw te duñu yàq deggoo gi.

Liggéeyukaay yi ci làkk yi mën nañu gëna gaaw te duñu yàq deggoo gi. Ci jëfandikoo yu am kalite bu kawe, loolu dañu koy tekki ci sàrti liggéey yuñ mëna natt, ay peggu boroom, ak ay xew-xewu xoolaat yu bari suko defee ekip yi mëna yokk wóolu seen bopp ci barabu yokk lu jaxasoo.

Dafay yaatal jëfandikoo gi ci làkk yi ak ci anam yi ñuy jokkoo.

Dafay yaatal jëfandikoo gi ci làkk yi ak ci anam yi ñuy jokkoo. Ci jëfandikoo yu am kalite bu kawe, loolu dañu koy tekki ci sàrti liggéey yuñ mëna natt, ay peggu boroom, ak ay xew-xewu xoolaat yu bari suko defee ekip yi mëna yokk wóolu seen bopp ci barabu yokk lu jaxasoo.

Ekip yi mën nañu gëna yàgg ci àtte ci jamono ji otomatisation di liggéey ci baamtu.

Ekip yi mën nañu gëna yàgg ci àtte ci jamono ji otomatisation di liggéey ci baamtu. Ci jëfandikoo yu am kalite bu kawe, loolu dañu koy tekki ci sàrti liggéey yuñ mëna natt, ay peggu boroom, ak ay xew-xewu xoolaat yu bari suko defee ekip yi mëna yokk wóolu seen bopp ci barabu yokk lu jaxasoo.

Ëlëgu bàyyi xel ci laaj yu bari

MQA dafa wane ni mën nga dagg boppu caabi / valeur yu bari te du am benn loraange, te gis-gis boobu leegi dafay forme LLM bu gaaw bu nekk. Doxalin bi dafa gëna jëm ci Grouped-Query Attention (GQA), ñu koy jëfandikoo ci Llama 2/3 ak ñeneen ñu bari, ñuy jëfandikoo ay grupu KV yu néew, du benn ngir defaraat kalite bi, di tëye li gëna bari ci gaawaay bi. Liggéey bi ci kanam dafay boole xalaat yooyu ak KV-cache compression, quantisation, ak bàyyi xel ci lu bari lu nëbbu ngir push contexte yu gëna gudd ak serwiis bu gëna xéewale.

Doxal ci àdduna dëgg

Speeding up token-by-token generation in chat assistants where the KV cache, not raw compute, limits throughput.

Google's PaLM, which used Multi-Query Attention to enable efficient large-scale inference.

Dimbali jëfandikukat yu bari ci benn GPU ci wàññi memory cache KV bu laaj bu nekk.

Groupe-Query Fexe ci Llama 2 70B ak Llama 3, ab mbokk buy yemale gaawaayu MQA ak kalite bàyyi xel bu mat sëkk.

Modèlu jëfandikoo

Laajte yu bari Fexe ci jëf

Speeding up token-by-token generation in chat assistants where the KV cache, not raw compute, limits throughput.

Gaawaay token-by-token generation ci chat assistants fu cache KV, nekkul ordinatër bu ñor, di tënk throughput Ekip yi dañuy faral di am njariñ yu gëna baax suñu joxee thresholds yu baax ci kanam, tëye yoonu escalation nit ngir jafe-jafe yi, ba noppi topp njariñu produit ak njëgu njuumte ci diir bi.

Laajte yu bari Fexe ci jëf

Google's PaLM, which used Multi-Query Attention to enable efficient large-scale inference.

Google's PaLM, bi jëfandikoo Multi-Query Attention ngir mëna am inference bu mag ci anam wu baax.

Laajte yu bari Fexe ci jëf

Dimbali jëfandikukat yu bari ci benn GPU ci wàññi memory cache KV bu laaj bu nekk.

Serwiis jëfandikukat yu bari ci benn GPU ci wàññi per-request KV cache memory Teams yi dañuy faral di am njariñ yu gëna baax suñu joxee thresholds yu baax ci kanam, tëye yoonu escalation nit ngir jafe-jafe yi, ba noppi topp njariñu produit ak njëgu njuumte ci diir bi.

Laajte yu bari Fexe ci jëf

Groupe-Query Fexe ci Llama 2 70B ak Llama 3, ab mbokk buy yemale gaawaayu MQA ak kalite bàyyi xel bu mat sëkk.

Grouped-Query Attention ci Llama 2 70B ak Llama 3, ab mbokku direct buy ekilibre gaawaayu MQA ak kalite bu mat sëkk. Ekip yi dañuy faral di am njariñ yu gëna baax suñu joxee threshold yu baax ci kanam, tëye yoonu escalation nit ngir jafe-jafe yi, ak topp error ci diiru produit yi.

Risk yi ak balustrade yi

!

Lépp lu jaarul yoon mën na dugg ci rapoor yi, jàppale ci liggéey bi, wala ci njariñu gëstu bi.

!

Sensibilite bu gaaw mën na jur njariñ yu wuute ci laajte yu noonu mel.

!

Done yu am solo mën nañu feeñ sudee seytu jëfandikoo gi néew doole.

Roadmap ngir samp gi

1

Mandargal formaa génne gi, melokaan bi, ak standard kalite yi laata ngay dugal ko.

Mandargal formaa génne gi, melokaan bi, ak standard kalite yi laata ngay dugal ko. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.

2

Tontu yu am solo ak balluwaay yu wóor saa yu dëggu bi di am solo.

Tontu yu am solo ak balluwaay yu wóor saa yu dëggu bi di am solo. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.

3

Fexeel am barabu xool nit ñi ngir am njariñ yu am solo.

Fexeel am barabu xool nit ñi ngir am njariñ yu am solo. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.

4

Toppal anami gacce yi ak di faral di tàggataat ay laaj wala def-liggéey.

Toppal anami gacce yi ak di faral di tàggataat ay laaj wala def-liggéey. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.

Weyal di banneexu