GUIDE teknik

Prefill ak decode buñ xaaj

Architecture buy liggéey buy xaaj modelu làkk bu yaatu ci ñaari fase yu wuute - prefill ak decode - ba noppi doxal leen ci pool yu wuute ci GPUs.

Résumé

Architecture buy liggéey buy xaaj modelu làkk bu yaatu ci ñaari fase yu wuute - prefill ak decode - ba noppi doxal leen ci pool yu wuute ci GPUs. Dafa am solo ndax ñaari fase yooyu dañu am bëgg-bëggu hardware bu wuute, te forse leen ñu nekk ci benn masin dafay yàq kàttan ba noppi gaañ latency.

Prefill ak Decode Serving buñ xaajale ab bloku tabax la bu am njeexital ci kalite model bi, njëgu infrastructure bi, latency bi, ak wóor ci escale bi.

Plongeur bu xóot

Su LLM tontu, dafay liggéey ci ñaari pàcc. Prefill dafay jàng laaj bi yépp benn yoon ba noppi tabax cache key-valeur (KV); lii ab burst bu rëy la, paralel, buñu tënk ci xayma buy feesal GPU bi ci math. Decode dafay defar ay token benn-benn, jéego bu nekk di jàng cache KV bi yépp - ab bandwidth bu memory, trickle bu woyof. Daw ñoom ñaar, ab prefill bu gudd dafay taxawal decode ku nekk (bopp-of-line blocking), ak batching ñaar yi defar ay jafe-jafe. Desagrégation dafay def prefill ci benn pool GPU ak decode ci beneen, toxal cache KV ci seen biir ci lëkkaloo yu gaaw yu melni NVLink wala InfiniBand. Pool bu nekk dañu koy ajuste ak scale moom boppam, gëna baaxal produit bi, gëna yombal latency geen gi, ba noppi may operatër yi ñu mëna dóor time-to-first-token ak time-per-output-token ci benn yoon.

Gis-gis xarala

Ñaari fase yi wuute nañu ci seeni bottleneck. Prefill dafay doxal bépp token bu gaaw ci paralel, moo tax FLOPs yi dañuy yokk guddaay bu gaaw bi ba noppi di gëna bari tensor core yi. Decode autoregressif la: token bu bees bu nekk dafa soxla benn jéego bu jëm kanam buy jàngaat cache KV bi yépp ci HBM, kon génne gi dafay jaar ci yaatuwaayu mémoire bi, du calcul. Desaggregation dafay jëfandikoo lii ci dimensionnement, batching, ba ci tànn parallelism bu wuute ngir pool bu nekk, ba noppi yónnee cache KV ci liggéeykat yiy prefill ngir decode liggéeykat yi.

Mastering prefill buñ xaaj ak dekode sargal

Architecture buy liggéey buy xaaj modelu làkk bu yaatu ci ñaari fase yu wuute - prefill ak decode - ba noppi doxal leen ci pool yu wuute ci GPUs. Dafa am solo ndax ñaari fase yooyu dañu am bëgg-bëggu hardware bu wuute, te forse leen ñu nekk ci benn masin dafay yàq kàttan ba noppi gaañ latency. Prefill ak Decode Serving buñ xaajale ab bloku tabax la bu am njeexital ci kalite model bi, njëgu infrastructure bi, latency bi, ak wóor ci escale bi. Ngir tabax xam-xam bu xóot, jàppal Disaggregated Prefill ak Decode Serving ni xeetu liggéey, du benn man-man: leeral njariñ yi nga bëgg, leeral xalaat yi, ak tàqale li sistem bi mëna def ci anam wu wóor ak li ba leegi soxla àtteb kàngam.

Ci jëf, ekip yu am doole yiy jëfandikoo Prefill ak Decode Serving yuñ xaajale dañuy gëna baaxal architecture, done, ak tànneefi infrastructure ci wàllu wóor ak njëg. Dañuy bind kritër yu leer ngir am ndam, natt leen ci done yu dëggu ak def liggéey, ba noppi ñu baamtu ci anamu ñàkka mëna seetlu, du ci benn yoon benchmark wins. Mooy barab bi xam-xam theorie bi di soppiku nekk kàttan buy yàgg ci produit yi, ci politik yi ak ci liggéey yi.

Dogal yi architecture di jël dañuy indi njariñ ak njëgu liggéey bi ay at ci ginaaw. Ci jamano jooju, Optimisation benn benchmark mën na nëbb ñakk kattan yu gëna yaatu ci sistem bi. Xeetu jëf bi gëna dëgër mooy boole gaawaayu jàngat ak disipline nguur: doxal pilote, jàpp firnde, siiwal dogal yi, ak wéy di yeesal kaaraange gi ci anam wi ñuy doxalee, li jëfandikukat bi di xaar, ak sàrti sàrt yi di jëm kanam.

njeextalu pexe

Dogal yi architecture di jël dañuy indi njariñ ak njëgu liggéey bi ay at ci ginaaw.

Dogal yi architecture di jël dañuy indi njariñ ak njëgu liggéey bi ay at ci ginaaw. Ci jëfandikoo yu am kalite bu kawe, loolu dañu koy tekki ci sàrti liggéey yuñ mëna natt, ay peggu boroom, ak ay xew-xewu xoolaat yu bari suko defee ekip yi mëna yokk wóolu seen bopp ci barabu yokk lu jaxasoo.

Njàngalem xarala yi dafay jàppale ekip yi ñu tànn li gën, te baña yam ci li gëna bees daal.

Njàngalem xarala yi dafay jàppale ekip yi ñu tànn li gën, te baña yam ci li gëna bees daal. Ci jëfandikoo yu am kalite bu kawe, loolu dañu koy tekki ci sàrti liggéey yuñ mëna natt, ay peggu boroom, ak ay xew-xewu xoolaat yu bari suko defee ekip yi mëna yokk wóolu seen bopp ci barabu yokk lu jaxasoo.

Tanneef yu gëna baax ci wàllu ingeñër dina wàññi jafe-jafe yi ci wàllu wóor ci liggéey bi.

Tanneef yu gëna baax ci wàllu ingeñër dina wàññi jafe-jafe yi ci wàllu wóor ci liggéey bi. Ci jëfandikoo yu am kalite bu kawe, loolu dañu koy tekki ci sàrti liggéey yuñ mëna natt, ay peggu boroom, ak ay xew-xewu xoolaat yu bari suko defee ekip yi mëna yokk wóolu seen bopp ci barabu yokk lu jaxasoo.

Ëlëgu prefill ak decode buñ xaaj

Xaarandil ni disagregation nekk na lu jaadu ci pilu liggéey yi. Sistem yu melni DistServe, Xaajale, ak Mooncake ñoo ko siiwal, vLLM ak NVIDIA Dynamo leegi dañuy yónnee ay anam yu wuute. Gëstu dafay puus KV-cache toxal gëna xéewale, cache pooling ak jëfandikoowaat ci laaj yi, ekilibre dinaamik bu prefill / decode ratios ci suufu dem bi ak dikk bi, ak boole bu gëna dëgër ak cache prefix ak prefill chunked. Lu palanteer yi di màgg ba nekk ay milioŋ ciy jeton, tàqale fase yooyu dafay gëna am solo ngir am xaalis bu bari te baax.

Doxal ci àdduna dëgg

Assistant chat dafay yónnee dokimaa yu gudd yi ci cluster prefill bu bari ordinatër, ba noppi mu yónnee tontu yi ci cluster decode buñ defaree mémoire ngir mëna wéy di bind.

NVIDIA Dynamo ak vLLM may nañu operatër yi ñu tàqale prefill ak decode grupu liggéeykat yi suko defee ñu bari ay laaj yu gudd duñu gelé jikkoom yiy wéy.

Mooncake (Kimi bu Moonshot AI moo ko jëfandikoo) dafay xaaj prefill ak decode ba noppi yokk ci pool KV-cache buñ séddale ngir dagg xayma yu bari yi ci eskaal bi.

Ab sarwis buy yeggali kode dafay jagleel benn piscine bu ndaw ngir ay laaj yu gàtt ak benn piscine decode bu mag, ndax njëg yi gëna bari ñu ngi bawoo ci streaming ay token yu bari.

Modèlu jëfandikoo

Prefill ak decode yuñ xaajale di liggéey ci jëf

Assistant chat dafay yónnee dokimaa yu gudd yi ci cluster prefill bu bari ordinatër, ba noppi mu yónnee tontu yi ci cluster decode buñ defaree mémoire ngir mëna wéy di bind.

Assistant chat dafay yóbbu dokimaa yu gudd yi ci benn cluster prefill bu diis, ba noppi di joxe tontu ci benn cluster decode buñ optimize ci mémoire ngir wéy di bind latency bu neex.

Prefill ak decode yuñ xaajale di liggéey ci jëf

NVIDIA Dynamo ak vLLM may nañu operatër yi ñu tàqale prefill ak decode grupu liggéeykat yi suko defee ñu bari ay laaj yu gudd duñu gelé jikkoom yiy wéy.

NVIDIA Dynamo ak vLLM may operatër yi ñu tàqale prefill ak decode grupu liggéeykat yi suko defee benn xeetu laaj yu gudd duñu gelé jikkoom yiy wéy. Ekip yi dañuy faral di am njariñ yu gëna baax suñu joxee threshold yu baax ci kanam, tëye yoonu escalation nit ci kaw jafe-jafe yi, ak topp gains error time.

Prefill ak decode yuñ xaajale di liggéey ci jëf

Mooncake (Kimi bu Moonshot AI moo ko jëfandikoo) dafay xaaj prefill ak decode ba noppi yokk ci pool KV-cache buñ séddale ngir dagg xayma yu bari yi ci eskaal bi.

Mooncake (kimi bu Moonshot AI jëfandikoo) dafay xaaj prefill ak decode ba noppi yokk benn pool KV-cache buñ séddale ngir dagg recomputation bu gaaw ci scale.

Prefill ak decode yuñ xaajale di liggéey ci jëf

Ab sarwis buy yeggali kode dafay jagleel benn piscine bu ndaw ngir ay laaj yu gàtt ak benn piscine decode bu mag, ndax njëg yi gëna bari ñu ngi bawoo ci streaming ay token yu bari.

Ab sarwis buy yeggali kode dafay jagleel benn piscine bu ndaw ngir ay laaj yu gàtt ak benn piscine decode bu mag, ndax njëg yi gëna bari ñu ngi bawoo ci streaming tokens yu bari yu génne.

Risk yi ak balustrade yi

!

Optimize benn benchmark mën na nëbb ñakk kattan yu gëna yaatu ci sistem bi.

!

Njëg li ñuy fay ci infrastructure yi ak ci toppatoo dañuy faral di suufeel.

!

Bu sistem yi di gëna xawa jafee xam, jafe-jafe yi am ci wàllu kaaraange ak seetlu mën nañu gëna bari.

Roadmap ngir samp gi

1

Mandargal latency, kalite, ak njëg yi laata ngay jëfandikoo.

Mandargal latency, kalite, ak njëg yi laata ngay jëfandikoo. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.

2

Benchmark ci biir sargal ak done yu dëggu.

Benchmark ci biir sargal ak done yu dëggu. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.

3

Jumtukaay bi di saytu njuumte yi, derive bi ak njeextalu jëfandikukat bi.

Jumtukaay bi di saytu njuumte yi, derive bi ak njeextalu jëfandikukat bi. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.

4

Waajal rollback ak yooni tontu ci jafe-jafe yi laata ngay eskale.

Waajal rollback ak yooni tontu ci jafe-jafe yi laata ngay eskale. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.

Weyal di banneexu