Résumé
Parallelism expert dafay xaaj xeetu 'expert' yu bari yu feed-forward ci GPU yu wuute suko defee aparey bu nekk tëye benn wàll ci parametre yi. Mooy caabi ngir joxe xeetu MoE yu am ay trillion ci njëg yu yomb, ndax yenn kàngam yu néew rek ñooy doxal token bu nekk.
Parallelism Expert ngir MoE Serving ab bloku tabax la bu am njeexital ci kalite model bi, njëgu infrastructure bi, latency bi, ak wóor ci escale bi.
Plongeur bu xóot
Benn layer Mixture-of-Experts (MoE) dafay wecci benn reso feed-forward bu mag ak yeneen yu gëna ndaw (expert) boole ci routeur buy tànn top-k (dafay faral di nekk 1 wala 2) eksper ci token bu nekk. Parallelism eksper (EP) dafay dugal ay eksper yu wuute ci GPU yu wuute. Ci inference, routeur bi mooy tànn ban expert la token bu nekk soxla, ci ganaw ga jéego jokkoo bu mat sëkk dafay jaxase token yi ci GPU yi yor seen eksper yiñ tànn, doxal FFN, ba noppi jaxase resultaa yi. Loolu dafay may benn model mu am ay parametre yu bari (sparse) fekk muy aktive benn fraction bu ndaw ci token bu nekk (FLOPs yu woyof). Modèle yu melni Mixtral 8x7B, DeepSeek-V3, ak GPT-OSS dañu koy jëfandikoo. Wàll yu dëgër yi ñooy balance charge ci diggante eksper yi ak ñaari hops yu seer yi ci couche bu nekk.
Gis-gis xarala
Mekanisien core bi mooy ñaari collectives yu ñépp bokk ci MoE layer bu nekk: yónnee (yónnee ay jeton seeni kàngam) ak boole (dajale ay génne). Ndax yoon wi dafay aju ci done, limu jeton yiy dóor eksper bu nekk dafay wuute, moo waral desekilibre sargal ak 'stragglers.' Sistem yiy serwiis dañuy yokk fakteer yu kàttan, tampon yu xarañ, ak daaneel token wala padding ngir tëye GEMMs (matrix multiplies) yu wuute, ba noppi ñuy faral di jaxasoo jokkoog ñépp ak xayma yu xarañ ngir nëbb latency.
Xam parallelism eksper ngir liggéey MoE
Parallelism expert dafay xaaj xeetu 'expert' yu bari yu feed-forward ci GPU yu wuute suko defee aparey bu nekk tëye benn wàll ci parametre yi. Mooy caabi ngir joxe xeetu MoE yu am ay trillion ci njëg yu yomb, ndax yenn kàngam yu néew rek ñooy doxal token bu nekk. Parallelism Expert ngir MoE Serving ab bloku tabax la bu am njeexital ci kalite model bi, njëgu infrastructure bi, latency bi, ak wóor ci escale bi. Ngir tabax xam-xam bu xóot, jàppal Parallelism Expert for MoE Serving ni xeetu liggéey, du benn man-man: fësal njariñ yi nga bëgg, leeral xalaat yi, ak tàqale li sistem bi mëna def ci anam wu wóor ak li ba leegi soxla àtteb eksper.
Ci jëf, ekip yu dëgër yiy jëfandikoo Parallelism Expert ngir MoE Serving dañuy gëna baaxal architecture, done, ak tànneefi infrastructure ci wàllu wóor ak njëg. Dañuy bind kritër yu leer ngir am ndam, natt leen ci done yu dëggu ak def liggéey, ba noppi ñu baamtu ci anamu ñàkka mëna seetlu, du ci benn yoon benchmark wins. Mooy barab bi xam-xam theorie bi di soppiku nekk kàttan buy yàgg ci produit yi, ci politik yi ak ci liggéey yi.
Dogal yi architecture di jël dañuy indi njariñ ak njëgu liggéey bi ay at ci ginaaw. Ci jamano jooju, Optimisation benn benchmark mën na nëbb ñakk kattan yu gëna yaatu ci sistem bi. Xeetu jëf bi gëna dëgër mooy boole gaawaayu jàngat ak disipline nguur: doxal pilote, jàpp firnde, siiwal dogal yi, ak wéy di yeesal kaaraange gi ci anam wi ñuy doxalee, li jëfandikukat bi di xaar, ak sàrti sàrt yi di jëm kanam.
njeextalu pexe
Dogal yi architecture di jël dañuy indi njariñ ak njëgu liggéey bi ay at ci ginaaw.
Dogal yi architecture di jël dañuy indi njariñ ak njëgu liggéey bi ay at ci ginaaw. Ci jëfandikoo yu am kalite bu kawe, loolu dañu koy tekki ci sàrti liggéey yuñ mëna natt, ay peggu boroom, ak ay xew-xewu xoolaat yu bari suko defee ekip yi mëna yokk wóolu seen bopp ci barabu yokk lu jaxasoo.
Njàngalem xarala yi dafay jàppale ekip yi ñu tànn li gën, te baña yam ci li gëna bees daal.
Njàngalem xarala yi dafay jàppale ekip yi ñu tànn li gën, te baña yam ci li gëna bees daal. Ci jëfandikoo yu am kalite bu kawe, loolu dañu koy tekki ci sàrti liggéey yuñ mëna natt, ay peggu boroom, ak ay xew-xewu xoolaat yu bari suko defee ekip yi mëna yokk wóolu seen bopp ci barabu yokk lu jaxasoo.
Tanneef yu gëna baax ci wàllu ingeñër dina wàññi jafe-jafe yi ci wàllu wóor ci liggéey bi.
Tanneef yu gëna baax ci wàllu ingeñër dina wàññi jafe-jafe yi ci wàllu wóor ci liggéey bi. Ci jëfandikoo yu am kalite bu kawe, loolu dañu koy tekki ci sàrti liggéey yuñ mëna natt, ay peggu boroom, ak ay xew-xewu xoolaat yu bari suko defee ekip yi mëna yokk wóolu seen bopp ci barabu yokk lu jaxasoo.
Doxal ci àdduna dëgg
Mixtral 8x7B di liggéey ci 2-4 GPUs ci def 2-4 ci 8 kàngam yi ci aparey bu nekk
DeepSeek-V3 dafay jëfandikoo yoon wu yam ci node ngir tënk ñaata node la kàngam yi ci token bi, dagg diggante node yépp
Jëfandikoo vLLM wala SGLang anam eksper-paralel ngir dalal ab xeetu 200B+ bu bariwul ci benn node 8-GPU
boole paralelismu eksper ak paralelismu tensor ci wàllu bàyyi xel ci jëfandikoo EP+TP hybrid
Modèlu jëfandikoo
Parallelism eksper ngir MoE di liggéey ci jëf
Liggéey Mixtral 8x7B ci 2-4 GPUs ci def 2-4 ci 8 kàngam yi ci aparey bu nekk.
Liggéey Mixtral 8x7B ci 2-4 GPUs ci def 2-4 ci 8 eksper yi ci aparey bu nekk. Ekip yi dañuy faral di am njariñ yu gëna baax suñu leeralee threshold yu baax ci kanam, tëye yoonu escalation nit ngir jafe-jafe yi, ba noppi topp njariñu produit ak njëgu njuumte ci diir bi.
Parallelism eksper ngir MoE di liggéey ci jëf
DeepSeek-V3 dafay jëfandikoo yoon wu yam ci node ngir tënk ñaata node la eksper token yi di dagg, dagg diggante node yépp-ci-ñépp.
DeepSeek-V3 jëfandikoo node-limited routing ngir cap ñaata node la eksper token span, dagg inter-node all-to-all Teams yi dañuy faral di am njariñ yu gëna baax suñu joxee thresholds yu baax ci kanam, tëye yoonu escalation nit ngir jafe-jafe yi, ak topp njuréefi produit ak error.
Parallelism eksper ngir MoE di liggéey ci jëf
Jëfandikoo vLLM wala SGLang anam paralel eksper ngir dalal xeetu 200B+ bu bariwul ci benn node 8-GPU.
Jëfandikoo vLLM wala SGLang mode expert-parallel ngir dalal 200B + model bu bari ci benn node 8-GPU Teams yi dañuy faral di am njariñ yu gëna baax suñu joxee threshold yu baax ci kanam, tëye yoonu escalation nit ngir jafe-jafe yi, ba noppi topp njariñu produit ak njëgu njuumte ci diir bi.
Parallelism eksper ngir MoE di liggéey ci jëf
Njaxas paralelism eksper ak paralelism tensor ci wàllu bàyyi xel ci jëfandikoo EP+TP hybrid.
Njaxas parallelism eksper ak parallelism tensor ci wàllu bàyyi xel ci benn hybrid EP + TP deployment Teams yi dañuy faral di am njariñ yu gëna baax suñu joxee threshold yu baax ci kanam, tëye yoonu escalation nit ngir jafe-jafe yi, ba noppi topp njariñu produit yi ak njëgu njuumte yi ci diir bi.
Risk yi ak balustrade yi
Optimize benn benchmark mën na nëbb ñakk kattan yu gëna yaatu ci sistem bi.
Njëg li ñuy fay ci infrastructure yi ak ci toppatoo dañuy faral di suufeel.
Bu sistem yi di gëna xawa jafee xam, jafe-jafe yi am ci wàllu kaaraange ak seetlu mën nañu gëna bari.
Roadmap ngir samp gi
Mandargal latency, kalite, ak njëg yi laata ngay jëfandikoo.
Mandargal latency, kalite, ak njëg yi laata ngay jëfandikoo. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.
Benchmark ci biir sargal ak done yu dëggu.
Benchmark ci biir sargal ak done yu dëggu. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.
Jumtukaay bi di saytu njuumte yi, derive bi ak njeextalu jëfandikukat bi.
Jumtukaay bi di saytu njuumte yi, derive bi ak njeextalu jëfandikukat bi. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.
Waajal rollback ak yooni tontu ci jafe-jafe yi laata ngay eskale.
Waajal rollback ak yooni tontu ci jafe-jafe yi laata ngay eskale. Japp jéego bu nekk ni buntu firnde: sudee mattul kritër yi, noppali génne gi, tëj bërëb bi, ba noppi nga yaatal jëfandikoo gi.