UMHLAHLANDLELA Wobuchwepheshe

I-Expert Parallelism ye-MoE Serving

Ukufana kochwepheshe kuhlukanisa 'ochwepheshe' abaningi bemodeli ye-Mixture-of-Experts kuwo wonke ama-GPU ahlukene ukuze idivayisi ngayinye ibambe ucezu lwamapharamitha kuphela.

Uhlolojikelele

Ukufana kochwepheshe kuhlukanisa 'ochwepheshe' abaningi bemodeli ye-Mixture-of-Experts kuwo wonke ama-GPU ahlukene ukuze idivayisi ngayinye ibambe ucezu lwamapharamitha kuphela. Kuyisihluthulelo sokusebenzisa amamodeli we-MoE we-trillion-parameter ngemali ephansi, njengoba ochwepheshe abambalwa kuphela abasebenzisa ithokheni ngayinye.

I-Expert Parallelism for MoE Serving iyibhulokhi yokwakha yobuchwepheshe ethinta ikhwalithi yamamodeli, izindleko zengqalasizinda, ukubambezeleka, nokuthembeka esikalini.

I-Deep Dive

Isendlalelo se-Mixture-of-Experts (MoE) singena esikhundleni senethiwekhi eyodwa enkulu yokuphakelayo neziningi ezincane (ochwepheshe) kanye nerutha ekhetha ochwepheshe abaphezulu (okuvame ukuba ngu-1 noma 2) ngethokheni ngayinye. I-Expert parallelism (EP) ibeka ochwepheshe abahlukene kuma-GPU ahlukene. Uma kucatshangelwa, i-router inquma ukuthi yiluphi ochwepheshe oludinga ithokheni ngayinye, bese isinyathelo sokuxhumana sishova amathokheni kuma-GPU aphethe ochwepheshe abawakhethile, iqhube i-FFN, bese ishova imiphumela ibuye. Lokhu kuvumela imodeli ibe nengqikithi yemingcele emikhulu (incane) kuyilapho isebenzisa ingxenye encane kuphela yethokheni ngayinye (ama-FLOP aphansi). Amamodeli afana ne-Mixtral 8x7B, i-DeepSeek-V3, ne-GPT-OSS asebenzisa lokhu. Izingxenye eziqinile ziyi-load balancing kubo bonke ochwepheshe kanye nama-hops amabili abiza konke ukuya kukho konke isendlalelo ngasinye.

I-Technical Insight

Umakhenikha oyinhloko amaqoqo amabili okuhlanganisa konke kuya kukho konke kusendlalelo ngasinye se-MoE: thumela (thumela amathokheni kochwepheshe bawo) futhi uhlanganise (qoqa imiphumela emuva). Ngenxa yokuthi umzila uncike kudatha, inani lamathokheni ashaya uchwepheshe ngamunye liyahlukahluka, okubangela ukungalingani komthwalo kanye 'nama-stragglers.' Amasistimu wokunikeza isevisi engeza izici zamandla, izibhafa zochwepheshe, nokuwisa amathokheni noma ukuphediswa ukuze kugcinwe ama-GEMM (i-matrix iphindaphinda) umfaniswano, futhi ngokuvamile adlulela kukho konke ukuxhumana nezibalo zochwepheshe ukuze kufihlwe ukubambezeleka.

I-Mastering Expert Parallelism Yokusebenzela i-MoE

Ukufana kochwepheshe kuhlukanisa 'ochwepheshe' abaningi bemodeli ye-Mixture-of-Experts kuwo wonke ama-GPU ahlukene ukuze idivayisi ngayinye ibambe ucezu lwamapharamitha kuphela. Kuyisihluthulelo sokusebenzisa amamodeli we-MoE we-trillion-parameter ngemali ephansi, njengoba ochwepheshe abambalwa kuphela abasebenzisa ithokheni ngayinye. I-Expert Parallelism for MoE Serving iyibhulokhi yokwakha yobuchwepheshe ethinta ikhwalithi yamamodeli, izindleko zengqalasizinda, ukubambezeleka, nokuthembeka esikalini. Ukuze wakhe ukuqonda okujulile, phatha i-Expert Parallelism ye-MoE Esebenza njengemodeli yokusebenza, hhayi isici esisodwa: chaza imiphumela efiselekayo, ucacise ukucabanga, futhi uhlukanise lokho uhlelo olungakwenza ngokwethembeka kulokho okusadinga ukwahlulela kochwepheshe.

Empeleni, amaqembu aqinile asebenzisa i-Expert Parallelism ye-MoE Serving athuthukisa ukukhetha kwezakhiwo, idatha, kanye nengqalasizinda ngokumelene nokuthembeka nezindleko. Babhala imibandela yempumelelo ecacile, ukuhlola okuqhathaniswa nedatha engokoqobo nokugeleza komsebenzi, futhi baphindaphinde ngokusekelwe kumaphethini okuhluleka aqashiwe esikhundleni sokuwina kwebhentshimakhi yesikhathi esisodwa. Yilapho ukuqonda kwethiyori kuguquka kube amandla ahlala njalo kuwo wonke umkhiqizo, inqubomgomo, kanye nokusebenza.

Izinqumo zezakhiwo ziqhuba ukusebenza kanye nezindleko zokusebenza iminyaka. Ngesikhathi esifanayo, Ukuthuthukisa ibhentshimakhi eyodwa kungafihla ubuthakathaka obubanzi besistimu. Indlela eqine kakhulu iwukuhlanganisa isivinini sokuhlola nesiyalo sokuphatha: qhuba abashayeli bezindiza, bamba ubufakazi, ushicilele amalogi ezinqumo, futhi ubuyekeze izivikelo ngokuqhubekayo njengoba imodeli yokuziphatha, okulindelwe ngabasebenzisi, kanye nezimfuneko zokulawula zishintsha.

I-Strategic Impact

Izinqumo zezakhiwo ziqhuba ukusebenza kanye nezindleko zokusebenza iminyaka.

Izinqumo zezakhiwo ziqhuba ukusebenza kanye nezindleko zokusebenza iminyaka. Ekusetshenzisweni kwekhwalithi ephezulu, lokhu kuhunyushwa emithethweni yokusebenza elinganisekayo, imingcele yobunikazi, nemikhuba yokubuyekeza ephindelelayo ukuze amaqembu akwazi ukukala ukuzethemba esikhundleni sokukala ukungaqondakali.

Imfundo yobuchwepheshe isiza amaqembu ukuthi akhethe isitaki esifanele, hhayi nje esisha.

Imfundo yobuchwepheshe isiza amaqembu ukuthi akhethe isitaki esifanele, hhayi nje esisha. Ekusetshenzisweni kwekhwalithi ephezulu, lokhu kuhunyushwa emithethweni yokusebenza elinganisekayo, imingcele yobunikazi, nemikhuba yokubuyekeza ephindelelayo ukuze amaqembu akwazi ukukala ukuzethemba esikhundleni sokukala ukungaqondakali.

Izinketho ezingcono zobunjiniyela zinciphisa izehlakalo ezinokwethenjelwa ekukhiqizeni.

Izinketho ezingcono zobunjiniyela zinciphisa izehlakalo ezinokwethenjelwa ekukhiqizeni. Ekusetshenzisweni kwekhwalithi ephezulu, lokhu kuhunyushwa emithethweni yokusebenza elinganisekayo, imingcele yobunikazi, nemikhuba yokubuyekeza ephindelelayo ukuze amaqembu akwazi ukukala ukuzethemba esikhundleni sokukala ukungaqondakali.

Ikusasa Lokufana Kochwepheshe Lokusebenzela I-MoE

Lindela ukudizayina okuqinile komzila kanye nezingxenyekazi zekhompuyutha: ama-kernels ahlanganisiwe we-dispatch-compute-combine, ama-GEMM aqoqwe ahlanganisa ochwepheshe abaningi, kanye ne-NVLink/InfiniBand-aware all-to- all. Amasu afana ne-DeepSeek's axiliary-loss-free balancing kanye ne-node-limited routing kunciphisa ithrafikhi ye-cross-node. Ukunikezwa okuhlukanisiwe kuzonikezela ama-GPU 'ochwepheshe' ahlukene nama-GPU okunaka, futhi ukubalwa kochwepheshe okukhudlwana (amakhulu) ane-top-k ecolekile kuzophusha i-MoE ebuncaneni obukhulu kuyilapho igcina izindleko zethokheni ngayinye ziphansi.

Ukuqaliswa Komhlaba Wangempela

Ukukhonza i-Mixtral 8x7B kuwo wonke ama-2-4 GPUs ngokubeka ochwepheshe bayo abangu-2-4 ku-8 kudivayisi ngayinye.

I-DeepSeek-V3 isebenzisa i-node-limited routing ukuze ihlanganise ukuthi mangaki amanodi ochwepheshe bethokheni, ukusika i-inter-node konke kuye konke

Usebenzisa i-vLLM noma i-SGLang imodi ehambisanayo yochwepheshe ukusingatha imodeli egqagqene engu-200B+ endaweni eyodwa engu-8-GPU

Ukuhlanganisa ukufana kochwepheshe ne-tensor parallelism ezendlalelo zokunaka ekusetshenzisweni kwe-hybrid EP+TP

Amaphethini Okusebenzisa

I-Expert Parallelism ye-MoE Isebenza ngokusebenza

Ukukhonza i-Mixtral 8x7B kuwo wonke ama-2-4 GPUs ngokubeka ochwepheshe bayo abangu-2-4 kwabangu-8 kudivayisi ngayinye.

Ukukhonza i-Mixtral 8x7B kuwo wonke ama-2-4 GPUs ngokubeka ochwepheshe abangu-2-4 bochwepheshe bayo abangu-8 kudivayisi ngayinye Amaqembu ngokuvamile athola imiphumela engcono uma echaza imingcele yekhwalithi ngaphambili, egcina indlela yokukhuphuka komuntu yamakesi asemaphethelweni, futhi alandelele kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

I-Expert Parallelism ye-MoE Isebenza ngokusebenza

I-DeepSeek-V3 isebenzisa umzila we-node-limited to cap ukuthi mangaki amanodi ochwepheshe bethokheni abaphakathi, ukusika phakathi kwamanodi konke kuye konke.

I-DeepSeek-V3 isebenzisa umzila we-node-limited to cap ukuthi mangaki amanodi ochwepheshe bethokheni, ukusika ama-inter-node wonke ama-Team ngokuvamile athola imiphumela engcono uma echaza imingcele yekhwalithi ngaphambili, agcina indlela yokukhuphuka komuntu ngamacala asemaphethelweni, futhi alandelele kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

I-Expert Parallelism ye-MoE Isebenza ngokusebenza

Kusetshenziswa imodi ehambisanayo ye-vLLM noma ye-SGLang ukusingatha imodeli egqagqene engu-200B+ endaweni eyodwa engu-8-GPU.

Kusetshenziswa imodi ehambisanayo yochwepheshe ye-vLLM noma ye-SGLang ukusingatha imodeli engu-200B+ egqagqene endaweni eyodwa eyi-8-GPU Amaqembu ngokuvamile athola imiphumela engcono lapho echaza imingcele yekhwalithi ngaphambili, agcine indlela yokukhuphuka kwabantu yamacala asemaphethelweni, futhi alandelele kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

I-Expert Parallelism ye-MoE Isebenza ngokusebenza

Ukuhlanganisa ukufana kochwepheshe nokufana kwe-tensor kuzingqimba zokunaka ekusetshenzisweni kwe-EP+TP eyingxube.

Ukuhlanganisa ukufana kochwepheshe nokufana kwe-tensor kuzingqimba zokunaka ku-hybrid EP+TP Amathimba okuthunyelwa ngokuvamile athola imiphumela engcono lapho echaza imingcele yekhwalithi ngaphambili, egcina indlela yokukhuphuka komuntu yamakesi asemaphethelweni, futhi alandelele kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

Izingozi & Guardrails

!

Ukuthuthukisa ibhentshimakhi eyodwa kungafihla ubuthakathaka obubanzi besistimu.

!

Izindleko zengqalasizinda nezokulungisa zivame ukubukelwa phansi.

!

Izikhala zokuphepha nokubonakala zingakhula njengoba izinhlelo ziba nzima kakhulu.

Ukuqalisa Umhlahlandlela

1

Chaza ukubambezeleka, ikhwalithi, nezindleko ezihlosiwe ngaphambi kokuqaliswa.

Chaza ukubambezeleka, ikhwalithi, nezindleko ezihlosiwe ngaphambi kokuqaliswa. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

2

Ibhentshimakhi ngaphansi komthwalo wangempela nezimo zedatha.

Ibhentshimakhi ngaphansi komthwalo wangempela nezimo zedatha. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

3

Ukuqapha amathuluzi amaphutha, ukukhukhuleka, nomthelela wabasebenzisi.

Ukuqapha amathuluzi amaphutha, ukukhukhuleka, nomthelela wabasebenzisi. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

4

Lungiselela izindlela zokuhlehlisa nezigameko ngaphambi kokukala.

Lungiselela izindlela zokuhlehlisa nezigameko ngaphambi kokukala. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

Qhubeka Uhlole