UMHLAHLANDLELA Wobuchwepheshe

Ukuthuthukiswa Kwenqubomgomo Ehlobene Neqembu

I-Group Relative Policy Optimization (GRPO) iyindlela yokuqinisa yokufunda yamamodeli olimi ashuna kahle ahlulela impendulo ngayinye ngokumelene neqembu lezimpendulo eziyizelamani ekwazisweni okufanayo, esusa inethiwekhi yenani ehlukile esetshenziswa i-PPO.

Uhlolojikelele

I-Group Relative Policy Optimization (GRPO) iyindlela yokuqinisa yokufunda yamamodeli olimi ashuna kahle ahlulela impendulo ngayinye ngokumelene neqembu lezimpendulo eziyizelamani ekwazisweni okufanayo, esusa inethiwekhi yenani ehlukile esetshenziswa i-PPO. Kwaduma njengeqhinga lokuqeqesha eliyisisekelo ngemuva kwamamodeli okucabanga we-DeepSeek.

I-Group Relative Policy Optimization iyibhulokhi yokwakha yobuchwepheshe ethinta ikhwalithi yemodeli, izindleko zengqalasizinda, ukubambezeleka, nokuthembeka esikalini.

I-Deep Dive

I-GRPO iwuhlobo oluhlukile lokufunda okuqiniswa kwenqubomgomo-gradient okudizayinelwe ukwenza ukulungisa kahle kwe-RL kwamamodeli wezilimi ezinkulu kushibhe futhi kuzinze kakhudlwana. I-PPO evamile idinga 'umgxeki' ofundiwe (imodeli yenani), cishe enkulu njengenqubomgomo ngokwayo, ukuze ilinganisele ukuthi ithokheni ngalinye lihle kangakanani. I-GRPO isusa lowo mgxeki ngokuphelele. Emyalweni ngamunye yenza amasampula eqembu lokuqedela (ithi 8-64), iwathole wonke ngesignali yomklomelo, bese ihlanganisa inzuzo yokuqeda ngakunye ngokumisa umvuzo wayo ngokumelene nesilinganiso nokuchezuka okujwayelekile kweqembu. Izimpendulo ezingaphezu kwesilinganiso ziyaqiniswa futhi ezingaphansi kwe-avareji ziyacindezelwa. Itemu le-KL-divergence ligcina imodeli iseduze nenqubomgomo yereferensi. Yethulwe yi-DeepSeek, inikeze amandla i-DeepSeekMath kanye nemodeli ye-DeepSeek-R1 yokucabanga.

I-Technical Insight

Umbono obalulekile ukufaka isisekelo senani le-PPO esikhundleni sesisekelo seqembu le-Monte Carlo. Eqenjini lemiphumela enemiklomelo engu-r_i, inzuzo ngayinye ithi A_i = (r_i - mean(r)) / std(r). Lowo mphumela ojwayelekile uphindaphinda isilinganiso samathuba asikiwe, ncamashi njengaku-PPO, kanye nenhlawulo ye-KL ngokumelene nemodeli yereferensi efriziwe inqamula ukukhukhuleka. Ngoba akekho umgxeki oqeqeshiwe, inkumbulo nokubala cishe kuhhafu, futhi ukujwayela ngokushesha kunikeza izinzuzo ezilinganiselwe ngokwemvelo, zokuhlukahluka okuphansi.

I-Mastering Group Relative Policy Optimization

I-Group Relative Policy Optimization (GRPO) iyindlela yokuqinisa yokufunda yamamodeli olimi ashuna kahle ahlulela impendulo ngayinye ngokumelene neqembu lezimpendulo eziyizelamani ekwazisweni okufanayo, esusa inethiwekhi yenani ehlukile esetshenziswa i-PPO. Kwaduma njengeqhinga lokuqeqesha eliyisisekelo ngemuva kwamamodeli okucabanga we-DeepSeek. I-Group Relative Policy Optimization iyibhulokhi yokwakha yobuchwepheshe ethinta ikhwalithi yemodeli, izindleko zengqalasizinda, ukubambezeleka, nokuthembeka esikalini. Ukuze wakhe ukuqonda okujulile, phatha Ukuthuthukiswa Kwenqubomgomo Yeqembu njengemodeli yokusebenza, hhayi isici esisodwa: chaza imiphumela efiselekayo, ucacise ukucabanga, futhi uhlukanise lokho isistimu engakwenza ngokwethembeka kulokho okusadinga ukwahlulela kochwepheshe.

Empeleni, amaqembu aqinile asebenzisa i-Group Relative Policy Optimization athuthukisa izakhiwo, idatha, nokukhetha kwengqalasizinda ngokumelene nokuthembeka nezindleko. Babhala imibandela yempumelelo ecacile, ukuhlola okuqhathaniswa nedatha engokoqobo nokugeleza komsebenzi, futhi baphindaphinde ngokusekelwe kumaphethini okuhluleka aqashiwe esikhundleni sokuwina kwebhentshimakhi yesikhathi esisodwa. Yilapho ukuqonda kwethiyori kuguquka kube amandla ahlala njalo kuwo wonke umkhiqizo, inqubomgomo, kanye nokusebenza.

Izinqumo zezakhiwo ziqhuba ukusebenza kanye nezindleko zokusebenza iminyaka. Ngesikhathi esifanayo, Ukuthuthukisa ibhentshimakhi eyodwa kungafihla ubuthakathaka obubanzi besistimu. Indlela eqine kakhulu iwukuhlanganisa isivinini sokuhlola nesiyalo sokuphatha: qhuba abashayeli bezindiza, bamba ubufakazi, ushicilele amalogi ezinqumo, futhi ubuyekeze izivikelo ngokuqhubekayo njengoba imodeli yokuziphatha, okulindelwe ngabasebenzisi, kanye nezimfuneko zokulawula zishintsha.

I-Strategic Impact

Izinqumo zezakhiwo ziqhuba ukusebenza kanye nezindleko zokusebenza iminyaka.

Izinqumo zezakhiwo ziqhuba ukusebenza kanye nezindleko zokusebenza iminyaka. Ekusetshenzisweni kwekhwalithi ephezulu, lokhu kuhunyushwa emithethweni yokusebenza elinganisekayo, imingcele yobunikazi, nemikhuba yokubuyekeza ephindelelayo ukuze amaqembu akwazi ukukala ukuzethemba esikhundleni sokukala ukungaqondakali.

Imfundo yobuchwepheshe isiza amaqembu ukuthi akhethe isitaki esifanele, hhayi nje esisha.

Imfundo yobuchwepheshe isiza amaqembu ukuthi akhethe isitaki esifanele, hhayi nje esisha. Ekusetshenzisweni kwekhwalithi ephezulu, lokhu kuhunyushwa emithethweni yokusebenza elinganisekayo, imingcele yobunikazi, nemikhuba yokubuyekeza ephindelelayo ukuze amaqembu akwazi ukukala ukuzethemba esikhundleni sokukala ukungaqondakali.

Izinketho ezingcono zobunjiniyela zinciphisa izehlakalo ezinokwethenjelwa ekukhiqizeni.

Izinketho ezingcono zobunjiniyela zinciphisa izehlakalo ezinokwethenjelwa ekukhiqizeni. Ekusetshenzisweni kwekhwalithi ephezulu, lokhu kuhunyushwa emithethweni yokusebenza elinganisekayo, imingcele yobunikazi, nemikhuba yokubuyekeza ephindelelayo ukuze amaqembu akwazi ukukala ukuzethemba esikhundleni sokukala ukungaqondakali.

Ikusasa Lokuthuthukiswa Kwenqubomgomo Yeqembu

I-GRPO isiphenduke iresiphi ezenzakalelayo yokuqeqesha amamodeli avulekile okucabanga, futhi amalebhu aphindaphinda ezindaweni zawo ezibuthakathaka. Abacwaningi bahlola ukulungiswa kokuchema kobude nobunzima (okufana noDkt. GRPO), ileveli yamathokheni esikhundleni sokujwayelekile kwezinga lokulandelana, nokususa noma ukulolonga kabusha igama le-KL. Lindela ukuhlanganiswa okuqinile okunemiklomelo eqinisekiswa (izibalo, ikhodi, ukusetshenziswa kwamathuluzi), ukuphatha kangcono amasignali agqagqene, namahybrids ahlanganisa isisekelo seqembu nabagxeki abangasindi bemisebenzi, yezinyathelo eziningi.

Ukuqaliswa Komhlaba Wangempela

Ukuqeqesha i-DeepSeek-R1 kanye ne-DeepSeekMath ukuze kukhiqizwe uchungechunge olude lokucabanga usebenzisa imivuzo esekelwe emthethweni yokunemba ezinkingeni zezibalo.

Amamodeli okukhiqiza ikhodi yokuhlela kahle lapho isixazululo ngasinye esiyisampula sitholwa ngokuthi siyaphumelela yini ekuhlolweni kweyunithi, futhi iqembu lenziwa ngokwejwayelekile ukuze likhethe abawinile.

Amapayipi omthombo ovulekile we-RLHF (isb., ku-TRL kanye namalabhulali e-verl) kusetshenziswa i-GRPO ukuqondisa amamodeli engxoxo ngaphandle kokukhokhela inethiwekhi yenani elihlukile

Ukuthuthukisa ukulandela imiyalelo noma ukuziphatha kokuphepha ngokuthatha isampula izimpendulo ezimbalwa ngesikhathi ngasinye kanye nokuklomelisa lezo imodeli yomklomelo amanani aphezulu uma kuqhathaniswa nontanga yabo.

Amaphethini Okusebenzisa

Ukuthuthukiswa Kwenqubomgomo Ehlobene Neqembu kuyasebenza

Ukuqeqesha i-DeepSeek-R1 kanye ne-DeepSeekMath ukuze kukhiqizwe uchungechunge olude lokucabanga usebenzisa imiklomelo yokulunga esekelwe emthethweni ezinkingeni zezibalo.

Ukuqeqesha i-DeepSeek-R1 kanye ne-DeepSeekMath ukuze kukhiqizwe uchungechunge olude lokucabanga kusetshenziswa imiklomelo yokunemba okusekelwe emthethweni ezinkingeni zezibalo Amaqembu ngokuvamile athola imiphumela engcono lapho echaza izilinganiso zekhwalithi ngaphambili, agcine indlela yokukhuphuka kwabantu yamacala asemaphethelweni, futhi alandelele kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

Ukuthuthukiswa Kwenqubomgomo Ehlobene Neqembu kuyasebenza

Amamodeli okukhiqiza ikhodi yokuhlela kahle lapho isisombululo ngasinye esiyisampula sitholwa ngokuthi siyaphumelela yini ekuhlolweni kweyunithi, futhi iqembu lenziwa ngokwejwayelekile ukuze likhethe abawinile.

Amamodeli okukhiqiza ikhodi yokuhlela kahle lapho isixazululo ngasinye esiyisampula sitholwa ukuthi siyaphumelela yini ekuhlolweni kweyunithi, futhi iqembu lenziwa ngokwejwayelekile ukukhetha abawinile Amaqembu ngokuvamile athola imiphumela engcono uma echaza izilinganiso zekhwalithi ngaphambili, agcina indlela yokukhuphuka komuntu yamakesi asemaphethelweni, futhi alandelele kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

Ukuthuthukiswa Kwenqubomgomo Ehlobene Neqembu kuyasebenza

Amapayipi omthombo ovulekile we-RLHF (isb., ku-TRL nakumalabhulali e-verl) kusetshenziswa i-GRPO ukuqondisa amamodeli engxoxo ngaphandle kokukhokhela inethiwekhi yenani elihlukile.

Amapayipi omthombo ovulekile we-RLHF (isb., kumitapo yolwazi ye-TRL kanye ne-verl) esebenzisa i-GRPO ukuze iqondanise amamodeli engxoxo ngaphandle kokukhokhela inani elihlukile lenethiwekhi Amaqembu ngokuvamile athola imiphumela engcono uma echaza izilinganiso zekhwalithi ngaphambili, agcina indlela yokukhuphuka komuntu yamakesi asemaphethelweni, futhi alandelele kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

Ukuthuthukiswa Kwenqubomgomo Ehlobene Neqembu kuyasebenza

Ukuthuthukisa ukulandela imiyalelo noma ukuziphatha kokuphepha ngokuthatha isampula izimpendulo ezimbalwa ngesikhathi ngasinye kanye nokuklomelisa lezo imodeli yomklomelo amazinga aphezulu uma kuqhathaniswa nontanga yabo.

Ukuthuthukisa ukulandela imiyalelo noma ukuziphatha kokuphepha ngokuthatha izibonelo zezimpendulo ezimbalwa ngokushesha kanye nokuklomelisa lezo amanani emodeli yomvuzo aphezulu kakhulu uma kuqhathaniswa nontanga bawo Amaqembu ngokuvamile athola imiphumela engcono lapho echaza imingcele yekhwalithi ngaphambili, agcine indlela yokukhuphuka kwabantu yamacala asemaphethelweni, futhi alandelele kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

Izingozi & Guardrails

!

Ukuthuthukisa ibhentshimakhi eyodwa kungafihla ubuthakathaka obubanzi besistimu.

!

Izindleko zengqalasizinda nezokulungisa zivame ukubukelwa phansi.

!

Izikhala zokuphepha nokubonakala zingakhula njengoba izinhlelo ziba nzima kakhulu.

Ukuqalisa Umhlahlandlela

1

Chaza ukubambezeleka, ikhwalithi, nezindleko ezihlosiwe ngaphambi kokuqaliswa.

Chaza ukubambezeleka, ikhwalithi, nezindleko ezihlosiwe ngaphambi kokuqaliswa. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

2

Ibhentshimakhi ngaphansi komthwalo wangempela nezimo zedatha.

Ibhentshimakhi ngaphansi komthwalo wangempela nezimo zedatha. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

3

Ukuqapha amathuluzi amaphutha, ukukhukhuleka, nomthelela wabasebenzisi.

Ukuqapha amathuluzi amaphutha, ukukhukhuleka, nomthelela wabasebenzisi. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

4

Lungiselela izindlela zokuhlehlisa nezigameko ngaphambi kokukala.

Lungiselela izindlela zokuhlehlisa nezigameko ngaphambi kokukala. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

Qhubeka Uhlole