Ulimi lwe-AI GUIDE

I-Proximal Policy Optimization

I-Proximal Policy Optimization (PPO) iyi-algorithm yokufunda eqinisayo ehlotshaniswa kakhulu namamodeli wolimi wokushuna kahle okuvela empendulweni yomuntu.

Uhlolojikelele

I-Proximal Policy Optimization (PPO) iyi-algorithm yokufunda eqinisayo ehlotshaniswa kakhulu namamodeli wolimi wokushuna kahle okuvela empendulweni yomuntu. Ithuthukisa inqubomgomo ezinyathelweni ezicophelelayo, ezincane ukuze kugwenywe ukungazinzi okukhungethe izindlela zenqubomgomo ezingenangqondo.

I-Proximal Policy Optimization iyingxenye yesitaki solimi-AI esetshenziselwa ukufunda, ukukhiqiza, ukuhlukanisa, nokuguqula umbhalo nenkulumo ngezikali.

I-Deep Dive

I-PPO yethulwa ngu-OpenAI ngo-2017 futhi yaba inselele ngemuva kwe-RLHF kumasistimu afana ne-InstructGPT kanye ne-ChatGPT. Inselele eyinhloko ku-RL yegradient yenqubomgomo ukuthi isibuyekezo esisodwa esikhulu kakhulu singagoqa ukusebenza. I-PPO ibhekana nalokhu 'ngenhloso yokuthatha enye enqanyuliwe': ikala ukuthi isenzo sesiphenduke maningi kangakanani (noma ngaphansi) uma kuqhathaniswa nenqubomgomo endala, iphindaphinda leso silinganiso ngenzuzo (indlela isenzo ebesingcono ngayo kunobekulindelekile), bese inqamula isilinganiso kububanzi obuncane obufana no-0.8 kuya ku-1.2. Lokhu kuhlanganisa ukuthi inqubomgomo ingahamba ibanga elingakanani isibuyekezo ngasinye, igcine ukufunda kuzinzile kuyilapho kuvumela ukuthuthuka okuqhubekayo. Kumodeli yolimi ye-RLHF, 'isenzo' sikhiqiza ithokheni noma impendulo, umvuzo uvela kumodeli yomklomelo, futhi inhlawulo ye-KL-divergence igcina imodeli ingakhukhuli kude kakhulu nokuziphatha kwayo kwasekuqaleni.

I-Technical Insight

I-PPO ikhulisa umgomo osikiwe: min(isilinganiso * inzuzo, isiqeshana(isilinganiso, 1-eps, 1+eps) * inzuzo), lapho isilinganiso singamathuba esenzo esisha esidala. Izinzuzo ngokuvamile zilinganiselwa nge-Generalized Advantage Estimation kanye nenethiwekhi yenani elifundiwe (eligxekayo). Ku-RLHF, isamba somklomelo sihlanganisa isikolo semodeli yomklomelo nenhlawulo ye-KL yethokheni ngayinye ngokumelene nenqubomgomo yesithenjwa, ukulinganisa inzuzo yomklomelo nokuhlala useduze nemodeli yoqobo.

Ingcweti Kokuthuthukisa Inqubomgomo Esondele

I-Proximal Policy Optimization (PPO) iyi-algorithm yokufunda eqinisayo ehlotshaniswa kakhulu namamodeli wolimi wokushuna kahle okuvela empendulweni yomuntu. Ithuthukisa inqubomgomo ezinyathelweni ezicophelelayo, ezincane ukuze kugwenywe ukungazinzi okukhungethe izindlela zenqubomgomo ezingenangqondo. I-Proximal Policy Optimization iyingxenye yesitaki solimi-AI esetshenziselwa ukufunda, ukukhiqiza, ukuhlukanisa, nokuguqula umbhalo nenkulumo ngezikali. Ukuze wakhe ukuqonda okujulile, phatha i-Proximal Policy Optimization njengemodeli yokusebenza, hhayi isici esisodwa: chaza imiphumela oyifunayo, ucacise ukucabanga, futhi uhlukanise lokho isistimu engakwenza ngokwethembeka kulokho okusadinga ukwahlulela kochwepheshe.

Empeleni, amaqembu aqinile asebenzisa imiyalo yokuklama Yokuthuthukisa Inqubomgomo Ephakeme, ukubuyisa, nokubuyekeza amaluphu njengohlelo olulodwa lokuxhumana oludidiyelwe. Babhala imibandela yempumelelo ecacile, ukuhlola okuqhathaniswa nedatha engokoqobo nokugeleza komsebenzi, futhi baphindaphinde ngokusekelwe kumaphethini okuhluleka aqashiwe esikhundleni sokuwina kwebhentshimakhi yesikhathi esisodwa. Yilapho ukuqonda kwethiyori kuguquka kube amandla ahlala njalo kuwo wonke umkhiqizo, inqubomgomo, kanye nokusebenza.

Ukugeleza komsebenzi wolimi kungahamba ngokushesha ngaphandle kokudela ukuvumelana. Ngesikhathi esifanayo, amaqiniso Akhohliwe angafaka imibiko buthule, ukugeleza kosekelo, noma imiphumela yocwaningo. Indlela eqine kakhulu iwukuhlanganisa isivinini sokuhlola nesiyalo sokuphatha: qhuba abashayeli bezindiza, bamba ubufakazi, ushicilele amalogi ezinqumo, futhi ubuyekeze izivikelo ngokuqhubekayo njengoba imodeli yokuziphatha, okulindelwe ngabasebenzisi, kanye nezimfuneko zokulawula zishintsha.

I-Strategic Impact

Ukugeleza komsebenzi wolimi kungahamba ngokushesha ngaphandle kokudela ukuvumelana.

Ukugeleza komsebenzi wolimi kungahamba ngokushesha ngaphandle kokudela ukuvumelana. Ekusetshenzisweni kwekhwalithi ephezulu, lokhu kuhunyushwa emithethweni yokusebenza elinganisekayo, imingcele yobunikazi, nemikhuba yokubuyekeza ephindelelayo ukuze amaqembu akwazi ukukala ukuzethemba esikhundleni sokukala ukungaqondakali.

Yandisa ukufinyelela kuzo zonke izilimi nezitayela zokuxhumana.

Yandisa ukufinyelela kuzo zonke izilimi nezitayela zokuxhumana. Ekusetshenzisweni kwekhwalithi ephezulu, lokhu kuhunyushwa emithethweni yokusebenza elinganisekayo, imingcele yobunikazi, nemikhuba yokubuyekeza ephindelelayo ukuze amaqembu akwazi ukukala ukuzethemba esikhundleni sokukala ukungaqondakali.

Amaqembu angachitha isikhathi esiningi ekwahluleleni kuyilapho i-automation isingatha impinda.

Amaqembu angachitha isikhathi esiningi ekwahluleleni kuyilapho i-automation isingatha impinda. Ekusetshenzisweni kwekhwalithi ephezulu, lokhu kuhunyushwa emithethweni yokusebenza elinganisekayo, imingcele yobunikazi, nemikhuba yokubuyekeza ephindelelayo ukuze amaqembu akwazi ukukala ukuzethemba esikhundleni sokukala ukungaqondakali.

Ikusasa Lokuthuthukiswa Kwenqubomgomo Eseduze

I-PPO ihlala iqinile kodwa idume kabi: idinga inethiwekhi yenani ehlukile, ukushuna okucophelelayo kwepharamitha, nokubala okuningi. Ezinye izindlela ezilula ziyazuza, okuhlanganisa i-DPO (ayikho nhlobo i-RL) kanye ne-GRPO, eyehlisa inani lenethiwekhi ngokulinganisa izinzuzo ezivela emaqenjini empendulo eyisampula futhi inikeze amandla amamodeli akamuva okucabanga. I-PPO izoqhubeka lapho ukuhlola kwenqubomgomo kusiza ngempela, kodwa inkambu ihweba ngokuqhubekayo ngobunkimbinkimbi bayo ngezindlela ezishibhile.

Ukuqaliswa Komhlaba Wangempela

I-Fine-tuning InstructionGPT kanye ChatGPT ukulandela imiyalelo nokuthandwa abantu nge-RLHF

Ukuqeqesha okudlala umdlalo kanye nama-ejenti okulawula amarobhothi, isizinda soqobo se-PPO ngaphambi kwamamodeli olimi

Ukunciphisa ubuthi noma ukuthuthukisa usizo ngokwandisa isikolo semodeli yomvuzo ngaphansi kwengcindezi ye-KL

Ukuthuthukisa ukusetshenziswa kwamathuluzi noma ukuziphatha komenzeli wezinyathelo eziningi lapho imodeli iklonyeliswa ngokuqeda imisebenzi ngendlela efanele

Amaphethini Okusebenzisa

Ukuthuthukiswa Kwenqubomgomo Eseduze kuyasebenza

I-Fine-tuning InstructionGPT kanye ne-ChatGPT ukulandela imiyalelo nokuthandwa abantu nge-RLHF.

I-Fine-tuning InstructGPT kanye ChatGPT yokulandela imiyalelo nokuthandwayo komuntu ngamaQembu e-RLHF ngokuvamile athola imiphumela engcono uma echaza izinga eliphezulu ngaphambili, egcina indlela yokukhuphuka yabantu yamakesi asemaphethelweni, futhi alandelele kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

Ukuthuthukiswa Kwenqubomgomo Eseduze kuyasebenza

Ukuqeqesha okudlala umdlalo kanye nama-ejenti okulawula amarobhothi, isizinda soqobo se-PPO ngaphambi kwamamodeli olimi.

Ukuqeqesha ama-ejenti okulawula okudlala umdlalo kanye namarobhothi, isizinda sokuqala se-PPO ngaphambi kwamamodeli olimi Amaqembu ngokuvamile athola imiphumela engcono uma echaza izilinganiso zekhwalithi ngaphambili, agcina indlela yokukhuphuka yabantu yamakesi asemaphethelweni, futhi alandelele kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

Ukuthuthukiswa Kwenqubomgomo Eseduze kuyasebenza

Ukunciphisa ubuthi noma ukuthuthukisa usizo ngokwandisa isikolo semodeli yomvuzo ngaphansi kwesithiyo se-KL.

Ukunciphisa ubuthi noma ukuthuthukisa usizo ngokwandisa isikolo semodeli yomvuzo ngaphansi kwesithiyo se-KL Amaqembu ngokuvamile athola imiphumela engcono uma echaza izinga eliphezulu ngaphambili, egcina indlela yokukhuphuka yabantu yamakesi asemaphethelweni, futhi alandelele kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

Ukuthuthukiswa Kwenqubomgomo Eseduze kuyasebenza

Ukuthuthukisa ukusetshenziswa kwamathuluzi noma ukuziphatha komenzeli wezinyathelo eziningi lapho imodeli iklonyeliswa ngokuqeda imisebenzi ngendlela efanele.

Ukuthuthukisa ukusetshenziswa kwamathuluzi noma ukuziphatha komenzeli wezinyathelo eziningi lapho imodeli iklonyeliswa ngokuqeda imisebenzi ngendlela efanele Amaqembu ngokuvamile athola imiphumela engcono uma echaza izinga eliphezulu ngaphambili, egcina indlela yokukhuphuka yabantu yamakesi asemaphethelweni, futhi elandelela kokubili izinzuzo zokukhiqiza nezindleko zamaphutha ngokuhamba kwesikhathi.

Izingozi & Guardrails

!

Amaqiniso akhonjiwe angafaka ngokuthula imibiko, ukugeleza kosekelo, noma imiphumela yocwaningo.

!

Ukuzwela okusheshayo kungadala imiphumela engahambisani kuzo zonke izicelo ezifanayo.

!

Idatha yombhalo ebucayi ingase idalulwe uma izilawuli zokufinyelela zibuthakathaka.

Ukuqalisa Umhlahlandlela

1

Chaza ifomethi yokuphumayo, ithoni, namazinga wekhwalithi ngaphambi kokukhishwa.

Chaza ifomethi yokuphumayo, ithoni, namazinga wekhwalithi ngaphambi kokukhishwa. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

2

Izimpendulo eziyisisekelo ngemithombo ethembekile noma nini lapho ukunemba kubalulekile.

Izimpendulo eziyisisekelo ngemithombo ethembekile noma nini lapho ukunemba kubalulekile. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

3

Gcina indawo yokuhlola isibuyekezo somuntu ukuze uthole imiphumela ephezulu.

Gcina indawo yokuhlola isibuyekezo somuntu ukuze uthole imiphumela ephezulu. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

4

Landela amaphethini okuhluleka futhi uqeqeshe kabusha imiyalo noma ukuhamba komsebenzi njalo.

Landela amaphethini okuhluleka futhi uqeqeshe kabusha imiyalo noma ukuhamba komsebenzi njalo. Phatha isinyathelo ngasinye njengesango lobufakazi: uma imibandela ingafinyelelwa, misa ukukhishwa, vala igebe, bese unweba ukusetshenziswa.

Qhubeka Uhlole