Jagorar Fasaha

PagedAttention da vLLM

PagedAttention dabara ce ta sarrafa ƙwaƙwalwar ajiya wacce ke adana ma'aunin hankalin samfurin harshe a cikin ƙananan tubalan da za'a iya sake amfani da su a maimakon guda ɗaya babba mai jujjuyawa.

Dubawa

PagedAttention dabara ce ta sarrafa ƙwaƙwalwar ajiya wacce ke adana ma'aunin hankalin samfurin harshe a cikin ƙananan tubalan da za'a iya sake amfani da su a maimakon guda ɗaya babba mai jujjuyawa. Yana ba da iko vLLM, injin buɗe tushen sabis wanda ke haɓaka buƙatun nawa GPU ɗaya zai iya ɗauka.

PagedAttention da vLLM tubalin ginin fasaha ne wanda ke shafar ingancin samfuri, farashin kayayyakin more rayuwa, latency, da aminci a sikeli.

Zurfafa nutsewa

Lokacin da samfurin harshe ya samar da rubutu, yana adana 'KV cache' (maɓalli da ƙima) ga kowane alamar da ya gani don haka alamar ta gaba zata iya zuwa ga cikakken mahallin. A al'ada kowace buƙata tana tanadin babban shinge guda ɗaya na ƙwaƙwalwar GPU mai girma don iyakar yuwuwar tsayinsa, ɓata adadi mai yawa lokacin da jeri ya kasance gajarta ko bambanta tsayi. PagedAttention, wanda aka gabatar a cikin 2023 vLLM takarda daga UC Berkeley, yana ɗaukar ra'ayin rumbun ƙwaƙwalwar ajiya daga tsarin aiki: yana raba cache na KV zuwa ƙayyadaddun ƙayyadaddun tubalan waɗanda za su iya rayuwa a ko'ina cikin ƙwaƙwalwar ajiya kuma a keɓe su akan buƙata. Taswirorin bincike na taswirori na ma'ana a matsayin alama zuwa tubalan jiki. Wannan yana kusan kawar da ɓarnawar ƙwaƙwalwar ajiya kuma yana ba da damar raba tubalan, misali a cikin abubuwan da aka samu da yawa daga sa'a ɗaya.

Fahimtar Fasaha

An raba cache na KV zuwa ƙayyadaddun shafuka masu girma, kowanne yana riƙe da maɓalli da ƙididdiga don saita adadin alamun. Toshe taswirori na kowane-jeri na taswirori na ma'ana zuwa wuraren shafi na zahiri, don haka ma'ajin jerin ba dole ba ne su kasance masu ci gaba. Saboda maƙasudi iri ɗaya (tsarin tsarin da aka raba, ko rassan bincike na katako) na iya nuni zuwa shafuka na zahiri ta hanyar kwafi-kan-rubutu, ana sake amfani da ƙwaƙwalwar ajiya maimakon kwafi, yanke sharar gida daga sama da 60% zuwa ƴan kashi.

Jagorar PagedAttention da vLLM

PagedAttention dabara ce ta sarrafa ƙwaƙwalwar ajiya wacce ke adana ma'aunin hankalin samfurin harshe a cikin ƙananan tubalan da za'a iya sake amfani da su a maimakon guda ɗaya babba mai jujjuyawa. Yana ba da iko vLLM, injin buɗe tushen sabis wanda ke haɓaka buƙatun nawa GPU ɗaya zai iya ɗauka. PagedAttention da vLLM tubalin ginin fasaha ne wanda ke shafar ingancin samfuri, farashin kayayyakin more rayuwa, latency, da aminci a sikeli. Don haɓaka fahimta mai zurfi, bi PagedAttention da vLLM azaman ƙirar aiki, ba sifa ɗaya ba: ayyana sakamakon da ake so, fayyace zato, da raba abin da tsarin zai iya yi da dogaro daga abin da har yanzu yana buƙatar yanke hukunci na ƙwararru.

A aikace, ƙungiyoyi masu ƙarfi masu amfani da PagedAttention da vLLM suna haɓaka gine-gine, bayanai, da zaɓin abubuwan more rayuwa a kan dogaro da farashi. Suna rubuta ƙayyadaddun ƙa'idodin nasara, gwaji akan bayanan gaskiya da gudanawar aiki, da jujjuyawar bisa ga tsarin gazawar da aka lura maimakon cin nasara na lokaci ɗaya. Wannan shine inda fahimtar ka'idar ta juya zuwa iyawa mai dorewa a cikin samfura, manufofi, da ayyuka.

Hukunce-hukuncen gine-gine suna haifar da aiki da tsadar aiki na shekaru. A lokaci guda, Haɓaka ma'auni ɗaya na iya ɓoye manyan raunin tsarin. Hanyar da ta fi dacewa ita ce haɗa saurin gwaji tare da horon gudanarwa: gudanar da matukin jirgi, kama shaida, buga rajistan ayyukan yanke shawara, da ci gaba da sabunta abubuwan tsaro kamar yadda halayen ƙira, tsammanin mai amfani, da buƙatun tsari ke tasowa.

Dabarun Tasiri

Hukunce-hukuncen gine-gine suna haifar da aiki da tsadar aiki na shekaru.

Hukunce-hukuncen gine-gine suna haifar da aiki da tsadar aiki na shekaru. A cikin ƙawance masu inganci, ana fassara wannan zuwa ƙa'idodin aiki waɗanda za a iya aunawa, iyakokin ikon mallaka, da kuma bita-da-kullin bita don ƙungiyoyi su iya haɓaka kwarin gwiwa a maimakon ɓata shakku.

Ilimin fasaha yana taimaka wa ƙungiyoyi su zaɓi tari mai kyau, ba kawai sabon abu ba.

Ilimin fasaha yana taimaka wa ƙungiyoyi su zaɓi tari mai kyau, ba kawai sabon abu ba. A cikin ƙawance masu inganci, ana fassara wannan zuwa ƙa'idodin aiki waɗanda za a iya aunawa, iyakokin ikon mallaka, da kuma bita-da-kullin bita don ƙungiyoyi su iya haɓaka kwarin gwiwa a maimakon ɓata shakku.

Zaɓuɓɓukan injiniya mafi kyau suna rage abin dogaro a cikin samarwa.

Zaɓuɓɓukan injiniya mafi kyau suna rage abin dogaro a cikin samarwa. A cikin ƙawance masu inganci, ana fassara wannan zuwa ƙa'idodin aiki waɗanda za a iya aunawa, iyakokin ikon mallaka, da kuma bita-da-kullin bita don ƙungiyoyi su iya haɓaka kwarin gwiwa a maimakon ɓata shakku.

Makomar PagedAttention da vLLM

vLLM ya zama tsohuwar ƙashin bayan fayyace tushen tushen tushe, kuma ra'ayoyin PagedAttention yanzu sun bayyana a cikin mafi yawan tarin sabis. Yi tsammanin caching prefix mai zurfi (sake yin amfani da tsarin da aka adana yana haifar da faɗakarwa a cikin masu amfani), rarrabuwar prefill da yanke ƙididdiga akan injuna daban, ingantattun manufofin korar, da haɗin kai tare da ƙididdigewa da ƙididdige ƙididdiga. Yayin da tagogin mahallin ke girma zuwa miliyoyin alamu, ingantaccen tsarin sarrafa KV ya zama maɗaukaki don kiyaye hidima mai araha.

Aiwatar da Gaskiyar Duniya

Bayar da buɗaɗɗen tushen LLM API inda vLLM ke ba da sabis na masu amfani da hira lokaci guda daga GPU ɗaya a babban kayan aiki.

Rarraba dogon tsarin faɗakarwa cikin dubban buƙatun ta hanyar caching prefix don haka ana sarrafa shi sau ɗaya, ba akai-akai ba.

Binciken bim mai gudana ko kammala samfura da yawa waɗanda ke raba shingen KV don saurin gama gari ta hanyar kwafi-kan-rubuta

Yanke sharar žwažwalwar ajiya na GPU daga rarrabuwa ta yadda mai bada zai iya tattara ƙarin zaman lokaci guda akan hardware iri ɗaya

Hanyoyin Aiwatarwa

PagedAttention da vLLM a aikace

Bayar da buɗaɗɗen tushen LLM API inda vLLM ke hidima ga yawancin masu amfani da hira lokaci guda daga GPU ɗaya a babban kayan aiki.

Bayar da tushen tushen LLM API inda vLLM ke ba da yawancin masu amfani da hira na lokaci guda daga GPU ɗaya a babban kayan aiki Ƙungiyoyi yawanci suna samun sakamako mafi kyau lokacin da suka ayyana ma'auni masu inganci a gaba, kiyaye hanyar haɓakar ɗan adam don shari'o'in gefe, da bin duk nasarorin samarwa da ƙimar kuskure akan lokaci.

PagedAttention da vLLM a aikace

Raba dogon tsari yana faɗakarwa cikin dubban buƙatun ta hanyar caching prefix don haka ana sarrafa shi sau ɗaya, ba akai-akai ba.

Rarraba dogon tsarin faɗakarwa a cikin dubunnan buƙatun ta hanyar caching prefix don haka ana sarrafa shi sau ɗaya, ba akai-akai Ƙungiyoyi yawanci suna samun sakamako mafi kyau lokacin da suka ayyana ma'auni masu inganci a gaba, kiyaye hanyar haɓakar ɗan adam don ƙararraki, da bin diddigin nasarorin samarwa da tsadar kuskure akan lokaci.

PagedAttention da vLLM a aikace

Binciken bim mai gudana ko kammala samfura da yawa waɗanda ke raba shingen KV don saurin gama gari ta hanyar kwafi-kan-rubutu.

Gudanar da bincike na katako ko samfurori da yawa waɗanda ke raba shingen KV don saurin gama gari ta hanyar kwafin-kan-rubuta Ƙungiyoyi yawanci suna samun sakamako mafi kyau lokacin da suka ayyana ma'auni masu inganci a gaba, kiyaye hanyar haɓakar ɗan adam don shari'o'in gefe, da bin diddigin nasarorin samarwa da ƙimar kuskure akan lokaci.

PagedAttention da vLLM a aikace

Yanke sharar žwažwalwar ajiya na GPU daga rarrabuwa ta yadda mai bada zai iya tattara ƙarin zaman lokaci guda akan hardware iri ɗaya.

Yanke sharar ƙwaƙwalwar GPU daga rarrabuwa don haka mai bada zai iya ɗaukar ƙarin zaman lokaci guda akan Ƙungiyoyin kayan aiki iri ɗaya yawanci suna samun sakamako mafi kyau lokacin da suka ayyana ma'auni masu inganci a gaba, kiyaye hanyar haɓakar ɗan adam don ƙararraki, da bin diddigin nasarorin samarwa da ƙimar kuskure akan lokaci.

Hatsari & Tsare-tsare

!

Haɓaka ma'auni ɗaya na iya ɓoye manyan raunin tsarin.

!

Sau da yawa ana raina kayan more rayuwa da kuma kuɗin kulawa.

!

Tsaro da gibin lura na iya girma yayin da tsarin ke ƙara haɓaka.

Taswirar Hanya

1

Ƙayyade latency, inganci, da maƙasudin farashi kafin aiwatarwa.

Ƙayyade latency, inganci, da maƙasudin farashi kafin aiwatarwa. Ɗauki kowane mataki azaman ƙofar shaida: idan ba a cika sharuɗɗa ba, dakatar da fitar, rufe tazarar, sannan kawai faɗaɗa amfani.

2

Alamar ma'auni a ƙarƙashin ainihin kaya da yanayin bayanai.

Alamar ma'auni a ƙarƙashin ainihin kaya da yanayin bayanai. Ɗauki kowane mataki azaman ƙofar shaida: idan ba a cika sharuɗɗa ba, dakatar da fitar, rufe tazarar, sannan kawai faɗaɗa amfani.

3

Kula da kayan aiki don kurakurai, ɗigo, da tasirin mai amfani.

Kula da kayan aiki don kurakurai, ɗigo, da tasirin mai amfani. Ɗauki kowane mataki azaman ƙofar shaida: idan ba a cika sharuɗɗa ba, dakatar da fitar, rufe tazarar, sannan kawai faɗaɗa amfani.

4

Shirya bijirowa da hanyoyin mayar da martani kafin sikeli.

Shirya bijirowa da hanyoyin mayar da martani kafin sikeli. Ɗauki kowane mataki azaman ƙofar shaida: idan ba a cika sharuɗɗa ba, dakatar da fitar, rufe tazarar, sannan kawai faɗaɗa amfani.

Ci gaba da Bincike