




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
NVIDIALLM全棧式方案使用和優(yōu)化最佳實踐Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLM2Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLM3NVIDIAFull-StackSolutionforLLMNVIDIAMegatron-Core(M-core)forLLMTrainingNVIDIATensorRT-LLMforLLMInference4OverviewofNVIDIA’sLargeLanguageModelOfferingsforTrainingSolutionsatEachLeveloftheNemoFramework:EasytouseOOTBFWwithalargemodelMegatron-LM:AlightweightframeworkreferenceforusingMegatron-Core:LibraryforGPUoptimizedtechniquesforLLMTransformerEngine:HopperacceleratedTransformermodels.5WhyWeNeedNVIDIAMegatron-Core?6NVIDIATensorRT-LLM?FasterTransformertoleverageitsoptimizedkernelsforperfo?OthercomponentsforthecustomizationsofLLMinference,suchasCUTLASS7KeyFeaturesinNVIDIATensorRT-LLM8WhatisNVIDIATritonInferenceServer?FeaturesofTritonInferenceServer:9Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMBestPracticeforNVIDIAMegatron-Core?Enabledistributedoptimizertoshardoptimizerstates.?Utilizedistributedoptimizertoshardoptimizerstatessimultaneously.BestPracticeforNVIDIAMegatron-Core?EnableTransformerEngine(--transformer-impltransformer_engine)?EnableFlashAttention(--use-flash-attn)?Enablecommunicationoverlapping?EnablekernelfusionsBestPracticeforNVIDIAMegatron-CoreTrainingLoopM-LMMegatron-LMMegatron-CoreTrainingLoopM-LMMegatron-LMMegatron-CoreEmbeddingsPipelineScheduleandCommunicationDistributedCheckpointingAttentionNormalizatiEmbeddingsPipelineScheduleandCommunicationDistributedCheckpointingAttentionNormalizationActivationRecomputeModelsConfig/Spec(Customization)Config/Spec(Customization)TransformerBlockTransformerBlockTransformerLayerTransformerLayerMLPMLPSequenceParallelismSequenceParallelismDDistributedOptimizerAgenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMHowtoUseNVIDIATensorRT-LLM?EasetheuseeffortsHowtoUseNVIDIATensorRT-LLM#Converthuggingfacellama-7bmodeltotrt-llmcheckpoint#Optionallywithtensorand/orpipelineparallelism,e.g.,tp=2pythonexamples/llama/convert_checkpoint.py\--model_dirllama-7b-hf\--dtypefloat16\--tp_size2\--output_dirtllm_ckpt/llama-7b-fp16-tp2#Quantizehuggingfacellama-7bandexporttotrt-llmcheckpoint#Optionallywithtensorand/orpipelineparallelism,e.g.,tp=2pythonexamples/quantization/quantize.py\--model_dirllama-7b-hf\--dtypefloat16\--qformatfp8\--tp_size2\--output_dirtllm_ckpt/llama-7b-fp8-tp2HowtoUseNVIDIATensorRT-LLM#Buildtrt-llmenginesfromtrt-llmcheckpoint#Optionallyenable/disablebuildingoptionstrtllm-build--checkpoint_dirtllm_ckpt/llama-7b-fp8-tp2\--gemm_pluginfloat16\--output_dirtllm_engines/llama-7b-fp8-tp2\--workers2#Runinferencewiththetrt-llmenginesmpirun-n2--allow-run-as-rootpythonexamples/run.py\--engine_dirtllm_engines/llama-7b-fp8-tp2\--tokenizer_dirllama-7b-hf\--max_output_len30\--input_text"Borninnorth-eastFrance,Soyertrainedasa"#ExamplegeneratedoutputOutput[Text0Beam0]:"chefinParisandLondonbeforemovingtoNewYorkin1850.Hewasthefirstcheftobehiredbythenewly"HowtoUseNVIDIATensorRT-LLM?Oneormoresafetensorsfilesstoringrankweights?Eachfilesavesadictmappingh{'transformer.vocab_embedding.weight':torch.Tensor(...),'transformer.layers.0.attention.qkv.weight':torch.Tensor(...),'transformer.layers.0.attention.dense.weight':torch.Tensor(...),'transformer.layers.0.mlp.fc.weight':torch.Tensor(...),'j.weight':torch.Tensor(...),'lm_head.weight':torch.Tensor(...)}HowtoUseNVIDIATensorRT-LLMBuildOptions?In-flightbatchingisenabledbydefaultwithtrtllm-build,whichrequiresth?CustomAllReducePlugin:recommendtoenableforNVLink-basednodes?Embeddingparallelismandsharingfeatures:recommendtoenabletoimprovethroughputandreducememoryusageRuntimeOptions?gpt_model_type:recommendtouseinflight_fused_batchingtoincreasethroughputandreducelatency?batch_scheduler_policy:recommendtouseguaranteed_no_evictfirstlyandchangetomax_utilizationforpossiblyhigher?kv_cache_free_gpu_mem_fraction(default=0.9)ispreferredovermax_tokens_in_paged_kv_cacheduetoease-of-use.They?enable_trt_overlap:recommendtosetfalsefirstlyHowtoUseNVIDIATensorRT-LLMPerformanceBestPractices:QuantizationWeight-onlyQuantizationlatency;Getthesclatency;Getthescalesfromexternallibraries.WeightandActivationQuantizationHowtoUseNVIDIATensorRT-LLMAgenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMHowtouseNVIDIATritonInferenceServerHowtouseNVIDIATritonInferenceServer?Option2:Buildviadockerfile–canmodifydockerfileeasily.#Updatethesubmodulescdtensorrtllm_backendgitlfsinstallgitsubmoduleupdate--init–recursive#UsetheDockerfiletobuildthebackendinacontainer#Forx86_64DOCKER_BUILDKIT=1dockerbuild-ttriton_trt_llm-fdockerfile/Dockerfile.trt_llm_backend.#Foraarch64DOCKER_BUILDKIT=1dockerbuild-ttriton_trt_llm--build-argTORCH_INSTALL_TYPE="src_non_cxx11_abi"-fdockerfile/Dockerfile.trt_llm_backend.HowtouseNVIDIATritonInferenceServer#PreparetheTRT-LLMbaseimageusingthedockerfilefromtensorrtllm_backend.cdtensorrtllm_backend#Specifythebuildargsforthedockerfile.BASE_IMAGE=nvcr.io/nvidia/tritonserver:24.01-py3-minTRT_VERSION=9.2.0.5TRT_URL_x86=/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-.linux.x86_64-gnu.cuda-12.2.tar.gz/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-.Ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gzdockerbuild-ttrtllm_base\--build-argBASE_IMAGE="${BASE_IMAGE}"--build-argTRT_VER="${TRT_VERSION}"--build-argRELEASE_URL_TRT_x86="${TRT_URL_x86}"\--build-argRELEASE_URL_TRT_ARM="${TRT_URL_ARM}"-fdockerfile/Dockerfile.triton.trt_llm_backend.#RunthebuildscriptfromTritonServerrepo.Theflagsforsomefeaturesorendpointscanberemovedifnotneeded.TRTLLM_BASE_IMAGE=trtllm_basecdserver./build.py-v--no-container-interactive--enable-logging--enable-stats--enable-tracing\--enable-metrics--enable-gpu-metrics--enable-cpu-metrics\--filesystem=gcs--filesystem=s3--filesystem=azure_storage\--endpoint=http--endpoint=grpc--endpoint=sagemaker--endpoint=vertex-ai\--backend=ensemble--enable-gpu--endpoint=http--endpoint=grpc\--image=base,${TRTLLM_BASE_IMAGE}\--backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG}\--backend=python:${PYTHON_BACKEND_REPO_TAG}HowtouseNVIDIATritonInferenceServer#Gotothetensorrt_llm/examples/llamadirectorycdtensorrt_llm/examples/llama#ConverttheLLaMAmodelintotensorrt-llmcheckpointformat.pythonconvert_checkpoint.py--model_dir/path/to/llama-7b-hf\--output_dir./tllm_checkpoint_1gpu_fp16\--dtypefloat16#BuildtheLLaMA7BmodelusingasingleGPUandFP16.trtllm-build--checkpoint_dir./tllm_checkpoint_1gpu_fp16\--output_dir./llama_model/fp16/1-gpu\--gemm_pluginfloat16\--context_fmhaenable\--max_beam_width1\--max_batch_size8\--max_input_len--gpt_attention_pluginfloat16\d_kv_cacheenable\--remove_input_paddingenableHowtouseNVIDIATritonInferenceServerconnectionofinputandoutputtensorsbnumberofrequeststhatmustbesenttoTriton.prompts(string)toinput_ids(listofints).forinference.fromoutput_ids(listofints)tooutputs(string).postprocessingmodetensorrt_llmandpostprocessingmodelstogether.AlsosupportsmorefHowtouseNVIDIATritonInferenceServerHowtouseNVIDIATritonInferenceServer#EnterTritonNGCcontainerdockerrun--rm-it--nethost--shm-size=2g--ulimitmemlock=-1--ulimitstack=67108864\--gpusall-v/path/to/tensorrtllm_backend:/tensorrtllm_backendnvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3bash#LaunchTritonservercd/tensorrtllm_backend#--world_sizeisthenumberofGPUsyouwanttouseforservingpython3scripts/launch_triton_server.py--world_size=4--model_repo=/tensorrtllm_backend/all_models/inflight_batcher_llm++++|Model|Version|Status|++++|<model_name>|<v>|READY||..|.|..|++++I091914:52:10.475738293grpc_server.cc:2451]StartedGRPCInferenceServiceat:8001I091914:52:10.475968293http_server.cc:3558]StartedHTTPServiceat:8000I091914:52:10.517138293http_server.cc:187]StartedMetricsServiceat:8002HowtouseNVIDIATritonInferenceServercd/tensorrtllm_backend#Useinflight_batcher_llm_client.pypython3inflight_batcher_llm/client/inflight_batcher_llm_client.py--request-output-len200\--tokenizer-dir/path/to/llama/tokenizer\--text"Bor
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 品牌租賃合同范本
- 后補合同范本范文
- 合同范本 兩甲方
- 變更房屋合同范本
- 合伙合同范本在
- 吉利汽車訂購合同范本
- 加工洗沙合同范本
- 公司司機簡易合同范例
- 合同范本購貨合
- 賣車合同范本
- 《柔性棚洞防護結構技術規(guī)程》
- 現(xiàn)場施工環(huán)境保護應急預案
- 初中英語翻譯專題訓練100題含答案
- 社區(qū)舞蹈隊章程
- YYT 1898-2024 血管內導管導絲 親水性涂層牢固度試驗方法
- 2024年通信安全員ABC證試題及解析(1000題)
- 世界反法西斯戰(zhàn)爭的勝利(課件)
- 住宅鋼筋和混凝土用量限額設計參考指標(2021年)
- 中國慢性鼻竇炎診斷和治療指南課件
- 基坑開挖影響周邊環(huán)境與建筑物研究
- 《民事訴訟法》課件
評論
0/150
提交評論