LLM

Model

LLM

Affiliation	Model	Github	Size	Train	Infer	More
META	LLaMa		7B/13B/33B/65B	hf	github
Databrics	Dolly	Dolly	12B		v
LAION.ai
Stability.ai
Eleuther.AI
BigScience	BLOOM		176B

Variation

Affiliation	Model	Size	Base
Baize	7B

Dolly https://huggingface.co/databricks/dolly-v2-12b
Dolly https://github.com/databrickslabs/dolly
LLaMA https://github.com/facebookresearch/llama
LLaMA arXiv
StableLM https://docs.modelz.ai/templates/stablelm
OPT https://github.com/facebookresearch/metaseq/tree/main/projects/OPT
Bloom arxiv
Bloom huggingface
llama2.chttps://github.com/karpathy/llama2.c
llama.cpphttps://github.com/ggerganov/llama.cpp
ggml
llama-cpp-python
whisper
whisper.cpp

Tech

Megatron-DeepSpeed

Megatron-DeepSpeed

https://github.com/NVIDIA/Megatron-LM
https://github.com/microsoft/DeepSpeed
https://github.com/microsoft/Megatron-DeepSpeed
https://github.com/bigscience-workshop/Megatron-DeepSpeed)
HF LLaMA modeling_llama
Stanford Alpaca stanford_alpaca
Transformer Reinforcement Learning X trlx
3D parallelism

DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

ZeRO sharding
pipeline parallelism

DeepSpedd support

Optimizer state partitioning (ZeRO stage 1)
Gradient partitioning (ZeRO stage 2)
Parameter partitioning (ZeRO stage 3)
Custom mixed precision training handling
A range of fast CUDA-extension-based optimizers
ZeRO-Offload to CPU and Disk/NVMe

huggingface deepspeed

Megatron-LM

Megatron-LM is a large, powerful transformer model framework developed by the Applied Deep Learning Research team at NVIDIA.

Tensor Parallelism
main_grad

Workflow

https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat

https://github.com/EleutherAI/gpt-neox

https://github.com/CarperAI/trlx/blob/main/examples/summarize_rlhf/reward_model/reward_model.py

Question

How loss was caculated in SFT ? What's the difference between Pretraining ?