1.
welcome
2.
chronicles
❱
2.1.
feb 2024
3.
projects
❱
3.1.
copilot
3.2.
library
4.
survey
❱
4.1.
pollux
4.2.
adasum
4.3.
adaptation_learning
4.4.
gradient_descent
4.5.
auto_parallel
4.6.
scheduling
4.7.
gradient_compression
❱
4.7.1.
dgc
4.7.2.
csc
4.8.
flash attention
4.9.
LoRA
5.
mathematics
❱
5.1.
basic
5.2.
entropy
5.3.
newton
5.4.
regression
5.5.
conjugate descent
5.6.
gradient descent
5.7.
pca
5.8.
support vector
5.9.
differentiation
5.10.
fourier
5.11.
kmeans
6.
wavelets
❱
6.1.
plan
6.2.
preliminary
6.3.
haar wavelet
6.4.
fourier analysis
6.5.
uncertainty principle
6.6.
multiresolution
7.
models
❱
7.1.
llm
7.2.
falcon
7.3.
llama
7.4.
peft
7.5.
transformer
7.6.
models
8.
megatron
9.
deepspeed
10.
pytorch
❱
10.1.
tensor
10.2.
autograd
10.3.
operator
10.4.
profiler
10.5.
hook
10.6.
elastic
10.7.
patch
10.8.
misc
11.
paddle
❱
11.1.
ps
11.2.
framework
11.3.
cinn
11.4.
dataloader
12.
horovod
❱
12.1.
run
12.2.
workflow
12.3.
object
12.4.
develop
12.5.
pytorch
12.6.
tensorflow
12.7.
elastic
13.
ray
❱
13.1.
overview
13.2.
gcs
13.3.
raylet
13.4.
api
13.5.
survey
14.
python
❱
14.1.
concurrent execution
14.2.
multiprocessing
14.3.
decorator
15.
tips
❱
15.1.
enable_shared_from_this
15.2.
ip_local_port_range
15.3.
golang error
16.
infrastructure
❱
16.1.
pki
16.2.
linux cache
17.
kubernetes
❱
17.1.
concepts
17.2.
scheduler
17.3.
operator
17.4.
device plugin
17.5.
docker
17.6.
install
17.7.
api-service
18.
nccl
19.
cuda
20.
todo
❱
20.1.
gloo
20.2.
mpi
20.3.
jax
20.4.
tvm
20.5.
llm
21.
notes
❱
21.1.
influence and persuasion
21.2.
freynman technique
21.3.
wavelet signal processing
Light
Rust
Coal
Navy
Ayu
Aller au boulot
Scaling distributed training with adaptive summation
Saeed Maleki et al. Microsoft Research
Key point
g
=
(
1
−
2
∣
g
1
∣
2
g
1
⋅
g
2
)
g
1
+
(
1
−
2
∣
g
2
∣
2
g
1
⋅
g
2
)
g
2
Reference
AdaSum with Horovod
Arxiv