falcon

Architecture

Positionnal embeddings: rotary (Su et al., 2021);
Attention: multiquery (Shazeer et al., 2019) and FlashAttention (Dao et al., 2022);
Decoder-block: parallel attention/MLP with a single layer norm.

Spec

AWS SageMaker, on 64 A100 40GB GPUs in P4d instances.
40B parameters
1,000B tokens of RefinedWeb
Falcon-40B-Instruct = Falcon-40B + 150M tokens (Baize + 5% RefinedWeb).