@@ -38,6 +38,18 @@ cd torchscale
3838pip install -e .
3939```
4040
41+ For faster training install [ Flash Attention] ( https://github.com/Dao-AILab/flash-attention ) for Turing, Ampere, Ada, or Hopper GPUs:
42+ ```
43+ pip install flash-attn
44+ ```
45+ or [ xFormers] ( https://github.com/facebookresearch/xformers ) for Volta, Turing, Ampere, Ada, or Hopper GPUs:
46+ ```
47+ # cuda 11.8 version
48+ pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118
49+ # cuda 12.1 version
50+ pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
51+ ```
52+
4153## Getting Started
4254
4355It takes only several lines of code to create a model with the above fundamental research features enabled. Here is how to quickly obtain a BERT-like encoder:
@@ -86,6 +98,21 @@ It takes only several lines of code to create a RetNet model:
8698>> > print (retnet)
8799```
88100
101+ For LongNet models ([ Flash Attention] ( https://github.com/Dao-AILab/flash-attention ) required):
102+ ``` python
103+ >> > import torch
104+ >> > from torchscale.architecture.config import EncoderConfig, DecoderConfig
105+ >> > from torchscale.model.longnet import LongNetEncoder, LongNetDecoder
106+
107+ # Creating a LongNet encoder with the dilated pattern of segment_length=[2048,4096] and dilated_ratio=[1,2]
108+ >> > config = EncoderConfig(vocab_size = 64000 , segment_length = ' [2048,4096]' , dilated_ratio = ' [1,2]' , flash_attention = True )
109+ >> > longnet = LongNetEncoder(config)
110+
111+ # Creating a LongNet decoder with the dilated pattern of segment_length=[2048,4096] and dilated_ratio=[1,2]
112+ >> > config = DecoderConfig(vocab_size = 64000 , segment_length = ' [2048,4096]' , dilated_ratio = ' [1,2]' , flash_attention = True )
113+ >> > longnet = LongNetDecoder(config)
114+ ```
115+
89116## Key Features
90117
91118- [ DeepNorm to improve the training stability of Post-LayerNorm Transformers] ( https://arxiv.org/abs/2203.00555 )
@@ -231,6 +258,24 @@ If you find this repository useful, please consider citing our work:
231258}
232259```
233260
261+ ```
262+ @article{longnet,
263+ author={Jiayu Ding and Shuming Ma and Li Dong and Xingxing Zhang and Shaohan Huang and Wenhui Wang and Nanning Zheng and Furu Wei},
264+ title = {{LongNet}: Scaling Transformers to 1,000,000,000 Tokens},
265+ journal = {ArXiv},
266+ volume = {abs/2307.02486},
267+ year = {2023}
268+ }
269+ ```
270+
271+ @article {longvit,
272+ title = {When an Image is Worth 1,024 x 1,024 Words: A Case Study in Computational Pathology},
273+ author = {Wenhui Wang and Shuming Ma and Hanwen Xu and Naoto Usuyama and Jiayu Ding and Hoifung Poon and Furu Wei},
274+ journal = {ArXiv},
275+ volume = {abs/2312.03558},
276+ year = {2023}
277+ }
278+
234279## Contributing
235280
236281This project welcomes contributions and suggestions. Most contributions require you to agree to a
0 commit comments