LLM Roadmap
high-level overview of the training process of ChatGPT
ChatGPT is trained in 3 steps:
1. ๐ ๐ฃ๐ฟ๐ฒ-๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด:
โข ChatGPT undergoes an initial phase called pre-training.
โข During this phase, Large Language Models (LLMs) like ChatGPT, such as GPT-3, are trained on an extensive dataset sourced from the internet.
โข The data is subjected to cleaning, preprocessing, and tokenization.
โข Transformer architectures, a best practice in natural language processing, are widely used during this phase.
โข The primary objective here is to enable the model to predict the next word in a given sequence of text.
โข This phase equips the model with the capability to understand language patterns but does not yet provide it with the ability to comprehend instructions or questions.
2. ๐ ๏ธ ๐ฆ๐๐ฝ๐ฒ๐ฟ๐๐ถ๐๐ฒ๐ฑ ๐๐ถ๐ป๐ฒ-๐ง๐๐ป๐ถ๐ป๐ด ๐ผ๐ฟ ๐๐ป๐๐๐ฟ๐๐ฐ๐๐ถ๐ผ๐ป ๐ง๐๐ป๐ถ๐ป๐ด:
โข The next step is supervised fine-tuning or instruction tuning.
โข During this stage, the model is exposed to user messages as input and
AI trainer responses as targets.
โข The model learns to generate responses by minimizing the difference between its predictions and the provided responses.
โข This phase marks the transition of the model from merely understanding language patterns to understanding and responding to instructions.
3. ๐ ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ณ๐ฟ๐ผ๐บ ๐๐๐บ๐ฎ๐ป ๐๐ฒ๐ฒ๐ฑ๐ฏ๐ฎ๐ฐ๐ธ (๐ฅ๐๐๐):
โข Reinforcement Learning from Human Feedback (RHFL) is employed as a subsequent fine-tuning step.
โข RHFL aims to align the modelโs behavior with human preferences, with a focus on being helpful, honest, and harmless (HHH).
โข RHFL consists of two crucial sub-steps:
โข ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ ๐ ๐ผ๐ฑ๐ฒ๐น ๐จ๐๐ถ๐ป๐ด ๐๐๐บ๐ฎ๐ป ๐๐ฒ๐ฒ๐ฑ๐ฏ๐ฎ๐ฐ๐ธ: In this sub-step,
multiple model outputs for the same prompt are generated and ranked by human labelers to create a reward model. This model learns human
preferences for HHH content.
โข ๐ฅ๐ฒ๐ฝ๐น๐ฎ๐ฐ๐ถ๐ป๐ด ๐๐๐บ๐ฎ๐ป๐ ๐๐ถ๐๐ต ๐ฅ๐ฒ๐๐ฎ๐ฟ๐ฑ ๐ ๐ผ๐ฑ๐ฒ๐น ๐ณ๐ผ๐ฟ ๐๐ฎ๐ฟ๐ด๐ฒ-๐ฆ๐ฐ๐ฎ๐น๐ฒ ๐ง๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด: Once the reward model is trained, it can replace humans in labeling data, streamlining the feedback loop. Feedback from the reward model is used to further fine-tune the LLM at a large scale.
โข RHFL plays a pivotal role in enhancing the modelโs behavior and ensuring alignment with human values, thereby guaranteeing useful, truthful, and safe responses.

One Hundred 2 Years of Solitude by Gabriel Garcรญa Mรกrquez
Our story begins with Colonel Aureliano Buendรญa reflecting on the early years of Macondo, a secluded village founded by his father, Josรฉ Arcadio Buendรญa. Macondo is isolated from the outside world, only occasionally visited by gypsies bringing technological marvels that captivate its inhabitants. Josรฉ Arcadio Buendรญa, fascinated by these innovations, immerses himself in scientific study with supplies from Melquรญades, the gypsies' leader. He becomes increasingly solitary in his quest for knowledge. Meanwhile, his wife, รrsula Iguarรกn, is more practical and is frustrated by her husband's obsession.
One Hundred 3 Years of Solitude by Gabriel Garcรญa Mรกrquez
Our story begins with Colonel Aureliano Buendรญa reflecting on the early years of Macondo, a secluded village founded by his father, Josรฉ Arcadio Buendรญa. Macondo is isolated from the outside world, only occasionally visited by gypsies bringing technological marvels that captivate its inhabitants. Josรฉ Arcadio Buendรญa, fascinated by these innovations, immerses himself in scientific study with supplies from Melquรญades, the gypsies' leader. He becomes increasingly solitary in his quest for knowledge. Meanwhile, his wife, รrsula Iguarรกn, is more practical and is frustrated by her husband's obsession.