GONE WORDS: East Asian Languages Chapter | by Henry Heng LUO | Jun, 2024

The content discusses the analysis of long token words in various languages using the GPT-4o model. Ten languages were analyzed, and specific themes for each language were identified. Further exploration was done by analyzing the 100 longest token words for each language to investigate the “gone words” phenomenon and reasons behind their absence, especially in East Asian languages. The impact on different languages was discussed, with Chinese showing a significant influence compared to English, Japanese, and Korean.

The content also delves into the reasons behind the “gone words” phenomenon, including lack of training corpora, contextual limitations, and infrequent word usage. Additionally, the inclusion of foreign languages in responses was explored, with factors such as corpora containing advertisements from foreign countries and insufficient training contributing to this phenomenon.

Special words that prompt the GPT-4o model to translate input into specific foreign languages were also discussed. Challenges faced by multi-language large language models, such as data pollution, lack of high-quality contextual corpora, multi-language misalignment, and limitations in semantic understanding, were highlighted. Overcoming these challenges is essential to unlock the true potential of multi-language models for providing accurate and diverse responses.

