Multi-Language Training Systems

Explore top LinkedIn content from expert professionals.

Summary

Multi-language training systems are platforms or methods designed to teach, recognize, or generate content across multiple languages, making AI and technology accessible and useful for global users. These systems use advanced models and datasets to handle the complexities of different languages, scripts, and cultural contexts.

Broaden language coverage: Choose platforms or models that support a wide range of languages to reach more users and accommodate regional diversity.
Maintain intent accuracy: Focus on preserving meaning and context when training or prompting, rather than relying on direct translations, to ensure clarity and natural communication.
Test for reliability: Evaluate your system's performance across each language separately, as errors and challenges may differ by language or script.

Summarized by AI based on LinkedIn member posts

Tom Aarsen

🤗 Sentence Transformers & NLTK maintainer, MLE @ Hugging Face

20,369 followers 7mo
Report this post
ModernBERT goes MULTILINGUAL! One of the most requested models I've seen, The Johns Hopkins University's CLSP has trained state-of-the-art massively multilingual encoders using the ModernBERT architecture: mmBERT. Model details: - 2 model sizes: 42M non-embed (140M total) and 110M non-embed (307M total) - Uses the ModernBERT architecture, but with the Gemma2 multilingual tokenizer (so: flash attention, alternating global/local attention, unpadding/sequence packing, etc.) - Maximum sequence length of 8192 tokens, on the high end for encoders - Trained on 1833 languages using DCLM, FineWeb2, and many more sources - 3 training phases: 2.3T tokens pretraining on 60 languages, 600B tokens mid-training on 110 languages, and 100B tokens decay training on all 1833 languages. - Also uses model merging and clever transitions between the three training phases. - Both models are MIT Licensed, and the full datasets and intermediary checkpoints are also publicly released Evaluation details: - Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning) - Consistently outperforms equivalently sized models on all Multilingual tasks (XTREME, classification, MTEB v2 Multilingual after finetuning) - In short: beats commonly used multilingual base models like mDistilBERT, XLM-R (multilingual RoBERTa), multilingual MiniLM, etc. - Additionally: the ModernBERT-based mmBERT is much faster than the alternatives due to its architectural benefits. Easily up to 2x throughput in common scenarios. Check out the full blogpost with more details. It's super dense & gets straight to the point: https://lnkd.in/ebqTK3JS Based on these results, mmBERT should be the new go-to multilingual encoder base models at 300M and below. Do note that the mmBERT models are "base" models, i.e. they're currently only trained to perform Mask Filling. They'll need to be finetuned for downstream tasks like semantic search, classification, clustering, etc. I'm very much looking forward to seeing embedding models based on mmBERT! Great work by Marc Marone, Orion Weller, and the rest of the team at JHU!

mmBERT: ModernBERT goes Multilingual huggingface.co

20 Comments
Like Comment
Shayne Longpre

PhD @ MIT, AI researcher, Data Provenance Initiative Lead

4,826 followers 6mo
Report this post
🌍 I’m excited to share my largest research project to date: 𝗔𝗧𝗟𝗔𝗦 🗺️ - 𝗔𝗱𝗮𝗽𝘁𝗶𝘃𝗲 𝗧𝗿𝗮𝗻𝘀𝗳𝗲𝗿 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗟𝗮𝘄𝘀 𝗳𝗼𝗿 𝗠𝘂𝗹𝘁𝗶𝗹𝗶𝗻𝗴𝘂𝗮𝗹 𝗣𝗿𝗲𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 This is the largest public multilingual pretraining experiment to date, laying the scientific foundation for scaling beyond English. Across 774 experiments, up to 8B parameter models, on 400+ languages, we are able to explicitly model the curse of 𝘮𝘶𝘭𝘵𝘪𝘭𝘪𝘯𝘨𝘶𝘢𝘭𝘪𝘵𝘺. Our key contributions: 🔹 𝗔𝗧𝗟𝗔𝗦 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗟𝗮𝘄𝘀 – Modeling monolingual and multilingual mixtures, we outperform prior work by large margins (often 30%+ R²) through explicit modeling of cross-lingual transfer, data repetition, and multilingual capacity limits. 🔹 𝟯𝟴×𝟯𝟴 𝗖𝗿𝗼𝘀𝘀-𝗟𝗶𝗻𝗴𝘂𝗮𝗹 𝗧𝗿𝗮𝗻𝘀𝗳𝗲𝗿 𝗠𝗮𝘁𝗿𝗶𝘅 – The most comprehensive empirical map of how language A helps or hinders language B during training (1,444 pairs). We find that shared scripts (e.g., Latin vs Cyrillic) often matter more than shared families (e.g., Indo-European vs Sino-Tibetan). 🔹 𝗗𝗲𝗰𝗼𝗱𝗶𝗻𝗴 𝘁𝗵𝗲 𝗖𝘂𝗿𝘀𝗲 𝗼𝗳 𝗠𝘂𝗹𝘁𝗶𝗹𝗶𝗻𝗴𝘂𝗮𝗹𝗶𝘁𝘆 – We explicitly model how compute (C), model size (N), data (D), and language count (K) interact. Our fitted formulas tell practitioners how much to scale (N,D) when doubling the number of languages (K) to maintain the same performance. 🔹 𝗣𝗿𝗲𝘁𝗿𝗮𝗶𝗻 𝘃𝘀 𝗙𝗶𝗻𝗲𝘁𝘂𝗻𝗲 𝗚𝘂𝗶𝗱𝗮𝗻𝗰𝗲 – For builders training language-specific models (e.g., Greek), we identify the crossover points where—given enough compute—training from scratch surpasses starting from a multilingual checkpoint. [See the paper link in comments!] Grateful to collaborate with an amazing team, without which this work wouldn’t have been possible: Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex 'Sandy' Pentland, Sercan Arık, Chen-Yu Lee, and Sayna Ebrahimi across Google, MIT, UW, and Stanford.
No more previous content

No more next content
7 Comments
Like Comment
Armand Ruiz Armand Ruiz is an Influencer

building AI systems @meta

206,812 followers 1y
Report this post
Disclosing the full list of datasets used to train IBM LLMs Granite 3.0. This is true transparency - no other LLM provider shares such detailed information about their training datasets. WEB Data - FineWeb: More than 15T tokens of cleaned and deduplicated English data from CommonCrawl. - Webhose: Unstructured web content in English converted into machine-readable data. - DCLM-Baseline: A 4T token / 3B document pretraining dataset that achieves strong performance on language model benchmarks. CODE - Code Pile: Sourced from publicly available datasets like GitHub Code Clean and StarCoderdata. - FineWeb-Code: Contains programming/coding-related documents filtered from the FineWeb dataset using annotation. - CodeContests: Competitive programming dataset with problems, test cases, and human solutions in multiple languages. DOMAIN - USPTO: Collection of US patents granted from 1975 to 2023. - Free Law: Public-domain legal opinions from US federal and state courts. - PubMed Central: Biomedical and life sciences papers. - EDGAR Filings: Annual reports from US publicly traded companies over 25 years. MULTILINGUAL - Multilingual Wikipedia: Data from 11 languages to support multilingual capabilities. - Multilingual Webhose: Multilingual web content converted into machine-readable data feeds. - MADLAD-12: Document-level multilingual dataset covering 12 languages. INSTRUCTIONS - Code Instructions Alpaca: Instruction-response pairs about code generation problems. - Glaive Function Calling: Dataset focused on function calling in real scenarios. ACADEMIC - peS2o: A collection of 40M open-access academic papers for pre-training. - arXiv: Scientific paper pre-prints posted to arXiv. Full author acknowledgement can be found here. - IEEE: Technical content from IEEE acquired by IBM. TECHNICAL - Wikipedia: Technical articles sourced from Wikipedia. - Library of Congress Public Domain Books: More than 140,000 public domain English books. - Directory of Open Access Books: Publicly available technical books from the Directory of Open Access Books. - Cosmopedia: Synthetic textbooks, blog posts, stories, and WikiHow articles. MATH - OpenWebMath: Mathematical text from the internet, filtered from 200B HTML files. - Algebraic-Stack: Mathematical code dataset including numerical computing and formal mathematics. - Stack Exchange: User-contributed content from the Stack Exchange network. - MetaMathQA: Dataset of rewritten mathematical questions. - StackMathQA: A curated collection of 2 million mathematical questions from Stack Exchange. - MathInstruct: Focused on chain-of-thought (CoT) and program-of-thought (PoT) rationales for mathematical reasoning. - TemplateGSM: Collection of over 7 million grade-school math problems with code and natural language solutions. BOOM!

114 Comments
Like Comment
Sneha Vijaykumar

Data Scientist @ Takeda | Ex-Shell | Gen AI | LLM | RAG | AI Agents | Azure | NLP | AWS

25,181 followers 1w
Report this post
You’re in an AI Engineer interview. Interviewer asks: How do you handle multi language prompting effectively? Most people jump to translation APIs. Strong answer goes deeper. 1. Detect language first Never assume. Identify the user’s language and script before prompting. 2. Preserve intent, not just words Literal translation often breaks tone, context, and business meaning. 3. Prompt in the user’s language when possible Models usually respond better when instructions and output language align. 4. Use English for complex reasoning, then localize output For harder logic tasks, reasoning in English + final response in target language often works better. 5. Handle mixed language inputs Real users switch languages mid sentence. Your system should too. 6. Keep terminology consistent Especially for healthcare, finance, legal, and product names. 7. Test by language, not globally Kannada, Hindi, Tamil, Japanese, Arabic, Spanish all fail differently. 8. Build fallback layers If confidence is low, ask clarifying questions instead of hallucinating. What interviewers want to hear: You understand that multilingual AI is a product problem, not just a translation problem. #AI #GenerativeAI #PromptEngineering #LLM #AIEngineer #MachineLearning #NLP #AIEngineering Follow Sneha Vijaykumar for more... 😊

1 Comment
Like Comment
Allys Parsons

Co-Founder at techire ai. ICASSP ‘26 Sponsor. Hiring in AI since ’19 ✌️ Speech AI, TTS, LLMs, Multimodal AI & more! Top 200 Women Leaders in Conversational AI ‘23 | No.1 Conversational AI Leader ‘21

17,994 followers 1y
Report this post
Latest research from KAIST and Imperial College London introduces Zero-AVSR, an innovative framework that enables audio-visual speech recognition across languages without requiring training data in target languages. By learning language-agnostic speech representations through romanisation and leveraging LLMs, it can recognise speech even in languages never seen during training. What makes this approach interesting is the scale of language support. The team created MARC, a dataset spanning 2,916 hours of audio-visual speech across 82 languages—far beyond the 9 languages typical systems support. Their results show comparable performance to traditional multilingual systems while supporting this vastly larger language inventory. Zero-AVSR represents a significant advancement for speech tech in low-resource languages, potentially democratising access across thousands of languages without requiring extensive labelled datasets for each. The approach particularly excels when recognising languages from families similar to those in the training data, suggesting promising pathways for further expansion. Paper: https://lnkd.in/dnw_V7XK Authors: Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro #SpeechRecognition #MultilingualAI #SpeechAI

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations arxiv.org

2 Comments
Like Comment
Kuldeep Singh Sidhu

Senior Data Scientist @ Walmart | BITS Pilani

16,026 followers 1y
Report this post
Exciting breakthrough in multilingual embedding models! A team of researchers from HIT and Tongji University have developed KaLM-Embedding, setting a new standard for models under 1B parameters. What makes this model special? It leverages cleaner, more diverse training data and introduces three game-changing techniques: 1. Persona-based synthetic data generation using QWen2-72B-Instruct, creating 550k diverse examples across 6 task types 2. Ranking consistency filtering to remove noise and improve data quality by ensuring positive examples rank within top-k matches 3. Semi-homogeneous task batching that balances negative sample hardness with false negative risks Under the hood, KaLM-Embedding uses Qwen2-0.5B as its foundation and implements Matryoshka Representation Learning for flexible dimension embedding (896 to 64 dimensions). The model excels in Chinese and English while showing strong performance across other languages. The results? KaLM-Embedding achieves state-of-the-art performance on the MTEB benchmark, outperforming larger models with scores of 64.13 for Chinese and 64.94 for English tasks. This work demonstrates how thoughtful data curation and innovative training techniques can push the boundaries of what's possible with compact models. The team has open-sourced their work for the research community.
No more previous content

No more next content
Like Comment
Rajan Agarwal

Research Engineer

5,818 followers 5mo Edited
Report this post
Excited to share some research I've worked on under Cohere Labs! We present LLINK, a compute efficient method to extend LLM multilinguality without changing the tokenizer or pre-training. In the same way LLaVA teaches models to learn images, we treat languages as modalities. We align a multilingual encoder with LLaMA-1B, significantly improving retrieval and conversational performance in low-resource languages. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment. ArXiv Paper: https://lnkd.in/eY6ibngS Code: https://lnkd.in/eGbVhsUZ AlphaXiv: https://lnkd.in/e_YU8X2T Built with Aarush Gupta
No more previous content

No more next content
14 Comments
Like Comment
Kriti Aggarwal

Research@HippocraticAI | Microsoft | Adobe | UCSD | DCE

2,933 followers 1y
Report this post
🌟 Excited to share our latest research on enhancing multilingual capabilities in large language models! 🌟 Introducing SPHINX, a novel multilingual synthetic instruction tuning dataset created to address the performance gap in non-English languages. By translating instruction-response pairs from English into 50 languages, we achieved impressive results. In our study, fine-tuning models PHI-3-SMALL and MISTRAL-7B using SPHINX led to significant performance improvements, surpassing other multilingual datasets in benchmarks. Incorporating N-shot examples further boosted performance, showcasing the effectiveness and efficiency of SPHINX. This advancement marks a significant step forward in making large language models more inclusive and effective across diverse languages. Our research highlights the importance of sample efficiency and diversity while minimizing dataset creation costs. Excited for further discussions and collaborations in the realm of NLP, Multilingual AI, Machine Learning, and Artificial Intelligence! 🚀 Link to the paper : https://lnkd.in/g5CP9EZc Sanchit Ahuja Kumar Tanmay Hardik Chauhan Barun Patra Vishrav Chaudhary Monojit Choudhury Arindam Mitra Luciano Del Corro Tejas Indulal Dhamecha Ahmed Awadallah Sunayana Sitaram #NLP #MultilingualAI #MachineLearning #ArtificialIntelligence #Research #Innovation

2407.09879 arxiv.org

2 Comments
Like Comment
Asif Razzaq

Founder @ Marktechpost (AI Dev News Platform) | 1 Million+ Monthly Readers

35,060 followers 1y
Report this post
CMU Researchers Release Pangea-7B: A Fully Open Multimodal Large Language Models MLLMs for 39 Languages A team of researchers from Carnegie Mellon University introduced PANGEA, a multilingual multimodal LLM designed to bridge linguistic and cultural gaps in visual understanding tasks. PANGEA is trained on a newly curated dataset, PANGEAINS, which contains 6 million instruction samples across 39 languages. The dataset is specifically crafted to improve cross-cultural coverage by combining high-quality English instructions, machine-translated instructions, and culturally relevant multimodal tasks. In addition, to evaluate PANGEA’s capabilities, the researchers introduced PANGEABENCH, an evaluation suite spanning 14 datasets covering 47 languages. This comprehensive evaluation provides insight into the model’s performance on both multimodal and multilingual tasks, showing that PANGEA outperforms many existing models in multilingual scenarios. PANGEA was developed using PANGEAINS, a rich and diverse dataset that includes instructions for general visual understanding, document and chart question answering image captioning, and more. The dataset was designed to address the major challenges of multilingual multimodal learning: data scarcity, cultural nuances, catastrophic forgetting, and evaluation complexity. To build PANGEAINS, the researchers employed several strategies: translating high-quality English instructions, generating culturally aware tasks, and incorporating existing open-source multimodal datasets. The researchers also developed a sophisticated pipeline to filter culturally diverse images and generate detailed multilingual and cross-cultural captions, ensuring that the model understands and responds appropriately in different linguistic and cultural contexts... Read the full article here: https://lnkd.in/gV5Bnac5 Paper: https://lnkd.in/gcxNAQVy Model on Hugging Face: https://lnkd.in/gMKtJ83z Project Page: https://lnkd.in/gP7vWBkb Listen to the podcast on Pangea-7B created with the help of NotebookLM and, of course, with the help of our team, who generated the prompts and entered the right information: https://lnkd.in/gTBvt9NG Machine Learning Department at CMU Xiang Yue Yueqi Song Seungone Kim Jean de Dieu Nyandwi Simran Khanuja Lintang Sutawika Sathyanarayanan Ramamoorthy Graham Neubig
No more previous content

No more next content
2 Comments
Like Comment
Aishwarya Naresh Reganti

Founder & CEO @ LevelUp Labs | Ex-AWS | Consulting, Training & Investing in AI

123,794 followers 1y
Report this post
Always excited about research in multilingual space that can help transfer LLMs' amazing capabilities to other languages! Here's a new IFT dataset that supports 70 languages and is fully synthetic 🙀 😎 M2Lingual is a fully synthetic, multilingual, multi-turn Instruction Fine-Tuning (IFT) dataset comprising 182K evenly distributed instruction-response pairs across 70 languages, resulting in competitive performance across multilingual evaluation benchmarks and MT-Bench. ⛳ IFT plays a super important role in ensuring these LLMs can effectively follow instructions across various tasks. However, the authors note that existing datasets have the following issues: 👉 Limited Multilingual Support: Current IFT datasets are predominantly focused on English 👉 Single-Turn Conversations: Many existing datasets are not multi-turn, 👉 Task Diversity: There's a shortage of datasets covering a wide range of NLP tasks in multilingual settings ⛳ Here's how the dataset addresses these issues: 👉 M2Lingual spans 70 languages and includes 182K instruction-response pairs across 17 diverse NLP tasks, ensuring LLMs are trained on a broad spectrum of language understanding and generation challenges. 👉 Guided by a popular taxonomy called Evol, M2Lingual leverages seed samples from human-generated datasets to create machine-generated instructions. This approach ensures a balanced representation across languages and tasks. 👉Unlike many existing datasets, M2Lingual incorporates a multi-turn component in its instruction-response pairs, enabling LLMs to handle complex, extended conversations across different languages effectively. The authors not that LLMs fine-tuned with M2Lingual consistently outperform existing multilingual datasets in various evaluation benchmarks, demonstrating superior performance in multilingual task scenarios! Link to the paper: https://lnkd.in/euvMPedT
No more previous content

No more next content
2 Comments
Like Comment

Multi-Language Training Systems

Summary

More in Training Content Management Systems

Explore categories