Navigating the open-source AI era: a new paradigm in technology

Research & Development

Unless you’ve been living under a rock, you know that AI is quickly changing the world as we know it. We have previously covered the power of NLP techniques and the promises of LLM models. In this instalment of our AI series, we explore the transformative impact of open-source architecture for LLM applications.

 

A shifting technology landscape

2023 marked a significant year in the evolution of Large Language Models (LLMs). Interest in LLMs surged among both tech and non-tech companies. Initially, big names like OpenAI and Google dominated. However, open-source models are gaining ground. In July 2023, Meta released Llama2, one of the most advanced open-source models so far, able to rival the advanced GPT models from OpenAI. MIstral.ai released their first open-source model, that outperforms Llama2 while being a smaller and more compressed model.

The strategic importance of open-source AI

At BASE life science, we believe that open-source AI will play a major role in the development and advancement of these applications. And we are not alone. Mark Zuckerberg’s talk at Llama2’s release highlights the value of community-driven development in enhancing software quality and problem-solving. He claims:

“When software is open, more people can scrutinize it to identify and fix potential issues.”

Using open-source models like Haystack and Hugging Face, we gain access to AI advancements and bolster data privacy and security. The latter is particularly important in the life sciences industry, where keeping control of proprietary data is paramount.

Additions to your tech stack: Haystack and Hugging Face

We have integrated the Haystack framework with Hugging Face models to perform deep semantic search over extensive datasets. Hugging Face has revolutionized the NLP landscape with their Transformers library, featuring models like BERT, GPT-2, Llama2, and Mistral.

At the pace at which LLM models are advancing, we needed a reliable and secure place to derive our foundational models. Hugging Face provides an easy to navigate library where individual users as well as big tech companies and research institutions publish their LLMs as open-source models.

You can find multiple models that allow for a diverse set of tasks. At BASE life science, we have invested in creating applications for analysing and extracting information, in addition to generating content. We focus on two main LLM applications:

Information retrieval

Leveraging vector search and augmented retrieval, we can answer natural language queries directly from texts. This approach is vital for scanning unstructured documents and extracting necessary data for migrations.

Context generation and summarisation

This complex application generates new content or summarizes existing ones, adhering to regulated document guidelines. We fine-tune models for specific tasks, crucial in life sciences where content is industry specific.

Haystack: simplifying LLM implementation

Scikit-learn is a reliable Python library, used in Machine Learning. Within Scikit-learn you find the concept of pipelines: well-defined railways that can bring you from the data source to a trained model in a few lines of code.

Functioning like Scikit-learn for LLMs, Haystack offers a user-friendly pipeline that integrates well with Hugging Face. This facilitates efficient content processing and accurate query answering.

These are the main components we use when building a LLM application using Haystack:

Retrievers and Retrieval Augmented Generation (RAG)

A combination of Retrieval-based models and seq2seq models that use semantic and vector search to scan content and find the answer to the query. The goal of this step is to narrow down the context to reduce the possibilities of hallucination.

Readers

This step takes the content retrieved and “learns” from it. The task of the model here is to provide the answer based on the original question and the additional context when provided. These models are generally complex LLMs, such as the ones deriving from BERT or GPT, fine-tuned for question and answering tasks.

Haystack’s compatibility with platforms like ElasticSearch, AWS, and Google Cloud ensures our AI solutions are scalable and adaptable, crucial for deployment across diverse systems.

 

Forward-looking AI approaches

Embracing open-source AI is about pioneering new solutions in AI, especially in areas where data security and precision are vital. It enables the exploration of innovative paths in data analysis and application development.

Ultimately, the combination of Haystack and Hugging Face serves as a robust foundation for businesses aiming to make sense of their unstructured data. With constant advancements in the realm of NLP and search, the capabilities of these frameworks will only increase. It’s an exciting time for developers, data scientists, and businesses to leverage these tools and build applications that were once thought impossible.

Want to know more?

If you’re interested in how we can deploy a similar solution tailored to your needs, reach out to Manfredi Miraula. We are happy to guide you on your data journey.