BASE / Content Hub / Navigating the open-source AI era: a new paradigm in technology

Navigating the open-source AI era: a new paradigm in technology

 

In July 2023, Meta released Llama2, one of the most advanced open-source models so far, able to rival the advanced GPT models from OpenAI. MIstral.ai released their first open-source model, that outperforms Llama2 while being a smaller and more compressed model.

BASE PromoMats Integration Platform

The strategic importance of open-source AI

At BASE life science, we believe that open-source AI will play a major role in the development and advancement of these applications. And we are not alone.

Mark Zuckerberg’s talk at Llama2’s release highlights the value of community-driven development in enhancing software quality and problem-solving. He claims:

“When software is open, more people can scrutinize it to identify and fix potential issues.”

Using open-source models like Haystack and Hugging Face, we gain access to AI advancements and bolster data privacy and security. The latter is particularly important in the life sciences industry, where keeping control of proprietary data is paramount.

Additions to your tech stack: Haystack and Hugging Face

We have integrated the Haystack framework with Hugging Face models to perform deep semantic search over extensive datasets. Hugging Face has revolutionized the NLP landscape with their Transformers library, featuring models like BERT, GPT-2, Llama2, and Mistral.  

At the pace at which LLM models are advancing, we needed a reliable and secure place to derive our foundational models. Hugging Face provides an easy to navigate library where individual users as well as big tech companies and research institutions publish their LLMs as open-source models. 

You can find multiple models that allow for a diverse set of tasks. At BASE life science, we have invested in creating applications for analysing and extracting information, in addition to generating content. We focus on two main LLM applications:

Information retrieval:

Leveraging vector search and augmented retrieval, we can answer natural language queries directly from texts. This approach is vital for scanning unstructured documents and extracting necessary data for migrations.

Context generation and summarisation:

This complex application generates new content or summarizes existing ones, adhering to regulated document guidelines. We fine-tune models for specific tasks, crucial in life sciences where content is industry specific.

Haystack: simplifying LLM implementation

Scikit-learn is a reliable Python library, used in Machine Learning. Within Scikit-learn you find the concept of pipelines: well-defined railways that can bring you from the data source to a trained model in a few lines of code.

Functioning like Scikit-learn for LLMs, Haystack offers a user-friendly pipeline that integrates well with Hugging Face. This facilitates efficient content processing and accurate query answering.

These are the main components we use when building a LLM application using Haystack:

Retrievers and Retrieval Augmented Generation (RAG):

A combination of Retrieval-based models and seq2seq models that use semantic and vector search to scan content and find the answer to the query. The goal of this step is to narrow down the context to reduce the possibilities of hallucination.

Readers:

This step takes the content retrieved and “learns” from it. The task of the model here is to provide the answer based on the original question and the additional context when provided. These models are generally complex LLMs, such as the ones deriving from BERT or GPT, fine-tuned for question and answering tasks.

Haystack’s compatibility with platforms like ElasticSearch, AWS, and Google Cloud ensures our AI solutions are scalable and adaptable, crucial for deployment across diverse systems. 

Forward-looking AI approaches

Embracing open-source AI is about pioneering new solutions in AI, especially in areas where data security and precision are vital. It enables the exploration of innovative paths in data analysis and application development.

Ultimately, the combination of Haystack and Hugging Face serves as a robust foundation for businesses aiming to make sense of their unstructured data. 

With constant advancements in the realm of NLP and search, the capabilities of these frameworks will only increase. It’s an exciting time for developers, data scientists, and businesses to leverage these tools and build applications that were once thought impossible.

Want to know more?

If you’re interested in how we can deploy a similar solution tailored to your needs, reach out to Manfredi Miraula. We are happy to guide you on your data journey.

Author

Manfredi Miraula

Senior Data Engineer
mmir@baselifescience.com

Content Hub

Related content

Thank you for downloading our whitepapers

We have sent a download link to your mailbox