Currently, artificial intelligence (AI) is being used across the financial sector, but how it is being used and the strategies that data scientists and computational linguists in financial institutions employ are starting to change. In the last several years, significant focus has been put on developing large-scale models —complex architectures with billions of parameters — but these advancements may be misleading and the effects may have negative impacts on financial processes.
Why? AI is generally comprised of two main components: Data and Code. In the last decade, AI has mainly focused on enhancing algorithms. However, Andrew Ng,1 one of the most influential AI pioneers, advocates and leads a new revolution in AI society currently about data-centric AI.2 Ng also forecasts the biggest shift in AI will probably be moving the focus from the code (models and parameters) to enhancing the data used in models.
AI models are only as good as the data you feed them. In machine learning (ML), models are built with a high dependency on the data on which they are trained. The maturity and complexity of ML deep learning models and algorithms allow machines to take on more challenging tasks such as language translation, speech-to-text and object detection that power chat bots, RPA, and other advanced automations. These capabilities leverage supervised learning,3 a process in which the quality of the labelled data used is critical.4 Moreover, the nature of AI and ML systems is that machines learn patterns from its training data. For example, an NLP model that was trained to analyze medical documents will not work in other domains, such as finance.
Due to this nature, AI and ML systems will easily break down or deliver sub-optimal results if they encounter any of these common scenarios: training data has a knowledge domain gap as to the problem to be solved; there is bias in the data, or the data is less representative for certain groups; or inadequate or poor quality of data, inappropriate data permutation, poor proxy data or even errors in the ground truth.5
These scenarios can play out in financial institution’s AI in the following ways:
Let’s look at a typical lifecycle of developing and deploying an AI/ML product. Most AI processes can be broken down into five general continuous steps: data collection and labelling, experimentation and development, testing, deployment, and monitoring feedback.
In general, around 70 to 80 percent of development efforts is related to the model’s underlying data,9 starting from data collection, labelling, data preparation/augmentation, and further down the pipeline to monitoring data drift and getting feedback from production environment. Due to the nature of AI and ML systems, models are often trained on a limited set of data during initial training. AI teams probably need to repeat the above different steps for various iterations before the model can be deployed into production and then need to continue to monitor, get feedback, and fulfill the governance requirements after deployment, a process called a continuous training (CT).10
Considering the time spent on tasks related to data in the AI product development lifecycle, ensuring data quality is the key to control technical debts which will drive the success of an AI/ML platforms. This is often neglected as the focus is mostly on the code and algorithm. It is challenging to fully define the “correct” behavior for an ML system upfront11 until it is tested with end users and data in production. Also, there are a lot of unknowns even after deploying an AI model in production, which are addressed with an agile development process.12
Building the data-centric AI capability will be the key for companies in the financial sector to thrive in the next wave of AI revolution. There are two main aspects that your AI and data science teams and companies must focus on: Data Diagnosis and Data Labelling.
Data Diagnosis – the capability to explore, understand and validate data – tactics are:
Labelling Data – the capability to accelerate and systematically obtain quality labeled data — tactics are:
In conclusion, although the race of building large-scale models isn’t slowing down, a new rising trend of data-centric AI is receiving wider recognition and resonance from the AI community. While powerful models are exciting, if there isn’t enough of the right data to run these models, the advancements are stunted. Furthermore, data-centric AI supports compliance with regulatory considerations to control AI application in finance.
Data science teams at financial institutions must allocate resources to better acclimate to data-centric models with a focus on solid data diagnosis and labelling techniques to ensure that the power of AI models is realized.