CIOReview
CIOReview | | 9 OCTOBER 2023existence: the internet. Bayesian inferences engines are great at giving a statistically significant response given the set of inputs. Practicality or accuracy? Caveat emptor. This is not a new concern. For several years at least, the academic community has been warning of various types of attacks, including those from adversarial examples. What is the security concern with this? Imagine the following scenario:· Users start asking the latest cool chatbot to write code for building a new web service. · The chatbot is trained on data from a popular forum where developers like to share code snippets.· The code snippets contain exploitable security weaknesses in, say, authentication code. · The code snippets are upvoted in popularity because they work and are fast (due to the lack of proper security checks in them). · Users of the forum are effectively unauthenticated; anyone can sign up with merely a unique e-mail address. · The forum operators don't robustly try to prevent upvoting services from being able to influence the recommendations algorithm. Any problems in the training material, even potential vulnerabilities, would propagate through to the outcome. Effectively, an unreliable source, perhaps even an attacker, has influenced the configuration of the application. A savvy attacker could look for the signs of that common vulnerability being stamped into the output from the AI. It's almost as if this were the design and threat model: Basically, malicious or untrustworthy users are configuring the system. The code writing AI trained on the above will happily crank out new copies of the vulnerable code whenever asked. After all, it's the most popular instance of it, so statistically speaking, it should be the best answer. It doesn't even need to be a deliberately malicious scenario, and coincidence will do just fine to start. The data's inherent bias toward insecurity will go unnoticed. That bias will propagate into whichever system uses the generated data. What if it was intentional? How could an attacker influence this? Well, there is no guarantee the upvotes are real people. The forums aren't well authenticated, so creating a large set of fake accounts is a trivial activity. Services exist for the purpose of creating upvotes to increase the preference of recommendation algorithms to favor the target product or answer. There's little in the way of stopping the above short of careful curation of data used to train the model. When the service is free or low cost, you get what you pay for. Here are some things the technology industry can do about it. There needs to be honesty about the accuracy and biases of data sources used to train AI models. Expert systems should use authenticated, quality data sources reviewed by knowledgeable staff. There's a lot of data on the internet. Some are accurate. For those models to be trained for the best accuracy, they must be carefully curated. Even popular answers are not always correct. AI should be used to check its output too. The source code output from the example above should further be sent through other tools to check the quality of the first AI's work. AI has already been shown to happily generate fake or inaccurate data. In the above, the AI-generated code that gets checked in should still go through normal evaluations like static application security testing. Not doing so would just inject further bias toward blindly trusting the machine's output. Using processes of using one tool to assess the output of another will also help correct for the intentionally malicious scenario where an attacker has specifically supplied bad data into the training set. These concepts work whether a widely available commercial service or for private, in-house models used to generate any possible output. If we want to rely on systems such as AI, we must make sure they are trustworthy. Security and bias of AI model training data are no exception. AI application security weaknesses need to be watched for and corrected just like the output from the humans they're lauded to replace. If we want to rely on systems such as AI, we must make sure they are trustworthy
< Page 8 | Page 10 >