In the Amazon store, we strive to deliver the product recommendations that are mostly to customers’ queries. Often it can require Commonsse -Reasoning. For example, if a customer. Submitted to request for “Shoes for Pregnant Women”, the recommendation engine can be able to be able to infer that pregnant women may want sliding resistant shoes.
To help Amazon’s recommendation engine make these types of commconsense conclusions, we build a knowledge graph that manages the relationship between products in the Amazon store and the human contexts where they play a role -their functions, their audience, the tenants, where these, for example, use the Nowledge graph possibly used_for_arience Trotement to link Slip -resistant shoes and Press women.
In a paper we present in Association for Computing Machinery’s annual conference on data management (Sigmod) in June 2024, we describe Cosmo, a framework that uses large language models (LLMs) to distinguish the Commonsense relationship implicit in customer interaction data from the Amazon store.
Cosmo involves a recursive procedure in which an LLM generates hypotheses about the common implications of inquiry-Pchase and co-porchase data; A combination of human annotation and machine learning models filters low quality hypotheses; Human proofreaders extract indicative principles from high quality hypotheses; and instructions based on these principles are used to encourage LLM.
To evaluate the Cosmo, we used the shopping request data set, we set up for the KDD Cup 2022, a competition held at the 2022 conference on knowledge discovery and data amining (KDD). The data set consists of queries and product lists, with the products assessed according to their lift for each inquiry.
In our experiences three models-a bi-nodes or two-tower model; A transverse codes or overall model; And a cross-cutting coders improved with relationship information from the Cosmo Nowledge graph was tasked with finding the product that is most for each inquiry. We measured the performance using two different F1 scores: Makro F1 is an average of F1 scores in different categories, and the Micro F1 is the overall F1 score that appears to be categories.
When the models’ blood clots were fast-so, the only difference between transverse codes was that one includes cosmo ratios such as input, and the other non-cosmobased model dramatically exceeded the best possible baseline, which achieved a 60% increase in macros score. When the codes were fine -tuned on a subgroup of the test data set, the performance of all three models improved Nordic, but the cosmobased model still had a 28% edge in the macro F1 and a 22% edge in the micro F1 over the best possible baseline.
Cosmo
The Cosmos Knowledge Grund Building Procedure begins with two types of data: Inquiry Purchase Tailors that combine queries with purchases made within a fixed period or a fixed number of clicks, and Co-Buchase Peers, which combine purchases made during the sales session. We make an initial pruning of the data set to mitigate noise-for example, remove co-purchases peirs, where the product categories of the purchased products are too far apart in the Amazon product graph.
Then we feed the data troops to an LLM and ask them to describe the relationship between the inputs using one of the oven conditions: Used for,,,,,,,, able to,,,,,,,, Isaand cause. From the results charking a finer-grained set of frequent recurring conditions that we codify usage canonic formulations such as used_for_function,,,,,,,, used_for_EVENT,,,,,,,, and used_for_arience. Then we repeat the process and ask LLM to formulate its descriptions using our new, larger set of conditions.
LLMs, when they get this kind of task, tend to generate Tom Rational, such as “customers who are with theirs together because they like them”. So after LLM has generated a set of graduate relationships, we use different heurmistics to win them down. For example, if LLM’s answer to our questions is semantic too similar to the question itself, we filter the question of the question of the assumption that LLM is simple paraphrasing the question.
From the candidates who survive the filtration process, we choose an oppressive subgroup that we send to human annotators for assessment according to two criteria: plausibility or whatever the positive inferential conditions are reasonable and typical, or that the target product is one that is commonly associated with either the query or the source product.
Using the annotated data, we train a machine learning -based classification that assigns plausibility and typical scores to the remaining candidates, and we only keep those who exceed, then the threshold. From these candidates we extract syntactic and semantic relationships that can be coded as instructions to an LLM, such as “Generate explanations of search acquisition behavior in the domain 𝑑 using able to Relationships ”. Then we reassess all our Master’s Cords, which gets LLM with the relevant instructions.
The result is a set of device replial tripling, such asCameras ash and screen protects,,,,,,,, able to,,,,,,,, Camera protection>From where we gathered in knowledge graph.
Evaluation and application
The BI-ENCODE MODEL we used in our experience had two separate coders, one for a customer request and one for a product. The outputs of the two coders were linked and led to a neural network that produced a lift score.
In the transverse coder, all relevant functions are transferred in both the query and the product description to the same codes. Generally, transverse coders work better than bee-enskoders, so this is the architecture we used to test the power cycle of cosmo data.
In the first phase of the experiment, with frozen coders, the baseline models received the query product-peers; Another transverse codes received the query product peers along with Linde Triple The Cosmo Knowledge Graph, such asCameras ash and screen protects,,,,,,,, Capable_of,,,,,,,, Camera protection>. In this case, the cosmo-seeded model dramatically surpassed the transverse encoder-basseline, which surpassed the Bi-Enkoder-Baseline on both F1 dimensions.
In the second phase of the experiment, we fine-tuned the baseline models on the subgroup of the shopping queries data set and fine-tuned the second cross-coders on the same subgroup and the cosmo data. The performance of all three models jumped dramatically, but the Cosmo model santed an edge of more than 20% on both F1 targets.