Duplicate values

Problem

Your documents may have duplicate values and you want to find out which values have been duplicated.

Solution

Create an index where the values definition includes the field that may contain duplicate values. In this case, we want to find out where the elements field in a "spells" document has been duplicated:

try
{
    Value result = await client.Query(
        CreateIndex(
            Obj(
                "name", "spell_elements",
                "source", Collection("spells"),
                "values", Arr(
                    Obj("field", Arr("data", "element")),
                    Obj("field", "ref")
                )
            )
        )
    );
    Console.WriteLine(result);
}
catch (Exception e)
{
    Console.WriteLine($"ERROR: {e.Message}");
}
res, err := client.Query(
	f.CreateIndex(
		f.Obj{
			"name": "spell_elements",
			"source": f.Collection("spells"),
			"values": f.Arr{
				f.Obj{"field": f.Arr{"data", "element"}},
				f.Obj{"field": "ref" },
			},
		},
	))

if err != nil {
	fmt.Fprintln(os.Stderr, err)
} else {
	fmt.Println(res)
}
System.out.println(
    client.query(
        CreateIndex(
            Obj(
                "name", Value("spell_elements"),
                "source", Collection("spells"),
                "values", Arr(
                    Obj("field", Arr(Value("data"), Value("element"))),
                    Obj("field", Value("ref"))
                )
            )
        )
    ).get());
client.query(
  q.CreateIndex({
    name: 'spell_elements',
    source: q.Collection('spells'),
    values: [
      { field: ['data', 'element'] },
      { field: 'ref' },
    ],
  })
)
.then((ret) => console.log(ret))
.catch((err) => console.error(
  'Error: [%s] %s: %s',
  err.name,
  err.message,
  err.errors()[0].description,
))
result = client.query(
  q.create_index({
    "name": "spell_elements",
    "source": q.collection("spells"),
    "values": [
      { "field": ["data", "element"] },
      { "field": "ref" },
    ],
  })
)
print(result)
CreateIndex({
  name: "spell_elements",
  source: Collection("spells"),
  values: [
    { field: ["data", "element"] },
    { field: "ref" },
  ],
})
Query metrics:
  •    bytesIn:   164

  •   bytesOut:   318

  • computeOps:     1

  •    readOps:     0

  •   writeOps:     5

  •  readBytes: 1,835

  • writeBytes: 1,504

  •  queryTime:  54ms

  •    retries:     0

Then we can run a query that produces an object keyed on the element names whose values are Arrays of the associated document References:

try
{
    Value result = await client.Query(
        Reduce(
            Lambda(
                Arr("acc", "val"),
                Let(
                    "element", Select(0, Var("val")),
                    "ref", Select(1, Var("val")),
                    "ref_list", If(
                        ContainsField(Var("element"), Var("acc")),
                        Append(
                            Var("ref"),
                            Select(Var("element"), Var("acc"))
                        ),
                        Arr(Var("ref"))
                    ),
                    "mergeit", Arr(
                        Arr(Var("element"), Var("ref_list"))
                    )
                ).In(
                    Merge(
                        Var("acc"),
                        ToObject(Var("mergeit"))
                    )
                )
            ),
            Obj(),
            Match(Index("spell_elements"))
        )
    );
    Console.WriteLine(result);
}
catch (Exception e)
{
    Console.WriteLine($"ERROR: {e.Message}");
}
ObjectV(air: Arr(RefV(id = "181388642046968320", collection = RefV(id = "spells", collection = RefV(id = "collections"))), RefV(id = "181388642581742080", collection = RefV(id = "spells", collection = RefV(id = "collections")))),earth: Arr(RefV(id = "181388642088911360", collection = RefV(id = "spells", collection = RefV(id = "collections")))),fire: Arr(RefV(id = "181388642046968320", collection = RefV(id = "spells", collection = RefV(id = "collections"))), RefV(id = "181388642071085568", collection = RefV(id = "spells", collection = RefV(id = "collections")))),water: Arr(RefV(id = "181388642071085568", collection = RefV(id = "spells", collection = RefV(id = "collections"))), RefV(id = "181388642088911360", collection = RefV(id = "spells", collection = RefV(id = "collections")))))
res, err := client.Query(
	f.Reduce(
		f.Lambda(
			f.Arr{"acc", "val"},
			f.Let().Bind(
				"element", f.Select(0, f.Var("val")),
			).Bind(
				"ref", f.Select(1, f.Var("val")),
			).Bind(
				"ref_list", f.If(
					f.ContainsField(f.Var("element"), f.Var("acc")),
					f.Append(
						f.Var("ref"),
						f.Select(f.Var("element"), f.Var("acc")),
					),
					f.Arr{f.Var("ref")},
				),
			).Bind(
				"mergeit", f.Arr{
					f.Arr{f.Var("element"), f.Var("ref_list")},
				},
			).In(
				f.Merge(
					f.Var("acc"),
					f.ToObject(f.Var("mergeit")),
				),
			),
		),
		f.Obj{},
		f.Match(f.Index("spell_elements")),
	))

	if err != nil {
	fmt.Fprintln(os.Stderr, err)
} else {
	fmt.Println(res)
}
map[air:[{181388642046968320 0xc000180570 0xc000180570 <nil>} {181388642581742080 0xc000180720 0xc000180720 <nil>}] earth:[{181388642088911360 0xc0001808d0 0xc0001808d0 <nil>}] fire:[{181388642046968320 0xc000180a80 0xc000180a80 <nil>} {181388642071085568 0xc000180c30 0xc000180c30 <nil>}] water:[{181388642071085568 0xc000180de0 0xc000180de0 <nil>} {181388642088911360 0xc000180f90 0xc000180f90 <nil>}]]
System.out.println(
    client.query(
        Reduce(
            Lambda(
                Arr(Value("acc"), Value("val")),
                Let(
                    "element", Select(Value(0), Var("val")),
                    "ref", Select(Value(1), Var("val")),
                    "ref_list", If(
                        ContainsField(Var("element"), Var("acc")),
                        Append(
                            Var("ref"),
                            Select(Var("element"), Var("acc"))
                        ),
                        Arr(Var("ref"))
                    ),
                    "mergeit", Arr(
                        Arr(Var("element"), Var("ref_list"))
                    )
                ).in(
                    Merge(
                        Var("acc"),
                        ToObject(Var("mergeit"))
                    )
                )
            ),
            Obj(),
            Match(Index("spell_elements"))
        )
    ).get());
{air: [ref(id = "181388642046968320", collection = ref(id = "spells", collection = ref(id = "collections"))), ref(id = "181388642581742080", collection = ref(id = "spells", collection = ref(id = "collections")))], earth: [ref(id = "181388642088911360", collection = ref(id = "spells", collection = ref(id = "collections")))], fire: [ref(id = "181388642046968320", collection = ref(id = "spells", collection = ref(id = "collections"))), ref(id = "181388642071085568", collection = ref(id = "spells", collection = ref(id = "collections")))], water: [ref(id = "181388642071085568", collection = ref(id = "spells", collection = ref(id = "collections"))), ref(id = "181388642088911360", collection = ref(id = "spells", collection = ref(id = "collections")))]}
client.query(
  q.Reduce(
    q.Lambda(
      ['acc', 'val'],
      q.Let(
        {
          element: q.Select(0, q.Var('val')),
          ref: q.Select(1, q.Var('val')),
          ref_list: q.If(
            q.ContainsField(q.Var('element'), q.Var('acc')),
            q.Append(q.Var('ref'), q.Select(q.Var('element'), q.Var('acc'))),
            [ q.Var('ref') ]
          ),
          mergeit: [[q.Var('element'), q.Var('ref_list')]],
        },
        q.Merge(
          q.Var('acc'),
          q.ToObject(q.Var('mergeit'))
        )
      )
    ),
    {},
    q.Match(q.Index('spell_elements'))
  )
)
.then((ret) => console.log(ret))
.catch((err) => console.error(
  'Error: [%s] %s: %s',
  err.name,
  err.message,
  err.errors()[0].description,
))
{
  air: [
    Ref(Collection("spells"), "181388642046968320"),
    Ref(Collection("spells"), "181388642581742080")
  ],
  earth: [ Ref(Collection("spells"), "181388642088911360") ],
  fire: [
    Ref(Collection("spells"), "181388642046968320"),
    Ref(Collection("spells"), "181388642071085568")
  ],
  water: [
    Ref(Collection("spells"), "181388642071085568"),
    Ref(Collection("spells"), "181388642088911360")
  ]
}
result = client.query(
  q.reduce(
    q.lambda_(
      ["acc", "val"],
      q.let(
        {
          "element": q.select(0, q.var("val")),
          "ref": q.select(1, q.var("val")),
          "ref_list": q.if_(
            q.contains_field(q.var("element"), q.var("acc")),
            q.append(q.var("ref"), q.select(q.var("element"), q.var("acc"))),
            [ q.var("ref") ]
          ),
          "mergeit": [[q.var("element"), q.var("ref_list")]]
        },
        q.merge(
          q.var("acc"),
          q.to_object(q.var("mergeit"))
        )
      )
    ),
    {},
    q.match(q.index("spell_elements"))
  )
)
print(result)
{'air': [Ref(id=181388642046968320, collection=Ref(id=spells, collection=Ref(id=collections))), Ref(id=181388642581742080, collection=Ref(id=spells, collection=Ref(id=collections)))], 'earth': [Ref(id=181388642088911360, collection=Ref(id=spells, collection=Ref(id=collections)))], 'fire': [Ref(id=181388642046968320, collection=Ref(id=spells, collection=Ref(id=collections))), Ref(id=181388642071085568, collection=Ref(id=spells, collection=Ref(id=collections)))], 'water': [Ref(id=181388642071085568, collection=Ref(id=spells, collection=Ref(id=collections))), Ref(id=181388642088911360, collection=Ref(id=spells, collection=Ref(id=collections)))]}
Reduce(
  Lambda(
    ["acc", "val"],
    Let(
      {
        element: Select(0, Var("val")),
        ref: Select(1, Var("val")),
        ref_list: If(
          ContainsField(Var("element"), Var("acc")),
          Append(Var("ref"), Select(Var("element"), Var("acc"))),
          [ Var("ref") ]
        ),
        mergeit: [[Var("element"), Var("ref_list")]]
      },
      Merge(
        Var("acc"),
        ToObject(Var("mergeit"))
      )
    )
  ),
  {},
  Match(Index("spell_elements"))
)
{
  air: [
    Ref(Collection("spells"), "181388642046968320"),
    Ref(Collection("spells"), "181388642581742080")
  ],
  earth: [ Ref(Collection("spells"), "181388642088911360") ],
  fire: [
    Ref(Collection("spells"), "181388642046968320"),
    Ref(Collection("spells"), "181388642071085568")
  ],
  water: [
    Ref(Collection("spells"), "181388642071085568"),
    Ref(Collection("spells"), "181388642088911360")
  ]
}
Query metrics:
  •    bytesIn:  540

  •   bytesOut:  877

  • computeOps:    1

  •    readOps:    8

  •   writeOps:    0

  •  readBytes:  550

  • writeBytes:    0

  •  queryTime: 27ms

  •    retries:    0

Discussion

For the "spells" documents, the element field is expressed as an Array. When an array is indexed, an index entry for each item in the array is created. That makes it easy to discover every element value: we can just use Match(Index("spell_elements")) and every element name (and its associated Reference) is returned.

The index we created does not specify any terms so that we can find all values that may have been duplicated. You can define terms if you need to find duplicate values in a subset of your documents. If you do define terms, be sure to update the Match expression to specify the term(s) you want to search for.

The query may look complex, but the Reduce function does all the hard work. It iterates over all of the items in the set returned by the Match expression. At the beginning, the accumulator is an empty object. As Reduce works its way through the set, the element values are treated as field names in the accumulator, and the value is an array of document Reference for the associated documents.

In the Lambda function, we use Let to locate the element value and the document Reference. If the element already exists in the accumulator (acc), we Append the Reference to the already-existing list. Otherwise, we create a new array item that can be merged into the accumulator.

The mergeit array is structured so that we can use ToObject to create an Object that has a computed field name and an array value.

The Merge expression takes advantage of the fact that when the second object contains a field that exists in the first object, the second object’s field value is used. This means that the accumulator evolves as expected during processing.

Be aware that with many documents and/or with high cardinality fields, the query could surpass a limit that causes the query to fail.

Is this article helpful? 

Tell Fauna how the article can be improved:
Visit Fauna's forums or email docs@fauna.com

Thank you for your feedback!