> ## Documentation Index
> Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Use `_xact_id` to dedupe exports and determine update time

> Deduplicate cloud export rows and track updated records using stable IDs and _xact_id.

export const plans_0 = "Any"

export const deployments_0 = "Any"

export const data_plane_version_0 = undefined

export const use_case_0 = "Use case - Ingest Braintrust exported Parquet/JSON into Databricks and Datadog while correctly identifying updated rows and assigning the correct update/write timestamp for metrics"

<Note>
  **Applies to:**

  * Plan - {plans_0}
  * Deployment - {deployments_0}
  * {data_plane_version_0}
  * {use_case_0}
</Note>

## Summary

`created` is the original row timestamp and is not changed when a span/trace is updated. Exports can contain multiple files with duplicate ids. Use the stable id plus `_xact_id` to dedupe and detect the newest exported version. Do not treat `_xact_id` as a first-class `updated_at` timestamp.

## What is happening

Exported rows keep their original `created` timestamp. When a trace or span is rewritten the export produces a new row rather than replacing the old file. That produces duplicate rows in downstream tables. Each exported row includes a stable record id (`id` or `span_id`) and an internal transaction id (`_xact_id`). `_xact_id` increases on each write, so higher `_xact_id` means a newer exported version of the same record.

Exports do not include an `updated_at` column today. Rely on `id` + `_xact_id` to find the latest version and maintain a downstream high-water mark.

## Fix or suggestion

### Option 1: dedupe on ingest (most common)

1. Read the exported parquet files into a staging table.
2. Keep one row per id by choosing the row with the highest `_xact_id`.
3. Write the deduped result to your final table or overwrite the id partition.

Minimal SQL (Databricks / Spark SQL):

```sql theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY _xact_id DESC) rn
  FROM production.braintrust.traces
) t
WHERE rn = 1;
```

<Note>
  * Use `id` for trace-level exports and `span_id` (or `id`, depending on shape) for span exports.
  * Keep `_xact_id` in the final table for future comparisons.
</Note>

### Option 2: incremental processing by `_xact_id`

1. After each export run, store the highest `_xact_id` you processed.
2. On the next ingest run, process only rows where `_xact_id > last_exported_xact_id`.
3. Process only these rows (then dedupe by id if needed) and update `last_exported_xact_id`.

Example incremental filter:

```sql theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
SELECT * FROM production.braintrust.traces
WHERE date >= '2026-06-01'
  AND _xact_id > 123456789;
```

Pattern benefits:

* Avoids reprocessing stable rows.
* Ensures you capture writes in monotonic order.

## Datadog / metric timestamp guidance

* Use `created` as the metric timestamp when you want the metric to reflect the original event time (historical backfill).
* Use `_xact_id` ordering to decide which row is the latest when a trace is updated.
* If you must send a metric that reflects when your pipeline observed an update, choose one of:
  * Use the export run time (processing timestamp) for updated rows detected by `_xact_id`.
  * Or send only rows where `_xact_id > last_exported_xact_id` and set metric timestamp = the export run time.
* Datadog APIs accept historical timestamps; supply the desired timestamp when submitting metrics.

## How to confirm it worked

* Count duplicates before and after dedupe: confirm no more than one row per id.
* Verify the kept row per id has the highest `_xact_id`.
* For incremental runs, confirm `last_exported_xact_id` increases and only new `_xact_id` values are processed.

## Notes

* Exports are partitioned by `date`, based on the row's transaction date from `_xact_id`, not the original `created` timestamp or the export run time. Updated rows can appear in a later `date=` partition than their original `created` date, so incremental pipelines should scan new `date=` partitions and use `_xact_id` as the high-water mark.
* There is no exported `updated_at` field currently. If you require a per-row update timestamp, either derive/update it in your ingest pipeline or request a product feature to add an `updated_at` export column.
