Use `_xact_id` to dedupe exports and determine update time

Applies to:

Plan -
Deployment -

Summary

created is the original row timestamp and is not changed when a span/trace is updated. Exports can contain multiple files with duplicate ids. Use the stable id plus _xact_id to dedupe and detect the newest exported version. Do not treat _xact_id as a first-class updated_at timestamp.

What is happening

Exported rows keep their original created timestamp. When a trace or span is rewritten the export produces a new row rather than replacing the old file. That produces duplicate rows in downstream tables. Each exported row includes a stable record id (id or span_id) and an internal transaction id (_xact_id). _xact_id increases on each write, so higher _xact_id means a newer exported version of the same record. Exports do not include an updated_at column today. Rely on id + _xact_id to find the latest version and maintain a downstream high-water mark.

Fix or suggestion

Option 1: dedupe on ingest (most common)

Read the exported parquet files into a staging table.
Keep one row per id by choosing the row with the highest _xact_id.
Write the deduped result to your final table or overwrite the id partition.

Minimal SQL (Databricks / Spark SQL):

SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY _xact_id DESC) rn
  FROM production.braintrust.traces
) t
WHERE rn = 1;

Use id for trace-level exports and span_id (or id, depending on shape) for span exports.
Keep _xact_id in the final table for future comparisons.

Option 2: incremental processing by `_xact_id`

After each export run, store the highest _xact_id you processed.
On the next ingest run, process only rows where _xact_id > last_exported_xact_id.
Process only these rows (then dedupe by id if needed) and update last_exported_xact_id.

Example incremental filter:

SELECT * FROM production.braintrust.traces
WHERE date >= '2026-06-01'
  AND _xact_id > 123456789;

Pattern benefits:

Avoids reprocessing stable rows.
Ensures you capture writes in monotonic order.

Datadog / metric timestamp guidance

Use created as the metric timestamp when you want the metric to reflect the original event time (historical backfill).
Use _xact_id ordering to decide which row is the latest when a trace is updated.
If you must send a metric that reflects when your pipeline observed an update, choose one of:
- Use the export run time (processing timestamp) for updated rows detected by _xact_id.
- Or send only rows where _xact_id > last_exported_xact_id and set metric timestamp = the export run time.
Datadog APIs accept historical timestamps; supply the desired timestamp when submitting metrics.

How to confirm it worked

Count duplicates before and after dedupe: confirm no more than one row per id.
Verify the kept row per id has the highest _xact_id.
For incremental runs, confirm last_exported_xact_id increases and only new _xact_id values are processed.

Notes

Exports are partitioned by date, based on the row’s transaction date from _xact_id, not the original created timestamp or the export run time. Updated rows can appear in a later date= partition than their original created date, so incremental pipelines should scan new date= partitions and use _xact_id as the high-water mark.
There is no exported updated_at field currently. If you require a per-row update timestamp, either derive/update it in your ingest pipeline or request a product feature to add an updated_at export column.

​Summary

​What is happening

​Fix or suggestion

​Option 1: dedupe on ingest (most common)

​Option 2: incremental processing by _xact_id

​Datadog / metric timestamp guidance

​How to confirm it worked

​Notes

Summary

What is happening

Fix or suggestion

Option 1: dedupe on ingest (most common)

Option 2: incremental processing by `_xact_id`

Datadog / metric timestamp guidance

How to confirm it worked

Notes