Algorithm Engineering — L04
Start with std::vector. Use something else when the workload clearly asks for it.
Same sixteen integers, two containers — two very different shapes in memory.


| Level | Latency | Ratio |
|---|---|---|
| L1 cache | ~1 ns | 1× |
| L2 cache | ~4 ns | 4× |
| L3 cache | ~12 ns | 12× |
| Main memory | ~100 ns | 100× |
Order-of-magnitude numbers; exact values are hardware-specific.
A std::vector traversal lets the prefetcher keep data in cache.
A std::list traversal risks main memory on every hop.
Asymptotically identical, empirically one hundred times slower.
std::list actually win?Three conditions, all at once:
If you cannot name all three, reach for std::vector.
std::deque — the Swiss army knife nobody should openA map of pointers to blocks. Every iterator increment risks crossing a block boundary; every implementation is slightly different and usually slower than its specification suggests.


















Trade a little memory (the husks) for contiguous layout, a single allocation, and one cache line per hop.
Worth reaching for: absl::InlinedVector, boost::small_vector, llvm::SmallVector.
std::vector. Contiguous, predictable, cache-friendly.std::list. Only when splicing, stable iterators, or expensive moves demand it.std::deque. Prefer std::vector with a read index.absl::InlinedVector and friends when most instances hold only a handful of elements.A core workhorse — with performance that depends heavily on the implementation.
std::vector + linear scan — cache beats asymptotics.std::vector<Value> indexed directly — no hash needed.std::unordered_map is a linked list in disguise
The bucket array holds pointers to heap-allocated nodes. Every insert allocates; every lookup chases a pointer off the array.

One allocation. Collisions probed in adjacent slots. A miss stays in the same cache line.
But: wide values → few slots per cache line. Every probe step = fresh cache line.

Tiny metadata array scanned before any key is touched — 16 slots per cache line, regardless of value width.
A single SIMD instruction checks all 16 tags at once; only matches pay to read the full key-value pair.
std::hash<int> is typically the identity. Swap in xxHash or ankerl::unordered_dense::hash for hot string keys.std::vector. Ordered queries → sorted container. Tiny N → linear scan.std::unordered_map. Chained by standard mandate — 2× slower on hits, 3–4× on misses and unreserved build.ankerl::unordered_dense, absl::flat_hash_map, or boost::unordered_flat_map.reserve before bulk inserts. Leave load factor alone. Hash quality matters for non-trivial keys.When order itself is part of the query, not just membership.
If none of these apply, you want a hash map, not a sorted container.
std::vectorBuild once, binary-search many times.
Read-mostly ordered map? Start here. Simple and fast.
std::map is a red-black tree30 keys → 30 separate allocations, 6 levels deep. std::map is std::list with a sorting invariant — per-node malloc, plus rebalancing on every insert.
For one million keys, a balanced binary tree is ~20 levels deep.
Each level is a pointer hop to somewhere else in memory. In the bad case, 20 cache misses per lookup — and the work inside each node is negligible.
Cache misses come from tree height, not from node work. Reduce height by widening the branching factor.
30 keys → 13 nodes, 3 levels (order 5). Every key still carries its own value pointer (gray stubs).
Node sized to one cache line ⇒ one cache miss per level ⇒ logarithm in the branching factor, not in two.
Same structure, different constant. The “latency you are minimizing” decides the node size.
Internal nodes = pure routing. All values live in leaves. Leaves linked for range scans.
Range query = one tree descent + a linear walk over the leaf chain. This is why every database uses B+.
std::vector + lower_bound. Build once, search many — but mid-insert is O(N): on a million-key vector a single insert shifts megabytes.absl::btree_map. Drop-in for std::map, ~3× faster lookup, per-op cost stays bounded by log N.std::map. Red-black tree with per-node malloc — slower than absl::btree_map on every workload. No shape where it wins.The simple binary heap is the right default — with one useful tuning knob.
If you only need the ordering once at the end, just sort. A priority queue earns its place when you pop and keep inserting.

No pointers, no per-node allocations. The tree is index arithmetic on a std::vector.









A binary heap over one million keys is ~20 levels deep. Each level is a potential cache miss per push / pop. Widen the tree, shrink the height.




d-ary heap: each node has d children instead of 2. Same array, same index arithmetic — just d·i + k instead of 2i + 1. The d children sit in one cache line; height drops from log₂ N to log_d N.
Heaps support no efficient way to change a key already in the heap. Work around it:
// min-heap: smallest priority first
using Entry = std::pair<int, int>; // (priority, id)
std::priority_queue<Entry, std::vector<Entry>, std::greater<>> pq;
pq.push({new_priority, id}); // duplicate, not update
while (!pq.empty()) {
auto [p, id] = pq.top(); pq.pop();
if (p != priority[id]) continue; // stale — skip
process(id);
}Lazy deletion. The heap grows by at most a constant factor per update — a cheap price for avoiding the reverse-index bookkeeping a real decrease-key needs.
std::sort.std::priority_queue — binary heap on a vector. Cache-friendly, no allocations.d-ary heap (d = 4 or 8) if the heap is large.Asymptotic ≠ fast — the Fibonacci heap. Better textbook bound for Dijkstra, but pointer-chasing and cache-hostile consolidation make it lose to a flat binary heap in practice. Lazy deletion removes the last reason to use it.
A case where the theory translates well into practice: a compact implementation with very low amortized cost.
Maintain a partition of {0, …, 7} under:
find(x) — which set is x in?unite(x, y) — merge the two sets.



















Union by rank: attach the shorter tree under the taller. Tree height stays logarithmic.
Path compression: during find, re-parent every node on the path directly to the root.
#include <numeric>
#include <vector>
struct UnionFind {
std::vector<int> parent, rank_;
UnionFind(int n) : parent(n), rank_(n, 0) {
std::iota(parent.begin(), parent.end(), 0);
}
int find(int x) {
while (parent[x] != x) {
parent[x] = parent[parent[x]]; // path halving
x = parent[x];
}
return x;
}
void unite(int x, int y) {
int rx = find(x), ry = find(y);
if (rx == ry) return;
if (rank_[rx] < rank_[ry]) std::swap(rx, ry);
parent[ry] = rx;
if (rank_[rx] == rank_[ry]) ++rank_[rx];
}
};Amortized cost per operation: \[O(\alpha(n))\]
For any n a physical computer can represent: α(n) ≤ 4.
In practice, treat both operations as O(1). The constant factor is dominated by the single cache line that holds the parent array hot.
For graphs, the representation often matters as much as the algorithm.

Graphs are arguably the dominant abstraction of combinatorial optimization.
They model concrete entities — people in a social network, intersections in a street network, machines on a factory floor.
…but just as naturally abstract ones — states of a system and the transitions between them, configurations of a puzzle, positions in a game.
Once a problem wears this shape, a huge library of algorithms becomes available. The rest of this chapter is about making those algorithms run fast on real hardware.
std::vector<std::vector<int>> places each vertex’s neighbors in a separate heap allocation: one cache miss per vertex before reading a single edge.
CSR is to graphs what std::vector is to linear structures — the default unless you have a specific reason otherwise.
std::vector<T>; no indirection.weights array of length E, aligned with targets. weights[i] is the weight of the edge whose target is targets[i]. Two parallel arrays become three; everything still streams.struct Edge { int dst; float weight; } array. Better if the algorithm always touches both fields together; cache line carries what you need.The annotation layout is the same AoS vs SoA question you already know. Fields the inner loop reads together should sit together; cold fields (labels, timestamps, provenance) go in their own array.
Right for small V with a dense graph, when edge-existence queries dominate, or when the algorithm is expressed as linear algebra (SpMV, matrix powers, GraphBLAS).



std::vector<std::vector<int>> as a long-lived representation — same pathology as std::list.Queries in space, not by key. That usually calls for different structures.
The structure is chosen by the query, not the data. Range → grid/R-tree. Nearest-neighbor → kd-tree. Ray → BVH. Collision → BVH/grid. The rest of this section is a menu.




\(O(1)\) expected for near-neighbor queries when points are roughly uniform. One array of bucket vectors, one divide-and-floor to find a cell.





Degrades to linear scan as dimensionality rises. By about twenty dimensions, brute-force is competitive. This is the “curse of dimensionality” for data structures.













Not a query accelerator — a graph builder. Turns a point cloud into a sparse planar graph so downstream graph algorithms have something to walk.
A closing party trick: give up exactness for tiny memory — simple, surprising, rarely taught.













\[\Pr[\text{a fixed bit stays 0 after one hash}] \;=\; 1 - \tfrac{1}{m}\]
\[\Pr[\text{it stays 0 after all } kn \text{ hashes}] \;=\; \left(1 - \tfrac{1}{m}\right)^{kn} \;\approx\; e^{-kn/m}\]
\[\Pr[\text{bit is 1}] \;\approx\; 1 - e^{-kn/m}\]
\(m\): size of bit array
\(k\): number of hashes
\(n\): inserted elements
A non-member query hits \(k\) bits independently — false positive iff all are 1: \[\Pr[\text{false positive}] \;\approx\; \left(1 - e^{-kn/m}\right)^{k}\]
Minimize over \(k\) for fixed \(m/n\): optimum at \(k^\star = \tfrac{m}{n}\ln 2\).
| bits / element | optimal \(k\) | FPR | for \(n = 10{,}000{,}000\) |
|---|---|---|---|
| 4 | 3 | 14.7 % | 4.8 MB |
| 6 | 4 | 5.6 % | 7.2 MB |
| 8 | 6 | 2.2 % | 9.5 MB |
| 10 | 7 | 0.82 % | 12 MB |
| 12 | 8 | 0.31 % | 14 MB |
| 16 | 11 | 0.046 % | 19 MB |
| 20 | 14 | 0.0068 % | 24 MB |
Rule of thumb: each ~5 bits per element buys roughly a 10× lower FPR. Ten bits (~1 %) is the default most libraries pick; go higher only when a false positive is genuinely expensive.
d rows, w columns. Hash each item into one cell per row; increment. Query returns the min across rows.









\[w = \lceil e/\varepsilon \rceil, \quad d = \lceil \ln(1/\delta) \rceil \;\Rightarrow\; \Pr[\,\widehat{f}(x) - f(x) \le \varepsilon N\,] \ge 1-\delta\]
The estimate is always an overcount — never an undercount. The error is bounded by total stream mass, not by the item’s true frequency.
Hash each item → count leading zeros → the more distinct items, the higher the maximum.








\(2^{14}\) registers = 12 KB \(\Rightarrow\) ~0.8 % error. Redis PFCOUNT, BigQuery APPROX_COUNT_DISTINCT.
| Bloom | Count-Min | HyperLogLog | |
|---|---|---|---|
| Question | “seen it?” | “how often?” | “how many distinct?” |
| Answer | bit vector |
min of d counters
|
leading-zero stats |
| Error | false positives (no FN) | one-sided overcount \(\le \varepsilon N\) | rel. std. err. \(\approx 1.04/\sqrt{m}\) |
| Classic use | LSM read skip | heavy hitters |
COUNT(DISTINCT …)
|
Same recipe, three questions. Reach for them when an exact structure would blow the memory budget — otherwise just use a hash set.
The data structure is the performance story.
For lookup after the lecture. Not covered in the recorded walkthrough.
| Topic | C++ | Python |
|---|---|---|
| Dynamic array | std::vector |
list, numpy.ndarray |
| Small / inline vector | absl::InlinedVector, boost::small_vector, llvm::SmallVector |
— (fall back to list) |
| Deque / FIFO | std::vector with read index |
collections.deque |
| Hash map (default) | absl::flat_hash_map, ankerl::unordered_dense, boost::unordered_flat_map |
dict (already flat, open-addressed) |
| Hash map (std) | std::unordered_map — chained, avoid when possible |
— |
| Sorted lookup (static) | sorted std::vector + std::lower_bound |
bisect on a sorted list |
| Sorted (mutation-heavy) | absl::btree_map, absl::btree_set |
sortedcontainers.SortedList / SortedDict |
| Sorted (std) | std::map, std::set — red-black tree, avoid |
— |
| Priority queue | std::priority_queue, boost::d_ary_heap |
heapq on a list |
| Union-find | 20-line inline implementation | scipy.cluster.hierarchy.DisjointSet, networkx.utils.UnionFind |
| Graph (CSR) | hand-rolled CSR; GraphBLAS, Galois, Ligra | scipy.sparse.csgraph, igraph, rustworkx |
| Graph (dict-of-dicts) | — (avoid) | networkx — fine below ~10⁵ edges |
| Spatial (kd-tree, BVH, R-tree) | CGAL, nanoflann, Intel Embree, boost::geometry | scipy.spatial.KDTree, scipy.spatial.cKDTree, rtree |
| Bloom filter | folly, RocksDB, Abseil | pybloom-live, rbloom |
| Count-Min / HyperLogLog | datasketches (Apache) |
datasketches, probables |
Rule of thumb: in Python, reach for a native-code library (NumPy, SciPy, Abseil-via-pybind11) before a pure-Python container when the workload is the bottleneck. In C++, reach for Abseil/Boost flat containers before the std:: ordered or chained defaults.
Algorithm Engineering SS 2026 — L04 Data Structures