A library that provides an embeddable, persistent key-value store for fast storage.
by facebook5 of 6 standards met
Summary: Unify the MultiScan and regular iterator codepaths in BlockBasedTableIterator by introducing a MultiScanIndexIterator that implements InternalIteratorBase. During Prepare(), the original index iterator is swapped out for a MultiScanIndexIterator that wraps the prefetched block handles and scan range metadata. This allows SeekImpl() and FindBlockForward() to use the same code flow for both regular and MultiScan operations, eliminating the need for separate MultiScan-specific methods (SeekMultiScan, FindBlockForwardInMultiScan, MultiScanSeekTargetFromBlock, MultiScanUnexpectedSeekTarget, MultiScanLoadDataBlock, MarkPreparedRangeExhausted). Key changes: New MultiScanIndexIterator class that manages scan range tracking, block handle iteration, forward-only seek enforcement, and wasted block counting InitDataBlock() loads blocks from ReadSet when MultiScan is active FindBlockForward() detects scan range boundaries via IsScanRangeExhausted() after index_iter_->Next() Disabled reseek optimization for MultiScan so MultiScanIndexIterator::Seek() is always called to update scan range tracking state Removed MultiScanState struct and all MultiScan-specific methods from BlockBasedTableIterator No changes to CheckDataBlockWithinUpperBound or CheckOutOfBound โ they work as-is through iterate_upper_bound multi_scan_status_ intentionally not checked in Valid() hot path to avoid performance regression; when status is non-OK, block_iter_points_to_real_block_ is already false Fixed pre-existing bug in ReadSet::SyncRead() that used the base decompressor without compression dictionary, causing ZSTD data corruption when blocks with dictionary compression needed synchronous fallback reads Test plan - All MultiScan tests pass (buck test): BlockBasedTableReaderMultiScan: 2688/2688 pass (including compression dictionary tests) DBMultiScanIteratorTest: 32/32 pass db_bench (release build, 5M keys, no compression, L5=8 files/60MB + L6=49 files/317MB, 3 runs each): Before (parent) After (this diff) Delta readseq 2,426K ops/sec 2,454K ops/sec +1.2% seekrandom 49,053 ops/sec 50,008 ops/sec +1.9% multiscan 6,907 ops/sec 7,001 ops/sec +1.4% Setup: --batch_size=10 --seek_nexts=50 --multiscan_size=10 --multiscan_stride=100 --cache_size=512MB --duration=30 No regression in any benchmark.
Summary: As a follow-up to #13736, allow a "quiet" DB to react much sooner to time-based compaction triggers. For details see DBImpl::ComputeTriggerCompactionPeriod() implementation. Also based on review feedback, fixing a bug where only column families setting periodic compaction would be triggered, rather than any time-based compaction. Test Plan: extended+added unit tests to cover much of the logic
Repository: facebook/rocksdb. Description: A library that provides an embeddable, persistent key-value store for fast storage. Stars: 31575, Forks: 6748. Primary language: C++. Languages: C++ (83.2%), Java (8.6%), Starlark (2.2%), C (1.7%), Python (1.5%). License: GPL-2.0. Homepage: http://rocksdb.org Topics: database, storage-engine. Latest release: v10.10.1 (3w ago). Open PRs: 100, open issues: 1327. Last activity: 2h ago. Community health: 75%. Top contributors: siying, igorcanadi, pdillinger, ajkr, yhchiang, riversand963, ltamasi, IslamAbdelRahman, hx235, cbi42 and others.
C++
Last 12 weeks ยท 121 commits
Summary Add a callback-based to the C API, enabling language bindings to plug in custom memory allocators (mimalloc, tcmalloc, snmalloc) at runtime without requiring them to be linked at RocksDB build time. Resolves #14367 Motivation The C API currently only exposes . On Windows (MSVC), jemalloc is not supported, and the CRT allocator never decommits freed heap segments โ causing unbounded memory growth (#4112, #12364). Language bindings (Rust, Go, Python) have no way to plug in an alternative. Implementation Follows the exact callback pattern used by , , and : struct in inheriting Function pointers: , , (optional), (optional) + for user-managed lifetime Wrapped in via existing Changes : Add declaration : Add proxy class and factory function : Add test creating a custom allocator with malloc/free callbacks, wiring it to an LRU cache : Release note Test Plan Added test in Test creates allocator โ sets on LRU cache options โ creates cache โ destroys in order Verifies no crashes or leaks through the callback dispatch path
Summary The C API currently only exposes for creating a . There is no way to create a custom, user-defined through the C API using function pointers โ even though the C++ API fully supports this via subclassing . This prevents language bindings (Rust, Go, Python, Node.js) from plugging in alternative allocators like mimalloc, tcmalloc, or snmalloc โ which is critical on Windows where jemalloc is not supported. Motivation The Windows memory problem On Windows (MSVC), RocksDB links against the CRT allocator (/). The Windows CRT never decommits freed heap segments โ once committed virtual memory pages are allocated, marks them as available but does not call . This means: Any large allocation (e.g., iterating over a column family) permanently inflates the process's private bytes Block cache evictions free memory internally but the OS-level committed memory never shrinks Over time, processes using RocksDB on Windows exhibit unbounded memory growth This is the root cause behind #4112 (189 comments), #12364, and #12579. On Linux, jemalloc or tcmalloc solve this by returning memory to the OS via . But: jemalloc does not support Windows MSVC ( explicitly excludes ) tcmalloc requires which doesn't exist on Windows mimalloc (Microsoft's own allocator) works on Windows and returns memory, but there's no way to plug it into RocksDB through the C API Demand from language bindings The C API is the foundation for all non-C++ RocksDB bindings: Rust (): Issues rust-rocksdb#749, rust-rocksdb#926, rust-rocksdb#260 request custom cache/allocator support Go (/): No allocator control available Java (): Users in #4112 report memory growth with JNI, resort to hacks Python (): No allocator control All of these would benefit from a callback-based in the C API. Proposed API Add to : This follows the same callback + state pattern already used throughout the C API for comparators, merge operators, compaction filters, etc. Implementation sketch Add to : Scope Covers allocations that go through : Block cache (typically the largest memory consumer) Compressed block cache Blob cache Does not** cover iterator temporary buffers, compaction buffers, or memtable allocations (unless is configured with the same cache). A follow-up effort to route more internal allocations through would further improve control. Related issues #4112 โ Memory grows without limit (189 comments) #12364 โ Unexplained sudden increase in memory usage #12579 โ High Memory Usage / LRU cache size not respected #1442 โ RocksDB shouldn't determine at build time whether to use jemalloc/tcmalloc #4437 โ Original PR introducing CacheAllocator (now MemoryAllocator) Happy to submit a PR implementing this if there's interest.
When the UDI wrapper's OnKeyAdded() encountered a non-Put key type (e.g. Delete or Merge), it set its internal status_ to non-OK and stopped forwarding OnKeyAdded() to the wrapped internal index builder. However, AddIndexEntry() was always forwarded unconditionally. This asymmetry left the internal ShortenedIndexBuilder's current_block_first_internal_key_ empty, triggering an assertion failure in GetFirstInternalKey() during the buffered-block replay in MaybeEnterUnbuffered(). The crash required three conditions to co-occur: 1. UDI enabled (use_trie_index=1) 2. Compression dictionary enabled (triggering kBuffered mode) 3. Non-Put entries in the data (Delete, Merge, etc.) Fix: move the internal_index_builder_->OnKeyAdded() call before the status_ guard so the internal builder always receives every key, matching the unconditional forwarding in AddIndexEntry().