Glossary of terms

Glossary of miscellaneous topics

Many programming concepts and their descriptions: https://dev.to/chhunneng/100-computer-science-concepts-you-should-know-2pgk

Many DE terms: https://dagster.io/glossary

Two Generals problem

About Two Generals problem: https://linuxblog.io/the-two-generals-problem

Race condition

A race condition in programming occurs when multiple threads or processes concurrently access and modify shared resources, and the final outcome depends on the unpredictable timing and order of these operations. This can lead to non-deterministic behavior, where the program's output varies between executions even with the same input, making such bugs difficult to reproduce and debug. Key characteristics of race conditions:

Shared Resources: The presence of a shared resource (e.g., a variable, data structure, file, or database) that multiple threads or processes can access.
Concurrent Access: Multiple threads or processes attempt to access or modify the shared resource simultaneously.
Unpredictable Timing: The relative timing of these accesses is not guaranteed, and the operating system or runtime environment can schedule threads in an arbitrary order.
Non-deterministic Outcome: The final state of the shared resource and the program's behavior can vary depending on the precise order of operations, leading to incorrect or unexpected results.

Example: Consider a shared counter being incremented by multiple threads. If two threads read the current value of the counter, both increment it, and then both write the new value back, the final value might be incorrect if the writes overlap in an unfavorable way. For instance, if the counter is 10, both threads read 10, Thread A writes 11, and then Thread B writes 11, the final value is 11 instead of the expected 12. Mitigation techniques: To prevent race conditions, synchronization mechanisms are employed to ensure that only one thread or process can access a shared resource at a time, or that operations on shared resources are atomic:

Locks/Mutexes: Provide exclusive access to a critical section of code, ensuring only one thread can execute it at a time.
Semaphores: Control access to a limited number of resources, allowing a specified number of threads to proceed concurrently.
Atomic Operations: Use hardware-supported atomic instructions for simple operations like increments or decrements, guaranteeing they are indivisible.
Critical Sections: Identify and protect code blocks that access shared resources, ensuring mutual exclusion.

CQRS

CQRS (Command Query Responsibility Segregation) : is an architectural pattern that separates an application's write operations (commands) from its read operations (queries). This separation allows for different data models, scaling strategies, and data stores to be used for each, leading to improved performance, scalability, and flexibility, especially in complex and high-performance systems.

Database Design Implications: While CQRS doesn't dictate a specific database technology, it often leads to the use of different database designs or even entirely separate databases optimized for either reads or writes.
- Write-optimized databases (Command side): These often prioritize data integrity, transactional consistency, and normalized schemas to facilitate updates and prevent data duplication. Relational databases are a common choice here.
- Read-optimized databases (Query side): These might employ denormalized schemas, materialized views, or even different database types (e.g., NoSQL databases) to achieve high read performance and cater to specific query patterns.
Flexibility in Database Choices: CQRS allows for the use of different database technologies for the command and query sides, enabling you to choose the best tool for each specific need (e.g., a relational database for commands and a document database for queries).

In essence, CQRS is a higher-level architectural pattern that informs and guides database design decisions to achieve optimized performance and scalability for both read and write operations.

RBAC - Role-Based Access Control

Explanation

🔐 RBAC (Role-Based Access Control)

The Core Concept: Instead of giving permission to specific people, you give permissions to specific job titles (Roles). People are then assigned those titles.

The Analogy: The Hospital Badge

Imagine a hospital security system.

Without RBAC: You have to program every single door to open for "Dr. Smith," "Nurse Jones," and "Janitor Bob" individually. If Dr. Smith quits, you have to find every door she had access to and remove her.
With RBAC: You create a "Doctor" badge. You program the doors to open for anyone holding a "Doctor" badge. When Dr. Smith is hired, you just hand her the badge. If she quits, you take it back. You never touch the door programming.

The Three Pillars

RBAC separates "Who you are" from "What you can do" using a middle layer.

User (Who): The individual person (e.g., alice@company.com).
Role (The Bridge): A label that groups permissions (e.g., Admin, Editor, Viewer).
Permission (What): The specific action allowed (e.g., READ table, DELETE file, EXECUTE query).

The Flow:

User ➔ Assigned to ➔ Role ➔ Has ➔ Permissions

Why use it? (The "Scale" Argument)

Efficiency: If you hire 50 new Junior Engineers, you don't assign 50 sets of permissions. You just assign the Junior_Eng role 50 times.
Least Privilege: It makes it easier to ensure users only have the access they strictly need for their job function (a core security principle).
Auditing: It is easier to answer "Who can delete production data?" by looking at the Admin role than by checking every single user account.

Example: A Database Setup

Role A: Data_Analyst
- Permissions: SELECT on tables. (Can look, but cannot touch).
Role B: Data_Engineer
- Permissions: SELECT, INSERT, UPDATE, CREATE TABLE. (Can build and change things).
Scenario: Alice is promoted from Analyst to Engineer.
- Action: Revoke Data_Analyst role ➔ Grant Data_Engineer role.
- Result: Her permissions update instantly across the entire system.

Big-O Notation

https://blog.algomaster.io/p/big-o-notation-explained-in-8-minutes

📝 Big O Notation: Crash Course

The Core Idea: Big O Notation doesn't tell you the speed in seconds. It tells you how the number of operations grows as the input size (n) grows. It measures the worst-case scenario.

The Analogy: Simple Search vs. Binary Search

Imagine you have a list of 100 items.

Simple Search: You check every single item one by one. In the worst case, you check 100 items. If the list doubles to 200, you check 200. This is linear.
Binary Search: You split the list in half every time. For 100 items, it takes ~7 steps. If the list doubles to 200, it only takes 1 more step (8 steps). This is logarithmic.

Common Big O Run Times (Fastest to Slowest)

Notation

Name

Analogy / Example

Growth Rate

O(1)

Constant Time

Accessing an array index. It takes the same time regardless of size.

Flat line.

O(log n)

Logarithmic Time

Binary Search. The "Divide and Conquer" approach.

Grows very slowly.

O(n)

Linear Time

Simple Search (Looping through a list). Reading every page of a book.

Grows steadily.

O(n * log n)

Log Linear Time

Quicksort or Mergesort. Fast sorting algorithms.

Slightly steeper than O(n).

O(n²)

Quadratic Time

Selection Sort. Nested loops (a loop inside a loop).

Grows fast. Dangerous for big data.

O(n!)

Factorial Time

The Traveling Salesperson Problem. Calculating every possible route.

Explodes immediately. Impossible for large n.

Image Source: "Grokking Algorithms" book by Aditya Y. Bhargava

The chart above visualizes how different algorithms handle increasing workloads. It contrasts efficient algorithms (represented by the calm "Fast" duck) with inefficient ones (the sweating "Slow" duck).

Key Takeaways

Ignore the Constants: Big O focuses on growth. $O(2n)$ and $O(100n)$ are both just O(n) because the curve shape is the same.
Worst-Case Matters: When comparing algorithms, we usually care about the worst-case scenario (e.g., searching for an item that is at the very end of the list).
Space Complexity: Algorithms also take up memory. Big O can measure memory usage (space) just like it measures time.

Note: average-case run time is also important, not only worst-case run time.

💡 Visual Mnemonic

O(log n) is like flattening a piece of paper by folding it in half repeatedly.
O(n) is like reading a book page by page.
O(n²) is like a handshake line where everyone shakes hands with everyone else.

Source: "Grokking Algorithms" book by Aditya Y. Bhargava

CDN - Content Delivery Network

Heredoc

A heredoc (short for "here document") is a way to write multi-line strings in programming without dealing with a bunch of quote marks and escape characters. Think of it as a cleaner way to handle longer blocks of text.

Instead of writing something messy like this:

message = "Dear John,\n\nI hope this finds you well.\n\nBest regards,\nSarah"

You can use a heredoc to write it more naturally:

message = """Dear John,

I hope this finds you well.

Best regards,
Sarah"""

How It Works

The heredoc uses a special marker (the exact syntax depends on the language):

You start with a marker that says "here comes a multi-line string"
You write your content across multiple lines, exactly as you want it to appear
You end with a closing marker

In Shell (Bash/sh), heredocs use a special syntax with << followed by a delimiter. Here's how it works:

Basic Syntax

cat << EOF
This is line 1
This is line 2
This is line 3
EOF

The EOF (End Of File) is just a marker - you can use any word you want, but EOF is the most common convention.

Common Uses

Assigning to a variable:

message=$(cat << EOF
Hello, this is a multi-line
message that I'm storing
in a variable.
EOF
)

Writing to a file:

cat << EOF > output.txt
Line 1
Line 2
Line 3
EOF

Piping to a command:

cat << EOF | grep "search"
Some text to search
More text here
EOF

Useful Variations

Suppress leading tabs (use <<-):

cat <<- EOF
	This line has a tab
	But it will be removed
EOF

Prevent variable expansion (quote the delimiter):

cat << 'EOF'
$HOME will not be expanded
It will print literally as $HOME
EOF

Allow variable expansion (default behavior):

name="Alice"
cat << EOF
Hello, $name!
Your home is: $HOME
EOF

This prints: Hello, Alice! and your actual home directory path.

TDD - Test Driven Development

https://en.wikipedia.org/wiki/Test-driven_development

Data Residency and Data Sovereignty

https://www.splunk.com/en_us/blog/learn/data-sovereignty-vs-data-residency.html

PreviousHow to use LLMs for Data Engineering tasks NextPage 1

Last updated 17 days ago

hashtagTwo Generals problem

hashtagRace condition

hashtagCQRS

hashtagRBAC - Role-Based Access Control

hashtagBig-O Notation

hashtagRFC

hashtagRACI

hashtagFirst-class citizens

hashtagCDN - Content Delivery Network

hashtagHeredoc

hashtagTDD - Test Driven Development

hashtagData Residency and Data Sovereignty