Skip to content

7. Scheduling

Chapter 6 is cooperative: a process runs until it yields or exits. A CPU-bound loop without ecall holds the CPU indefinitely. This chapter fixes that with a timer quantum that forcibly preempts after 10 ms.

It also adds SYS_SLEEP and brings secondary harts online so multiple processes run in parallel.

spawned pid 0
spawned pid 1
hart 1 online
A1
B1
A2
pid 0: exited
B2
pid 1: exited
all processes exited

The demo programs now use sleep instead of yield. A sleeps 100 ms between prints, B sleeps 200 ms. “hart 1 online” may appear at a different position depending on timing.

FileStatus
.cargo/config.tomlchange -smp 1 to -smp 2
src/sched/timer.rsnew — rdtime, TICKS_PER_MS, handle_interrupt
src/sched/mod.rsadd quantum timer, sleep handling, preemption
src/sched/process.rsadd Sleeping state, sleep_until, wake_sleepers
src/sched/syscall.rsadd SYS_SLEEP
src/utils/sbi.rsadd HSM extension, hart_start
src/boot.rsstore hart IDs, start secondary harts
src/trap/mod.rstimer dispatch to sched::timer
src/main.rscall start_secondary_harts
src/demo.rsupdate programs to use sleep
boot.Sadd _secondary_entry

In .cargo/config.toml, change -smp 1 to -smp 2:

[build]
target = "riscv64gc-unknown-none-elf"
[target.riscv64gc-unknown-none-elf]
runner = "qemu-system-riscv64 -machine virt -m 256M -smp 2 -nographic -serial mon:stdio -bios default -kernel"
rustflags = [
"-C", "link-arg=-Tlink.ld",
]

Create src/sched/timer.rs:

pub const TICKS_PER_MS: u64 = 10_000; // 10 MHz QEMU virt timer
pub fn rdtime() -> u64 {
let val: u64;
unsafe { core::arch::asm!("rdtime {}", out(reg) val) };
val
}
/// Clear the pending timer interrupt.
pub fn handle_interrupt() {
crate::utils::sbi::set_timer(u64::MAX);
}

QEMU’s virt machine has a 10 MHz timer. rdtime reads the current tick count. handle_interrupt disarms the timer by setting it to u64::MAX.

Update src/sched/process.rs. The state machine gains a Sleeping variant:

use crate::memory::{PAGE_SIZE, alloc_frame, alloc_zeroed_frame};
use crate::paging::*;
use crate::trap::TrapFrame;
use alloc::vec::Vec;
use spin::Mutex;
const CODE_VA: usize = 0x10000;
const USER_STACK_BASE: usize = 0x0100_0000;
const USER_STACK_PAGES: usize = 4;
enum State {
Ready,
Running(usize),
Sleeping(u64),
Exited,
}
struct Process {
state: State,
tf: *mut TrapFrame,
satp: usize,
tf_va: usize,
}
unsafe impl Send for Process {}
#[derive(Clone, Copy)]
pub(crate) struct Runnable {
pub pid: usize,
pub tf: *mut TrapFrame,
pub satp: usize,
pub tf_va: usize,
}
pub(crate) enum Next {
Runnable(Runnable),
SleepUntil(u64),
Done,
Empty,
}
static TABLE: Mutex<ProcessTable> = Mutex::new(ProcessTable::new());
pub(crate) fn spawn_with_args(code: &[u8], _args: [usize; 4]) -> usize {
assert!(code.len() <= PAGE_SIZE, "user program too large");
// Code page is zeroed: any bytes past `code.len()` decode to a
// deterministic instruction (illegal-instruction trap on RV64) instead
// of stale frame contents from a previous owner.
let code_pa = alloc_zeroed_frame().expect("oom");
let tf_pa = alloc_frame().expect("oom");
unsafe {
core::ptr::copy_nonoverlapping(
code.as_ptr(),
code_pa as *mut u8,
code.len(),
);
}
let tf_va: usize = TRAMPOLINE - PAGE_SIZE;
let pt = PageTable::alloc();
pt.map_at(CODE_VA, code_pa, PTE_U | PTE_R | PTE_X, 0);
let stack_top = crate::memory::map_stack(
pt,
USER_STACK_BASE,
0,
USER_STACK_PAGES,
PTE_U,
);
pt.map_trampoline();
pt.map_at(tf_va, tf_pa, PTE_R | PTE_W, 0);
let tf_ptr = tf_pa as *mut TrapFrame;
let tf = TrapFrame::new(CODE_VA, stack_top);
unsafe { tf_ptr.write(tf) };
let pid = TABLE.lock().add(tf_ptr, pt.satp(), tf_va);
println!("spawned pid {}", pid);
pid
}
pub(crate) fn pick_next(now: u64, hartid: usize) -> Next {
let mut table = TABLE.lock();
table.wake_sleepers(now);
table
.pick_next(hartid)
.map(Next::Runnable)
.or_else(|| table.next_sleep_deadline().map(Next::SleepUntil))
.or_else(|| table.all_exited().then_some(Next::Done))
.unwrap_or(Next::Empty)
}
pub(crate) fn ready(pid: usize) {
TABLE.lock().ready(pid);
}
pub(crate) fn sleep_until(pid: usize, deadline: u64) {
TABLE.lock().sleep_until(pid, deadline);
}
pub(crate) fn exit(pid: usize) {
TABLE.lock().exit(pid);
}
impl Process {
fn new(tf: *mut TrapFrame, satp: usize, tf_va: usize) -> Self {
Self {
state: State::Ready,
tf,
satp,
tf_va,
}
}
fn to_runnable(&self, pid: usize) -> Runnable {
Runnable {
pid,
tf: self.tf,
satp: self.satp,
tf_va: self.tf_va,
}
}
fn sleeping_deadline(&self) -> Option<u64> {
match self.state {
State::Sleeping(deadline) => Some(deadline),
_ => None,
}
}
fn wake_if_due(&mut self, now: u64) {
if self
.sleeping_deadline()
.is_some_and(|deadline| now >= deadline)
{
self.state = State::Ready;
}
}
}
struct ProcessTable {
processes: Vec<Process>,
cursor: usize,
}
impl ProcessTable {
const fn new() -> Self {
Self {
processes: Vec::new(),
cursor: 0,
}
}
fn add(&mut self, tf: *mut TrapFrame, satp: usize, tf_va: usize) -> usize {
let pid = self.processes.len();
self.processes.push(Process::new(tf, satp, tf_va));
pid
}
fn wake_sleepers(&mut self, now: u64) {
for proc in &mut self.processes {
proc.wake_if_due(now);
}
}
fn pick_next(&mut self, hartid: usize) -> Option<Runnable> {
let len = self.processes.len();
(0..len).find_map(|_| {
let idx = self.cursor % len;
self.cursor = (idx + 1) % len;
let proc = &mut self.processes[idx];
if matches!(proc.state, State::Ready) {
proc.state = State::Running(hartid);
Some(proc.to_runnable(idx))
} else {
None
}
})
}
fn next_sleep_deadline(&self) -> Option<u64> {
self.processes
.iter()
.filter_map(Process::sleeping_deadline)
.min()
}
fn all_exited(&self) -> bool {
!self.processes.is_empty()
&& self
.processes
.iter()
.all(|proc| matches!(proc.state, State::Exited))
}
fn ready(&mut self, pid: usize) {
self.processes[pid].state = State::Ready;
}
fn sleep_until(&mut self, pid: usize, deadline: u64) {
self.processes[pid].state = State::Sleeping(deadline);
}
fn exit(&mut self, pid: usize) {
self.processes[pid].state = State::Exited;
}
}

Changes from Chapter 6:

  • Sleeping(u64) holds a deadline in timer ticks. wake_if_due transitions to Ready when now >= deadline.
  • pick_next now takes now and calls wake_sleepers first.
  • SleepUntil(u64) in Next tells the scheduler to sleep until the nearest deadline when no process is ready.
  • next_sleep_deadline scans for the earliest sleeping deadline.

Update src/sched/syscall.rs:

use crate::trap::TrapFrame;
use super::{process, timer};
const SYS_EXIT: usize = 1;
const SYS_SLEEP: usize = 2;
const SYS_YIELD: usize = 3;
const SYS_PUTCHAR: usize = 100;
pub(super) fn handle(pid: usize, tf: &mut TrapFrame) {
match tf.a7() {
SYS_EXIT => {
println!("pid {}: exited", pid);
process::exit(pid);
}
SYS_SLEEP => {
let ms = tf.a0() as u64;
process::sleep_until(
pid,
timer::rdtime() + ms * timer::TICKS_PER_MS,
);
}
SYS_YIELD => {
process::ready(pid);
}
SYS_PUTCHAR => {
print!("{}", tf.a0() as u8 as char);
process::ready(pid);
}
nr => {
println!("pid {}: unknown syscall {nr}", pid);
process::exit(pid);
}
}
}

SYS_SLEEP converts milliseconds to ticks and sets a deadline. The process is not runnable until the timer reaches that deadline. Each syscall sets the next state for pid (ready, sleeping, or exited); the scheduler decides which process to run next.

Update src/sched/mod.rs:

pub mod process;
mod syscall;
pub mod timer;
use crate::trap::{self, TrapCause};
use crate::utils::sbi;
use process::{Next, Runnable};
const QUANTUM: u64 = 10 * timer::TICKS_PER_MS; // 10 ms
pub fn run(hartid: usize) -> ! {
loop {
match process::pick_next(timer::rdtime(), hartid) {
Next::Runnable(proc) => {
sbi::set_timer(timer::rdtime() + QUANTUM);
let cause = enter_process(proc);
handle_trap(proc, cause);
}
Next::SleepUntil(deadline) => sleep_until_timer(deadline),
Next::Done => halt(hartid == crate::boot::boot_hartid()),
Next::Empty => {
sbi::set_timer(timer::rdtime() + QUANTUM);
unsafe { core::arch::asm!("wfi") };
}
}
}
}
fn enter_process(proc: Runnable) -> TrapCause {
trap::enter_usermode(unsafe { &mut *proc.tf }, proc.satp, proc.tf_va)
}
fn handle_trap(proc: Runnable, cause: TrapCause) {
match cause {
TrapCause::Syscall => syscall::handle(proc.pid, unsafe { &mut *proc.tf }),
TrapCause::TimerInterrupt => {
timer::handle_interrupt();
process::ready(proc.pid);
}
TrapCause::Exception {
scause,
stval,
sepc,
} => {
println!(
"pid {}: exception scause={scause:#x} stval={stval:#x} sepc={sepc:#x}",
proc.pid
);
process::exit(proc.pid);
}
}
}
fn sleep_until_timer(deadline: u64) {
disable_interrupts();
sbi::set_timer(deadline);
unsafe { core::arch::asm!("wfi") };
enable_interrupts();
}
fn halt(boot: bool) -> ! {
sbi::set_timer(u64::MAX);
if boot {
println!("all processes exited");
}
loop {
disable_interrupts();
unsafe { core::arch::asm!("wfi") };
}
}
fn disable_interrupts() {
unsafe { core::arch::asm!("csrci sstatus, 0x2") };
}
fn enable_interrupts() {
unsafe { core::arch::asm!("csrsi sstatus, 0x2") };
}

Key additions over Chapter 6:

  • Quantum timer: sbi::set_timer(rdtime() + QUANTUM) before entering each process. If the process does not trap within 10 ms, the timer fires and the trampoline returns TrapCause::TimerInterrupt. The process goes back to Ready and the scheduler picks the next one.
  • SleepUntil: when no process is ready but some are sleeping, the hart programs the timer for the nearest deadline, executes wfi, and retries. Interrupts are disabled before wfi so the timer handler does not consume the wakeup before wfi actually sleeps.
  • halt: only the boot hart prints “all processes exited.” All harts park with interrupts disabled.
  • timer::handle_interrupt() replaces the inline set_timer(u64::MAX).

Each iteration of the scheduler loop enters one process exactly once. When the process traps back, handle_trap decides its next state — ready, sleeping, or exited — and pick_next chooses what to run next. Putchar simply marks the process Ready again; if it is the only runnable one, pick_next will pick it on the next iteration, but multi-character prints from one process can interleave with another runnable process whenever a quantum expires.

In src/trap/mod.rs, update handle_kernel_trap to dispatch through the timer module:

#[unsafe(no_mangle)]
extern "C" fn handle_kernel_trap(sepc: usize, scause: usize, stval: usize) {
if scause & INTERRUPT_BIT != 0 {
match scause & !INTERRUPT_BIT {
SUPERVISOR_TIMER_INTERRUPT => {
crate::sched::timer::handle_interrupt()
}
cause => panic!(
"unhandled kernel interrupt: cause={cause}, sepc={sepc:#x}"
),
}
} else {
panic!(
"kernel exception: scause={scause:#x}, stval={stval:#x}, sepc={sepc:#x}"
);
}
}

Update src/utils/sbi.rs — add the HSM (Hart State Management) extension:

#[repr(usize)]
enum SbiExt {
ConsolePutchar = 0x01,
Hsm = 0x0048_534D,
Timer = 0x5449_4D45,
}
fn sbi_call(ext_id: SbiExt, fun_id: usize, args: [usize; 3]) -> (isize, usize) {
let mut error: isize;
let mut value: usize;
unsafe {
core::arch::asm!(
"ecall",
in("a7") ext_id as usize,
in("a6") fun_id,
inlateout("a0") args[0] => error,
inlateout("a1") args[1] => value,
in("a2") args[2],
)
}
(error, value)
}
pub fn console_write(buf: &[u8]) {
for &byte in buf {
sbi_call(SbiExt::ConsolePutchar, 0, [byte as usize, 0, 0]);
}
}
pub fn set_timer(val: u64) {
sbi_call(SbiExt::Timer, 0, [val as usize, 0, 0]);
}
pub fn hart_start(
hartid: usize,
start_addr: usize,
opaque: usize,
) -> (isize, usize) {
sbi_call(SbiExt::Hsm, 0, [hartid, start_addr, opaque])
}

hart_start asks OpenSBI to boot a secondary hart at start_addr with opaque passed in a1. The secondary hart starts in S-mode with paging disabled.

Update src/boot.rs. The boot hart discovers all harts from the DTB, then starts each secondary hart with its own kernel stack and the kernel page table.

use crate::memory::alloc_frame;
use crate::utils::symbols::*;
use core::sync::atomic::{AtomicUsize, Ordering};
use fdt::Fdt;
const MAX_HARTS: usize = 16;
const INVALID_HARTID: usize = usize::MAX;
const KSTACK_PAGES: usize = 4;
const KSTACK_VA_BASE: usize = 0xFFFF_FFD0_0000_0000;
static BOOT_HARTID: AtomicUsize = AtomicUsize::new(0);
static HART_COUNT: AtomicUsize = AtomicUsize::new(1);
static HART_IDS: [AtomicUsize; MAX_HARTS] =
[const { AtomicUsize::new(INVALID_HARTID) }; MAX_HARTS];
pub fn boot_hartid() -> usize {
BOOT_HARTID.load(Ordering::Relaxed)
}
pub fn init(hartid: usize, dtb_ptr: usize) -> usize {
BOOT_HARTID.store(hartid, Ordering::Relaxed);
println!("jikei: booted under OpenSBI");
print_kernel_info();
let fdt = unsafe { Fdt::from_ptr(dtb_ptr as *const u8) }
.unwrap_or_else(|e| panic!("failed to parse DTB: {e}"));
let mut cpu_count = 0;
for cpu in fdt.cpus() {
if cpu_count == MAX_HARTS {
panic!("too many harts in DTB");
}
HART_IDS[cpu_count].store(cpu.ids().first(), Ordering::Release);
cpu_count += 1;
}
HART_COUNT.store(cpu_count, Ordering::Release);
println!("CPUs: {}", cpu_count);
crate::memory::init(kernel_end());
{
let mut mem = crate::memory::lock();
fdt.memory()
.regions()
.filter_map(|r| {
r.size
.filter(|&s| s > 0)
.map(|s| (r.starting_address as usize, s))
})
.for_each(|(start, size)| {
println!(
"RAM: {:#x} - {:#x} ({} MB)",
start,
start + size,
size >> 20
);
mem.add_region(start, start + size);
});
let alloc_end = mem.end_ptr();
let dtb_end = dtb_ptr + fdt.total_size();
mem.init(&[(text_start(), alloc_end), (dtb_ptr, dtb_end)]);
println!(
"Memory: {} free frames ({} MB), {} total",
mem.free_count,
(mem.free_count * 4096) >> 20,
mem.num_frames,
);
}
crate::memory::freeze_regions();
crate::paging::init();
crate::memory::heap::init();
alloc_kernel_stack(hart_index(hartid))
}
// ── Secondary hart startup ──────────────────────────────────────
/// Layout must match boot.S: stack_top at offset 0, satp at offset 8.
#[repr(C)]
struct HartBootInfo {
stack_top: usize,
satp: usize,
}
unsafe extern "C" {
fn _secondary_entry();
}
pub fn start_secondary_harts() {
let count = HART_COUNT.load(Ordering::Acquire);
let boot = boot_hartid();
let satp = crate::paging::kernel_satp();
for hi in 0..count {
let hartid = HART_IDS[hi].load(Ordering::Acquire);
if hartid == boot {
continue;
}
let stack_top = alloc_kernel_stack(hi);
let info_pa = alloc_frame().expect("oom for hart boot info");
let info = unsafe { &mut *(info_pa as *mut HartBootInfo) };
info.stack_top = stack_top;
info.satp = satp;
let (err, _) = crate::utils::sbi::hart_start(
hartid,
_secondary_entry as *const () as usize,
info_pa,
);
if err != 0 {
println!("failed to start hart {}: error {}", hartid, err);
}
}
}
#[unsafe(no_mangle)]
pub extern "C" fn secondary_hart_main(hartid: usize) -> ! {
crate::trap::init_hart();
println!("hart {} online", hartid);
crate::sched::run(hartid)
}
// ── Helpers ─────────────────────────────────────────────────────
fn alloc_kernel_stack(hart_index: usize) -> usize {
crate::paging::with_kernel_pt_mut(|pt| {
crate::memory::map_stack(pt, KSTACK_VA_BASE, hart_index, KSTACK_PAGES, 0)
})
}
fn hart_index(hartid: usize) -> usize {
let count = HART_COUNT.load(Ordering::Acquire);
(0..count)
.find(|&idx| HART_IDS[idx].load(Ordering::Acquire) == hartid)
.expect("boot hart not found in DTB")
}
fn print_kernel_info() {
let kb = |a: usize, b: usize| (b - a) / 1024;
println!(
"Kernel: {:#x} - {:#x} ({} KB)",
text_start(),
bss_end(),
kb(text_start(), bss_end()),
);
println!(
" .text: {} KB, .rodata: {} KB, .data: {} KB, .bss: {} KB",
kb(text_start(), text_end()),
kb(rodata_start(), rodata_end()),
kb(rodata_end(), data_end()),
kb(data_end(), bss_end()),
);
}

The boot hart discovers all hart IDs from the DTB and stores them in HART_IDS. start_secondary_harts iterates them, skipping the boot hart, and for each:

  1. Allocates a kernel stack at a unique slot in KSTACK_VA_BASE.
  2. Allocates a HartBootInfo struct in a physical frame with the stack top and kernel satp.
  3. Calls sbi::hart_start with the entry address _secondary_entry and the info pointer.

The secondary hart boots into _secondary_entry, enables paging, and calls secondary_hart_main which sets up traps and enters the scheduler loop.

Add to boot.S, after the _switch_to_stack section:

# secondary hart entry point (called by SBI HSM hart_start)
# a0 = hartid, a1 = ptr to HartBootInfo
.globl _secondary_entry
_secondary_entry:
mv tp, a0 # tp = hartid
ld sp, 0(a1) # HartBootInfo.stack_top
ld t0, 8(a1) # HartBootInfo.satp
csrw satp, t0 # enable paging
sfence.vma # flush TLB
call secondary_hart_main # a0 still has hartid
1:
wfi
j 1b

The secondary hart arrives with paging disabled. It loads the kernel satp from HartBootInfo, enables paging, and calls into Rust. After that, it runs the same scheduler loop as the boot hart.

Update src/demo.rs — replace yield with sleep:

use crate::sched::process;
pub fn spawn() {
process::spawn_with_args(prog_bytes(_prog_a_start, _prog_a_end), [0; 4]);
process::spawn_with_args(prog_bytes(_prog_b_start, _prog_b_end), [0; 4]);
}
fn prog_bytes(
start: unsafe extern "C" fn() -> u8,
end: unsafe extern "C" fn() -> u8,
) -> &'static [u8] {
let s = start as usize;
let len = end as usize - s;
unsafe { core::slice::from_raw_parts(s as *const u8, len) }
}
unsafe extern "C" {
fn _prog_a_start() -> u8;
fn _prog_a_end() -> u8;
fn _prog_b_start() -> u8;
fn _prog_b_end() -> u8;
}
core::arch::global_asm!(
r#"
.pushsection .rodata.user_prog, "a"
.globl _prog_a_start
_prog_a_start:
li a7, 100
li a0, 'A'
ecall
li a7, 100
li a0, '1'
ecall
li a7, 100
li a0, '\n'
ecall
li a7, 2
li a0, 100
ecall
li a7, 100
li a0, 'A'
ecall
li a7, 100
li a0, '2'
ecall
li a7, 100
li a0, '\n'
ecall
li a7, 1
ecall
.globl _prog_a_end
_prog_a_end:
.globl _prog_b_start
_prog_b_start:
li a7, 100
li a0, 'B'
ecall
li a7, 100
li a0, '1'
ecall
li a7, 100
li a0, '\n'
ecall
li a7, 2
li a0, 200
ecall
li a7, 100
li a0, 'B'
ecall
li a7, 100
li a0, '2'
ecall
li a7, 100
li a0, '\n'
ecall
li a7, 1
ecall
.globl _prog_b_end
_prog_b_end:
.popsection
"#
);

Program A: print “A1\n”, sleep 100 ms, print “A2\n”, exit. Program B: print “B1\n”, sleep 200 ms, print “B2\n”, exit.

#![no_std]
#![no_main]
extern crate alloc;
#[macro_use]
mod utils;
mod boot;
mod demo;
mod memory;
mod paging;
mod sched;
mod trap;
core::arch::global_asm!(include_str!("../boot.S"));
#[unsafe(no_mangle)]
pub extern "C" fn kernel_main(hartid: usize, dtb_ptr: usize) -> ! {
let stack_top = boot::init(hartid, dtb_ptr);
unsafe { _switch_to_stack(stack_top, kernel_main_on_stack, hartid) }
}
unsafe extern "C" {
fn _switch_to_stack(
stack_top: usize,
entry: extern "C" fn(usize) -> !,
arg0: usize,
) -> !;
}
extern "C" fn kernel_main_on_stack(hartid: usize) -> ! {
trap::init_hart();
demo::spawn();
boot::start_secondary_harts();
sched::run(hartid)
}

The only change from Chapter 6: boot::start_secondary_harts() is called after spawning the demo processes. Secondary harts join the scheduler immediately.

Terminal window
cargo run --release
jikei: booted under OpenSBI
Kernel: 0x80200000 - ...
...
CPUs: 2
RAM: 0x80000000 - 0x90000000 (256 MB)
Memory: ... free frames (... MB), ... total
Paging enabled
Heap: 0xffffffc000000000 (1024 KiB)
spawned pid 0
spawned pid 1
hart 1 online
A1
B1
A2
pid 0: exited
B2
pid 1: exited
all processes exited

With two harts, both processes can run truly in parallel. The sequence:

  1. Both processes are spawned. Hart 1 comes online.
  2. One hart runs A, another runs B. Both print their first line.
  3. A sleeps 100 ms, B sleeps 200 ms. Both harts idle.
  4. A wakes first, prints “A2”, exits.
  5. B wakes, prints “B2”, exits.
  6. Boot hart prints “all processes exited.”

If you change -smp 2 back to -smp 1, the same output appears — the scheduler handles both processes on a single hart through preemption and sleep.

OpenSBI
-> boot::init
-> discover 2 harts from DTB
-> memory, paging, heap
-> alloc boot hart kernel stack
-> _switch_to_stack
-> kernel_main_on_stack
-> trap::init_hart
-> demo::spawn (pid 0, pid 1)
-> start_secondary_harts
-> alloc stack, sbi::hart_start
-> secondary_hart_main -> init_hart -> sched::run
-> sched::run (boot hart)
-> pick_next -> set quantum -> enter_usermode
-> timer preemption or syscall -> handle -> repeat
-> all exited -> halt

The kernel now has preemptive scheduling, sleep, and SMP. A misbehaving process that loops without syscalls gets preempted after 10 ms. Multiple harts run the same scheduler loop with a shared process table protected by a spin mutex.

Processes can print characters but cannot talk to each other. Chapter 8 adds IPC: rendezvous endpoints where one process sends a message and another receives it. This is the foundation for building services on top of the microkernel.