# PiF: In-Flash Acceleration for Data-Intensive Applications

Myoungjun Chun<sup>1</sup>, Jaeyong Lee<sup>1</sup>, Sanggu Lee<sup>1</sup>, Myungsuk Kim<sup>2</sup>, and Jihong Kim<sup>1</sup>

<sup>1</sup>Seoul National University, <sup>2</sup>Kyungpook National University

The 14<sup>th</sup> ACM Workshop on Hot Topics in Storage and File Systems June 28, 2022

### Processing-in-Storage Architectures



**Traditional architecture** 



**Processing-in-Storage (PiS) architecture** 

## Limitations of the PiS Technique



# of chips/channel

0

A R

Ш

ഗ

Large data movements from flash to an accelerator

 $\triangle$ 

Unscalable acceleration capability over the number of flash chips







<sup>2</sup> Use case: A Pattern Matching Enabled PiF Architecture



## An Opportunity for In-Flash Processing





C A

R E S

## **Processing-in-Flash Architectures**





# Challenges in Designing a PiF SSD

- **?** What is the power budget for CoX flash chips?
  - ⊘ In typical flash chips, Power<sub>program</sub> > Power<sub>read</sub>
  - ⊘ Allocate Power<sub>program</sub> Power<sub>read</sub> to the CoX flash chips
- Provide the support reliable in-flash read without a controller-side ECC module?
  Or Design a weak but low-complexity ECC module
- **?** What is the ideal computation stage should be offloaded to the CoX flash chip?
  - **⊘** Requirements of ideal candidates:
    - $\ensuremath{\boxdot}$  A large amount of data reduction ratio
    - **⊘** Suitable for data-parallel processing
    - **⊘** Low implementation overhead (i.e., under a power/area budget)







### Use case: A Pattern Matching Enabled PiF Architecture



### Use Case: PiF-PM with a Pattern Matcher



### **PiF-PM: Operational Overview**

- Two additional commands for supporting PiF-PM
  - *set\_pattern*: configure the PM to search the specific patterns
  - read\_when\_matched: only output the pages containing specified patterns



# Challenge 1: Reliabile In-flash Read

Direct implementation of controller-side ECC engine on chip incurs high power & area consumption



# CARES

In-flash read with

# Challenge 2: Bandwith Degradation

- Simple structure: Bandwidth degradation due to extra operations
- Pipelined structure of PiF-PM: Fully exploiting the chip bandwidth by overlapping all operations



<Simple structure>

<PiF-PM: Pipelined structure>



### <sup>001</sup> Processing-in-Flash Architectures and Challenges



### <sup>2</sup> Use case: A Pattern Matching Enabled PiF Architecture



### **Experimental Setup**

- Evaluation Platform
  - Cosmos+ OpenSSD

### Comparison schemes

- Baseline: Processing-in-Storage scheme
- Proposed: Processing-in-flash with CoX-PM

- Workloads
  - Grep
  - **SQL\_Query** (TPC-H benchmark)
- Metrics
  - All values are normalized to baseline

### Result 1: Performance Improvement



Observation 1: Almost scalable performance improvement under varying number of chips/channel

Observation 2: Different performance improvement by the difference in the data reduction ratio (Grep: 93.7%, SQL\_Query: 83.3%)

## Result 2: Energy Efficiency



**Observation: Achieved high energy efficiency** 

due to the performance improvement and data transfer reduction along channels

### Conclusion

- Investigated the limitation of the existing processing-in-storage scheme by slow internal bandwidth
- Proposed a processing-in-flash (PiF) scheme that moves computation inside flash chips where data are physically present
- Demonstrated that the PiF-based SSD is very promising by outperforming a PiS-based SSD by several times in the execution time/power efficiency

## Thank You!