The Code Property Graph (CPG) is a break-though innovation in static code analysis that powers ShiftLeft CORE.
The CPG combines many representations of source code into one queryable graph database. For example, the CPG merges graphs from the compiler such as the Abstract Syntax Tree, the Program Dependence Graph, the Control Flow graph, etc., into a single joint data structure.
This enables the CPG to understand the full ﬂow of information across an application or service, from ultimate source to ultimate sink and all the security transforms and sanitization steps in between. Moreover, the CPG was designed with modern, modular applications in mind. It can map routes across custom code, open source libraries, SDKs, APIs and even microservices.
The CPG is an intermediate representation (IR) independent of the programming language. This has the added beneﬁt of making queries independent of the programming language. When a programming language is supported, i.e., its translation into CPG is complete, any query written for any other programming language can also apply to the new language.
The workhorse of the CPG is a state-of-the-art data-flow tracker. The data-flow tracker is interprocedural, flow-sensitive, context-sensitive, field-sensitive, and operates on an intermediate code representation (see semantic code property graphs). The engine performs on-the-fly points-to analysis to resolve call sites and is able to benefit from the results of constant propagation, control flow graph pruning, and framework analysis passes. Framework analysis passes are able to process configuration files if present. The data flow engine provides a configurable set of heuristics to allow reporting of findings in an acceptable time frame. For example, we allow limiting the number of branches considered in strongly connected components, the maximum path length, and the total number of computation steps.
High-level information flows are the second core ingredient that contributes to the precision of our analysis. The idea is simple: for high-level programming languages such as Java, it is not sufficient to track single data-flows between APIs to understand the high-level flow of information. Instead, the information from multiple low-level flows needs to be combined: the primary data flow, and all flows that initialize sources, sinks, and transformations. Inspired by UNIX file descriptors, the idea is to mark parameters of sources, sinks, and transformations, that provide information on where data comes from, where it goes, or how it is transformed, similar to how file descriptors are just integers, but their initialization determines whether data is written to a file, socket or a terminal. By combining the primary data flow with its descriptor flow, we can derive high-level data flows, and formulate rules for their classification.
The modern software development lifecycle (SDLC) has become much more efficient, enabled by cloud computing, virtual machines/containers and DevOps. However, the resulting increase in release velocity has left AppSec behind - it’s not surprising to see an organization release weekly but run code analysis monthly! In order to catch up, AppSec needs to become fast, accurate, and comprehensive.
Delivering meaningful automation requires speed. Not only must the start of security processes be automated, but also the results and decisions. For example, automating the start of a SAST scan via a pull request does little good if the SAST scan is still running hours after the pull request is completed.
Whether in dev, test or prod, accuracy has long been the Achilles heel of application security. In order to automate security decisions, the results must be reliable. For example, if pull requests are regularly rejected based on false positives, developers will ignore the security feedback.
The modern software supply chain is increasingly dependent on 3rd party libraries, SDKs, frameworks and APIs, hence, securing custom code alone is no longer sufficient. Furthermore, attackers can exploit both technical and business logic flaws. Hence, modern AppSec must secure all components from all types of vulnerabilities.
While most code analysis tools are strictly focused on ﬁnding vulnerabilities, the CPG can also be used to identify the presence of code weakness that may be impacting performance and eﬃciency, such as methods with too many parameters, improperly validated inputs, duplicate code, and inconsistent naming conventions that may be present in your code’s construction.
The earliest incarnation of the CPG is the PhD. thesis of Dr. Fabian Yamaguchi. This early version of the CPG was used to find 18 vulnerabilities in the Linux Kernel, all of which were accepted and remediated. This early version of the CPG was recently re-born as Joern and is still completely open source. Furthermore, the CPG schema is also open source, and published to allow anyone to write their own support for a programming language.