Unique properties of a subset of human pluripotent stem cells with high capacity for self-renewal
Nature Communications volume 11, Article number: 2420 (2020) Cite this article
Archetypal human pluripotent stem cells (hPSC) are widely considered to be equivalent in developmental status to mouse epiblast stem cells, which correspond to pluripotent cells at a late post-implantation stage of embryogenesis. Heterogeneity within hPSC cultures complicates this interspecies comparison. Here we show that a subpopulation of archetypal hPSC enriched for high self-renewal capacity (ESR) has distinct properties relative to the bulk of the population, including a cell cycle with a very low G1 fraction and a metabolomic profile that reflects a combination of oxidative phosphorylation and glycolysis. ESR cells are pluripotent and capable of differentiation into primordial germ cell-like cells. Global DNA methylation levels in the ESR subpopulation are lower than those in mouse epiblast stem cells. Chromatin accessibility analysis revealed a unique set of open chromatin sites in ESR cells. RNA-seq at the subpopulation and single cell levels shows that, unlike mouse epiblast stem cells, the ESR subset of hPSC displays no lineage priming, and that it can be clearly distinguished from gastrulating and extraembryonic cell populations in the primate embryo. ESR hPSC correspond to an earlier stage of post-implantation development than mouse epiblast stem cells.
The successful application of human pluripotent stem cells (hPSC) in research and cell therapy relies on the ability to maintain, expand, and differentiate these cells in vitro in a tightly controlled and efficient fashion. Our understanding of the regulation of pluripotent stem cell self-renewal and lineage specification is in turn built largely on embryological paradigms, with developmental roadmaps providing critical knowledge of the key transitional stages, and pinpointing the extrinsic and internal molecular pathways drive cell fate decisions. In the mouse, we now have a fairly clear understanding of the states of pluripotency that span the developmental stages between the blastocyst and the late gastrula in vivo 1 . The characterization of mouse naive embryonic stem cells (ESC) 2 and epiblast stem cells (EpiSC) 3 , 4 as in vitro equivalents of the preimplantation epiblast and the anterior primitive streak, respectively 5 , has shed considerable light on the properties of the cultured cells. Recently, a stage between these two pluripotent states called formative pluripotency, corresponding to the early post-implantation epiblast, has been described 6 . Defined by specific molecular and biological features, including an absence of lineage priming and a rapid response to induction of lineage specification (including the germline lineage), formative mouse pluripotent stem cells have yet to be successfully serially propagated in vitro.
The derivation of mouse EpiSC, their characterization as epithelial cells dependent upon activin and FGF2 for maintenance in the pluripotent state 3 , 4 , and their co-expression of lineage-specific and pluripotency genes 4 , 5 , 7 , 8 , 9 , 10 , led many researchers to the conclusion that hPSC derived and maintained under conventional culture conditions (archetypal hPSC), which share these features, equate to the primed state of mouse pluripotency. This in turn led to a search for conditions that would support long-term maintenance of hPSC in a naive state 11 , 12 . Several culture systems have been described that support a cell with molecular features quite similar to the human preimplantation epiblast 13 , 14 . Extended propagation of diploid hPSC in these systems remains challenging, however.
Population heterogeneity complicates the interpretation of stem cell phenotype. To dissect heterogeneous populations in archetypal hPSC cultures, we used monoclonal antibodies to cell surface antigens to define subsets of stem cells that exist in a hierarchical continuum of cell states. When we subjected these cells to transcriptional profiling, we observed co-expression of pluripotency genes with lineage specific transcription factors, particularly in subpopulations of cells with lower levels of stem cell surface marker and pluripotency gene expression 15 . Importantly though, we also found that cells which expressed pluripotency markers at high levels were less likely to display lineage priming.
Subsequently by analyzing cells grown under defined conditions, and using a more refined sorting strategy to isolate the subpopulation enriched for high self-renewal (ESR) followed by medium throughput single-cell RT-QPCR, we were able to show that the cells at the top of the hierarchy expressed very uniform and high levels of pluripotency markers, and showed no lineage priming 16 .
As cells traverse through stages of pluripotency in vitro and in vivo, they undergo changes in cell cycle regulation and metabolic activity, they restructure their epigenome, and their gene expression profile changes. The ability to sort highly purified populations of hPSC with high self-renewal capacity, and to analyse transcription in these subpopulations at the genomic level using RNA-Seq for subpopulation and single cell analysis, coupled with the availability of new single-cell gene expression data from preimplantation and post-implantation primate embryos, prompted us to re-examine the properties of the ESR subpopulation of archetypal hPSC, and to reconsider its developmental status. The results show that the ESR subpopulation resembles the primate early post-implantation epiblast, similar to the mouse formative state of pluripotency.
Self-renewal of subsets of hPSC
We have previously used flow cytometry to isolate cell subpopulations, followed by assay of colony forming ability as an indicator of self-renewal 15 , 17 . hPSC survival after dissociation to single cells and flow cytometry is poor. We therefore developed a simple methodology that would allow comparison of self-renewal of defined cell populations at reasonable levels of initial survival after flow cytometry, through sorting small aggregates of cells, to maintain cell–cell contacts and enhance post-sort survival 18 .
Fluorescence activated cell sorting using antibodies GCTM-2 and TG30 (recognizing CD9) enabled us to recover small cell aggregates (chiefly doublets; singlets, 17%; doublets, 61%; triplets, 16%; quadruplets, 6%) (Fig. 1a ). We seeded wells with equal cell numbers as aggregates or singlets. Although both aggregates and single cells attached to the plate after sorting, a much larger fraction of the aggregates had begun to spread 1 h after plating, indicative of high viability; the difference was more evident after 24 h (Fig. 1b ). Initial attachment, spreading and survival of the GCTM-2highCD9high subpopulation was similar to the GCTM-2midCD9mid subpopulation. Flow cytometry re-analysis of both subpopulations 72 h later showed that they had largely retained their cell surface phenotypes though as shown previously 19 , the GCTM-2highCD9high population had begun to reconstruct the entire cell state continuum (Fig. 1c ). By 4 days, cells plated as aggregates displayed a higher colony forming efficiency than single cells, and the GCTM-2highCD9high subpopulation had formed a much larger number of microcolonies (Fig. 1d–e ). Colonies formed from GCTM-2highCD9high aggregates were larger and showed a higher proportion of cells bearing stem cell markers compared with colonies formed by GCTM-2midCD9mid cells (Fig. 1f ). Time-lapse video microscopy confirmed that the initial cell numbers attaching to the dish were similar for the two subpopulations. However, subsequent monitoring showed that while both subpopulations were migratory and underwent cell division, the GCTM-2highCD9high subpopulation persisted to form colonies of 4–32 cells several days later, whereas the GCTM-2midCD9mid colonies suffered abortive expansion and, in many cases, extinction (Supplementary Movies 1 and 2 ).
Fig. 1: Assay of self-renewal of subpopulations of hPSC under conditions that maintain cell–cell contacts.
a Isolation of aggregates of subsets of hPSC by flow cytometry. Left panel, side and forward scatter; middle panel, separation of single cells using GCTM-2 and anti-CD9; right panel, separation of cell aggregates using GCTM-2 and anti-CD9. b Phase contrast image of aggregates and single cells 1 and 24 h after plating. At 1 h, both populations of aggregates (GCTM-2highCD9high, HIGH and GCTM-2midCD9mid, MID) have attached and spread onto the substrate. Scale bar = 100 micron. c Flow cytometry analysis of cell surface antigen expression in aggregate cultures prepared as in a 72 h after plating. d Microcolony formation after 4 days by single cells (200, 1000, or 2000) or aggregates (100, 500, or 1000) of GCTM-2highCD9high (HS, HA) and GCTM-2midCD9mid (MS, MA) subpopulations. e Numbers of microcolonies formed at 4 days by GCTM-2highCD9high or GCTM-2midCD9mid single cells or aggregates (AGG). Values represent the mean ± standard deviation from eight wells from one experiment; see Supplementary Table 5 for biological replicates of this assay. f Immunostaining with stem cell surface marker antibody GCTM-2 of colonies formed by aggregates of GCTM-2midCD9mid and GCTM-2highCD9high subpopulations. Scale bar = 100 micron. Results in b, d, and f display representative outcomes from three experiments.
Full size image
We showed previously that the minority ESR subpopulation of hPSC can be isolated with cell surface markers and identified by colony formation assay. Colony formation measures both survival and self-renewal, and dissociation to single cells and flow cytometry compromises survival. We have used several assay strategies to avoid conflation of survival and self-renewal 16 , 17 , but the approach described in this study of isolating aggregates is simple, and yields defined subpopulations with high initial survival for subsequent analysis. Although the initial survival of aggregates of the hPSC subpopulations was similar, further development of viable stem cell colonies was observed predominantly in the fraction bearing the highest level of stem cell markers. Time-lapse video microscopy revealed that cells in the GCTM-2highCD9high fraction formed microcolonies that persisted during extended propagation, whereas microcolonies of cells in the lower fraction underwent gradual extinction. This is similar to the observations of Barbaric et al. 46 , who found that only a subset of SSEA3-positive hPSC formed microcolonies that persisted and grew. Their findings and ours suggest that self-renewal might be a function of the formation of a critical mass of cells expressing survival or growth factors at a sufficiently high local level. We have shown that cells in the GCTM-2highCD9high or GCTM-2highCD9highEPCAMhigh fractions express the highest levels of components of the NODAL signaling pathway 16 , 17 . Recent results in zebrafish indicate that cell–cell contacts are key to Nodal signaling 47 , suggesting the possibility that hPSC might similarly depend on a positive feedback loop of NODAL signaling and cell–cell adhesion to drive self-renewal.
In the mouse, under appropriate culture conditions, naive cells pass through an intermediate state between naive and primed pluripotency, and in this transient state, are competent to undergo germline differentiation 21 . It was previously reported that archetypal hPSC can form primordial germ cell-like cells 22 . We confirm this finding and show that ESR hPSC have the capacity for germline differentiation. This distinguishes these cells from the naive and primed states in the mouse.
Cells in the GCTM-2highCD9highEPCAMhigh fraction displayed a cell cycle with a very limited G1 fraction relative to other cells in the population. It has been shown that hPSC pause in G1 when preparing to embark on differentiation 25 , 26 , 28 . It is possible that ESR stem cells do not execute such a differentiation checkpoint and continue in a self-renewing loop, until a pause in the cell cycle is activated, possibly through an RB-dependent mechanism 48 .
Cells in the GCTM-2highCD9highEPCAMhigh fraction show higher mitochondrial membrane potential and increased oxidative phosphorylation compared with the general population. Comprehensive metabolite profiling and 13C-glucose labeling studies confirmed increased rates of catabolism of pyruvate in the mitochondrial TCA cycle. These studies also showed that increased TCA cycle flux in the GCTM-2highCD9highEPCAMhigh cells is used to generate key intermediates including citrate and α-ketoglutarate which are subsequently exported from the mitochondria and used for synthesis of lipids and amino acids/proteins. Elevated rates of amino acid synthesis in the GCTM-2highCD9highEPCAMhigh cells was also supported by increased flux of glycolytic intermediates into serine and glycine synthesis. The high rate of de novo synthesis of cholesterol in hPSC is unusual and further highlights the importance of TCA cycle in generating precursors (i.e., citrate) for this pathway. Overall, these data indicate that the GCTM-2highCD9highEPCAMhigh cells retain a highly active anabolic metabolism.
DNA methylation levels in mouse naive ES are generally low, similar to the pattern in the preimplantation epiblast, and rise in vivo as development progresses toward gastrulation. In the mouse or primate post-implantation embryo, DNMT3A and DNMT3B, and TET1 are activated shortly after implantation 36 . Levels of methylation in the GCTM-2highCD9highEPCAMhigh and general population were similar in this study, and considerably lower than levels reported for mouse epiblast stem cells. In human hPSC, co-expression of DNA methyltransferases and TET enzymes could account for the dynamic nature of DNA methylation, exemplified by the dramatic response of these cells to the presence of ascorbic acid, an activator of TET enzymes, in the cell culture medium 49 .
In contrast to DNA methylation levels, chromatin accessibility varied strikingly in the cell subpopulations that we studied. Chromatin regions that showed high accessibility in the GCTM-2highCD9highEPCAMhigh mapped to sites of previously identified pluripotency transcription factor binding in hPSC, and to sites of DNAse hypersensitivity that were found to be unique to stem cells. This pattern changed markedly in the GCTM-2midCD9mid population. These findings suggest that targets of pluripotency transcription factors in the GCTM-2highCD9highEPCAMhigh cell fraction might be involved in the regulation of self-renewal. We found no evidence for enrichment of binding sites for TFAP2C in open chromatin in our GCTM-2highCD9highEPCAMhigh cells, thus distinguishing this population from the naive state of hPSC described earlier 50 .
The properties of the ESR subpopulation of hPSC discussed above are consistent with those of the mouse early post-implantation epiblast. The ability of archetypal hPSC to undergo differentiation into amnion 51 , 52 , 53 or the germline 22 , 54 is also consistent with a phenotype closer to an earlier post-implantation state rather than to the EpiSC. We have confirmed the capacity of archetypal hPSC for germline differentiation, a capacity which is found in the majority of the population of archetypal hPSC grown in defined conditions; further study with refinements to this assay will be required to confirm a higher capacity for germline differentiation in GCTM-2highCD9highEPCAMhigh cells. We previously used embryoid body assays 16 to show that the ESR subpopulation is pluripotent, as did another study using the teratoma assay 19 . These findings are not surprising, since flow cytometry analyses of replated populations of the ESR subpopulation reported here (Fig. 1a vs. 1c ) and in a more extensive previous work 19 indicate that this subpopulation can eventually regenerate the entire hierarchy of archetypal hPSC. Here quantitative assessment of directed differentiation into three germ layer lineages in adherent cultures confirmed that both the ESR and the remaining population are pluripotent. Together, our current data and the previous work are consistent with a developmental equivalence of the ESR with early post-implantation epiblast, and not naive or primed pluripotent states.
Previous studies have indicated that gene expression in archetypal hPSC aligns with the primate post-implantation but not preimplantation epiblast 44 , 45 . Our studies on the gene expression in the subpopulation of hPSC with a high capacity for self-renewal are consistent with these findings. The post-implantation epiblast persists longer in the primate than in the mouse, and it is difficult to resolve changes in epiblast gene expression during the post-implantation period until gastrulation, when pluripotency genes turn off and the activation of lineage-specific programs occurs. Our results here and elsewhere demonstrate that the cell cycle and metabolic profile, and the lack of lineage priming in the archetypal hPSC subpopulation showing a high capacity for self-renewal, are all consistent with early post-implantation epiblast, but not an mouse EpiSC-like state. Whether a subpopulation within mouse EpiSC cultures with high self-renewal capacity exists is unknown.
Recently Nakanishi et al. 55 also described a subpopulation of hPSC with high self-renewal capacity. These cells, which reside at the periphery of hPSC colonies, were isolated on the basis of cell surface expression of NCAD, and display priming toward the primitive endoderm lineage. We previously identified a subset of cells at the periphery of hPSC colonies which co-isolated with the self-renewing subpopulation and similarly co-expressed pluripotency and primitive endoderm genes (GATA4/GATA6/HNF4A/BMP2/FN1) 16 . We found that these cells were more abundant in cultures grown in medium supplemented with serum replacement on a feeder cell layer relative to defined conditions; they were quite rare in the HSR subpopulation isolated from cultures grown in defined medium 16 . We did not observe cells expressing markers of primitive endoderm in the current study. It is possible that these lineage primed cells are descendants of the subpopulation we describe here. Determination of the relationship between the NCAD+ cells and the HSR fraction we have identified will require further analysis, but it is clear that the majority of HSR cells in hPSC cultures grown under defined conditions resemble the early post-implantation epiblast, and not primitive endoderm. Cornacchia et al. 56 recently reported that hPSC cultured in E8 media display features that are intermediate between naive and primed hPSC. We have used E8 and mTeSR interchangeably in these studies and have not noted differences between the two with respect to the parameters we have studied. Several features of the E8 cultures described by Cornacchia et al. 56 including increased capacity for self-renewal, pluripotency associated marker and gene expression, and the metabolome, are similar to the properties of the ESR population we describe. A simple interpretation of both sets of data and consistent with prior observations 16 , 43 is that culture in defined media shifts the overall population toward the ESR state.
The self-renewing subpopulations of archetypal hPSC that we have identified here and elsewhere represent a minority of the culture with distinct properties. The majority of the archetypal hPSC population shows features closer to mouse EpiSC, as noted by others. Even under the defined culture conditions used in this study that enhance self-renewal, transcriptional evidence of neural lineage priming is evident in the general population. The development of methods to propagate pure populations of self-renewing archetypal hPSC populations in a state similar to the formative stage in mouse might enhance the efficiency of cloning and differentiation protocols, and reduce the variability in differentiation efficiency often observed between hPSC lines from different genetic backgrounds. Further elucidation of the molecular regulatory pathways that maintain cells in this state will enhance our understanding of human development and guide efforts to model embryology in a dish.
hPSC culture, differentiation, and marker expression
Experimental procedures for culture and differentiation of human embryonic stem cells, indirect immunofluorescence microscopy and flow cytometry followed minor modifications to established protocols.
Human embryonic stem cell stocks (WA09 and FUCCI-G1) were maintained as described previously 57 . The FUCCI cells were a gift from Prof Jonathan S. Draper 29 . Routine maintenance of hPSC was carried out in serum-supplemented medium with fibroblast feeder cell support and subculture was performed using Dispase or collagenase to harvest fragments of colonies dissected manually.
For experiments, cells were transferred to defined feeder-free conditions (mTeSR conditions) using either mTeSR1 medium or Essential 8™ medium on Matrigel or recombinant vitronectin and subculture performed with Dispase or Tryple Express according to the manufacturers’ protocols.
For dissociation to single cells, cultures were treated with 10 μM of InSolution™Blebbistatin (Merck Millipore cat. no. 203389) for 1 h prior to dissociation to single cells using TrypLE™ (Life Tech, cat. no. 12605). Cultures were incubated in media supplement with 10 μM of InSolution™Blebbistatin overnight after which Blebbistatin was removed.
Routine tests confirmed the absence of mycoplasma contamination and a diploid karyotype (20/20 cells on G-banding) in the cell lines used.
Fluorescence activated cell sorting for colony formation assays
Cells grown for 24 h in the presence of Y-27632 Rho kinase inhibitor were dissociated using Accutase, harvested, and examined under the microscope to ensure that most of the cells had not completely dissociated into single cells. Immunolabeling was carried out by incubation in a primary antibody cocktail containing TG30 anti-CD9 antibody (mouse IgG2a) and GCTM-2 (mouse IgM) (neat supernatants, this laboratory) for 20 min at 4 °C, followed by incubation in a secondary antibody cocktail containing goat anti-mouse IgG2a Alexa Fluor 488 antibody and goat anti-mouse IgM Alexa Fluor 647 antibody (A21131, A21238, both from Thermofisher), diluted in 2% FBS in DMEM-F12 flow cytometry buffer, again at 4 °C for 20 min. The cells were resuspended the cells in 1x DAPI in cold mTeSR1 before filtering the suspension through a 35-µm cell strainer to remove any larger aggregates and debris. Commercial antibodies were used at the dilutions recommended by the manufacturer.
All cell sorting was performed on a BD FACSAria III, using the 100-µm nozzle at 17 psi. A custom cytometer configuration was created that used the 17 psi pressure as opposed to the typical 20 psi pressure. This was performed to minimize aggregates from dissociating into single cells as they were deposited into the 96-well plates. To isolate the GCTM-2highCD9high and GCTM-2midCD9mid fractions of aggregates and single cells, the gating strategy in Fig. 1a was used (forward and side scatter were employed to gate out debris and to identify single cells and aggregates; propidium iodide or AAD-7 was used to identify dead cells). The various fractions were deposited into 96-well tissue culture plates coated with Matrigel via the automated cell deposition unit on the Aria III. Subsequent to sorting, tissue culture plates were centrifuged at 200 × g for 2 min at room temperature. The sheath solution and media containing mixture was discarded and replaced with 150 µL of mTeSR1 medium.
Single cells from each of two FACS populations (GCTM-2highCD9high and GCTM-2midCD9mid) were plated at a density between 100 and 4000 cells per well of a 96-well Matrigel-coated plate. Aggregates from each of two FACS populations were placed between 50 and 2000 aggregates per well of a 96-well Matrigel-coated plate (Corning, Costar 3603). Cultures were maintained in mTESR1 with medium changed once after 48 h post-plating. After 72 h, wells were fixed and stained for colony counting as follows. The culture media was removed from cells and each well rinsed once with PBS prior to fixing with 100% ethanol for 10 min at room temperature. Ethanol was removed and wells allowed to air dry for 30 min prior to staining with haematoxylin for 10 min. Wells were then washed four times with Milli-Q water. In all, 0.08% aqueous ammonia solution was added into each well and incubated at room temperature for 2 min. After that, the wells were also washed three times with Milli-Q water and allowed to dry overnight. For time-lapse video microscopy, cells were imaged during 1–24 h post-plating in an environmental chamber (Clear State Solutions TCH 885-9G) to maintain a humid atmosphere with 5% CO2 in air while phase contrast exposures were recorded every 15 min with a Leica SP8 Confocal microscope.
Differentiation potential of hPSC subpopulations
Differentiation into primordial germ cell-like cells was carried out essentially as described 22 . Cultures grown in defined, feeder-free conditions (mTeSR1 medium, above) were sorted into GCTM-2highCD9high and GCTM-2midCD9mid fractions and plated onto fibronectin-coated dishes in Glasgow Minimal Essential Medium supplemented with Activin A (50 ng/ml), CHIR99021 (3 μM), and Y-27632 Rho kinase inhibitor (10 μM) for induction of incipient mesoderm-like cells. Two days later, cultures were harvested and replated as aggregates into Glasgow Minimal Essential Medium supplemented with LIF (10 ng/ml), BMP4 (200 ng/ml), KITLG (100 ng/ml), and EGF (50 ng/ml), or without these supplements in low attachment 96-well plates. Samples were taken at Day 0, 2, 4, and 6 and analyzed by flow cytometry for expression of cell surface ITGA6 and EPCAM using anti-human and mouse ITGA6 BV421 (Rat IgG2a, Biolegend 313624) and mouse anti-human CD236-PerCP (Mouse IgG2b, Biolegend 324213). On Day 4, some aggregates were fixed with 4% paraformaldehyde in PBSA for 30 min, permeabilized in 0.2% TritonX100 and 10% bovine serum albumin in PBSA for 30 min, stained overnight with rabbit antisera against PRDM1 (rabbit monoclonal IgG clone 9115 from Cell Signaling Technology) or NANOS3 (rabbit antisera, Abcamab70001), then incubated 1 h with goat anti-rabbit IgG Alexa488 conjugate, and counterstained with DAPI (Life Tech, Cat. #D1306), prior to examination under indirect immunofluorescence microscopy. Commercial antibodies were used at the concentrations recommended by the manufacturer.
To examine the potential for differentiation into somatic lineages in directed differentiation assays, we used the StemCell Technologies StemDiff Trilineage Differentiation Kit. WA09 or WA01 cells were separated by flow cytometry and seeded onto Matrigel-coated 8-well chamber slides, then incubated in three differentiation media for 5 days following the manufacturer’s instructions. Then the cultures were fixed in 4% paraformaldehyde in PBSA for 15 min, incubated with primary antibodies (rabbit anti-PAX6 polyclonal IgG Biolegend Poly19013; goat anti-T polyclonal IgG R&D AF2085; goat anti-SOX17polyclonal IgG R&D AF1924) for 3 h, then in secondary antibodies (donkey anti-rabbit IgG Alexa Fluor Plus 488 A21206; donkey anti-goat IgG Alexa Fluor Plus 488 A11070, both from Thermofisher) for 1 h followed by counterstaining with DAPI and examination under indirect immunofluorescence. Sufficient cells from each group were counted to attain a 95% confidence interval of 6) at 105 cells per well in 180 μl Seahorse XF Assay Medium (supplemented with 10 mM glucose, 2 mM glutamine and 1 mM sodium pyruvate). The mitochondrial electron transport chain was challenged using the Seahorse XF Mito Stress Test Kit (Agilent). A total of 12 measurement cycles were carried out, each cycle consisting of 3 min mixing and 3 min measurement of the oxygen consumption rate and extracellular acidification rate. The first three cycles determined the basal rate and 3 × 3 additional cycles were measured after addition of oligomycin (2 μM final), carbonyl cyanide-4-(trifluoromethoxy)phenylhydrazone (FCCP, 2 μM final) and rotenone/antimycin A (0.5 μM final each). The data were exported and analyzed using Excel (Microsoft) and Wave Software (Agilent).
Reduced representational bisulfite sequencing
FACS sorted cells were snap frozen in buffer RLT plus (Qiagen) before DNA extraction using the AllPrep DNA/RNA Mini Kit (Qiagen) as per the manufacturer’s instructions. DNA was treated with RNase A then purified through the DNA Clean and Concentrator column (Zymo). RRBS libraries were made from 100 ng of purified DNA using the Ovation RRBS Methyl-Seq System (NuGEN), according to the manufacturers recommendations, which includes the Qiagen Epitect kit for bisulfite conversion. Sequencing was performed on a NextSeq 500 with a 75 bp single-end sequencing protocol 60 . Sequence quality control was performed using FastQC 61 . Trimming of adapters and low-quality base calls was performed with trim_galore 62 . Trimmed reads were filtered for true RRBS reads (which contain an MspI cut site at the 5ʹ) using trimRRBSdiversityAdaptCustomers.py (NuGEN). Reads were aligned to a bisulfite converted human genome (hg38), using Bismark 63 Methylation calls were made with bismark_methylation_extractor 63 . Analysis of methylation over CpG islands was performed using Seqmonk 64 where only CGIs with 10 or more informative CpG sites were considered. FastQC, trim_galore, and Seqmonk are all available from www.bioninformatics.babraham.ac.uk .
DNA accessibility was measured using the FAST-ATAC protocol 65 with 50,000 sorted cells for each of three biological replicates of both GCTM-2highCD9highEPCAMhigh and GCTM-2midCD9mid populations. Sorted cells were pelleted by centrifugation for 5 min at 4 °C, supernatant was removed, and the cell pellet was washed one time with PBS pH 7.2. Cells were resuspended in 50 µl of transposase mixture consisting of 25 µl of 2x TD buffer (Illumina), 2.5 µl of TDE1 enzyme (Illumina), 0.5 µl of 1% digitonin (G9441, Promega), and 22 µl of nuclease-free water. Transposase reactions were incubated at 37 °C for 30 min with constant shaking in an Eppendorf ThermoMixer. Following transposition, DNA was purified using the Zymo DNA Clean and Concentrator-5 Kit (#D4014) following manufactures protocol and eluted in 20 µl of 10 mM Tris-HCL, pH 8. Transposase samples were barcoded using duel indexes during library amplification using NEBNext Ultra II Q5 Master Mix (#M0544L), 25 µM of each primer, and 20 µl of transposed DNA. Libraries were amplified for a total of nine cycles (98 °C for 10 s, 63 °C or 30 s, 72 °C for 1 min), followed by purification using AMPure beads (Beckman Coulter #A63881). ATAC-seq libraries were visualized on Bioanalyzer for typical nucleosome banding followed by sequencing on the Illumina Next-Seq platform. Libraries were trimmed using trimmomatic 66 and aligned to Ensemble (build hg19) using bwa 67 with default settings. Duplicates reads were removed using Picard tools MarkDuplicates, and each aligned read was shifted toward the Tn5 cut site as described 37 . As reported, we found that FAST-ATAC resulted in low percentages of reads mapping to mitochondria (Supplementary Fig. 4a ). Regions of open chromatin were determined by combining alignment files across replicates and using MACS v.1.4.3 68 . Open chromatin regions were merged between both high and low samples to form a universal set of peaks (peakome), and regions from ENCODE blacklist removed. Reads were quantified for each interval in the peakome using bedtools 69 and normalized using TMM method 70 . We found that each sample had a high fraction of reads in the peakome, showed strong enrichment for open chromatin at transcription start sites, and characteristic distribution of fragments showing nucleosome free regions and mono- and di-nucleosome patterns, indicating high-quality ATAC libraries (Supplementary Fig. 4b–d ). Data exploration using principle component analysis (PCA) found that the first PCA, explaining 43% of the variance, separated high and low populations, while the second PCA represented potential batch effects (Supplementary Fig. 4e, f ). Based on these observations, a general linearized model in edgeR 70 was used to identify differences in DNA accessibility between populations including batch as a covariate. Peak set annotations and enrichments were calculated using LOLA 39 by comparing peaks more accessible in either high or low populations (FDR < 0.01, Supplementary Table 1 ) with the universal set of DNAse hypersensitivity sites as the universe against the LOLA core (Supplementary Data 2 ). Statistical analysis, data exploration, and visualization was performed using R ( http://www.R-project.org ).
Single-cell RNA sequencing
Single cells were flow sorted into a chilled 384-well PCR plate containing 1.2 μl of primer/lysis mix [20 nM indexed polydT primer, 1:6,000,000 dilution of ERCC RNA spike-in mix (Ambion, 4456740), 1 mM dNTPs, 1.2 units SUPERaseIN Rnase Inhibitor (Thermo Fisher, AM2696), DEPC water (Thermo Fisher, AM9920)] using a BD FACSAria III flow cytometer (BD Biosciences, San Jose, CA, USA) and the protocol described above. Sorted plates were sealed, centrifuged for 1 min at 3000 rpm and immediately frozen upside down at −80 °C until further processing using an adapted CelSeq2 protocol 71 .
Sorted plates were thawed on ice and briefly centrifuged. To lyse the cells and anneal the mRNA capture primer the plate was incubated at 65 °C for 5 min and immediately chilled on ice for at least 2 min before adding 0.8 μl reverse transcription reaction mix [in 2 μl RT reaction: 1x Fist Strand buffer (Invitrogen, 18064-014), 20 mM DTT (Invitrogen, 18064-014), 4 units RNaseOUT (Invitrogen, 10777-019), 10 units SuperScript II (Invitrogen, 18064-014)]. The plate was incubated at 42 °C for 1 h, 70 °C for 10 min and chilled to 4 °C to generate first strand cDNA. For second strand cDNA synthesis 6 μl of second strand reaction mix were added [1x NEBNext Second Strand Synthesis buffer (NEB #E6111S), NEBNext Second Strand Synthesis Enzyme Mix: 2.4 units DNA Polymerase I (E. coli), 2 units RNase H, 10 units E. coli DNA Ligase (NEB #E6111S), DEPC water (Thermo Fisher, AM9920)]. The plate was incubated at 16 °C for 2 h to generate double stranded cDNA.
All samples were pooled and cleaned using a 1.2X NucleoMag NGS Clean-up and Size select magnetic beads (Macherey-Nagel, 7449970.5) according to manufactures instruction. To reduce the amount of beads for each 100 μl pooled sample, 20 μl beads and 100 μl bead binding buffer (20% PEG8000, 2.5 M NaCl, pH 5.5) was added. The cDNA was eluted in 6.4 μl DEPC water and kept with beads for the following IVT reaction were 9.6 μl of IVT reaction mix [1.6 μl of each of the following: A, G, C, U, 10X T7 buffer, T7 enzyme (MEGAscript T7 transcription kit (Ambion, AM1334))] was added and incubated at 37 °C for 13 h and then chilled and kept at 4 °C. To remove the leftover primers, 6 μl ExoSAP-IT For PCR Product Clean-Up (Affymetrix, 78200) was added and the sample was incubated at 37 °C for 15 min and then chilled and kept at 4 °C.
Chemical heat fragmentation was performed by adding 5.5 μl of 10X Fragmentation buffer (RNA fragmentation reagents, AM8740) to the sample and incubation in pre-heated thermal cycler at 94 °C for 2.5 min followed by immediately chill on ice and addition of 2.75 μl of Fragmentation Stop buffer (RNA fragmentation reagents, AM8740). The fragmented amplified RNA was purified using 1.8X RNAClean XP beads (Beckman Coulter, A63987) according to manufactures instruction and eluted in 6 µl DEPC water of which 5 μl (no beads) were transferred to a fresh tube for library preparation.
The fragmented RNA was transcribed into cDNA using 5ʹ-tagged random hexamer primers (GCCTTGGCACCCGAGAATTCCANNNNNN) introducing a partial Illumina adapter as also described in ref. 71 . To remove RNA secondary structure and anneal the mRNA capture primer 1 μl of tagged random hexamer (100 µM) and 0.5 μl of 10 mM dNTPs (dNTP solution set NEB, N0446S) were added to the sample and incubated at 65 °C for 5 min and immediately chilled on ice for at least 2 min before adding 4 μl reverse transcription reaction mix [in 10 μl RT reaction: 1x First Strand buffer (Invitrogen, 18064-014), 20 mM DTT (Invitrogen, 18064-014), 4 units RNaseOUT (Invitrogen, 10777-019), 10 units SuperScript II (Invitrogen, 18064-014)].
The PCR primers introduce the full-length adaptor sequence required for Illumina sequencing (for details see Illumina small RNA PCR primers). PCR was performed in 12.5 μl using half of the ranhexRT sample as a template [1X KAPA HiFi HotStart ReadyMix (KapaBiosystems KK2602), 400 nM each primer].
The final PCR amplified Library was submitted to two consecutive 1x NucleoMag NGS Clean-up and Size select magnetic beads (Macherey-Nagel, 7449970.5) according to manufactures instruction. The final library was eluted in 20 μl of 10 mM Tris-HCl solution (Sigma-Aldrich, T2319-1L).
RNA-seq data analysis
CelSeq2 scRNA-sequencing reads were mapped to the GRCh38 human genome using the Subread aligner 72 and assigned to genes using scPipe 73 with ENSEMBL v86 annotation. Gene counts were exported as a matrix by scPipe with UMI-aware counting and imported into R. Cells were removed from further analysis if they fewer than 12,500 total counts, or >60,000 total counts, or 4500 total genes detected. Genes were filtered out if they failed to achieve 1 count in at least 10% of a particular cell condition group. Heatmaps were generated on normalized expression values using heatmap2 from the gplots package with row normalization. Dimensionality reduction was performed on normalized log2-cpm expression values with size factors from computeSumFactors in scran 74 . Single-cell RNA-seq data are available through the Gene Expression Omnibus under accession number #:GSE119323.
The non-human primate data were generated on the SC3-seq platform 75 and are publicly available under GEO accession number GSE74767 . In order to compare between human and macaque expression levels, we defined a set of high confidence orthologous metaexons between the two species as in ref. 76 . Briefly, we used BLAT 77 to compare every annotated human exon in ENSEMBL release 86 (737,982 unique exons across 63,305 genes) against the human (hg38) and Macaca fascicularis (macFas5) genomes, retaining those that matched the macaque genome with at least 92% sequence identity and mapped back to their annotated location in humans. We then excluded all exons that had a second match with >90% sequence similarity in either genome, to control for interspecies differences in mappability. Overlapping exons from the same gene (associated with different isoforms) were collapsed into a single “metaexon”.
We discarded overlapping exons associated with more than one ENSEMBL gene ID, exons associated with any gene annotated to two or more chromosomes in either species, and exons where the difference in intron size between the two species is ≥10,000 bp, suggestive of poor genome assembly or annotation. After applying these quality control criteria, we ultimately retained 198,172 unique metaexons in humans and macaques across 34,142 annotated ENSEMBL human genes. The final table, as well as code and additional documentation for metaexon identification is available at http://www.bitbucket.org/ee_reh_neh/orthoexon
We processed all 2526 files from ref. 44 (from 390 cells) with sickle [ https://github.com/najoshi/sickle ] to remove bad reads and trim low-quality bases from the 3ʹ end. We then mapped all reads to macFas5 using Rsubread 1.20.6 and R 3.2.2, allowing up to 2 mismatches and 2 indels per 50 bp read, which is proportional to our setting of 5 mismatches or indels for a 100 bp read. Mapped reads were assigned to the orthologous metaexon list using featureCounts at both the single metaexon and whole-gene level, and summed within individuals in R 3.2.2.
Human scRNA-seq data processing
Sequence analysis was performed on the Illumina Next-Seq 500 platform.
Quality control of raw data was assessed using FASTQC and visualized using MultiQC. The scPipe package v1.0 for R was used to count genes based on UMI profile. Gene expression was normalized using scaterv1.6.1 and scran v1.6.6packages for R. FASTQ files were aligned to hg38 using the Subread package v1.26.1 for R statistical software, aligned reads were re-annotated to exons using ENSEMBL v86 transcriptome to define the exon/intron mapping rate. Cells with