amazonv | Accessibility Rating Framework: a Simple Way to Compare Accessibility Quality Across a Portfolio of (Reply)

Accessibility Rating Framework: a Simple Way to Compare Accessibility Quality Across a Portfolio of Apps & Websites

Type: Breakout

Track: Wildcard

For non-specialists, accessibility ratings are a simple, intuitive way to understand the accessibility quality of the apps and websites their organization publishes. For the accessibility program team, compliance, or legal, the rating framework is an effective tool to set priorities, thresholds, tolerance, and goals. The framework emphasizes end-user experience and task completion over technical conformance. It allows for comparisons between apps and websites that may not all be running the same automated test tools and uses inputs beyond test results, such as customer feedback.

[notes]

It's hard to say yes vs no with is it accessible - you could be 99% compliant with a minor item missing and would have to say no - this proposes a method of stars - like restaurants or hotels, to better convey where on the journey you are

It's important to track time to remediation as a metric of quality

[closed captions]

[refreshed page :(]

. What was going to happen after that? The executive will be like, wow that's great. And she'll walk out at her floor and probably never ask more questions about that website. It sounds an answer like we made it, we reached it. End of job. That's a real problem to tie the customer experience for millions of customers simply to one metric, conformance to a standard.

So the WCAG standard, you know, doesn't help us here from 2.0, but there's a lot of optimism for where 3.0 will go. But if you're not familiar with the conformance model, it says: Choose your scope, which I set in my question for you as the website. And within that scope, if you're going to claim conformance, as it's sort of worded on the slide in WCAG speak, to conform to WCAG, you need to satisfy the success criteria that is -- there is no content that which violates the success criteria. Gotta love that, right? But no content. So what's that is saying is if you have a 10,012 page website and 10,011 pages totally conform but there's a missing alternative text on one of those pages in the website, you cannot strictly say that you conform to that website.

So the standard doesn't give us a lot of tools or techniques to describe the more real-world situation where we're very close but not, you know, perfect.

So enter the accessibility ratings, which I will take you through now. Which will give us a few more ways to describe quality but not too many. And they're based on star ratings. So these star ratings are around us everywhere. You want to have takeout, you choose a restaurant based maybe on how many stars it has. If you take ride share you're asked to rate the driver. And you're kind of a little bit concerned if you give too low a rating, they may lose their position in the ride share.

They're all around us, and what I've discovered is it's so intuitive. But in case -- just to prove that, I'm going to take you through a little example: I would like you to think of a hotel. Not just any hotel, but the best hotel you've ever stayed at. If nothing comes to mind, imagine the best hotel you could ever stay at. You know if you like the beach, it's on the beach. If you like kind of exploring around town it's in the best neighborhood. Beautiful rooms, the staff greet you by name. Wonderful hotel. So imagine now if we cross a threshold down to a four star hotel, what would that be like for you? I know for me I used to do a lot of traveling with work. And which is consulting work, and which is where I met a lot of wonderful people at Deque. And for me at that time, a four star hotel would be a nice hotel by the airport, you know, with a nice breakfast. Nothing wrong with it, but everything was right about it. So it didn't have a beach. You know, maybe it wasn't in a great neighborhood. It was on the side of the highway, but for me it met what I needed.

Now if we cross down to the threshold below, which would go into three star, I want you to think of a three star hotel experience. What might that be for you? See if something comes to mind. And my thinking here is that once we're into the three star, something is taken away. That for example, a nice hotel but it's on the -- beside it is a railway line, and in the middle of the night a freight train just barrels through all night long and you can't sleep. Or it's a nice hotel, the staff are so friendly, but the walls are paper thin and you can't sleep. So there's some good some bad. And now I would like you to imagine the worst hotel you have ever stayed at. When I think of that, I somehow seem to think of trips when I was in high school down to South Carolina with my family. It's quite a drive from Toronto. Takes a few days, so we would stop along the way. And some of those roadside hotels were -- I think what I would think of as traumatically unclean. Things on the walls that just shouldn't be on walls, and it was something I really remember. So that's like the worst. And you might have your sense of where that is. So we kind of created a scale there, but I'm not going to peg that at any number of stars. We'll come back to that.

Now you're joining this and thinking, where is this going? I wanted to hear about websites and apps, and we'll be turning to that in just a moment. But before we do, I want you to think of your five star hotel. And I want you to think that you've been there a few days. Gone out in the morning, come back to your room, and you walk in and it's beautifully been serviced by housekeeping. You walk into the bathroom, and on the floor -- on the black marble floor there's a towel. A clean towel on the floor of the bathroom in your five star hotel. So my question to you is, is this now this moment of imperfection, is this now making it a four star hotel? Or is it still a five star hotel for you? Your favorite hotel, which happens to have some imperfection lying there on the bathroom floor?

So we'll come back to the idea of whether we tolerate the imperfection or whether we cross the threshold. Just wanted to spend a moment illustrating that, which I hope also reinforces how easily people take to thinking through the stars. So now of course we'll be turning to websites and apps.

So the star ratings have five star, four star, three star, two star, and one star. It's as simple as it gets. We could give names to that. We don't really need to, but we could say five star is awesome. Four star is great. Three star is meh, two stars is very poor. One star we're not going to say one star is extremely very poor. What we're going to do instead is a bit of a twist. The framework is going to reserve this to unknown. So at the one star level, it's not necessarily a reflection of poor or good quality. It's just unknown. Maybe that we haven't had a chance to review anything. And that could be the default rating. So it's very useful.

I want to point out a couple other things. At the five star level we can say awesome, but not perfectly flawless and that's important. So we cannot have a top level in this in my view that is expecting perfection, because if anyone works in software development and website content development, at the scale that we're operating, it's unrealistic to expect that products are absolutely flawless every day.

So we need to have some space for imperfection, tolerance for that. We need to know is it at a level we tolerate, or does it drop it down through the threshold to a lower level? One other thing I want to point out on the slide is there's a line. This is the most graphical element of the presentation probably. There's a yellow line between 4 star and 3 star. And that line is an important threshold. So in this framework, what it's signifying is things above the line -- so 4 star and 3 star -- would be considered something that the organization feels proud enough to put out there. It's solid. Below the line is unacceptable. Doesn't have an acceptable level of quality. Now you might wonder, why do we have a line and then two steps above it? Why not just say it's above the line and you're done?

Well, this is something I would put down to human nature or what I think of as product, manager, product nature. Do the minimum effort to get the maximum result. It's very likely -- and that's a good thing. I mean, they're operating an excellent principle, I think. But what is very possible is that they will just do enough to get their product over the line into the 4 star, and they're done. So by having an extra level above that, that allows us to set excellence as the organization's goal and not just adequate.

So hopefully that can be simple for a lot of people then. In an elevator, we can get asked the question: Is the website accessible? And I can answer, well, it's at the 4 star level. And if the executive is like, what's the 4 star level? I can then say, oh well, compliance or somebody else has a new way of explaining quality. And I'll send you something about that. This becomes the very handy way for people to talk about quality. For this to work, however; we have to define the levels of accessibility quite specifically. And that's going to be the work of, among others, the accessibility program team, but also other people who really live and breathe the requirements of your organization or the policies and those kind of things. So in a financial organization it's going to include people in compliance, maybe risk, definitely in legal. If in your organization those areas are less involved at the moment in day-to-day accessibility questions, maybe it starts with the accessibility team and then goes over to others to review and approve.

Why it's so important of having a collaboration -- and I'm so fortunate at TD to have that. So in TD, I work with just to name a few people who have been instrumental in making this come to life, burt Floyd, David Burrows, Lauren Kravetski, Lori Peters and others. They represent their specific areas, but also bring an expertise in accessibility. Experts in legal, experts in compliance for accessibility. So what this also does in improving framework itself is credibility. If it's just Aidan in an elevator saying it's 4 stars it's like what the heck is that? But if it has the sample of approval of other areas in the business, it starts to take on a life of its own. What you would hope to do is collect a small group to have people and get a consensus with them about what information should we be looking at for a product? And then aligning it to say it's 4 star 5 star whatever the star rating is.

We need this to be something that's information that is quite readily at hand. We're not going to go do a product and say go do an audit. No, no. It's just what information would they normally, and we would account for the fact they don't have it even in the ratings. So it's just a reporting exercise for the products. And this group is deciding what they need to report.

And we're deciding on the thresholds. Now if it's not clear, I just want to, like, really stress the idea of this framework that I'm proposing is not to define for the whole of the internet what quality is. It's not that at all. It's meant as a very specific tool for an organization to talk better about accessibility in order to work quicker and faster and create better products. And very much will be tied to the context of your organization.

So let me show how you would tie that. What I'm going to do is go through four things that we could put into definitions. I won't go into all the details that are on these slides at the moment, but I'm then going to do another pass through with a more specific example. So we could ask for each product to give us an inventory of the screens or flows they have. So in that case of the 10,012 page website, a list of the pages. In the case of an app, sometimes pages or screens doesn't really make sense. And flows, oh, the user can authenticate and then check their balance. Whatever the app's main activities are.

We can then put on this scale the amount of completion of what they provided, and we'll go through some examples of that. The other thing we can look at that's a little bit related to that is the testing coverage. So how much of the website or the app was tested for accessibility? This can be very helpful to understand quality in two senses. If it has not been tested, we cannot make conclusions about quality. So that simplifies the activity of this rating a lot. We can rule out a lot of apps because they haven't been tested properly -- not properly but they haven't been tested. The other thing is when were they last tested? I think in glenda sim's presentation yesterday she said, how long are the results valid? Sometimes while you're testing the results are no longer valid because things are changing so much. So we can put a time stamp on the result and say when are they best by and best before. We'll do an example with that.

For sites you do have test data, what we want to be doing here is looking at that from the context of users. Can they do the things they're supposed to be -- that the app and website is for? So if your issues are categorized by user impact at high and critical, then that's going to be very simple. If your high and critical are more just aligned to particular success criteria, you might need to do a bit of a pass through to see are these affecting a user stopping -- are they a complete barrier or a partial barrier? Then we can put those across the scale.

An interesting thing that we can collect here, which is not technical really is unresolved customer accessibility complaints. This brings together two quite interesting concepts. So we get feedback from customers, and in this case complaints. But also we can measure how quickly does the product resolve those complaints? Because in a largely -- in a widely used product, I don't think we can fault any product for having a complaint. It may actually be really helpful to get that complaint, so to make it better for everyone. But by combining the speed with which they fix it and the fact they had a complaint, it gives us a sense not only of the quality of the product but also the product team, you know, the culture of that team perhaps. Or the challenges they face with technology.

So if we go through these four possible definitions, we could look at a couple examples. Maybe three if there's time. So I didn't mention before, but all the values on these slides are just totally arbitrary. Like you just make them up yourself. So the values I mean are, for example, at 5 stars, we're going to say the inventory has to be 100 percent complete. So we know all the screens and any related questions that we've asked the product about that. At the 4 star level we could say, well, there's some tolerance for not having all of that information. Maybe we expect 90 percent. Then going down. But your group would decide what is the best for your organization. So let's think of an example.

Imagine a website that was built in 2011 as a marketing microsite. And it's a contest where you can win a ticket to a concert that happened in 2011. For reasons unknown, it's still on -- like it's still out there on the internet. Customers could go there. Potentially they do go there. It's still out there. So we find out there's this website. We find out who the owner is. We go to them and we say, tell us about your website. How many screens does it have? And they'll be, like, um, what website? I don't know what website. Like well your name is next to it. They're like, yeah well that's an old website. I think we're going to get rid of that. So that's very valuable information that we have there for this framework, because it allows us to look in the definitions at how much -- so they basically have given us zero information, and they know nothing about the pages that are there. So we can say looking through the definitions starting from the bottom 1 star, 0-49 percent. Well, zero percent fits there. Or maybe 1 percent we'll give it the benefit because we know the home page. This is 1 star. This is where the framework really becomes efficient. Because once we identify a good reason -- and this is a clear reason why something is 1 star -- we can stop. We can stop. So I've oversimplified it. Maybe the product owner comes back and checks it all. But maybe they don't know. Maybe they can't get behind the first screen for some reason, and it's going to be the owner's job to figure out what's there. We know it's 1 star, we can stop.

We probably can ask the product owner those subsequent questions, but really the answers at this point don't matter. So for the interest of time, we can leave that aside and we've rated it.

Let's think of an app that was launched last month and it's a really small app. All it does is something important. But it's very simple. It's authentication app. So instead of getting a text message to do two-factor authentication, you know, when you put your password in and it asks for your phone number. Instead of that it gives you an option to have a code come up on an app. You might be familiar with some of these, especially in work situations. So we go to the product owner, and she says, yeah I'll give you the inventory. It's actually -- just take this confluence page. It's all there. There's three pages or three main flows, I should say. And then we've got that full information. So that aligns with 5 stars so far. So if we ask, tell us about your testing coverage. The product owner is also there. And says here's the JIRA link. So they've tested everything. And so far it's tracking 5 stars because we've defined you need to have tested everything at 100 percent to be at the top level. Then we look through the issues that are in that repository, which usually is not too hard to do. You can do a filter and look at them quickly. And we see that there are no highs. There are no criticals. There's a few lows that they have put in a backlog to resolve later. So at this -- we don't need to know anything about those issues. We have to trust that QA did their testing, that they are in fact lows. And then we can look here, well, where would that fit? At 5 star we can have no highs or criticals. So that seems to align with what we just found. So could be 5 star. Also 4 star we might expect the same quality, but we've so far established everything else was 5 star, so we're still tracking at 5 star.

Now the fourth of these possible definitions is unresolved customer complaints. Well, it's just launched, so there are no complaints older than six months, so it looks like it tracked everything to 5 star. So it's 5 star. So just briefly I'll do one more and let's imagine a kind of mixed app website that has some new pages and some really old legacy screens, which is fairly large website. So if we ask for the inventory, product owner has a good sense of what's in the newer parts, but it's a little unclear about some of the pages from the, like, 2010 era that are in the website. And it would be -- there's no list of them and they would have to do a lot of work to just create an inventory.

So let's say that that puts it at around 75 percent of what their information that we would want to be comfortable to rate it. So that seems if I look in my rating definitions, 5 star expected 100; 4 star, 90; 3 star, 75. Seems like we tracked there. So this aligns with 3 star. Going back to the product owner, well, give us your test coverage. So all they test it turns out it is some of the flows at the front of the website, the stuff that's new. So if we look here where this aligns, it's not 100 percent. It's not 80 percent, which is 4 star. It's core tasks and flows within the last three years. Maybe. We go back to the product owner or check the documentation again. And in fact some of that testing happened four years ago. Some happened more recently, but not everything even on the core information has been tested in the last three years for accessibility. So in fact here we're not at the -- we have to go down a notch to the 2 star, because we don't meet the criteria of core tasks and flows tested within three years.

Can the users complete the core tasks? Well, it turns out in the results there are some high issues that have been there. So we could look where this fits, but already we know the best they can do is 2 star. So in the interest of time, I won't take you through that because you can see. This kind of skipping over things is what is useful, and it's how you actually will use the framework. Once you know, you know, you can just kind of gloss over. Unresolved customer complaints. In fact, it turns out they do have customer complaints. And in fact they don't know how old they are. If they don't know, then we can assume it's not in the last six months. It's older than that.

But we really don't have to dwell too much on this in terms of deciding the rating because we've already established the best it can be is 2 star. So we will call that 2 star. That's kind of the flow. Automated scan results would be amazing, but the thing there is you want to make sure you have enough applications running that across the portfolio to make the most use of that. You would definitely not want to penalize the products that use automated scanning, which are potentially -- especially if they're older websites -- finding issues. That's actually a good thing that they're using the scanning. So you want to find a way to factor that in. But that would be the ideal scenario would be where all the portfolio is running some kind of automated scanning then you put a metric from the scanning into the definitions. Age of issues can be really helpful if you have the ability to pull that information, you know, how long does a product take to fix an accessibility issue?

Usability testing would be interesting to factor in, particularly in an organization where they're making a lot of progress with the basics. You could say at the 5 star level, you need to do usability testing once every, you know, year or two years. Whatever you name it. And then that means that, you know, those products that are just trying to get over the line, they really have to step up and go meet customer, whether virtually or not and hear back is it accessible really to them?

External targeted audits is another area to factor in as a stretch goal at the 5 star level. This would be where you don't -- like it's not a regular audit where you would do end-to-end check of everything. But it's just getting someone outside, like a specialist to come in and test a few pages, and see if that matches up with what the product is reporting. Because a lot of this information we are taking to be is self-reported. But like tax, right? You submit your tax, and there's always the possibility that someone will come in and check it. So having external targeted audits would be an excellent way to really be sure that the self-reported information is accurate.

So there are a few concepts that underpin the ratings that you're doing a lot of testing is one of the main things. The more testing you're doing, the more efficient the testing is, such as automated testing you're going to have better quality information coming in, and easier to make good distinctions between the levels.

Now the important threshold is the 4 star threshold. So if we think -- imagine a website is about to launch, brand new. It's what I would call the "good to go" anything 4 star or 5 star gets a thumbs up, it's ready to go, it meets what the organization needs. Anything else needs more work before it's ready to go.

So what is so helpful here is if you get a genuine consensus and understanding from the different stake holders as to what that looks like, and to explain in ways that product owners can see. What do they really need to meet? It cannot be in my view absolute perfection. And I found an interesting in Preety's keynote about JT11, for those who caught that yesterday. So the idea being that rather than saying everything is required -- hopefully maybe someone from Deque can explain it better than me. JT coming from Jim Thatcher 11, these 11 things that have been identified internally by the organization, those need to be met in order to go. So that illustrates to some extent this concept where there's people have thought ahead of time and defined it and said those are things that you must do, but that's not absolute perfection.

So the people who would assign the ratings initially need to be accessibility team, but this -- once it gets up and running, because it's just pulling in information, it doesn't necessarily need specialists to validate that. It could be a lot -- done more almost automatically as long as the data's accurate.

We need to communicate the definitions so that the product owner -- so we're transparent. And the product owners need to understand that it's not an individual specialist who is going to make an opinion. So this is where it's not like the hotel. These ratings are, you know, just aligning data with a definition. Not an opinion from one tester or -- well, they're not a tester, and that's the key thing that... I want to make it explicitly clear, the accessibility specialist doesn't need to go to the website. They don't need to log in. They don't need to do anything. They're not testing. That would not scale.

So once that's all set up, and I'll be shifting gears for questions in just a moment. We need to rate the portfolio. So the steps to that would be to find out what is in the portfolio that's maybe taking a chunk of it to start. Go out and ask the product owners to provide the information. And then once you've collected it, before you can really start assigning ratings what you are probably going to discover, what seems like one website is actually made up of pieces that are owned by different groups. So could be something as customer sees one website and there's actually 12 product teams building it. That's not an exaggeration. Could be more. So it's painstaking to go through that and figure it all out, but it's worth it. If you want to improve quality, you have to get to the team. And you have to get to the owner of the product. And so if we can identify the pieces and rate the pieces, then they have some benchmark to see if they can improve over time. Where next? Well one of the things would be what do you do with low-rated applications? That's a subject of a whole other topic really. It would be how does the organization move forward? One of the things that's going to come up is how often to do it again. And what I would love to do and then I'll turn it to questions just now... what I would love to do is be able to validate the thresholds with customers with disabilities. And get a sense from them, what is it for you that makes something drop from being awesome to just meh? What kinds of barriers or what happens for that to happen? Or alternatively, what is it that makes a very good experience awesome? What are those little things that make it just a cut above.

So I'll turn it back to Noah for questions.

>> Fantastic. Thank you so much Aidan. This was a tremendous presentation. I know personally, I just -- yeah, everything you just ran through was amazing. Lots of great questions in the Q&A. Thanks to everybody for engagement. They've been stack ranked by popularity. Question from Anonymous: Interesting you talk about the last time a test was completed. How often do you recommend doing testing? Is it every time a page or flow is updated? Or only when major changes are completed on a page.

>> Aidan: Yeah, that's something that I think, you know, that kind of strategy would hopefully set in your organization. If it's that cohesive. But the -- the thing about this rating framework is it has to accommodate whatever you're doing right now. So it's not trying to set the ideal for how to develop accessible applications. So I think I'll kind of punt that question in the sense that you would find what you're doing now and then rate just to that. Trying to solve for the communication only, and not the -- all the good stuff underneath.

>> Interesting, that makes sense. Control what you can control. Another question from Anonymous: I work at a 3,000 employee international company. We do not have an accessibility team like I hear some of the speakers state. There is one person who among other responsibilities manages to make sure development teams self-report a VPAT on their projects. How can I pitch this idea to my management?

>> Aidan: I think what's typical at a large company is at a certain point there will be an accessibility team. At TD there's actually multiple teams. And our team is focused entirely on customer experience, customer-facing large websites and apps. And I think if you can -- I see companies even larger than 3,000 that don't have that role. And so that to me seems like getting executive buy-in that it's important enough to have one group look over it as a first start. But I think you could also use some of the concepts here even without that, and maybe come at it a different way and say, you know, we've discovered a lot of the products haven't been tested. Is there a legal risk with that? Or we've discovered a lot of the products have complaints that haven't been resolved. There might be ways in to have some leverage with others in the organization when you show a more simplified view of what's going on with accessibility.

>> Yeah, makes sense. Will WCAG 3.0 include any similar spirit kind of rating system? I thought I heard about that, but I'm not sure.

>> Aidan: Yeah, I haven't looked at the latest draft that's out there. There's been a lot of discussion on the discussion board. And just like the best minds in the industry are working on that. I don't know where that'll end up. But I think they're definitely considering, you know, two parts of this. One is how do we have, you know, good, very good, amazing -- maybe could be platinum, gold, silver or something. But also how do we test a gigantic website when we know that we're not going to be able to test every screen? Is there a way to have a more, like, unified agreement that certain targeted testing or template testing is acceptable? So both of those are needed in order to do that. I hope they, you know, come up with something that works, but also my hope is that none of it expects absolute perfection, because then we're left in the accessibility teams either glossing over in that elevator, yeah, it conforms wink, wink. There's lots of aria misuse but it's an amazing experience. I hope they would address that. That would be my biggest hope.

>> We've been getting this question a lot today. Where can I get one of those T-shirts? I can answer that one. There will be a post-conference e-mail survey. And if you fill out the survey you'll be entered in a raffle. And I think we're giving away a thousand or more T-shirts. Look out for that e-mail. We'll probably take this as our last question.

Should persona testing be used to determine an accessibility rating? Do you think different personas as a vital part of this rating system?

>> Aidan: I'm a big fan of testing early. And I do a lot of talking about testing wireframes for accessibility. I don't see that as fitting, because I think what we want to see is -- I'm glad the question came in. My view is this framework is best approached from what is the customer actually seeing? Which is in production if you like. We may not be literally testing production, but we're testing what the customer sees. So all the great work that happens before is not necessarily relevant to that, because I've seen a lot of projects where they do all the work up front, and in the end it still has a lot of issues for customers. And then obviously most of the time it's the other way. Like they do all the work up front and it's made it smooth sailing to a great experience. But I personally don't think there's value in putting those process things in there. But you in your organization may find that's an indicator of quality. And then why not add it?

>> Yeah, I love the kind of framework you've established here being so context specific, and what works best for you. I think that's great.